Recently I was given two very long lists of scientific publications. One as a BibTeX file and another as a table in an HTML file. Some of the publications in the BibTeX file were missing in the HTML table and the task was to find out which ones these were. An additional challenge was, that both lists were created manually by different people and therefore author names, titles, etc. did not match character by character. Words with special characters, eg. 'Jörg', would be spelled as 'J\"org' in BibTeX and 'Jörg' in the HTML table.
A simple script that helps with this tedious problem, can be downloaded here. The script reads the .bib and the .html file and compares the title field of every BibTeX entry with every row in the HTML table. The package difflib is used to perform "approximate (sub)string matching". By some string comparison metric, it calculates a value from 0.0 (no match at all) to 1.0 (identical string is contained as a substring).
Finally the script generates a report, that contains all the publications, which are most probably missing.
No comments:
Post a Comment