Weird Invention of the Day: R

2013-11-01

Sorting Papers by Keywords

Imagine you are a given an inhumanely big electronic pile of publications to read and an early deadline. Even reading the abstracts will cost you a considerable amount of your time and most of the papers are not all related to what you are up to. How do you select the papers to read first?

A simple approach might be the following: Assume you can come up with a set of keywords with an accompanying quality factor.The quality factor indicates how much you are interested in a given keyword. A very important keyword might be given a quality factor of 1.0 and a more general keyword might have a quality factor of just 0.1.

With this set of keywords and quality factors it is quite easy to compute a score for every publication. For every paper and every keyword the number of occurrence of the keyword is counted and the score of the document is increased according to the quality factor. The papers can be sorted by score and this gives you the priorities in which to read the papers. While this may not be a masterpiece of Information Retrieval, it is still a simple and quick approach to find relevant information.

A simple R script to create a table with paper scores can be downloaded here. The text mining package tm is used, which reads .pdf files conveniently.
The keywords/quality factor pairs need to be provided in an extra file just like the paths to the publications. The script creates a simple .html file for convenient viewing of the scored paper list.

2013-06-13

Octave vs Python

"Don't do loops in Octave." is a well known truth. But sometimes loops are just too handy or cannot be avoided at all. I was curious whether there is a difference in execution time of loops in Python and Octave since both are interpreted languages.

tic;
for i=1:100
for j=1:100
for k=1:100
  vain1 = i^2+j^2+k^2;
  vain2 = i^2+j^2+k^2;
  ...
  vain10 = i^2+j^2+k^2;
endfor
endfor
endfor
t = toc;
disp(num2str(t));

The following two scripts do nothing, but executing 3 nested loops with 100 iterations each and doing some useless computation within the loops. One script is in Python and one is in Octave. The Octave script is also listed above. Octave version 3.6.4 and Python version 2.7.3 was used for the comparison. The results are devastating.

user@machine:~/scripts/python_vs_octave$ octave loops.m
...
39.48
...
user@machine:~/scripts/python_vs_octave$ python loops.py
3.10390710831

Even in this simple example, the Octave script takes more than 10 times as long as the equivalent Python script. The difference becomes even bigger, if the amount of computation inside the loops is increased. Since Python also comes with sophisticated matrix processing capabilities (NumPy, SciPy) and if severe performance degradation for more sophisticated numerical analysis is not acceptable, the proverb above can be simply shortened to "Don't do Octave."

PS.: Another popular data analysis software is R. On the same machine using R version 2.14.1, the execution time of the equivalent script was 18.35s, which is in between the execution time of the Python and Octave scripts.