2013-11-21

Finding Research Gaps with Google Scholar

Imagine you want to do some very important research and you are despaired to identify a research gap according to the current state-of-the-art. Moreover let's assume you have the intuition that a research gap can be found by combining two concepts from two different fields.
For instance, you might just have read two textbooks, one about freshwater aquarium fish and one about chemicals dissolved in water. Now you want to combine concepts from these two fields. To do this you need an estimate of 'how much' research has been done on the effect of chemical X on fish Y.

To get a rough estimate 'how much' research has already been done, Google Scholar can be used. For every search, it gives you an approximate number of publications that match your search terms. With this you can build a matrix like the following:


The rows correspond the keywords from one category (here: different types of fish) and the columns correspond to the other category (here: different chemicals). The color corresponds to the approximate amount of publications on Google Scholar that contain both keywords.

Certainly you cannot gain ultimate wisdom from this. Two keywords might just be a nonsensical paring or the keywords might be used in many publications, but in a context totally different from what you anticipated. However it provides a quick and simple way to figure out if you are entering a crowded field or not.

The script that was used to produce this plot can be downloaded here. The text based web browser Lynx needs to be installed to run it.

2013-11-01

Sorting Papers by Keywords

Imagine you are a given an inhumanely big electronic pile of publications to read and an early deadline. Even reading the abstracts will cost you a considerable amount of your time and most of the papers are not all related to what you are up to. How do you select the papers to read first?

A simple approach might be the following: Assume you can come up with a set of keywords with an accompanying quality factor.The quality factor indicates how much you are interested in a given keyword. A very important keyword might be given a quality factor of 1.0 and a more general keyword might have a quality factor of just 0.1.

With this set of keywords and quality factors it is quite easy to compute a score for every publication. For every paper and every keyword the number of occurrence of the keyword is counted and the score of the document is increased according to the quality factor. The papers can be sorted by score and this gives you the priorities in which to read the papers. While this may not be a masterpiece of Information Retrieval, it is still a simple and quick approach to find relevant information.

A simple R script to create a table with paper scores can be downloaded here. The text mining package tm is used, which reads .pdf files conveniently.
The keywords/quality factor pairs need to be provided in an extra file just like the paths to the publications. The script creates a simple .html file for convenient viewing of the scored paper list.