Weird Invention of the Day

2014-03-18

Wget, Cookies and Firefox

Did you ever want to automatically (mass)-download data from a website, where a login is required, eg. a wiki or a social network? If the website stores a session cookie on your computer, it might be possible to download content automatedly using Wget.

It is possible to pass Wget a cookie file as a parameter. This might look like the following:
wget --keep-session-cookies --load-cookies=cookies.txt -p -k https://someurl.org/protected/site_01.htm

An example of a cookie file might look as follows (use tabs instead of spaces!):
# HTTP cookie file.
someurl.org TRUE / FALSE 1391671828 someurlUserID 42
someurl.org TRUE / FALSE 1391671828 someurlUserName Peter
someurl.org TRUE / FALSE 1391671828 someurlToken d3d3fdsere
someurl.org TRUE / FALSE -1 someurl_session g8furfv99dmp1

After logging in on the respective website, you can conveniently view the necessary cookies in Firefox.

date can be used to convert the expiration time of the cookies in Firefox to the format used in the wget cookie files, eg. by issuing:
date -d "Wed 12 Mar 2014 01:31:42 PM CET" +%s

2014-01-10

Installing Tizen SDK on Ubuntu with OpenJDK

Today I tried installing the Tizen SDK (tizen-sdk-ubuntu64-v2.2.71.bin) on Ubuntu 12.04. The installation script exited complaining that it requires Oracle JDK instead of OpenJDK ("OpenJDK is not supported. Try again with Oracle JDK.").

This is a bit annoying, because OpenJDK comes with Ubuntu per default and Oracle JDK is not in the repositories any more. It seems, the installer's requirement is merely a policy and not an actual technical requirement. The installation does succeed even with the OpenJDK, if the following lines are commented out in the installation script:

# check the default java as OpenJDK ##
if [ "ubuntu" = "${OS_NAME}" ] ; then
    CHECK_OPENJDK=`java -version 2>&1 | egrep -e OpenJDK`
    if [ -n "${CHECK_OPENJDK}" ] ; then
        echo "${CE} OpenJDK is not supported. Try again with Oracle JDK. ${CN}"
        exit 1
    fi
fi

The installation and basic usage of the IDE seem to work without problems after this. The OpenJDK version used was 1.6.0_27.

2013-12-28

Fun with File Systems

Imagine you have a data logging application, that writes data to disk continuously. Since the application is not very stable, you want it to write out the data in small files, so it does not loose too much data, if the application crashes. This creates some need to find a good trade-off between file size and file system in order to avoid wasting too much disk space with file system overhead.

An approach to measure file system overhead and to explore the design space of different file systems and file sizes quickly is as follows:

Create a ramdisk and in this ramdisk create a bulk file of given size (using dd)

For all combinations of file size and file system:

Format the bulk file with a desired file system (using mkfs) and mount it.
Continuously write files with a fixed size to the mounted bulk file until an exception occurs and record how many files could be written to mounted bulk file (using some script).

Operations to the mounted bulk file are very fast, since the bulk file resides in a ramdisk. An experiment using this approach was conducted for a bulk file of 1 GiB. Considered file systems were ntfs, exfat, vfat, ext2,ext3 and ext4. File sizes were varied from 1 byte to 2²⁰ bytes. A plot summarizing the relative file system overhead for different file sizes and file systems is shown below:

From this figure it can be seen that file system overhead is excessive for small file sizes. ext2, ext3 and ext4 behave almost identical in terms of overhead. Minimal overhead in this experiment is observed for vfat at a file size of 65536 bytes per file. Strangely exfat is always outperformed by ntfs.

The scripts that were used to conduct this experiment can be downloaded here.

2013-12-19

Creating Tagclouds with PyTagCloud

Tag clouds are a nice way to visualize textual information. They provide a colorful overview of frequent terms of a text and they might also tell you something about it's writing style.

For instance, the following is a tag cloud of the famous paper "Cramming more components onto integrated circuits" by Gordon Moore. The script that was used to create it, can be downloaded here.

The script uses PyTagCloud, which gets most of the job done. Cloning the git repository, building and installing is straight-forward. Do not forget to have pygame installed.

Nice Tagclouds can not be created fully automatically. To create beautiful tag clouds, natural language text usually needs a bit of preprocessing. The script provided above uses NLTK for stop word removal and calculating term frequencies. Moreover it might be necessary to manually change term frequencies or to remove certain terms entirely.

PyTagCloud supports exporting the tag cloud to .png images. Exporting to HTML/CSS is also almost possible, but this feature seems a little broken at the time of this writing. PyTagCloud will not export correctly whether a term should be rotated or not resulting in tag clouds with overlapping terms.

2013-12-06

Matching Bibtex and HTML

Recently I was given two very long lists of scientific publications. One as a BibTeX file and another as a table in an HTML file. Some of the publications in the BibTeX file were missing in the HTML table and the task was to find out which ones these were. An additional challenge was, that both lists were created manually by different people and therefore author names, titles, etc. did not match character by character. Words with special characters, eg. 'Jörg', would be spelled as 'J\"org' in BibTeX and 'Jörg' in the HTML table.

A simple script that helps with this tedious problem, can be downloaded here. The script reads the .bib and the .html file and compares the title field of every BibTeX entry with every row in the HTML table. The package difflib is used to perform "approximate (sub)string matching". By some string comparison metric, it calculates a value from 0.0 (no match at all) to 1.0 (identical string is contained as a substring).
Finally the script generates a report, that contains all the publications, which are most probably missing.

2013-11-21

Finding Research Gaps with Google Scholar

Imagine you want to do some very important research and you are despaired to identify a research gap according to the current state-of-the-art. Moreover let's assume you have the intuition that a research gap can be found by combining two concepts from two different fields.
For instance, you might just have read two textbooks, one about freshwater aquarium fish and one about chemicals dissolved in water. Now you want to combine concepts from these two fields. To do this you need an estimate of 'how much' research has been done on the effect of chemical X on fish Y.

To get a rough estimate 'how much' research has already been done, Google Scholar can be used. For every search, it gives you an approximate number of publications that match your search terms. With this you can build a matrix like the following:

The rows correspond the keywords from one category (here: different types of fish) and the columns correspond to the other category (here: different chemicals). The color corresponds to the approximate amount of publications on Google Scholar that contain both keywords.

Certainly you cannot gain ultimate wisdom from this. Two keywords might just be a nonsensical paring or the keywords might be used in many publications, but in a context totally different from what you anticipated. However it provides a quick and simple way to figure out if you are entering a crowded field or not.

The script that was used to produce this plot can be downloaded here. The text based web browser Lynx needs to be installed to run it.

2013-11-01

Sorting Papers by Keywords

Imagine you are a given an inhumanely big electronic pile of publications to read and an early deadline. Even reading the abstracts will cost you a considerable amount of your time and most of the papers are not all related to what you are up to. How do you select the papers to read first?

A simple approach might be the following: Assume you can come up with a set of keywords with an accompanying quality factor.The quality factor indicates how much you are interested in a given keyword. A very important keyword might be given a quality factor of 1.0 and a more general keyword might have a quality factor of just 0.1.

With this set of keywords and quality factors it is quite easy to compute a score for every publication. For every paper and every keyword the number of occurrence of the keyword is counted and the score of the document is increased according to the quality factor. The papers can be sorted by score and this gives you the priorities in which to read the papers. While this may not be a masterpiece of Information Retrieval, it is still a simple and quick approach to find relevant information.

A simple R script to create a table with paper scores can be downloaded here. The text mining package tm is used, which reads .pdf files conveniently.
The keywords/quality factor pairs need to be provided in an extra file just like the paths to the publications. The script creates a simple .html file for convenient viewing of the scored paper list.