2013-12-28

Fun with File Systems

Imagine you have a data logging application, that writes data to disk continuously. Since the application is not very stable, you want it to write out the data in small files, so it does not loose too much data, if the application crashes. This creates some need to find a good trade-off between file size and file system in order to avoid wasting too much disk space with file system overhead.

An approach to measure file system overhead and to explore the design space of different file systems and file sizes quickly is as follows:
  • Create a ramdisk and in this ramdisk create a bulk file of given size (using dd)
  • For all combinations of file size and file system:
    • Format the bulk file with a desired file system (using mkfs) and mount it.
    • Continuously write files with a fixed size to the mounted bulk file until an exception occurs and record how many files could be written to mounted bulk file (using some script).
Operations to the mounted bulk file are very fast, since the bulk file resides in a ramdisk. An experiment using this approach was conducted for a bulk file of 1 GiB. Considered file systems were ntfs, exfat, vfat, ext2,ext3 and ext4. File sizes were varied from 1 byte to 220 bytes. A plot summarizing the relative file system overhead for different file sizes and file systems is shown below:
From this figure it can be seen that file system overhead is excessive for small file sizes. ext2, ext3 and ext4 behave almost identical in terms of overhead. Minimal overhead in this experiment is observed for vfat at a file size of 65536 bytes per file. Strangely exfat is always outperformed by ntfs.

The scripts that were used to conduct this experiment can be downloaded here.

2013-12-19

Creating Tagclouds with PyTagCloud

Tag clouds are a nice way to visualize textual information. They provide a colorful overview of frequent terms of a text and they might also tell you something about it's writing style.

For instance, the following is a tag cloud of the famous paper "Cramming more components onto integrated circuits" by Gordon Moore. The script that was used to create it, can be downloaded here.
The script uses PyTagCloud, which gets most of the job done. Cloning the git repository, building and installing is straight-forward. Do not forget to have pygame installed.

Nice Tagclouds can not be created fully automatically. To create beautiful tag clouds, natural language text usually needs a bit of preprocessing. The script provided above uses NLTK for stop word removal and calculating term frequencies. Moreover it might be necessary to manually change term frequencies or to remove certain terms entirely.

PyTagCloud supports exporting the tag cloud to .png images. Exporting to HTML/CSS is also almost possible, but this feature seems a little broken at the time of this writing. PyTagCloud will not export correctly whether a term should be rotated or not resulting in tag clouds with overlapping terms.

2013-12-06

Matching Bibtex and HTML

Recently I was given two very long lists of scientific publications.  One as a BibTeX file and another as a table in an HTML file. Some of the publications in the BibTeX file were missing in the HTML table and the task was to find out which ones these were. An additional challenge was, that both lists were created manually by different people and therefore author names, titles, etc. did not match character by character. Words with special characters, eg. 'Jörg', would be spelled as 'J\"org' in BibTeX and 'Jörg' in the HTML table.

A simple script that helps with this tedious problem, can be downloaded here. The script reads the .bib and the .html file and compares the title field of every BibTeX entry with every row in the HTML table. The package difflib is used to perform "approximate (sub)string matching". By some string comparison metric, it calculates a value from 0.0 (no match at all) to 1.0 (identical string is contained as a substring).
Finally the script generates a report, that contains all the publications, which are most probably missing.

2013-11-21

Finding Research Gaps with Google Scholar

Imagine you want to do some very important research and you are despaired to identify a research gap according to the current state-of-the-art. Moreover let's assume you have the intuition that a research gap can be found by combining two concepts from two different fields.
For instance, you might just have read two textbooks, one about freshwater aquarium fish and one about chemicals dissolved in water. Now you want to combine concepts from these two fields. To do this you need an estimate of 'how much' research has been done on the effect of chemical X on fish Y.

To get a rough estimate 'how much' research has already been done, Google Scholar can be used. For every search, it gives you an approximate number of publications that match your search terms. With this you can build a matrix like the following:


The rows correspond the keywords from one category (here: different types of fish) and the columns correspond to the other category (here: different chemicals). The color corresponds to the approximate amount of publications on Google Scholar that contain both keywords.

Certainly you cannot gain ultimate wisdom from this. Two keywords might just be a nonsensical paring or the keywords might be used in many publications, but in a context totally different from what you anticipated. However it provides a quick and simple way to figure out if you are entering a crowded field or not.

The script that was used to produce this plot can be downloaded here. The text based web browser Lynx needs to be installed to run it.

2013-11-01

Sorting Papers by Keywords

Imagine you are a given an inhumanely big electronic pile of publications to read and an early deadline. Even reading the abstracts will cost you a considerable amount of your time and most of the papers are not all related to what you are up to. How do you select the papers to read first?

A simple approach might be the following: Assume you can come up with a set of keywords with an accompanying quality factor.The quality factor indicates how much you are interested in a given keyword. A very important keyword might be given a quality factor of 1.0 and a more general keyword might have a quality factor of just 0.1.

With this set of keywords and quality factors it is quite easy to compute a score for every publication. For every paper and every keyword the number of occurrence of the keyword is counted and the score of the document is increased according to the quality factor. The papers can be sorted by score and this gives you the priorities in which to read the papers. While this may not be a masterpiece of Information Retrieval, it is still a simple and quick approach to find relevant information.

A simple R script to create a table with paper scores can be downloaded here. The text mining package tm is used, which reads .pdf files conveniently.
The keywords/quality factor pairs need to be provided in an extra file just like the paths to the publications. The script creates a simple .html file for convenient viewing of the scored paper list.

2013-09-13

'Synchronizing' Podcasts with a Portable Device


Do you also like listening to podcasts? I do and my usual use case is the following: At first I discover a new podcast on the web. Then I use a program like gpodder or Miro to download all the episodes, which end up in one plain directory of the hard drive. At last I want to 'synchronize' the episodes with a portable device.

A lot of the time the portable device will have less memory than the size of the downloaded files or it is not desirable to fill up the portable device with just one podcast. So 'sychronizing' should copy only some episodes at a time to the portable device and remember which episodes have been copied. After listening/watching an episode, it can be deleted on the portable device to free some space for new episodes. New episodes should be copied during the next synchronization. No more capabilities except for playback and deleting episodes are assumed on the side of the portable device. My use case is a bit similar to the use cases discussed here.

The following script implements this idea of synchronization. It does so by building a simple Sqlite-Database which contains the information if an episode has been copied to the portable device already.

2013-09-01

Remarks on Presentations

Here are some simple suggestions that I find useful to improve slides for (scientific) presentations. Many suggestions are subjective and no exhaustiveness is claimed. Therefore feel free to differ (and comment).

The suggestions can be downloaded here as pdf and odp.

2013-08-22

Very Simple Thermal Simulation Of Tiled Multi-Core System

Let's assume a tiled multi-core architecture, eg. a two dimensional grid of nxm compute tiles. There might be several tasks running on every tile, which results in an increase of temperature of the corresponding tiles. Idle tiles will otherwise cool down.

The following video shows the result of a very simple thermal simulation with a 2x2 grid and a single task. The toy "thermal management" migrates the task if the average temperature of a tile exceeds a threshold temperature and migrates the task to the tile with the minimum average temperature.


The video was created using this script (Python+Numpy+Matplotlib+ffmpeg). A toy model for temperature conduction is being used. Albeit the temperature model not being physically accurate, the simulation might still be useful to quickly evaluate more sophisticated thermal management strategies.

2013-07-31

Ear training with Anki

This post was created in collaboration with Thomas Fischbach after a discussion whether it would be possible for a person with only relative hearing to gain perfect hearing by practicing identifying musical notes with a flash card program.

No conclusive answer to this question will be given within this post, but you may try for yourself using  the following script. It can be used to create an Anki deck with sounds. Anki is an excellent flash card program (similar to Mnemosyne). csound is a software synthesizer used for sound generation. Installation instructions are provided with the script.


2013-07-01

Dark Frame Analysis

To improve noisy video, shot at low light conditions, it is useful to measure the distribution of the noise. Therefore I recorded several minutes of video in a setup, where no light would enter the camera (see Dark-frame subtraction). A script was used to simply sum up a big amount of recorded frames. This 'noise accumulation frame' was used for further analysis.

The camera that was being used, is a consumer grade Panasonic HC-V500. Some strange effects will be unveiled further down and you might be interesting in testing if your camera has these as well. The simple script that was used to create the plots can be downloaded here (requires Scipy, Numpy, Opencv and Matplotlib).

Unfortunately there is no 1:1 correspondence of the pixels that end up in the video file and the real pixels on the CMOS sensor. It is therefore unknown if effects seen further down, are a result of the sensor or the image processing in the camera, esp. video compression. It would be favorable to take still images at the highest possible resolution in an automated way, but at least my video camera does not have this feature.

The distribution of the blue-channel in the noise accumulation frame looks like this:
The other channels (red, green) are almost indistinguishable from this. The following plot shows the distribution of red+green+blue channel:
It can be seen that a big part of the noise is approximately normal distributed. However if you look at the noise accumulation frame directly, some structure is visible, which looks a bit like the electric field of a quadrupole:
Even though there is no direct correspondence of the pixels in the video file and the pixels on the CMOS sensors, "hot pixels" are still present (Why?). These can be seen easily by looking at details of the picture above. Keep in mind that mu is around 3.61 and sigma is around 0.47, so all values above 5 should be extremely unlikely. The plots below simply show the same as the plot above subdivided into 4 parts:

2013-06-13

Octave vs Python

"Don't do loops in Octave." is a well known truth. But sometimes loops are just too handy  or cannot be avoided at all. I was curious whether there is a difference in execution time of loops in Python and Octave since both are interpreted languages.
tic;
for i=1:100
for j=1:100
for k=1:100
  vain1 = i^2+j^2+k^2;
  vain2 = i^2+j^2+k^2;
  ...
  vain10 = i^2+j^2+k^2;
endfor
endfor
endfor
t = toc;
disp(num2str(t));
The following two scripts do nothing, but executing 3 nested loops with 100 iterations each and doing some useless computation within the loops. One script is in Python and one is in Octave. The Octave script is also listed above. Octave version 3.6.4 and Python version 2.7.3 was used for the comparison. The results are devastating.

user@machine:~/scripts/python_vs_octave$ octave loops.m
...
39.48
...
user@machine:~/scripts/python_vs_octave$ python loops.py
3.10390710831


Even in this simple example, the Octave script takes more than 10 times as long as the equivalent Python script. The difference becomes even bigger, if the amount of computation inside the loops is increased. Since Python also comes with sophisticated matrix processing capabilities (NumPy, SciPy) and if severe performance degradation for more sophisticated numerical analysis is not acceptable, the proverb above can be simply shortened to "Don't do Octave."


PS.: Another popular data analysis software is R. On the same machine using R version 2.14.1, the execution time of the equivalent script was 18.35s, which is in between the execution time of the Python and Octave scripts.

2013-06-02

Characterize CPU Cooling

Did you ever want to find out how good your PC's cooling system does under stress? Maybe you have just bought/built a PC and you want to find out if it keeps sufficiently cool.

The temperatures of the individual cores can be obtained easily with the lm-sensors package. To put a little stress on the CPU, cpuburn is very helpful.

Here is a little Python script that records the temperatures and frequencies of the CPU over time and another Octave script that visualizes the temperature data. The package cpufrequtils is required to obtain the CPU frequencies. For validation purposes, mpstat is used to record the CPU utilization. If you benchmark your CPU with a program that does not fully utilize the CPU during certain periods of time, this data might be helpful to correct for strange effects in the temperature curve.

Assuming a CPU with 2 cores and using burnP6 to create some stress , the script might be executed as follows:
$python rec_sensor_log 5000 sensors_log.csv & burnP6 & burnP6

The first parameter is the time between two temperature samples in milliseconds and the second is the name of the output file. Depending on the amount of available cores in the CPU, several instances of burnP6 (or equivalent) should be started. The script parses the output from lm-sensors in a not very sophisticated way. If you have more/less cores, you will probably have to make modifications to the script.

The visualization script (vis_log.m) will prompt for the input filename. Frequency scaling might occur at high temperatures. If this is detected, a vertical line is drawn into the plot.

Here is a sample plot that was created during a quick test on my laptop. Under stress the temperature rises up to approx. 70°C and the fan spins faster to keep the temperature at about this level. As soon as the stress is removed, the temperature quickly falls below 50°C.

2013-03-08

Remarks on Latex Spell Checking

This post focuses on some remarks how to improve the language of a (bigger) Latex document. A lot of the time technical issues (compiling, fixing syntax errors, adjusting images, tables, etc.) and the Latex typesetting itself ('How do I ... in Latex?') draw away a lot of the attention from the actual content of the document. I think that ordinary word processors like LibreOffice Writer have a significant advantage over Latex here, even though the result will not be so beautiful. This post discusses some techniques I found helpful to mitigate this problem.

You probably want to improve the quality of a document in several stages. There is (should be) spell checking happening on an everyday basis and after certain periods of time you will want to do bigger reviews to improve the overall consistency of the document.

The first thing would be to use the editors integrated spell checking capabilities. In Emacs I found Flyspell Mode quite convenient (M-x flyspell-mode). Otherwise ispell can be run from within emacs (M-x ispell-buffer). The downside of this is, that it will generate lots of false positives if you have a more technical document with lots of acronyms and technical expressions. Therefore it might be quite distracting to have lots of words on the screen marked.

Alternatively you might want to generate spelling reports for your whole document once in a while. The following short script can be used to generate a spelling report for several .tex files using Hunspell.

Usually bigger Latex documents will be spread over many different files. Finding some string in several files and opening every file that contains the string can be quickly accomplished by issuing:
$> find . -name "*.tex" | xargs grep "some word" -isl | xargs emacs

More sophisticated spell checking and grammar checking can be done using LanguageTool. Unfortunately it cannot be used with Latex directly. detex can be used to remove Tex commands from Latex files. This is a bit tedious, because it gives you lots of false positives, but you will probably discover some new language mistakes this way.
$> find ./chapters/ -name "*.tex" -exec detex -n {} \; >> doc_detexed.txt
$> java -jar LanguageTool.jar -l en-US -c utf-8 doc_detexed.txt > doc_languagetool_report.txt

Microsoft Office spell checking and grammar checking is superior to the tools mentioned above. To be able to open your Latex document in Microsoft Word, the tool latex2rtf can be used. Instead of compiling your document to PDF it generates an .rtf file. This is also an alternative to using detex, if you want to use Libre Office and LanguageTool. If you do not have access to an Microsoft Office installation, GDocs might also be an alternative.

Xournal allows you to annotate PDF documents. This is handy because directly writing annotations in the PDF allows you to really focus on the content and saves lots of paper if you would otherwise print intermediate stages of your document frequently.

2013-03-02

Print PDFs by Creating Multiple Print Jobs

Printing long PDF documents is sometimes tedious, especially if you have a dull network printer. You might know the case that for mysterious reasons the printer just does nothing for a very long time and thereafter it seems to have forgotten about the actual print job.

Sometimes printing can still be accomplished by sending only a few pages of the PDF document in distinct print jobs. Doing this manually is also tedious, so here is a simple Python script that breaks up a PDF document into many PDFs with just 2 pages (using pdftk) and spools them via lpr (the package cups-pdf is required for this).

2013-02-07

Detect Printer Steganography

click on the image to enlarge
Recently I learned that many color laser printers print an (almost) invisible pattern of tiny yellow dots on every page. This is called printer steganography and was put in place to prevent forgery. However it is also a privacy issue as outlined here, because every pages contains some additional information, that most people are not aware of.

I was curious if it was possible to visualize the pattern. In the article mentioned above, a test setup with a microscope and blue LED is described. Alternatively I found that it was sufficient to use a simple scanner and do some post processing with GIMP. Many scanners allow to scan with equal or higher resolutions and color depths than the printer can print.

The test page was printed with 600 dpi and also scanned with this resolution. The color depth was set to 16 bit. To make the little dots visible, an edge detection filter (in Gimp 2.6 under Filters, Edge-Detect, Edge...) was used. In the resulting image, the dots were visible best in the blue channel. The other two channels were removed as described here by adding an entirely green and an entirely red layer and setting the blending mode to subtract.

An excerpt of the empty part of the page can be seen here:
click on the image to enlarge
The pattern is repeated over the entire page. Interpreting the pattern is not so easy. Maybe it contains information like printer serial id, time, etc., but more effort would be required to figure that out exactly.

2013-01-27

Typesetting Word-By-Word Translations

Recently I was asked how to typeset documents with a word-by-word translation like the following:

Welche Farbe hat der gelbe Bus?
Which color has the yellow bus?

As I have learned, linguists have the fancy word Interlinear Gloss for this. There are several Latex packages available for this purpose. Among them is gb4e which I decided to use.

For simplicity it is assumed that the text to be 'glossed' is provided as a plain text file with sentences delimited by '.  ', '?  ' or '!  ' (2 spaces) and words separated by individual spaces. The implementation of a small script that creates a document with nicely aligned words is very straight forward. The dictionary needs to be provided as a .csv file.

Unfortunately the task can not be fully automated. Breaking text into sentences requires some knowledge about a specific language. So does breaking sentences into words. Ideally the dictionary should also have some capability to detect flections, etc. The script just generates a Latex file that can be modified manually.

The script can be downloaded here.