Wednesday, November 27, 2013

Mapping DNA nucleotides to numbers... the cool way!

So, there are four DNA 'characters', or nucleotides, or bases: A, C, G and T.  Says the Wikipedia: '"A" stands for Adenine and pairs with the "T", which stands for Thymine. The "C" stands for Cytosine and pairs with the "G", Guanine'.

Since there are only four nucleotides, it is wasteful to spend an entire byte (8 bits) on storing these 2 bits of information. And many of the more karmic tools do indeed 'compress' DNA this way, either for storage or for rapid indexing.

When Antonie implemented this, we used A=0, C=1, G=2, T=3, which makes some kind of lexicographical sense. However, other software specified a different mapping: A=0, C=1, T=2, G=3. So, I wondered if this had some kind of biological background, but it doesn't, it is all computer geekery!

Behold:
A = ASCII 65, binary 1000001  -> & 6 -> 00x -> 0 
C = ASCII 67, binary 1000011  -> & 6 -> 01x -> 1 
G = ASCII 71, binary 1000111  -> & 6 -> 11x -> 3 
T = ASCII 84, binary 1010100  -> & 6 -> 10x -> 2
This is how many tools in fact map: (c&6)>>1, and it has thus become some kind of standard. 

So now you know. 


Wednesday, November 13, 2013

Getting the correct 'git hash' in your binaries w/o needless recompiling or linking

When processing bug reports, or when people doubt the output of your tools, it is tremendously helpful to know the provenance of the binaries. One way of doing this is to embed the git hash in your code.

A useful hash can be generated using:
$ git describe --always --dirty=+
This outputs something like '4120f32+', where the '+' means there have been local changes with respect to the commit that can be identified with '4120f32':
$ git show 4120f32
commit 4120f32ef7b684eb1ff42d136e37e8733f9811e1
Author: bert hubert
Date:   Wed Nov 13 12:46:43 2013 +0100
    rename git-version wording to git-hash, move the hash to a .o file so we only need to relink on a change
What we want is this:
$ antonie --version
antonie  version: g4120f32+ 
To achieve this, we need to convince our Makefile to generate something that includes the hash from git-describe. Secondly, if this output changes, all binaries must be updated to this effect. Finally, if the git hash did not change, we should not get spurious rebuilds.

Many projects, including PowerDNS, had given up on achieving the last goal. Running 'make' will always relink PowerDNS now, and even recompile a small file, even if nothing changed. And this hurts our sensibilities.

For Antonie, I got stuck on a boring issue today, and decided I needed to solve the git-hash-embedding issue once and for all instead.

As the first component, enter update-git-hash-if-necessary, which will update (or create) githash.h to the latest git hash, but only if it is different from what was in there already - effectively preserving the old timestamp if there were no changes.

Secondly, we need to convince Make to always run update-git-hash-if-necessary before doing anything else. This can be achieved by adding the following to the Makefile:
CHEAT_ARG := $(shell ./update-git-hash-if-necessary)
Even though we never use CHEAT_ARG, whenever we now run make to do anything, this will ensure that githash.h is updated if required. This clever trick was found here.

To complete the story, create githash.c which #includes githash.h and make it define a global variable, possibly like this: 'const char* g_gitHash = GIT_HASH;'. Now include githash.o as an object for all programs needing access to the hash.

Finally, add 'extern const char* g_gitHash;' somewhere so the programs can actually see it. Note that you can't use githash.h for that, since that would trigger recompilation where relinking would suffice.

If you now run 'make' or even 'make antonie', any change in the git hash will trigger a recompile of githash.o, followed by rapid relinking. If there was no change, nothing will appear to have happened.

Please let me know if you find ways to improve on the trick above!