Saturday, February 23, 2013

A horse, a donkey, a cow: a genetic diff

So, continuing the series 'if you are a hammer, everything looks like a nail', here we'll bridge the worlds of genetic sequencing and programming and show the diff between a horse, a cow and a donkey.

DNA is a lot like computer code, except that it is not an imperative nor a functional programming language. DNA describes what amino acids end up in proteins, and these proteins have shapes and chemical properties which makes them interact in a way that we call 'life'. For more context, see my earlier article 'DNA as seen through the eyes of a computer programmer'.

Like code, DNA evolves in the course of the development of life. Some code never changes, because it is so vital and tricky that any change immediately leads to a non-functioning organism. Other code is so uncritical or unimportant that it can (and does) change at a high clip, leading to many useful or perhaps detrimental mutations.

In between are pieces of DNA that are very consistent within species, but show remarkable change between them. Such code is used to fingerprint organisms, live, but mostly dead. Such a fingerprint (or better, a barcode) can quickly and reliably tell if we are eating horse, donkey or beef.

Huge databases have been established, one of which (BOLDSystems) can be queried here. This is called 'the barcode of life', and for animals, this has been standardized on the mitochondrial CO-1 gene, which encodes part of our aerobic metabolism, powering our cells.

So, what does this look like? Behold, the diff between a horse and a donkey:


As we can see, most DNA is identical, with variations mostly impacting individual nucleotides. In addition, there is one longer stretch that is different.

Now let's make a very current and relevant comparison: a horse and a cow:


Note that we still have lots of single mutations, but we also see a whole line that is mostly different! Clearly, a horse is not a cow. No matter how well you cook it!
If you want to make your own comparison, first look up the scientific (latin) name of the desired animal. Next, look it up, and from the list of sequences, pick two with the same CO-1 length (some barcodes contain more DNA than others). 
Then use this tiny Python script to generate the nice html diffs you see above:
import sys, os, time, difflib, optparse

def main():
    usage = "usage: %prog [options] fromfile tofile"
    parser = optparse.OptionParser(usage)
    (options, args) = parser.parse_args()

    fromfile, tofile = args

    one =open(fromfile).readlines()
    theother = open(tofile).readlines()
  
    d = difflib.HtmlDiff()
    result = d.make_file(one, theother, fromfile, tofile)
    sys.stdout.writelines(result)

if __name__ == '__main__':
    main()

Good luck!