Pythonic Evolution - Part 1

Date

This post is in different parts. The fact is that it requires quite a lot of time investment, something I really don't have in this period.

A long time ago I wanted to play with the concept of genetic code, and how it represents nothing but a language to code for molecular machines. As the Jacquard loom, it's nothing but a chemical punchcard who gets executed by the ribosome loom, and loomed into a protein. The fact behind evolution at this level is that if a small mutation happen in the punchcard by accident, and the resulting protein turns out to provide a competitive advantage for survival and reproduction, then this wrong punchcard will naturally exploit this advantage by becoming the predominant form. Even a very tiny advantage is sufficient, in the span of million of years, to reach fixation (the old code is completely replaced by the new one).

The math is simple. Here is a program that does it

normal = 500
mutated = 1

normal_survive_prob = 0.90
mutated_survive_prob = 0.91

for generation in xrange(1,1000):
    normal -= int(normal * (1-normal_survive_prob))
    mutated -= int(mutated*(1-mutated_survive_prob))

    normal *= 2
    mutated *= 2

    total = normal + mutated
    print "Generation %d -- normal : %d %%  mutated : %d %% " % \
          (generation, 100*normal/total, 100*mutated/total)

In this program (which has some simplifications, but the concept is there), we have a population of bacteria. We start with 500 normal bacteria, and only one mutant. This mutant has an advantage: its mutation makes him more likely to survive (for example, to an antibiotic, or to UV light from the Sun, or to oxygen). In this case the probability that the standard bacterium can survive is 90 %. In other words, if you have a large quantity of bacteria, some of these will die due to UV light. If you divide the number of survived bacteria by the number of bacteria you had, you will find out that 90 out of 100 survived, and the other 10 out of 100 died. The remaining 90 % do what bacteria do all the time: reproduce by splitting in two, meaning that if 40 bacteria survived the UV light and they reproduce, you will get 80 bacteria. In turn, some of them will die, and other will survive, and reproduce, and this goes on and on.The mutated bacteria has an advantage: its mutation makes it more resistent to UV light, so it has a slightly higher probability of surviving: 91 %

If you run the program, you will find out thatat generation 1, almost everything is made out of normal bacteria

Generation 1 -- normal : 99 %  mutated : 0 %

At generation 100, you see that the mutated ones start to be visible

Generation 107 -- normal : 99 %  mutated : 1 %

At generation 365, the mutated ones can start a political party

Generation 365 -- normal : 85 %  mutated : 14 %

After 525 generations, they are fifty/fifty:

Generation 525 -- normal : 49 %  mutated : 50 %

At generation 700, the mutated ones are the large majority

Generation 700 -- normal : 12 %  mutated : 87 %

And finally, at generation 1000 the normal ones are basically disappeared, and the mutated ones conquered their environmental niche:

Generation 1000 -- normal : 0 %  mutated : 99 %

Now, if you consider that a bacterial generation happens in days, or even hours, you see how it does not take much for bacteria to evolve. If our bacterium here divided every day, we would have a total fixation after just three years. If you assume that the difference in probability is 0.01 % (instead of 1% as in this example) you will get the same result, it will just take more time (and even million of years is a blink of an eye against the age of the Earth).

The best part about evolution is that it works for everything behaving with the same modality. What you need is:

  1. An imperfect replicator, an entity able to produce a copy of itself, for example the DNA of our bacteria in the example above. The replication mechanism must have a certain degree of imperfection, so that some chance for occasional mutation must be possible, and consequently variability.
  2. Action of the replicator on the environment. For example, The DNA molecule, through a complex biological setup is able to produce proteins which, according to the laws of chemistry, influence the external world by digesting a metabolite and obtaining energy, or fighting an aggression to survive (to UV light, in our example) and reproduce.
  3. Variability of the replicator "phenotype" (which means the "final visible result" of the replicator). Small modifications (accidental or not) to the replicator will change its interaction with the external environment (for example, by granting him to survive better under UV light)
  4. Environmental conditions: the environment will propose conditions, for example availabe metabolites, temperature, pressure, UV light irradiation and these conditions could also change as time passes..
  5. Selective pressure: the interaction between the environmental conditions and the replicator phenotype will create a differentiation between replicators. Those better coping with the conditions will have higher reproductive chance than the others, producing a drift toward better replicators (in our case, the bacteria were under selective pressure because the UV light was killing them, but the mutated ones were more likely to survive)

A famous example of evolution at work is antibiotic resistance. The selective pressure of a poisonous substance (an antibiotic) kills the organisms not able to neutralize it, while those less sensitive to it survive and reproduce, creating more copies of so-called antibiotic resistant bacteria. Very soon, the antibiotic becomes ineffective against this new breed, and a new one must be invented. This is also the reason why antibiotics should be used sparingly, and always up to the end of the treatment, so to be sure that even those bacteria that are more resistant received a lethal dose and are therefore unable to start a resistant strain.

But the funny part of the evolutive setup is that it works no matter what the replicator and environment are....

Continued!


Change separator in gnuplot

Date

Gnuplot is a great software. Very useful for easily plotting datapoints without fuss and complicated interface. However, the default accepted format is a table defined by space-separated values. If you have comma separated values, then you have to change the delimiter. But how?

This command

set datafile separator ","

will do the trick.


Hard disk going ballistic on OSX ?

Date

Sometimes it happens to me that my hard drive starts being accessed in a very aggressive and noisy way. I found a couple of commands to see who is responsible:

sudo fs_usage -f filesys

and also

sudo iotop

In the last one, you could get error messages like

dtrace: error on enabled probe ID ....

but if you check the whole output, you will see the bytes moved by some processes.

In general, I saw that these frequent disk access are due to the mds process, who is the responsible for updating the Spotlight indexes. The process will last some minute.


Why is most science programming done in fortran?

Date

I found this interesting question in the referral logs on ForTheScience. Why is most science programming done in Fortran (77 or 95)?

After some thought, I can fill the following reasons:

  • Fortran is simple to understand. Not the code itself maybe, but the style. The learning curve for doing something in Fortran is very low, and after you manage the basic concepts, read and write, you can be proficient enough to write even complex computational application. Most scientists are not programmers, and they would be overwhelmed with the intricacies of C and C++. I would strongly prefer a non-programmer to code Fortran than, say, perl. In other words, Fortran just fits the need computational scientists have: read numbers, do a calculation, write the result, and everything can be taught to a profane in a standard semester course.
  • Fortran is computationally efficient. I will not go into the age old debate about "is Fortran really faster than C?", for which I have a rather articulate opinion I won't delve in. Instead, I will just present the fact that is indeed one of the languages whose compilers and computational libraries have been beaten to death for computational efficiency, being their marketing value.
  • Fortran is old. This has the effect of producing a huge amount of legacy code that have to be maintained or reused. Rewriting this code is normally not possible: who should do it? Even if this task requires just one man month in three years of Ph.D. contract, it will probably not produce a scientific publication, so nobody want the task. Moreover, the rewriting will likely ruin interfacing with other codes, programs and libraries, as well as the group knowledge (if any) of the code, so this move will almost always be opposed.
  • Fortran has a slow release cycle. Backward compatibility has been kept into account. Knowing that the code you wrote in the 80s will still compile today (or eventually you will have to add some compiler switch) make everyone happier. I am not sure you can run a perl or python program written 10 years ago and have it running today. I have no experience with C and C++ and old codes, so I am not completely sure about this point, and I welcome being proven wrong.

There are for sure many other reasons, but I won't go further.

Let's see instead why and when Fortran should not be used. My head goes to python for most comparison:

  • Fortran has very reduced expressivity. You need a lot of code, often redundant, to code something. In some cases, you need to put stuff in temporary variables to pass the information to a subroutine, introducing more variables (difficult to maintain) or recycling old ones (bad).
  • Fortran (77, things are better in 95) makes very difficult to perform modular programming. Namespace pollution can be dramatic on large programs, especially considering the short identifier limit (not an issue on modern compilers, but in violation of the standard). Fortran 95 modules are a step ahead, but you can't group modules into submodules.
  • Fortran (95) does not allow storage of function or subroutine pointers, making callback-oriented programming very hard.
  • Fortran (95) does not allow inheritance. Smart workarounds exist, but they require some skills and the base class develops a dependency towards derived ones.
  • Fortran has no polymorphism nor templating, making very painful to work on generic data types. Again, workarounds exist, but they require external tools.
  • Fortran makes very difficult to keep loose coupling. A very strong dependency network arises. For large programs, the number of modules USEd (or the amount of code in them) may increase considerably. Compare with python, where a module does not need to be imported if you have to call a method on an object inside that module, or with C++, where you have forward declarations.
  • Fortran (95) does not have object orientation, it is very difficult, if not impossible, to use traditional design patterns.
  • Fortran does not have exceptions (F2003 will, but not custom ones, as far as I know).
  • Fortran has IMPLICIT. (Edit: yes, it has IMPLICIT NONE, but the existence of implicit declaration is unfortunately abused still today. It should have been deprecated.)
  • Fortran (77) does not have aggregated data types and dynamic memory allocation (in the standard)
  • Fortran strings are not dynamic in length (unless, if I remember correctly, if you do very weird hacks). A string of Length 100 and another of Length 101 are like being two different datatypes (say an int and a string), unless you use the LEN=* in routine calls, but you cannot make more room to an allocated string if needed.
  • Fortran did not have clear interfacing with C, and every compiler did as it pleased. Apparently this is no longer true with the introduction of BIND.
  • No effective tools exist for documenting the code or easily perform Test Driven Development.
  • Libraries out there are targeted at computational tasks. I haven't seen any good library for GUI programming, networking, db access, and even if you could, would you ?
  • Fortran is full of unusual pitfalls for anyone used to a different language. While pitfalls exist in any language, Fortran has pitfalls coming from compatibility towards older improper use (e.g. automatic SAVE in assignment at declaration). In some other cases though, pitfalls are due to the highly optimized nature of the language. These pitfalls are in general a strong deviation from the behavior of any other language using the similar constructs.
  • Most of the code out there uses old code. Even if the language progressed, you will still find ancient remains of code written when the main writing method was a stick on a clay table. This code will most likely be impossible to refactor.

This is just out of the top of my head, and I am sure there is a lot more. In any case, Fortran 2003 seems to alleviate most of the problems outlined above. In particular, it will have object oriented programming, and function/subroutine pointers. A considerable step forward.

Please note that I wasn't a Fortran fan, but with time I became tolerant to it. It should be used sparingly and only where the need exist, or if a real reason exists: use high level programming languages with good expressivity first, such as Python. Then eventually optimize where needed, sometimes with a drop of Fortran, but only if you really, really (yes, I mean really) need it.


What's the point of inheritance in python ?

Date

Python is a fascinating language. It makes you think. Sometimes it can destroy your beliefs. Sometimes it makes you understand new concepts in a natural way, specifically Generic Programming.

My background is in statically typed languages. I came from C++, camped there for a while, then moved to python. I also explored Java recently.

I would like to present my case by remembering the three pillars of object oriented programming:

  1. Encapsulation
  2. Inheritance
  3. Polymorphism

In particular, I would like to go into the details of two of them: inheritance and polymorphism.

Inheritance is a strategy that allows you to define a new class (hereafter called NEW) as derived of an old one (OLD). The new class gets access to all methods and attributes of the old one, and according to the substitution principle, you can replace objects of type OLD with object of type NEW without altering the properties of the program. For example, suppose you have the following situation

#include <iostream>

class Animal {
public:
    virtual void speak() = 0;
};

class Dog : public Animal {
    void speak() { std::cout << "woff!" <<std::endl; }
};

class Cat : public Animal {
    void speak() { std::cout << "meow!" <<std::endl; }
};

void makeSpeak(Animal &a) {
    a.speak();
}

int main() {
    Dog d;
    Cat c;
    makeSpeak(d);
    makeSpeak(c);
}

As you can see, makeSpeak is a routine that accepts a generic Animal object. In this case, Animal is quite similar to a java interface, as it contains only a pure virtual method. makeSpeak does not know the nature of the Animal it gets passed. It just sends it the signal "speak" and leaves the late binding to take care of which method to call: either Cat::speak() or Dog::speak(). This means that, as far as makeSpeak is concerned, the knowledge of which subclass is actually passed is irrelevant.

But what about python? Let's see the code for the same case in python. Please note that I try to be as similar as possible to the C++ case for a moment:

class Animal(object):
    def speak(self):
        raise NotImplementedError()

class Dog(Animal):
    def speak(self):
        print "woff!"

class Cat(Animal):
    def speak(self):
        print "meow"

def makeSpeak(a):
    a.speak()

d=Dog()
c=Cat()
makeSpeak(d)
makeSpeak(c)

Now, in this example you see the same strategy. You use inheritance to leverage the hierarchical concept of both Dogs and Cats being Animals. Here is the nice thing: in python, there's no need for this hierarchy. This works equally well

class Dog:
    def speak(self):
        print "woff!"

class Cat:
    def speak(self):
        print "meow"

def makeSpeak(a):
    a.speak()

d=Dog()
c=Cat()
makeSpeak(d)
makeSpeak(c)

In python, this kind of inheritance application is irrelevant because there's no concern of the type. you can send the signal "speak" to any object you want. If the object is able to deal with it, it will be executed, otherwise it will raise an exception. Suppose you add a class Airplane to both codes, and submit an Airplane object to makeSpeak. In the C++ case, it won't compile, as Airplane is not a derived class of Animal. In the python case, it will raise an exception at runtime, which could even be an expected behavior.

On the other side, suppose you add a MouthOfTruth class with a method speak(). In the C++ case, either you will have to refactor your hierarchy, or you will have to define a different makeSpeak method to accept MouthOfTruth objects... or in java you could extract the behavior into a CanSpeakIface and implement the interface for each ... There are many solutions, but I wont' go into this direction.

What I'd like to point out is that I haven't found a single reason yet to use inheritance in python: you don't need to implement a base-derived hierarchy to perform polymorphically. See also this post which points out a similar point of view.

So, if interface-only inheritance has no meaning, maybe it could be ok for implementation-based inheritance? What about recycling the code and the data of a base class? You can accomplish the same through a containment relationship, with the added benefit that you can alter it at runtime, and you clearly define the interface of the contained, without risking unintended side effects.

But yes, there's a natural case where a class hierarchy is indeed useful: Exceptions. Suppose your code is like this

class ServiceException(Exception): pass
class TrivialException(Exception): pass

class OutOfPaperException(TrivialException): pass
class OutOfTonerException(TrivialException): pass
class BrokenCircuitryException(ServiceException): pass

def usePrinter():
    raise OutOfTonerException()

def writeLocalAdmin():
    print "writing to local admin"
def writeServiceShop():
    print "writing to service shop"

try:
    usePrinter()
except TrivialException:
    writeLocalAdmin()
except ServiceException:
    writeServiceShop()

Now, in this case, you actually gain something. Suppose the printer raises OutOfTonerException. This exception will be caught by the try/except, and by virtue of this exception being hierarchically a TrivialException, it will write the local admin that something is wrong.

What I have written is the essence and the reason for the existence of Generic Programming in statically typed languages such as C++. As you can see, both approaches, OOP and Generic Programming are natural and smooth in Python.


Random, strange behavior? Check your free HD space first!

Date

Yesterday I was adding a small feature to GenomeAtlas when suddenly something strange happened: in the php script, images are generated dynamically using the gd extension, and saved into a cache. For some reason, suddenly some images were corrupted. At first, I blamed the Apache installation or the php interpreter, because none of the modifications I was doing involved the image generation routines. I was due to something else, so I kept the problem outstanding and left. When I came back, the problem had magically disappeared, but after a while it started again. I checked the cache, and some images were empty. Kind of weird.

I committed part of the code I modified, and the subversion client warned me that the disk was full, and it was unable to commit. Suddenly everything made sense. This happened to me every now and then, and I always forget this golden rule: if something weird and unexplainable starts to happen on your installation, check the free hard disk space first!


From URI to ontologies

Date

It is some sort of well known pattern, the "http://" stuff. It is so frequently used that browsers just fill it in automatically by default. But what does exactly mean? What's behind it? In reality, behind a so called URI, there's more than meets the eye.

What is an URI?

First of all, we need some definition. an URI, or Uniform Resource Identifier, is a string identifier that, as the name implies, identifies some resource. A resource is "something" that can be referred to. It could be an internet document, a book, whatever. The official definition of an URI is detailed in its latest form in RFC 3986.

The URI is made of the following parts:

  1. A scheme
  2. A hierarchical part
  3. Optionally, a query
  4. Optionally, a fragment

Some URIs can be URLs (Locators) or URNs (Names). Both are URIs, but the URL identifies a resource and the place where to find the resource, while a URN describes a name of a resource through the special scheme name urn:. When you type an address in your browser, you are specifying a URL. When you talk about urn:isbn:0-395-36341-1, you are using a URN to talk about a book with a specific ISBN. Please note that the relationship is not bijective: although every URL is a URI, not all URIs are URLs, and the same can be said for URNs instead of URL.

Although many scheme names are named after protocols (eg. http, ftp, ldap) this is mostly incidental. There is no "technical correspondence" between the scheme and the protocol: for example by saying a URI with the scheme name http, trying to get the resource (therefore assuming the URI is also a location) will trigger not only HTTP, but also DNS. Moreover, if you have the document in cache, you will not use HTTP at all, but you will get a resource that is on your hard drive. "Resolving" a URI means finding an access strategy to "use" (in very broad term) the resource identified by that URI, an operation called "dereference". The most common dereferencing is retrieval, like downloading the resource.

The hierarchical part includes either an authority and a path, or just a path. Please note that the "path" concept is very loose: for example, in mailto:user@example.com, the user@example.com is a path. Similarly, in urn:isbn:0-395-36341-1, the path is isbn:0-395-36341-1 (which corresponds to the Webster Dictionary, in case you are wondering). When you have an authority involved for the resolution of the path, and only in this case, then you have the "//" to indicate the authority, for example http://example.com/foo/bar, where the authority is example.com and the path is /foo/bar. An apparent exception is telnet://example.com, but in any case you do have an authority, and the path is always empty. Another apparent strange situation is mailto:user@example.com: in this case, be careful not to confuse a path with an authority (maybe user qualified, as in http://user@host/).

What can you use URIs for ?

So, what does the distinction of URI, URL and URN is really useful for? The fact is that, while a URL refers to a "place" where to find a resource, a URN refers to a "name"of a resource in the urn scheme, and a URI as a name of a resource in any scheme. When we talk about resource, the meaning is very broad. It could be anything, even an abstract concept. An example is the definition of namespaces in XML, something like

<h:html xmlns:h="http://www.w3.org/1999/xhtml">
 <h:head><h:title>Hello!</h:title></h:head>
 <h:body>
   <h:h1>Big Title</h:h1>
 </h:body>
</h:html>

The URI http://www.w3.org/1999/xhtml is just an URI. It is not a location (the fact that W3C is actually using the address to provide an informative page can be seen as a coincidence). It is a symbol to mean something, namely, the fact that some elements in that XML documents belong to the vocabulary of XHTML.

In Chestnut package manager, the XML manifest starts with

<Package xmlns="urn:uuid:d195be0c-200a-40a4-9d05-35fdf42eb29f" version="1.0.0">

Where the URI urn:uuid:d195be0c-200a-40a4-9d05-35fdf42eb29f again means something: the fact that the "grammar" used is the one of Chestnut package manager. I created this URI with the utility uuidgen, and by definition is a unique id. I could have used anything else, even the URI of this post. The important point is that it has to be a URI, and has to be unique for this specific use. In this particular case, the URI is also a URN.

Another interesting usage of URIs is in ontologies and RDF descriptions. Briefly and roughly, an ontology is a "description of a world". Suppose that you want to describe the following information

my guitar is white

and you want a computer to be able to understand it. Moreover, you would like to make the computer understand that the guitar is a musical instrument

The guitar is a musical instrument

and that white is a color

white is a color

Now, if you ask the computer to search all the musical instruments that are white, you would like to get my guitar, because it is white, and because it is a musical instrument. Seems easy, but it's not. The fact is that humans are very smart at interpreting those phrases, but a computer is not. Quoting Bill Clinton, it depends on what the meaning of "is" is. By saying my guitar is white we describe a property of the guitar, specifically its color. We have a subject (my guitar), a verb (is, in terms of has color) and a predicate (white). This is also known as a triplet, and is the basis of RDF.

Of course, if we take the guitar is a musical instrument we have again a case of triplet: the guitar is the subject, is a in terms of is a kind of is the verb, and musical instrument is the predicate. Please note that we have another interesting fact here: while my guitar is a very specific guitar (that is, the guitar I own), the guitar is an abstract concept that applies to any guitar. They are not the same thing, they are two different concepts, but we can say without doubt that my guitar (the first concept) is a guitar (the second concept). This is a case of instance/class relationship.

We can form very complex networks of subject-verb-predicate triplets describing the digital and non-digital world we live in, so that a computer can help us in doing complex search and analysis. How do we differentiate all the concepts and ambiguities we just encountered? You guessed it: with URIs. We will have a URI to express the concept of my guitar, another URI to express the concept of white, another URI to express the concept of guitar, musical instrument, color, and we will also have different URIs expressing the concepts of is (as a color) and is (a kind of). All these concepts are part of the description (also known as ontology) we want to grant to the small world we created in this example. This description is rather simple and loose, but we can define way more complex ontologies. For example, we can create a color ontology, describing the colors and their relationships:

white is a color
red is a color
snow white is a kind of white
blood red is a kind of red
red is a warm color
blue is a cold color

or an instrument ontology describing

guitar is a six-string instrument
violin is a four-string instrument
four-string instrument is a string instrument
six-string instrument is a string instrument
string instrument is a musical instrument

Each of these concepts (guitar, six-string instrument, violin, four-string instrument, string instrument, musical instrument) will then be referred by means of a unique URI.

I hope I gave a reasonable and down-to-earth introduction to URIs and Ontologies, what they mean and represent, and what they can be used for. As usual, you are welcome to comment.


Error 1044 in MySQL: Access denied when using LOCK TABLES

Date

I got an error while using mysqldump

mysqldump: Got error: 1044: Access denied for user x@y to database z when using LOCK TABLES

To solve this problem, either ask you administrator to grant you the lock privileges, or use the following command instead.

mysqldump -u username -p database --single-transaction >dump.sql

This is the help entry for the keyword

--single-transaction
          Creates a consistent snapshot by dumping all tables in a
          single transaction. Works ONLY for tables stored in
          storage engines which support multiversioning (currently
          only InnoDB does); the dump is NOT guaranteed to be
          consistent for other storage engines. While a
          --single-transaction dump is in process, to ensure a
          valid dump file (correct table contents and binary log
          position), no other connection should use the following
          statements: ALTER TABLE, DROP TABLE, RENAME TABLE,
          TRUNCATE TABLE, as consistent snapshot is not isolated
          from them. Option automatically turns off --lock-tables.

Chestnut Package Manager 2.0.0 released

Date

I just released a program I developed: Chestnut Package Manager, a utility to handle executables and resource files in a transparent, platform independent and relocatable way. Its concept is similar to Apple bundles and Java archives. It is implemented in Python.

I also provide a nice tutorial about how to use it and how to deploy your packages. It has been very useful to me, and I guess it will be for other people out there.


Is SVN slow?

Date

I am using SVN to manage my current development repository. As the project grew, the operations became slower and slower. Things like updating or committing could require minutes.

I ran some strace, and apparently SVN takes a lot of time in walking through the repository tree, probably checking for differences between the previous copy (in .svn) and the current copy. Also, it walks through the tree to remove all the lock files it created. These operations grow with the number of subdirectories involved during the operation, so if you don't want to spend your time staring at the ceiling while committing or updating, either you keep the number of subdirs to an acceptable low amount, or you invoke operations involving a low amount of directories (like, committing only a subtree).

My situation is maybe quite extreme, as my repository has 2000 directories (I have lots of small applications to keep track of). I could check out a part of the repository, or split the repository into smaller, independent ones, but I don't want to risk: it could prove potentially dangerous right now, where I have to work on other issues with a quite tight schedule.

Despite this, I like SVN, but I am considering taking a look at git.