Bug fingerprinting and code phylogenies

I recently read this post on a GPL violation occurred for the game engine ScummVM. It appears that Atari released a game who was obtained by a sub-sub-contractor, who used the GPL released code taken from ScummVM, without giving credit or obey to the GPL license. That would be easily solved except that the Nintendo NDA forbids the use of opensource software (something I don’t get. What’s the problem with opensource ?) so a settlement had to be found.

The interesting part is how they recognized the GPL violation, and recognized the exact version of ScummVM used: by verifying bugs. Bugs are characteristics quirks of every software and every version of it. You can use this technique to find exactly which build you have in front.

This can be taken to the extreme.

Phylogenetic analysis is a technique that is used in evolutionary biology. You get the animals of today, with their genetic code, and you compare their code. It won’t be the same: you will have some differences. It is clear that these differences come from modifications from a common genetic ancestor, which has mutated. By using the proper approach, you can find how related are these organisms, and have an idea on how the ancestor code was with a certain degree of probability.

You can do phylogenetic analysis on books as well. Highly copied and translated books, like the bible, or the odissey, are available in many different copies, written by hand in different times from a different starting copy. By doing phylogenetic analysis on the text, you can obtain useful information about when a given copy has been performed, if it’s authentic, and how the original text should have looked like.

Phylogenetic analysis on the source, or on the bugs, to reconstruct the evolutive steps of the code is another very powerful application, in particular when legal issues are present. Is the Felsenstein going to be in every software lawyer’s bookshelf ?

Share This

Chestnut 2.2.0

I just released Chestnut 2.2.0. Just grab it while it’s hot!

Share This

Google Wave - say goodbye to email

I had the chance to see the google wave presentation

I was in awe with the power of this old, new thingie. As soon as wave is released to third parties, email will be a thing of the past in less than three years. Start counting, I can bet on it.

Share This

Pythonic Evolution - part 2

This is the second part of a post relative to evolution. You can find the first part of the post here.

The last argument in the first post was relative to the requirements for evolution to happen. To recall, you need

  1. An imperfect replicator, an entity able to produce a copy of itself, for example the DNA of our bacteria in the example above. The replication mechanism must have a certain degree of imperfection, so that some chance for occasional mutation must be possible, and consequently variability.
  2. Action of the replicator on the environment. For example, The DNA molecule, through a complex biological setup is able to produce proteins which, according to the laws of chemistry, influence the external world by digesting a metabolite and obtaining energy, or fighting an aggression to survive (to UV light, in our example) and reproduce.
  3. Variability of the replicator “phenotype” (which means the “final visible result” of the replicator). Small modifications (accidental or not) to the replicator will change its interaction with the external environment (for example, by granting him to survive better under UV light)
  4. Environmental conditions: the environment will propose conditions, for example availabe metabolites, temperature, pressure, UV light irradiation and these conditions could also change as time passes..
  5. Selective pressure: the interaction between the environmental conditions and the replicator phenotype will create a differentiation between replicators. Those better coping with the conditions will have higher reproductive chance than the others, producing a drift toward better replicators (in our case, the bacteria were under selective pressure because the UV light was killing them, but the mutated ones were more likely to survive)

We offered the example of bacteria where a slight mutation offered to a single organism a very tiny additional chance to survive and produce children. In this very simple model, the mutated (better fit) characteristic became predominant in approx. one thousands generations, not many when you consider bacteria having replication time in the order of hours or days.

We closed the post by stating that this behavior works regardless of the nature of the replicator and the environment. It works in nature with the genetic code, compiled by ribosomes into proteins. It must works in a digital environment with digital bacteria too.

Four years ago, I developed a simple python program to show how “software bacteria” can evolve behavior to solve simple mathematical equations with no active human programming. The idea is as follows: a python program defines a “Creature” class, which represent our “bacterium”. Also, I define an “Environment”, where a bunch of Creatures will live, prosper and hopefully reproduce. The environment will provide “mathematical food” to the bacteria, and the better the bacteria elaborate this food, the better for them, and the higher will be their chance to survive and reproduce.

In real bacteria, the genetic code is converted into protein, and the proteins are responsible for processing metabolites (food) according to the laws of chemistry. In our example, we will make a simplification, skipping the protein step and having the mathematical food directly processed by our genetic code.

The mathematical food will be a simple integer number and the genetic code will be a very simple language able to do sums (of signed numbers) and conditional branching. The result of the genetic code processing the food will be another integer number, a metabolized mathematical result. The environment will promote a condition so that those bacteria whose metabolized product has a particular characteristic (in our case, a particular resulting value) will have a higher chance of reproduction. The condition the environment will propose is the expected result of a simple mathematical equation. Ideally, at the end of this experiment, the evolved bacteria will possess genetic code able to perform the mathematical equation for any given input. In this sense, they evolved to process the food in the best possible way.

Confused? Don’t worry. It will be clearer soon.

Here comes The Creature

Cellular systems are not that different from computer programs. There is a programming language, a compiler, input and output parameters. There’s even memory.

In DNA, the genetic code is made out of four molecular letters, A G C T, forming three letters words, named codons. DNA is first copied to RNA. The RNA then is interpreted by an elaborate mechanism (the ribosome) able to translate each group of three-letter words into a specific amino acid (out of 20) as building block (think LEGO) to be put in a protein. The protein so created is a molecular machine which performs a chemical task, like digesting a metabolite, storing substances (eg. iron), providing structural support and so on. How the system got started is currently not known, but it takes the creation of a replicator molecule to start the process, in particular if this molecule has self-catalytic properties (meaning: it makes easier the creation of a copy of itself). Likely candidates are vulcanic activity, electric discharges (lightnings), or panspermia. I tend to favor the electric discharge, coupled with catalysis by means of metals (found in the rocks). After all, the Miller-Urey experiment demonstrated that if you mix and boil and discharge electricity long enough, you will generate aminoacids out of simple gases like water, methane, nitrogen and ammonia, something that it was present for sure into the Earth primordial atmosphere. So, at the moment is not know which was the replicator and how it got produced, but even if unlikely as an event, once started it grows and is basically unstoppable, provided proper conditions are met.

Once you have aminoacids, you have proteins. Biological systems evolved to use proteins because they are more efficient and disposable, while the genetic code is less efficient, and very important. After all, programmers keep their source code in a backed up subversion repository, while giving out compiled versions. Loosing a compiled version is not a big issue, you can always recreate it from the sources, but having your subversion system go corrupt means that your asset is totally lost, and could mean that your company is out of business. As already said, in our in-silico setup, we skip the translation part for simplicity.

In our mathematical bacterium we have three variables that can be set: A, X and Y. Each of them has a different role and particularity. A is the import/export variable where the value representing the food is first inserted, modified, and released to the environment. You can see it as the unique input/output parameter of our “genetic code function”. X and Y are internal variables. You can make an analogy to internal metabolites needed to support the food consumption. The main difference between the two is that X is more of a counter: it can be set, incremented (or decremented) and tested for being zero and decide if branch (or not). Y is instead auxilliary in processing A. Our bacterium has a genetic code which consist of 10 codons. Like amino acids are able to perform various chemical operations on substrates by their own chemical nature, these operations can perform different mathematical or logical tasks. The codons are:

  • LoadA, LoadX, LoadY: each of them loads a specified value in the variable A, X or Y.
  • IncA, IncX: increments the value of the variable A or X of a given specified amount (can be negative)
  • MoveAtoY: copies the content of A into Y.
  • AddYtoA: performs the sum between the content of A and Y, and stores the result in A.
  • BranchXZero,BranchXNotZero: jumps a specified number of codons forward (or backwards) if the content of the variable X is zero or not-zero, respectively.
  • Return: terminates the execution of the genetic code.

Let’s see some example of genetic code that does something to a mathematical food.

This is the most trivial one

Return

A bacterium with this genetic code will take the metabolite (an input number), and load it automatically in A. Then, it will finish immediately with no processing. The content of A is the result of the metabolic process. In this case, the bacterium returns what it eats. you give him 5, it returns 5. You give him 13, it returns 13.

A more interesting case is the following:

IncA 5
Return

The bacterium with this genetic code in the first instruction will increment 5 to the content of A. The second statement will return whatever it is containted in A. This bacterium eats 4 and returns 9, eats 13 and returns 18, etc. You get the idea.

So, now you can imagine a population of bacteria, and imagine that the genetic code was created with a completely random process. For example, say that we create a population of 3000 bacteria with the following criteria:

  1. When you create each bacterium, you extract a random number of codons (from 2 to 50) which will be used to generate their genetic code.
  2. Given the number of codons for a specific bacterium, you extract that number of randomly chosen codons from the available pool (LoadA, LoadX, LoadY, IncA, IncX, MoveAtoY, AddYtoA, BranchXZero, BranchXNotZero, Return).
  3. For codons accepting a numeric value (LoadA, LoadX, LoadY, IncA, IncX, BranchXZero, BranchXNotZero), extract a random number from, say, -5 to +5 and use it as a numeric value.
  4. What you obtain is a bacterium whose genetic code is a random mess of a random number of random codons with random parameters.
  5. And of course you obtain a population of 3000 bacteria all with random genetic code.

If you feed a number (say 42) to each bacteria, you will expect many different results. Each bacterium will be fed with the number 42 (which will be placed in A) and then the randomly generated set of operations will occur. Nice, but not particularly useful.

But here the cool stuff begins. Suppose you decide to say: if the environment provides 42, those bacteria that produces a result close to 47 are more likely to survive. Those who produce a numeric value very far from 47 are instead more likely to die. With this in mind, you start killing bacteria. Those who return exactly 47 will survive. Those that return 48 have a slight chance of dying, but not much. Those who return 0, or 500 will be probably killed immediately. Out of the starting 3000 bacteria, you will now have a troop of survivors (say 100) whose genetic code produce, by pure random chance, something that is quite near to the expected result (47) out of the food value 42.

Now you allow this bacteria to reproduce. Of course, if you take the 100 survivors and produce exact clones so to repopulate up to 3000, you will obtain no improvement. Here the “imperfect replication” kicks in. You allow a random number of mutations to occur to each bacterium before duplicating. These mutations will change the genetic code, potentially creating a new program that produces something lethal (too far from 47) but also something with better fit (something quite near to 47).

After this event takes place, you allow the bacteria to replicate so that you restore your pool of 3000, and you apply selection again. You feed them 42 and you kill all those bacteria producing results too far from the expectation (47). New survivors, new mutations, new generation, and you go on and on.

As you can see, all the conditions for evolution are met:

  1. An imperfect replicator exists: it’s our genetic code based on mathematical codons. Replication is imperfect because we have random mutation of the genetic code at every new generation.
  2. Action of the replicator on the environment. The genetic code takes a number and process it into another number.
  3. Variability of the replicator “phenotype”. Modifications on the genetic code produce modification in the final resulting value.
  4. Environmental conditions: The environment presents 42 and expects 47 as a good value indicative of a nice processing.
  5. Selective pressure: genetic code responding at best to the environmental conditions will have a higher chance to survive and produce a new generation. Genetic code that is slightly less accurate will have a lower chance to survive, and genetic code producing values too far from what the environment considers a proper response will be killed.

In the next post, we will see how this mechanism has been implemented into a small python program, and we will see what happens for different cases.

Share This

Pirates Pirates!

So I discovered that I apparently entered the sacred shrine of those people whose work is available in the warez underground.

I don’t think I am allowed to state my personal opinion on this regard. Suffice to say I think this means for sure that I reached a fair level of achievement.

Quick, someone get some ninjas, so we can deliver pizzas!

Share This

Pythonic Evolution - Part 1

This post is in different parts. The fact is that it requires quite a lot of time investment, something I really don’t have in this period.

A long time ago I wanted to play with the concept of genetic code, and how it represents nothing but a language to code for molecular machines. As the Jacquard loom, it’s nothing but a chemical punchcard who gets executed by the ribosome loom, and loomed into a protein. The fact behind evolution at this level is that if a small mutation happen in the punchcard by accident, and the resulting protein turns out to provide a competitive advantage for survival and reproduction, then this wrong punchcard will naturally exploit this advantage by becoming the predominant form. Even a very tiny advantage is sufficient, in the span of million of years, to reach fixation (the old code is completely replaced by the new one).

The math is simple. Here is a program that does it

normal = 500
mutated = 1

normal_survive_prob = 0.90
mutated_survive_prob = 0.91

for generation in xrange(1,1000):
    normal -= int(normal * (1-normal_survive_prob))
    mutated -= int(mutated*(1-mutated_survive_prob))

    normal *= 2
    mutated *= 2

    total = normal + mutated
    print "Generation %d -- normal : %d %%  mutated : %d %% " % \
          (generation, 100*normal/total, 100*mutated/total)

In this program (which has some simplifications, but the concept is there), we have a population of bacteria. We start with 500 normal bacteria, and only one mutant. This mutant has an advantage: its mutation makes him more likely to survive (for example, to an antibiotic, or to UV light from the Sun, or to oxygen). In this case the probability that the standard bacterium can survive is 90 %. In other words, if you have a large quantity of bacteria, some of these will die due to UV light. If you divide the number of survived bacteria by the number of bacteria you had, you will find out that 90 out of 100 survived, and the other 10 out of 100 died. The remaining 90 % do what bacteria do all the time: reproduce by splitting in two, meaning that if 40 bacteria survived the UV light and they reproduce, you will get 80 bacteria. In turn, some of them will die, and other will survive, and reproduce, and this goes on and on.The mutated bacteria has an advantage: its mutation makes it more resistent to UV light, so it has a slightly higher probability of surviving: 91 %

If you run the program, you will find out thatat generation 1, almost everything is made out of normal bacteria

Generation 1 -- normal : 99 %  mutated : 0 %

At generation 100, you see that the mutated ones start to be visible

Generation 107 -- normal : 99 %  mutated : 1 %

At generation 365, the mutated ones can start a political party

Generation 365 -- normal : 85 %  mutated : 14 %

After 525 generations, they are fifty/fifty:

Generation 525 -- normal : 49 %  mutated : 50 %

At generation 700, the mutated ones are the large majority

Generation 700 -- normal : 12 %  mutated : 87 %

And finally, at generation 1000 the normal ones are basically disappeared, and the mutated ones conquered their environmental niche:

Generation 1000 -- normal : 0 %  mutated : 99 %

Now, if you consider that a bacterial generation happens in days, or even hours, you see how it does not take much for bacteria to evolve. If our bacterium here divided every day, we would have a total fixation after just three years. If you assume that the difference in probability is 0.01 % (instead of 1% as in this example)  you will get the same result, it will just take more time (and even million of years is a blink of an eye against the age of the Earth).

The best part about evolution is that it works for everything behaving with the same modality. What you need is:

  1. An imperfect replicator, an entity able to produce a copy of itself, for example the DNA of our bacteria in the example above. The replication mechanism must have a certain degree of imperfection, so that some chance for occasional mutation must be possible, and consequently variability.
  2. Action of the replicator on the environment. For example, The DNA molecule, through a complex biological setup is able to produce proteins which, according to the laws of chemistry, influence the external world by digesting a metabolite and obtaining energy, or fighting an aggression to survive (to UV light, in our example) and reproduce.
  3. Variability of the replicator “phenotype” (which means the “final visible result” of the replicator). Small modifications (accidental or not) to the replicator will change its interaction with the external environment (for example, by granting him to survive better under UV light)
  4. Environmental conditions: the environment will propose conditions, for example availabe metabolites, temperature, pressure, UV light irradiation and these conditions could also change as time passes..
  5. Selective pressure: the interaction between the environmental conditions and the replicator phenotype will create a differentiation between replicators. Those better coping with the conditions will have higher reproductive chance than the others, producing a drift toward better replicators (in our case, the bacteria were under selective pressure because the UV light was killing them, but the mutated ones were more likely to survive)

A famous example of evolution at work is antibiotic resistance. The selective pressure of a poisonous substance (an antibiotic) kills the organisms not able to neutralize it, while those less sensitive to it survive and reproduce, creating more copies of so-called antibiotic resistant bacteria. Very soon, the antibiotic becomes ineffective against this new breed, and a new one must be invented. This is also the reason why antibiotics should be used sparingly, and always up to the end of the treatment, so to be sure that even those bacteria that are more resistant received a lethal dose and are therefore unable to start a resistant strain.

But the funny part of the evolutive setup is that it works no matter what the replicator and environment are….

Continued!

Share This

Smashing a car at the speed of sound

Since I moved abroad, I rediscovered good TV. And Mythbuster is definitely an interesting show. Adam and Jamie are so fantastic in exceeding expectations every time. But this time, they did more.

So, the legend says that two trucks smashing onto each other pancake a car so badly that nobody is able to spot it in the remaining of the aftermath. So to either confirm or disprove it, they smashed the trucks for real, but the car was not pancaked as the legend says. As is typical from Adam and Jamie, they push the conditions to the limit so to verify what are the real parameters that would be needed for achieveing the desired effect.

So they shoot a metal missile at the speed of sound against a car, using a two-stages rocket sledge. See it by yourself.

What amazes me the most is the slow motion capture of the event. The car gets literally erased in the process. In one shot toward the end you can also see the air waveshock due to the achieved speed.

Share This

Change separator in gnuplot

Gnuplot is a great software. Very useful for easily plotting datapoints without fuss and complicated interface. However, the default accepted format is a table defined by space-separated values. If you have comma separated values, then you have to change the delimiter. But how?

This command

set datafile separator ","

will do the trick.

Share This

Papers: organizing scientific publications

Part of the job of being a researcher is organizing knowledge, which most of the time comes in the form of scientific papers. Doing a bibliography search can take a large amount of time, and organizing it in a proper way is even more complex.

Recently I spotted this application for MacOSX, Papers. It behaves in a similar way as iTunes, but instead of organizing songs and movies, it organizes scientific papers, allowing you to download them, track them, search them and produce a bibtex file of your collection for the LaTeX/BibTeX citation mechanism. It is not open source, and you have to pay for it, but I found the price reasonable for the features it offers, and the developers are very eager to answer your questions (I already contacted them twice to ask for new features, and they replied in a short time and with enthusiasm).

Please don’t take this post as an advertisement. It is an opinion about a product I use which, honestly, is helping me considerably in doing my job. If you have troubles organizing your paper library, and you have a mac, it is something worth considering. You can also check other Reference management software and leave comments on your experience.

Share This

Restyling!

I decided to restyle the blog a bit. Images and design are from the following websites and people, which I gratefully thank

The restyling is not done yet. I will add more credits to this post if I will use more third party productions.

Share This
Close
E-mail It