About class attributes, semantics and microformats

Date

I just got to this post by Richard le Guen via the referrals to my blog, and I feel it's important to clarify my point.

So, let's describe the problem. In HTML, you describe the layout of your information. How the information will look like is "presentation" and is managed via a so called Cascading Style Sheet, or CSS. Change the CSS, the visual aspect changes.

To bind the entities you describe in HTML with how they will appear, you use a standardized attribute, the class. An example is:

<div class="info">This is a message</div>

In this example, the info label is the class assigned to the content of the <div> HTML tag. Now you can change its appearance with a CSS directive like

.info {
   color: blue;
   font-size: 13px;
}

if you want the text to appear blue and of a given size. So far so good.

One of the problems in todays' web is to give semantic meaning to things, so that a computer can extract the information and do smart things with it. Unfortunately there's no clear way to perform this (well, there is, but we would get into an argument too complex to be described here).

Enter microformats: the point of microformats is to grant semantics through ad-hoc class names that are not only used as a representational reference into the css, but also are assigned to a well-defined meaning. Example:

<div>
  <div>Joe Doe</div>
  <div>The Example Company</div>
  <div>604-555-1234</div>
  <a href="http://example.com/">http://example.com/</a>
</div>

can be endowed with semantic meaning if you use the hCard microformat

<div class="vcard">
  <div class="fn">Joe Doe</div>
  <div class="org">The Example Company</div>
  <div class="tel">604-555-1234</div>
  <a class="url" href="http://example.com/">http://example.com/</a>
</div>

With this solution, a computer can now understand meaning out of the class attribute, and do smart things like aggregating data, and allow searching on it.

Some time ago, on Jeff Atwood blog, coding horror, a point was made about microformats and their problems. In particular to the fact that (citing Jeff) "the crux of microformats is overloading CSS classes", and I tended to agree, considering technically wrong to overload CSS classes with meaning. Richard objected that the very definition at the W3 consortium about class attributes is also "For general purpose processing by user agents", which convincingly includes hCard scrapping tools.

On this, Richard is right, and I'm torn on my previous stance. The point is that the class attribute is indeed specified as a general purpose tool, which specifically and most frequently is used for stylization purposes. Microformats are, from the formal point of view, a totally legitimate use of the class attribute. Nevertheless, when you overload classes with meaning you can incur in all the problems Jeff Atwood points out, and you should be very, very careful, or chaos will ensue. Formal approval for a usage is not always indicative of a safe practice, but in this case the problem arises from the fact that we tend to think at class names as something that has a meaning only within our web application. With microformats, this meaning extends outside the boundaries of our self-contained world, and this conflictual view can produce problems.

Edit : Another issue worth reporting is namespacing. The fact that you grant semantic meaning to a given class attribute means that a given string, say "vcard" conveys a specific meaning which is unique. If you take every possible available string in the world, they belong to a unique, flat namespace. Now, RDF-based semantics approach uses namespacing to distinguish different meanings granted to the same string by using, instead of a trivial string, a URI. In microformats, and in the approach used by microformats, you don't have namespacing, but you have containment. For example, the class "tel" in the vcard microformat can be distinguished by another "tel" (for example, indicating a table cell on the same webpage) by the fact that the one in the microformat matches the selector ".vcard .tel". It's sort of namespacing, although with a different mechanism, and it's done and built through the containment relationships among HTML elements.

It's mind bending if you think about it, but once you see how it works, it's kind of smart. It's not as complex and demanding as RDF, not as powerful as OWL-based ontology description, but it can work for simple to medium complexity semantic data.


The subtle art of writing a code example

One of the most frustrating experiences when learning a new technology is finding useless examples. An example is the most precious thing that comes with a new library, language, or technology. It must be a starting point, a wise and unadulterated explanation on how to achieve a given result. A perfect example must have the following characteristics:

  • Self contained: it should be small enough to be compiled or executed as a single program, without dependencies or complex makefiles. An example is also a strong functional test if you correctly installed the new technology. The more issues could arise, the more likely is that something goes wrong, and the more difficult is to debug and solve the situation.
  • Pertinent: it should demonstrate one, and only one, specific feature of your software/library, involving the minimal additional behavior from external libraries.
  • Helpful: the code should bring you forward, step by step, using comments or self-documenting code.
  • Extensible: the example code should be a small "framework" or blueprint for additional tinkering. A learner can start by adding features to this blueprint.
  • Recyclable: it should be possible to extract parts of the example to use in your own code
  • Easy: An example code is not the place to show your code-fu skillz. Keep it easy.

And here comes the helpful acronym: SPHERE. Yes, I looked for the correct synonyms to make it, after the main concepts were in place.

Prototypical examples of violations of those rules:

Violation of self-containedness: an example spanning multiple files without any real need for it. If your example is a python program, keep everything into a single module file. Don't sub-modularize it. In Java, try to keep everything into a single class, unless you really must partition some entity into a meaningful object you need to pass around (and java mandates one class per file, if I remember correctly).

Violation of Pertinency: When showing how many different shapes you can draw, adding radio buttons and complex controls with all the possible choices for point shapes is a bad idea. You de-focalize your example code, introducing code for event handling, controls initialization etc., and this is not part the feature you want to demonstrate, they are unnecessary noise in the understanding of the crucial mechanisms providing the feature.

Violation of Helpfulness: code containing dubious naming, wrong comments, hacks, and functions longer than one page of code.

Violation of Extensibility: badly factored code that have everything into a single function, with potentially swappable entities embedded within the code. Example: if an example reads data from a file and displays it, create a method getData() returning a useful entity, instead of opening the file raw and plotting the stuff. This way, if the user of the library needs to read data from a HTTP server instead, he just has to modify the getData() module and use the example almost as-is. Another violation of Extensibility comes if the example code is not under a fully liberal (e.g. MIT or BSD) license.

Violation of Recyclability: when the code layout is so intermingled that is difficult to easily copy and paste parts of it and recycle them into another program. Again, licensing is also a factor.

Violation of Easiness: Yes, you are a functional-programming nerd and want to show how cool you are by doing everything on a single line of map, filter and so on, but that could not be helpful to someone else, who is already under pressure to understand your library, and now has to understand your code as well.

And in general, the final rule: if it takes more than 10 minutes to do the following: compile the code, run it, read the source, and understand it fully, it means that the example is not a good one.


Image self consistency from xkcd

Date

I love xkcd. A comic combining fun and math by definition has to be good and geeky and the author, Randall Munroe, is a real genius on this. The latest comic is pretty interesting

xkcd

The image is self-descriptive, meaning that each graph represents information about the image itself. For example, the first panel contains a pie chart which says how many pixels are either white or black on the image. Clearly, the relative amount of black pixels in the image depends on the size of the slice of that piechart representing the amount of black pixels, a "chicken-egg" kind of problem. It is apparently difficult to obtain such image, because the plotted data must be consistent with themselves via the graphical representation. This kind of problems, where the solution depends on itself, is quite common in many scientific problems, and it's solved through self-consistency.

The trick is as follows: we start with a first, approximate solution, called a guess, and we apply a method that gives us a result depending on this guess. Then, we take this newly obtained result, and reapply the method again, to obtain a new result, and then again, and again, until, hopefully, the input and the output of the method are the same. When this occurs, we solved our problem via self-consistency. Of course, this convergence is not guaranteed to occur, but if it occurs, we found a solution (there could be more than one).

Let's see it in action in a simplified form. I wrote two small python programs. They use matplotlib and the Python Image Library. The first (called piechart.py) creates a pie chart from a given data input

import sys
from matplotlib import pyplot

white = int(sys.argv[1])
black = int(sys.argv[2])

pyplot.pie([white, black], colors=('w', 'k'))
pyplot.savefig(sys.argv[3], format="pdf")

If we call this program specifying two values (the absolute values are not important, as the pie chart shows relative amount), it draws the pie chart accordingly

python piechart.py 100 400 piechart_100w_400b.pdf
convert -geometry 210x158 piechart_100w_400b.pdf piechart_100w_400b.png
piechart convergence iteration 0

This creates a pie chart where white is 1/5 of the pie chart area and black is 4/5. Please note that due to a setup problem of my matplotlib I can only create pdf, so I convert the pdf into png of defined size, in our case, 210x158, using the convert program. The total size of the image is of course important, having an influence on the total number of pixels. I chose a good value for presentation purposes which guarantees quick convergence.

The second program is called imagedata.py and extracts size and number of white and black pixels from an image.

import sys

from PIL import Image

im = Image.open(sys.argv[1])
white = 0
black = 0
for i in im.getdata():
  if i == (255,255,255):
    white += 1
  else:
    # we assume black everything that is not white:
    black += 1
print im.size[0],im.size[1],white,black

If we run this program on the png image, it will tell us how many pixels are white, and how many are black.

$ python imagedata.py piechart_100w_400b.png
210 158 23988 9192

Of the 33.180 pixels defining the full image above (border included, not only the pie chart circle), 23988 are white (72%), and 9192 are black (28%). Hence the image is not representing itself: the plot represents our initial values of 20 % white and 80 % black.

Now we create a new image, in agreement with the iterative procedure, passing the most recently obtained values

python piechart.py 23988 9192 piechart_23988w_9192b.pdf
convert -geometry 210x158 piechart_23988w_9192b.pdf piechart_23988w_9192b.png

and repeat the process. This becomes tedious very soon, so I wrote a driver (driver.sh) to perform the process for me

# generates the starting guess
python piechart.py 100 400 iter_0.pdf
convert -geometry 210x158 iter_0.pdf iter_0.png

# iterative process
echo "step w   h  white black"
step=1
while true;
do
 data=`python imagedata.py iter_$(($step-1)).png`
 echo "$step - $data"
 python piechart.py `echo $data|awk '{print $3}'` `echo $data|awk '{print $4}'`  iter_$step.pdf
 convert -geometry 210x158 iter_$step.pdf iter_$step.png
 step=$(($step+1))
done

If we run it, we immediately see a very interesting result

step w   h  white black
1 - 210 158 23988 9192
2 - 210 158 29075 4105
3 - 210 158 30551 2629
4 - 210 158 30977 2203
5 - 210 158 31108 2072
6 - 210 158 31158 2022
7 - 210 158 31164 2016
8 - 210 158 31169 2011
9 - 210 158 31172 2008
10 - 210 158 31172 2008
11 - 210 158 31172 2008
12 - 210 158 31172 2008

The number of black pixels decreases, and the number of white ones increases. At every step, the image slightly changes, until it reaches a point where it does not change anymore: it achieved self-consistency, and it is representing itself. This is a movie of the various steps until convergence

piechart convergence

What if we started from the other direction, namely, with a guess containing zero as the number of black pixels? The result would have been the same

1 - 210 158 31750 1430
2 - 210 158 31320 1860
3 - 210 158 31221 1959
4 - 210 158 31184 1996
5 - 210 158 31178 2002
6 - 210 158 31174 2006
7 - 210 158 31172 2008
8 - 210 158 31172 2008
9 - 210 158 31172 2008

Again, even with a different starting guess, we obtain the same result, here depicted as a movie

piechart convergence

I hope this gave a brief explanation on how Randall achieved the self-consistent image. His case was more complex, having three plots. Also, the comic is scribbled, so either he drew it by hand, approximating the computed result, or he performed some scribble-like transformation preserving the pixel count. I assume it is the former.


How much statistics should one know ?

Date

I just wrote an answer to this very interesting question on Stackoverflow. Now, as a disclaimer, I'm not an expert in statistics, but I did enough statistics to "know the beast", or at least what are the dangers. I will rearrange my answer for this post, to address the more general case.

The main issue is "How much statistics should any person know?". In our life, we all deal with statistics, willful or not. Polls, weather forecast, drug effectiveness, insurances, and of course some parts of computer science. Being able to critically analyze the presented data gives the line between picking the right understanding out of them or being scammed, tricked, or misdirected.

Technically, the following points are important:

All these points are critical if you want to interpret anything with a grain of salt. Yet, they are not the whole story. Let's face it. Statistics needs understanding before anything can be inferred, otherwise wrong conclusions will be obtained. I will give you some examples:

  • The evaluation of the null hypothesis is critical for testing of the effectiveness of a method. For example, if a drug works, or if a fix to your hardware had a concrete result or it's just a matter of chance. Say you want to improve the speed of a machine, and change the hard drive. Does this change matters? you could do sampling of performance with the old and new hard disk, and check for differences. Even if you find that the average with the new disk is lower, that does not mean the hard disk has an effect at all. Here enters Null hypothesis testing, and it will give you a probability, not a definitive answer, like: there's a 90 % probability that changing the hard drive has a concrete effect on the performance of your machine. Depending on this value, you could decide to upgrade hard drives to all 10.000 machines in your server farm, or not.
  • Correlation is important to find out if two entities "change alike". As the internet mantra "correlation is not causation" teaches, it should be taken with care. The fact that two random variables show correlation does not mean that one causes the other, nor that they are related by a third variable (which you are not measuring). They could just behave in the same way. Look for pirates and global warming to understand the point. A correlation reports the possible presence of a signal, it does not report a finding.
  • Bayesian inference. We all know Bayesian-based spam filter, but there's more, and it's important to see how human decisions and mood can be influenced by a clear understanding of data analysis. Suppose someone goes to a medical checkup and the result tells him/her has cancer. Fact is: most people at this point would think "I have cancer" without any doubt. That's wrong. A positive testing for cancer moves your probability of having cancer from the baseline for the population (say, 12 % of women have the chance for breast cancer) to a higher value, which is not 100 %. How high is this number depends on the accuracy of the test. If the test is lousy, you could just be a false positive. The more accurate the method, the higher is the skew, but still not 100 %. Of course, if multiple independent tests all confirm cancer, then it's very probable it is there, but still it's not 100 %. maybe it's 99.999 %. This is a point many people don't understand about bayesian statistics.
  • Plotting methods. That's another thing that is always left unattended. Analysis of data does not mean anything if you cannot convey effectively what they mean via a simple plot. Depending on what information you want to put into focus, or the kind of data you have, you will prefer a xy plot, a histogram, a violin plot, etc... Each data insight has a different preferred plot, exactly as each conversation has a different appropriate wording.

Statistics enter our lives every time we have to distill an answer or compare numerical (or reduced to numerical) data from unreliable sources: a signal from an instrument, a bunch of pages and the number of words they contain and so on. Think for example to the algorithm to perform click detection on the iphone. You are using a trembling, fat stylus (also known as finger) to point to an icon which is much smaller than the stylus itself. Clearly, the hardware (capacitive touchscreen) will send a bunch of data about the finger, plus a bunch of data about random noise from the environment. The driver must make sense out of this mess and give you a x,y coordinate on the screen. That needs a lot of statistics.

An additional issue is sampling. Sampling actually comes first than statistical analysis: you collect a sample, reduce it to a number, and perform statistics on this number (among many others). Sampling is a fine and delicate art, and no statistics will correct, or even point out at an incorrect sampling, unless you act smart. Sampling introduces bias, either from the sampler, the sampling method, the analysis method, the nature of the sample, or the nature of nature itself. A good sampler knows these things and tries to reduce unwanted bias as much into a random distribution, so to treat it statistically.

As a closing remark, statistic is among the most powerful allies we have to understand the noisy universe we live in, but it's also a very dangerous backstabber enemy, if not used properly. Willfully misusing it is definitely evil.


Using contexts in rdflib

Date

I am playing with rdflib, a fantastic python library to handle RDF data. I had trouble understanding how to use contexts so to partition my ConjunctiveGraph into independent subgraphs. Here is the code

import rdflib
from rdflib.Graph import Graph

conj=rdflib.ConjunctiveGraph()

NS=rdflib.Namespace("http://example.com/#")
NS_CTX=rdflib.Namespace("http://example.com/context/#")

alice=NS.alice
bob=NS.bob
charlie=NS.charlie

pizza=NS.pizza
meat=NS.meat
chocolate=NS.chocolate

loves=NS.loves
hates=NS.hates
likes=NS.likes
dislikes=NS.dislikes

love_ctx=Graph(conj.store, NS_CTX.love)
food_ctx=Graph(conj.store, NS_CTX.food)

love_ctx.add( (alice, loves, bob) )
love_ctx.add( (alice, loves, charlie) )
love_ctx.add( (bob, hates, charlie) )
love_ctx.add( (charlie, loves, bob) )

food_ctx.add( (alice, likes, chocolate) )
food_ctx.add( (alice, likes, meat) )
food_ctx.add( (alice, dislikes, pizza) )

print "Full context"
for t in conj:
    print t

print ""
print "Contexts"
for c in conj.contexts():
    print c

print "love context"
for t in love_ctx:
    print t

print "food context"
for t in food_ctx:
    print t

Running this script will produce the full graph (printing the triples in ConjunctiveGraph) but also the subgraphs as assigned for each context.


Backups!

Date

A friend of mine just reported me a very tragic event. His bag with his laptop and his backups hard drive was stolen. I can feel his pain, having experienced the very same situation once, and I am writing here hints for him and the rest of the world to protect yourself from this eventuality.

Buy cheap. Computers can be expensive to buy again, but you can tolerate more the loss of a 13 inches macbook than a 17 inches macbook pro. Of course, you should buy the minimum that satisfies your actual needs, but if you can survive without a feature, or buy it external, you can spend less money and therefore lose less in case of theft.

Encrypt everything. As our lives are more and more represented on digital media, it could be a complete and utter disaster if they fall into the wrong hands. Bank access, credit card numbers, all our online passwords, health documents, tax documents, non open intellectual property, your MP3s, movies, family photos, software licenses, emails... in the hands of a thief ? Use Truecrypt to create an encrypted area on your media. As a key, you can use "something you know", like a passphrase or "something you have", like a well defined file acting as your key. I strongly advise against the latter: you can lose your keyfile, meaning that under no circumstance you will be able to access your data again. If the file is accidentally modified (e.g. it's a MP3 file and some music program changes the ID3 metadata info), the file is no longer a valid key. Ditto. A long passphrase uses your brain, which is less prone to corruption. Choose wisely: you will not use this passphrase often, so it is something that under no circumstance you can forget. Anything chosen in the bout of the moment will be long forgotten the next time you open the backup. Of course, it is strictly important that nobody (yes, I really mean it, nobody) knows, can spy on, or can guess your passphrase.

For movable storage (like SD cards, USB keys and solid state disks you use for everyday data handling) use Truecrypt to create an encrypted disk image file onto the storage device. Do not encrypt the whole card. Instead leave a readable area, and put a textfile named "RETURN_LOST_DRIVE_DETAILS.TXT" containing something like "if found please contact me at me@example.com". Remember the TXT extension so that windows users are not intimidated by an unknown file, as these devices can be used for social engineering. Encryption cannot work for SD card aimed at cameras or MP3 players, as the hardware device won't be able to access the encrypted area. Complain with your camera/pod producer to add Truecrypt support (don't hold your breath). If you use the SD as a disk for your computer, there's no problem.

Computer: use the native disk FileVault encryption of Mac, and turn off autologin. Alternatively, create a Truecrypt volume and open it every time you perform login, but remember this won't protect some parts of your personal life, such as your browser cache, from curious eyes. Using Truecrypt has the advantage of being cross platform: you can move your encrypted disk image and read it on Linux or Windows as well, while the Mac encrypted storage format can be opened only by Mac (as far as I know).

Backups medias: For optical devices, remember that CDs and DVDs tend to become unreadable with time. It makes sense to refresh your backup media every 5 years. In this span of time, you will likely have a larger storage format, allowing you to concentrate your data (e.g. from 5 CDs to 1 DVD). Of course, you have a single point of failure in that DVD, so you have to make two exact copies of them. Use different brands for the two physical supports! This protects you from a faulty shipment. Optical devices in general don't become completely unreadable, but tend to develop areas of unreadability. If this happens, with some doctoring and the second optical media you should be able to recreate the original archive completely. It's a good idea to store the md5 of the files you burn. This hint however is rather obsolete, as optical devices are used less and less.

Backup hard drives: if you prefer to get rid of optical media completely, and go hard disk, my personal solution is to buy a small disk and encrypt it with Truecrypt, whole. You are not supposed to lose backup disks, and if you lose them it should not be a tragedy. If you want to feel safer, use the textfile trick, but it should not be needed for backup disks. Once you get the encrypted partition, put the contents of your computer onto the disk with the following script

#!/bin/sh
SOURCE_DIRS="$HOME:/another/dir:/and/so/on"
TARGET_DIR="/Volumes/Backup/laptop/"

# if the external drive is not there, complain and stop
if [ ! -e "$TARGET_DIR" ]
then
 echo Target directory does not exist!
 exit 1
fi

IFS=:
date=`date +%Y%m%d-%H%M%S`
pushd .
cd ~/
/usr/bin/rsync --backup --suffix="-backup-$date" --progress -av $SOURCE_DIRS "$TARGET_DIR"
popd

This script automatically backs up everything in your home directory onto the backup disk. New files will be added. Old files you modified since the last backup will be renamed to append the date, and the new, updated file will be stored. This command never deletes anything from your backup disk. It always adds. Run this script with some frequency, as high as you feel safe. If the backup disk breaks down or is stolen, you will have the most recent files on your computer. If the computer is stolen, you can recover the files from the backup. You will have some cleanup to do, but still...

Sooner or later the disk will fill up. Buy a new, larger disk from a different company (again, to protect you from a faulty stock). Copy all the contents of your old backup disk into the new one, and continue adding data to the new one from your computer. Eventually you can remove old, unused stuff from your computer now, because you know that at least two copies exists: one in the first disk, one in the second. When the second disk fills up, repeat the procedure with a new, larger hard drive. Store all your backup drives in a safe, fire resistant place. Banks are specialized for this.

As you can see, older files gets replicated more than recent ones, but even the most recent file is always present in at least two copies: one on the latest backup disk, one on the computer. Even if one of your backup disks breaks down in 10 years, you will have the same data duplicated on later disks.

Now, there are only two points to address. The first one is to find the sweet spot in increased hard disk size that allows you to fill it up in approximately one year, so you can archive it and move on to a newer hard drive. This depends on your usage intensity. If you put large, transitory files on your computer, remember to delete them before running the backup. Same if you move large files around, they will be copied again (remember, the script never deletes anything). The second point is relative to the most recent files you have on your computer: they are not backed up, so unless you run your script every day, you risk to lose something more than a day worth of work. For this, you can have a small USB key. You can trash the contents at the next run of the script.

What about online backups ? Well... do you trust them ? If you do, go for it. I don't.

Summing up, the basic rules are:

  • Encrypt everything
  • Plan for worst case scenario
  • At least duplicate, better n-plicate
  • Keep live data and backup in two different locations
  • Frequency of backups is tuned to the amount of work you are willing to redo

Pythonic Evolution - Part 3

Date

You are welcome to take a look at Part 1 and Part2 of this series.

In this third part of the "silicon-based" bacterial evolution, we move to the real action. I developed a program (you can download it from here), which perform evolutive selection based on mathematical criteria. The program has a set of rules to perform selection, but please note that most (if not all) of these rules are totally arbitrary. In this sense, it's nothing more than a toy, and I therefore warn against any scientific analysis of what we obtain. It's just for fun. Also, the code is pretty bad.

There are two classes in the program: Bacterium and Environment. A Bacterium can be fed, and its genetic code can mutate.The Environment is responsible for feeding the bacteria with the mathematical conditions, assess if they are worthy for survival and reap the unworthy. At every new generation, the Environment perform duplication for each bacterium in the pool, and apply mutations to the new entries. We try to keep the pool filled, so we always replenish it to 4000 bacteria at every generation, out from the available mutated. 4000 is also the limit for the number of bacteria simultaneously present. This is a very strong limitation, as it hinders the normal behavior of conventional bacteria: multiply exponentially. Unfortunately, we don't want to spend weeks in calculations, and in any case the computer memory is limited, so we have to find a proper and very harsh compromise.

Once the duplication/mutation step, the environment performs the selection (reaping). For each of the mathematical conditions, each bacterium is fed with the input value. The returned value is then compared against what the environment expects. A score is obtained by summing the relative errors for each condition, and then normalized on the number of conditions. The closest this score is to zero, the better is the response of the bacterium. Obtaining zero means that the bacterium genetic code is able to respond to the conditions exactly. In the more general situation, the error will be a positive floating point number which will be compared against a tolerance. If the result exceed the tolerance, then the bacterium is killed, otherwise it survives and is allowed to reproduce in the next generation. An additional condition tries to favor those bacteria having a more compact genome: a random number between 0 and 50 is extracted, and if the value is greater than the length of the genome of the bacterium, it will survive, otherwise it will be killed. This will prevent genomes longer than 50, and strongly favour short genomes, as it will be more probable that a random extraction is greater than their length, granting it survival.

In order to improve our genetic pool, the Environment defines an "epoch". An epoch is a number of generations where selection happens. During each generation of the epoch, the tolerance on the conditions will progressively decrease, meaning that, as time passes, the environment becomes more and more selective.

Enough general chatting. Let's see an actual example. In the program you can download, the following conditions are set for the environment

e.setConditions([(0,7),(2,11),(4,15),(5,17)])

As you can see, the environment expects the bacteria to be able to solve the equation 2x+7. The line

e.epoch(1.0,0.0, 20)

Starts an epoch made of 20 generations. During these 20 generations, the tolerance will decrease from 1.0 to 0.0. In other words, the environment expects, at the end of the epoch, that the bacteria in the pool are able to solve exactly 2x+7 for all the specified points (x=0,2,4,5).

Suppose we have a bacterium in the pool with the following genetic code:

IncX -3             # a = n | x = -3 | y = 0
AddYtoA             # a = n | x = -3 | y = 0
Return              # returns a = n
LoadX 1             # Never reached
BranchXNotZero 2    #

As you can see, this bacterium will return the input value unchanged. This means that the result and relative error for each condition will be

Given  | Expected | Produced | abs(Relative Error)
0      | 7        | 0        | 7/7 = 1.0
2      | 11       | 2        | 9/11 = 0.81
4      | 15       | 4        | 11/15 = 0.73
5      | 17       | 5        | 12/17 = 0.70
---------------------------------------------------
                       Total = 3.24
                               / 4 (number of conditions
                             = 0.81

This bacterium (as it is, assuming no mutations) will therefore survive in the first two generations (where the tolerance is 1.0 and 0.9) but it will be killed in Generation 3, as the tolerance is now 0.8.

Let's see another example

MoveAtoY     # a = n   | x = 0 | y = n
IncA  3      # a = n+3 | x = 0 | y = n
MoveAtoY     # a = n+3 | x = 0 | y = n+3
Return       # returns a = n+3
LoadX 1      # Never reached

In this case, we get

Given  | Expected | Produced | abs(Relative Error)
0      | 7        | 3        | 4/7 = 0.57
2      | 11       | 5        | 6/11 = 0.54
4      | 15       | 7        | 8/15 = 0.53
5      | 17       | 8        | 9/17 = 0.53
---------------------------------------------------
                       Total = 2.17
                               / 4 (number of conditions
                             = 0.54

Much better. Of course, the best condition would be something like this (chose a long one, just for illustrative purpose)

MoveAtoY    # a = n    | x = 0 | y = n
IncX 3      # a = n    | x = 3 | y = n
AddYtoA     # a = 2n   | x = 3 | y = n
IncA 2      # a = 2n+2 | x = 3 | y = n
IncA 4      # a = 2n+6 | x = 3 | y = n
MoveAtoY    # a = 2n+6 | x = 3 | y = 2n+6
IncA 1      # a = 2n+7 | x = 3 | y = 2n+6
LoadY -3    # a = 2n+7 | x = 3 | y = -3
IncX -2     # a = 2n+7 | x = 1 | y = -3

This bacterium returns exactly what the environment expects. This as well

BranchXNotZero 2   # a = n    | x = 0 | y = 0 | no branch
LoadY 1            # a = n    | x = 0 | y = 1
MoveAtoY           # a = n    | x = 0 | y = n
IncA 3             # a = n+3  | x = 0 | y = n
AddYtoA            # a = 2n+3 | x = 0 | y = n
IncA 4             # a = 2n+7 | x = 0 | y = n

At the end of the run of the program, you will obtain bacteria like these. It is interesting to note that selection produced genetic code able to solve the equation outside its defined space of conditions. We never put the condition (10, 27), but these two bacteria are able to satisfy it.

In the current setup, I purposely introduced "noise" codons in the available genetic code. As you can see, the solution for 2n+7 can be obtained with a proper combination of MoveAtoY, IncA and AddYtoA. The remaining codons are not used, or if they are, their effect is neutral. You have two contrasting effects here: the need for the genetic code to be small (so to maximize its chance of survival against the length selection) and the need to have buffer codons that can mutate without particular trouble, in particular if they are after the Return codon. This reduces the chance that a mutation will ruin the achieved functionality, making the bacterium with a long genetic code less sensitive to mutation.

Of course, you would be tempted to try similar cases. I can assure you that simple equations will be properly satisfied. However, if you try to do x^2, your bacteria will always die. Why ? X^2 is a rather particular situation. First of all, there's an important codon which is not present : MoveAtoX. Once you have this codon, the space of the genetic code combinations allows you to potentially obtain the solution. This is one I wrote by hand:

MoveAtoX
MoveAtoY
LoadA 0
BranchXZero 5
AddYtoA
IncX -1
BranchXNotZero -3

Obtaining this result from evolution is hard. In some sense, we face the issue of the so called Irreducible Complexity, an argument proposed to object evolution. Indeed this appears to be the case. The genetic code able to produce the square is irreducible. Either you take it as a whole, or you don't. From our toy program there's no "in-between" that satisfies the constraints and allows the generation of that code in steps. Although apparently a sound argument, there are many considerations to do on this point, which have a substantial effect against this position.

First, as I said the program shown here is a toy. You cannot put too much reasoning for proofs into it. We are running on a very restricted set of bacteria. Even if statistically improbable, the creation of the above genetic code when million, or even billion of bacteria are produced suddenly becomes more possible. Then, selection does its job by granting it full survivability, and therefore takeover of the population.

Second: the rules of chemistry are slightly more flexible than the rules provided here. In this sense, this program represents a situation more akin to a Ziggurat than an Egyptian pyramid. Electronic interaction of molecules allow a very refined, smooth and nuanced behavior, while our codons do not.

Third point is that we are assuming a single block of code to be able to produce a complex mathematical result. Biological systems do not work this way. Biological systems produce components, and make them interact. For example, a complex (but still pretty simple) biological process like the Krebs cycle is not performed by a single molecular übermachine. There are ten different enzymes involved in carbohydrate consumption, interacting together in the cell. Each enzyme is a small entity which performs a simple operation. Together they network for a nice and refined mechanism. In other words, selection and evolution moves to another level in real biology: not only the evolution of single components (enzymes) but also evolution of their mutual interaction. In our toy program, we don't allow interaction of "subroutines", nor of bacteria. The very fact that we are made of a system of interacting cells and not a huge unicellular bacterium is a hint that our case is very limited in possibilities.

Fourth point: there no chance for so called "lateral transfer" among bacteria. In biological systems, the DNA can be exchanged among bacteria, as it's universal and works in any case. Suppose that a very powerful enzymatic system would be obtained by the concerted presence of enzymes A,B,C and D. An organism happen to have enzymes A and B, but not C and D, because they are normally not evolved in its conditions. Another organism was able to evolve C and D to address its own environment. These two bacteria can come in contact, and exchange their genetic material. Suddenly, both organisms have the whole set of A,B,C, and D. This would have not been possible without the universality of the genetic code. It's also not possible in our program.

There are of course many other points and issues to consider. I think I reached my goal to share a personal experiment, and I would like to close with interesting links toward more evoluted (no pun intended) software to simulate digital life forms

Thanks for reading.


MySQL enumeration, strict mode and the troubles of debugging

Date

Suppose you have a MySQL table containing an enum column. The enumeration allows the values "FOO", "BAR" and "BAZ". This is a production database, hooked up by a quite huge amount of programs inserting and deleting rows into that table.

Now suppose that, for some reason, one of these programs tries to insert the value "HELLO" into the enumeration column. You would expect an error, wouldn't you? Unfortunately not in MySQL: when you try to insert a value not defined in the enumeration group, it will insert instead a special value '' (empty string) which is always present in any enumeration, and associated with the index 0. You just obtain a warning.

mysql> insert into test (my_enum) values ("FOO");
Query OK, 1 row affected (0.00 sec)
mysql> insert into test (my_enum) values ("HELLO");
Query OK, 1 row affected, 1 warning (0.00 sec)
mysql> select * from test;
+---------+
| my_enum |
+---------+
| FOO     |
|         |
+---------+
2 rows in set (0.00 sec)

This means that you will get non-expected entries in the column (not within your enum expected values). Moreover, you have no way of tracing back who performed the wrong insertion, and what value he stored. Imagine there's just a mistyped "FO" instead of "FOO" somewhere: it will produce an empty column entry.

So, what can you do if you really want the enum values enforced? You switch on Strict mode, which, unfortunately, is server wise. Not table wise, not even database wise. It's server wise. Meaning that if your server is dealing with hundreds of databases, all of them will be under strict mode. Argh!

Glad to be proven wrong!



Pythonic Evolution - Part 2

Date

This is the second part of a post relative to evolution. You can find the first part of the post here.

The last argument in the first post was relative to the requirements for evolution to happen. To recall, you need

  1. An imperfect replicator, an entity able to produce a copy of itself, for example the DNA of our bacteria in the example above. The replication mechanism must have a certain degree of imperfection, so that some chance for occasional mutation must be possible, and consequently variability.
  2. Action of the replicator on the environment. For example, The DNA molecule, through a complex biological setup is able to produce proteins which, according to the laws of chemistry, influence the external world by digesting a metabolite and obtaining energy, or fighting an aggression to survive (to UV light, in our example) and reproduce.
  3. Variability of the replicator “phenotype” (which means the “final visible result” of the replicator). Small modifications (accidental or not) to the replicator will change its interaction with the external environment (for example, by granting him to survive better under UV light)
  4. Environmental conditions: the environment will propose conditions, for example availabe metabolites, temperature, pressure, UV light irradiation and these conditions could also change as time passes..
  5. Selective pressure: the interaction between the environmental conditions and the replicator phenotype will create a differentiation between replicators. Those better coping with the conditions will have higher reproductive chance than the others, producing a drift toward better replicators (in our case, the bacteria were under selective pressure because the UV light was killing them, but the mutated ones were more likely to survive)

We offered the example of bacteria where a slight mutation offered to a single organism a very tiny additional chance to survive and produce children. In this very simple model, the mutated (better fit) characteristic became predominant in approx. one thousands generations, not many when you consider bacteria having replication time in the order of hours or days.

We closed the post by stating that this behavior works regardless of the nature of the replicator and the environment. It works in nature with the genetic code, compiled by ribosomes into proteins. It must works in a digital environment with digital bacteria too.

Four years ago, I developed a simple python program to show how "software bacteria" can evolve behavior to solve simple mathematical equations with no active human programming. The idea is as follows: a python program defines a "Bacterium" class, which represent our byte-based life form. Also, I define an "Environment", where a bunch of Bacterium instances will live, prosper and hopefully reproduce. The environment will provide "mathematical food" to the bacteria, and the better the bacteria elaborate this food, the better for them, and the higher will be their chance to survive and reproduce.

In real bacteria, the genetic code is converted into protein, and the proteins are responsible for processing metabolites (food) according to the laws of chemistry. In our example, we will make a simplification, skipping the protein step and having the mathematical food directly processed by our genetic code.

The mathematical food will be a simple integer number and the genetic code will be a very simple language able to do sums (of signed numbers) and conditional branching. The result of the genetic code processing the food will be another integer number, a metabolized mathematical result. The environment will promote a condition so that those bacteria whose metabolized product has a particular characteristic (in our case, a particular resulting value) will have a higher chance of reproduction. The condition the environment will propose is the expected result of a simple mathematical equation. Ideally, at the end of this experiment, the evolved bacteria will possess genetic code able to perform the mathematical equation for any given input. In this sense, they evolved to process the food in the best possible way.

Confused? Don't worry. It will be clearer soon.

Here comes The Bacterium

Cellular systems are not that different from computer programs. There is a programming language, a compiler, input and output parameters. There's even memory.

In DNA, the genetic code is made out of four molecular letters, A G C T, forming three letters words, named codons. DNA is first copied to RNA. The RNA then is interpreted by an elaborate mechanism (the ribosome) able to translate each group of three-letter words into a specific amino acid (out of 20) as building block (think LEGO) to be put in a protein. The protein so created is a molecular machine which performs a chemical task, like digesting a metabolite, storing substances (eg. iron), providing structural support and so on. How the system got started is currently not yet known, but it takes one replicator molecule to start the process, in particular if this molecule has self-catalytic properties (meaning: it makes easier the creation of a copy of itself). Likely candidates are vulcanic activity, electric discharges (lightnings), or panspermia. I tend to favor the electric discharge, coupled with catalysis by means of metals (found in the rocks). After all, the Miller-Urey experiment demonstrated that if you mix and boil and discharge electricity long enough, you will generate aminoacids out of simple gases like water, methane, nitrogen and ammonia, something that it was present for sure into the Earth primordial atmosphere. So, at the moment is not know which was the replicator and how it got produced, but even if unlikely as an event, once started it grows and is basically unstoppable, provided proper conditions are met.

Once you have aminoacids, you have proteins. Biological systems evolved to use proteins because they are more efficient and disposable, while the genetic code is less efficient, and very important. After all, programmers keep their source code in a backed up subversion repository, while giving out compiled versions. Losing a compiled version is not a big issue, you can always recreate it from the sources, but having your subversion system go corrupt means that your asset is totally lost, and could mean that your company is out of business. As already said, in our in-silico setup, we skip the translation part for simplicity.

In our mathematical bacterium we have three variables that can be set: A, X and Y. Each of them has a different role and particularity. A is the import/export variable where the value representing the food is first inserted, modified, and released to the environment. You can see it as the unique input/output parameter of our "genetic code function". X and Y are internal variables. You can make an analogy to internal metabolites needed to support the food consumption. The main difference between the two is that X is more of a counter: it can be set, incremented (or decremented) and tested for being zero and decide if branch (or not). Y is instead auxilliary in processing A. Our bacterium has a genetic code which consist of 10 codons. Like amino acids are able to perform various chemical operations on substrates by their own chemical nature, these operations can perform different mathematical or logical tasks. The codons are:

  • LoadA, LoadX, LoadY: each of them loads a specified value in the variable A, X or Y.
  • IncA, IncX: increments the value of the variable A or X of a given specified amount (can be negative)
  • MoveAtoY: copies the content of A into Y.
  • AddYtoA: performs the sum between the content of A and Y, and stores the result in A.
  • BranchXZero, BranchXNotZero: jumps a specified number of codons forward (or backwards) if the content of the variable X is zero or not-zero, respectively.
  • Return: terminates the execution of the genetic code.

Let's see some example of genetic code that does something to a mathematical food.

This is the most trivial one

Return

A bacterium with this genetic code will take the metabolite (an input number), and load it automatically in A. Then, it will finish immediately with no processing. The content of A is the result of the metabolic process. In this case, the bacterium returns what it eats. you give him 5, it returns 5. You give him 13, it returns 13.

A more interesting case is the following:

IncA 5
Return

The bacterium with this genetic code in the first instruction will increment 5 to the content of A. The second statement will return whatever it is containted in A. This bacterium eats 4 and returns 9, eats 13 and returns 18, etc. You get the idea.

So, now you can imagine a population of bacteria, and imagine that the genetic code was created with a completely random process. For example, say that we create a population of 3000 bacteria with the following criteria:

  1. When you create each bacterium, you extract a random number of codons (from 2 to 50) which will be used to generate their genetic code.
  2. Given the number of codons for a specific bacterium, you extract that number of randomly chosen codons from the available pool (LoadA, LoadX, LoadY, IncA, IncX, MoveAtoY, AddYtoA, BranchXZero, BranchXNotZero, Return).
  3. For codons accepting a numeric value (LoadA, LoadX, LoadY, IncA, IncX, BranchXZero, BranchXNotZero), extract a random number from, say, -5 to +5 and use it as a numeric value.
  4. What you obtain is a bacterium whose genetic code is a random mess of a random number of random codons with random parameters.
  5. And of course you obtain a population of 3000 bacteria all with random genetic code.

If you feed a number (say 42) to each bacteria, you will expect many different results. Each bacterium will be fed with the number 42 (which will be placed in A) and then the randomly generated set of operations will occur. Nice, but not particularly useful.

But here the cool stuff begins. Suppose you decide to say: if the environment provides 42, those bacteria that produces a result close to 47 are more likely to survive. Those who produce a numeric value very far from 47 are instead more likely to die. With this in mind, you start killing bacteria. Those who return exactly 47 will survive. Those that return 48 have a slight chance of dying, but not much. Those who return 0, or 500 will be probably killed immediately. Out of the starting 3000 bacteria, you will now have a troop of survivors (say 100) whose genetic code produce, by pure random chance, something that is quite near to the expected result (47) out of the food value 42.

Now you allow this bacteria to reproduce. Of course, if you take the 100 survivors and produce exact clones so to repopulate up to 3000, you will obtain no improvement. Here the "imperfect replication" kicks in. You allow a random number of mutations to occur to each bacterium before duplicating. These mutations will change the genetic code, potentially creating a new program that produces something lethal (too far from 47) but also something with better fit (something quite near to 47).

After this event takes place, you allow the bacteria to replicate so that you restore your pool of 3000, and you apply selection again. You feed them 42 and you kill all those bacteria producing results too far from the expectation (47). New survivors, new mutations, new generation, and you go on and on.

As you can see, all the conditions for evolution are met:

  1. An imperfect replicator exists: it's our genetic code based on mathematical codons. Replication is imperfect because we have random mutation of the genetic code at every new generation.
  2. Action of the replicator on the environment. The genetic code takes a number and process it into another number.
  3. Variability of the replicator “phenotype”. Modifications on the genetic code produce modification in the final resulting value.
  4. Environmental conditions: The environment presents 42 and expects 47 as a good value indicative of a nice processing.
  5. Selective pressure: genetic code responding at best to the environmental conditions will have a higher chance to survive and produce a new generation. Genetic code that is slightly less accurate will have a lower chance to survive, and genetic code producing values too far from what the environment considers a proper response will be killed.

In the next post, we will see how this mechanism has been implemented into a small python program, and we will see what happens for different cases.