Box and Whiskers plot. How ?

Date

I am trying to produce box and whiskers plots. Actually, not the plot in itself, but the values for the boxes, the whiskers and so on. An example of box and whiskers is the following

image

Browsing the net, I found that many sites just explain that a box plot is made according to the following recipe:

  1. Put the values in ascending order
  2. find the median (Q2), the first quartile (Q1) and the third quartile (Q3)
  3. make the box between the Q1 and Q3 values
  4. Put a line in the box where the median is
  5. make the whiskers including the lowest and highest values

This approach however does not represent outliers. A better recipe can be found on Wikipedia:

  1. Put the values in ascending order
  2. Find the median (Q2), the first quartile (Q1) and the third quartile (Q3)
  3. make the box between the Q1 and Q3 values
  4. Put a line in the box where the median is
  5. Calculate the Inter Quartile Range (IQR) as Q3-Q1
  6. Find the lower fence and the higher fence values as LowFence = Q1 - (1.5 * IQR) ; HiFence = Q3 + (1.5 * IQR)
  7. Mark any value below LowFence or above HiFence as outlier, and represent it with a circle
  8. Find the lowest value not marked as outlier, and make the low whisker using this value (NOT the LowFence value)
  9. Find the highest value not marked as outlier, and make the high whiskers using this value

As an example, the following set of 15 values in R

> a=c(2,2,3,3,3,4,4,5,6,6,6,6,8,12,13)

have Q2 = 5, Q1 = 3 and Q3 = 6

> quantile(a)
  0%  25%  50%  75% 100%
   2    3    5    6   13

The IQR is therefore 6-3 = 3

> IQR(a)
[1] 3

The LowFence and HighFence values are therefore 3 - (3*1.5) = -1.5 and 6 + (3*1.5) = 10.5. Any value below LowFence (none) or above HiFence (12 and 13) are marked as outliers. The whiskers therefore are delimited by 2 and 8 (the lowest and highest values in the dataset that are not outliers). The result is the plot you see above. You can obtain the values for the box and whiskers as

> boxplot.stats(a)
$stats
[1] 2 3 5 6 8

$n
[1] 15

$conf
[1] 3.776137 6.223863

$out
[1] 12 13

Where stats contains the relevant values for the boxes and whiskers, as computed by hand above, out is the set of outliers, n is the number of values in the dataset and conf is a set of values to plot the hinges, not displayed in the plot above.


Unraveling Unicode problems in WikkaWiki

Date

While I was setting up the wiki, I realized some problems with non-English letters, such as ö. Therefore I had to find out more details about Unicode and encoding (a task that does not happen frequently if you program in languages such as Fortran, where you normally have other class of problems to handle). I found this very interesting page on JoelOnSoftware, and another FAQ for Unix/Linux.

Basically, if I understood correctly, everything can be summed up to the following:

  • Unicode is a standard that define how to handle all the conceivable symbols (roman letters, numbers, Japanese and Chinese ideograms, arabic letters etc...). Actually, there is both ISO 10646 and Unicode, and they are not the same: Unicode does a lot more. So Unicode is a superset of ISO 10646, but for our discussion we will talk about Unicode even if talking about ISO 10646 would suffice just because is easier to type.
  • Unicode assigns to every symbol a given number, the code point. For example, the letter A is assigned to code point 0x41, and 食 (taberu), which is the japanese character for the verb "to eat", has code point 0x98DF. The code point set can be eventually expanded to include any symbol, even those from invented languages (although unofficially, Tolkien's Tengwar and Klingon are included).
  • The code point specifies an abstract symbol, not a way to draw that symbol. How this symbol is drawn is a task involving the font, which is used to produce a concrete glyph representation for each code point it covers.
  • To write code points into files or in memory, you have to use an encoding, basically a technique to represent the code point value in practice. There is an insane amount of encodings, but they can be classified in two rough categories: fixed length encodings and variable length encodings.
  • In fixed length encodings, every symbol require a fixed number of bytes, say 4. This is the approach UTF-32 uses: every code point requires four bytes to be written, always. This solution increases the space occupation four-fold even if you are only using plain English characters. Moreover, it is not backward compatible, so you cannot read a pure English text encoded in UTF-32 with cat. The advantage of this solution is that you know how much you have to skip to jump to the next or previous character, four bytes, and you know very easily how many characters you have: the occupied size divided by four.
  • In variable length encodings, whose major representative is UTF-8, the number of bytes needed to encode a code point depends on the code point itself. For code points in the interval 0x0-0x7F, UTF-8 uses just one byte. Code points between 0x80–0x7FF will need two bytes, code points between 0x800–0xD7FF will require three bytes and so on. With this approach, simple English text will continue to be small and backward compatible, because each code point will be represented with a single byte, and this code point is mapped as in ASCII. Another important advantage is that the space occupation is of course more efficient. The disadvantage is that it is more difficult to go forward or backward one symbol: in C, buf++ will not work, because you could instead jump in the middle of a two-byte encoded character. With a fixed length encoding you know how many bytes you have to skip. With UTF-8 you have to parse the code point representation, and then determine how many bytes to move. Normally, however, proper API is provided to take care of these issues once and for all.

The problems I had with the ö letter arised from a wrong encoding WikkaWiki was assuming: it specified the document as iso-8859-1 encoded. ISO-8859-1 (also known as Latin-1) is basically a single-byte encoding like ASCII, where values from 0x80 to 0xFF map to most European symbols. However, the data I wrote was UTF-8 encoded, and this introduced a mismatch between the data and the decoding performed by the browser (in ISO-8859-1, as specified in the HTML). Since ö has code point U+00F6 (binary 11110110), this resulted in a two-byte UTF-8 encoding of the type 110yyyyy 10zzzzzz: 11000011 10110110, or 0xC3 0xB6. Each of these two bytes was interpreted as a single-byte ISO-8859-1 value, leading to a weird ö (capital A tilde and pilcrow end paragraph) instead of an ö (o with umlaut).

To solve, I basically changed the default meta tag content in WikkaWiki, from

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

to

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

and now it works correctly.

Other references:


Multithreaded testing

Date

Suppose you have to perform tests. Lots of tests. Functional tests, where each test could take a lot of time. Suppose also that

  • you don't want to wait three days until your tests are done
  • you have a massively parallel architecture available
  • you are using python unittest framework
  • you enjoy colors

What would you do?

You install testoob. This thing is life-changing, believe me. So, let's see an example. Suppose you have this test:

import unittest
import time
class MyTestCase(unittest.TestCase):
    def testFoo(self):
        time.sleep(3)
        self.assertEqual(0,0)
    def testBar(self):
        time.sleep(5)
        self.assertEqual(0,0)

if __name__ == '__main__':
    unittest.main()

If you run it, the whole testcase will take 8 seconds.

..
----------------------------------------------------------------------
Ran 2 tests in 8.000s

OK

But if you install testoob, now you have a nice executable

stefano$ testoob test.py
..
----------------------------------------------------------------------
Ran 2 tests in 8.001s
OK

Here is the magic: run it with the option --threads=2 and the result is served in just 5 seconds:

stefano$ testoob --threads=2 test.py
..
----------------------------------------------------------------------
Ran 2 tests in 5.001s
OK

Ok, but what about the colors? Well, I like the testing suites that print out something green for every successful test. testoob does it, so it brings way more fun and enjoyment!

Testoob does a lot more. If you feel limited by the current unittest module, you should definitely consider to take a look at testoob.


Lines of code

Date

First, a disclaimer. I just want to play with numbers in this post. Lines of code (LOC) do not mean anything in terms of productivity. They can eventually be used as a rough estimate of the complexity of a project. MacOSX counts more than 80 millions LOC. the Linux kernel a little more than 5 millions. Openoffice is around 1 million. GIMP is 650.000 LOC. (Values from Wikipedia). In other words, a huge project like an operating system and its environment goes in the tenths of millions. A large application could still be around hundreds of thousands.

I tried to evaluate the maximum number of lines of code a human being could be able to produce in the best case scenario. It turns out: less than 4000 per day, assuming an average typist (240 char per minute), lines of 30 characters and 8 hours of work.

We are of course in a very ideal and practically unrealistic case: no thinking, just typing, eight hours straight, with no breaks. The Linux kernel could be done by this typist in slightly more than 3.5 years. It actually took 15, with a huge amount of people involved.

It turns out, however, that the number of lines of code you can actually write is of course way lower. Unsurprisingly I must say. On a small project I did recently, I wrote 2000 lines of code (logic+test) in 40 hours of work. This means 50 lines of code per hour, or 400 lines per day, a factor of 10 lower than the (in)human limit. If you could actually count the fact that many lines were modified, debugged etc. you can see how the actual line counting does not reflect productivity. Does my productivity decrease if I refactor the code, removing clutter, and therefore reducing the lines of code? Obviously no.

On another longer (and more complex) project I took part on, I wrote 9200 lines of code (logic+test) in 45 days of work, meaning approx 200 lines of code/day. Testing code, by experience, equals the logic code in terms of amount.

In other words, assuming my cases as a rough guideline, I would expect to produce an average of 300 lines of code per day. Of these 300, only 150 are logic. The other 150 are testing.

This leads also to these rough estimates for a one-man-army approach:

  • A small project of 1000 lines of (logic) code can be done in one or two weeks
  • An average complexity project of 10.000 lines of (logic) code could be completed in two or three months.
  • A large project of 100.000 lines of (logic) code requires at least a couple of years

Note that 100.000 lines of code is still 1/6th of the GIMP complexity.

Some interesting reading:


XML Namespace URL (updated)

Date

I am attending the EMBRACE workshop in SOAP clients. A very interesting workshop I would say, and I'll wrote more about it when finished. During the workshop, it has been pointed out the usual issue of XML namespaces: the attribute value looks like a URL but it is not referring to anything in particular. An example from a SOAP envelope

<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">
</soap:Envelope>

Technically, the attribute value is a string that represents the namespace. The choice to use a URL stems by the fact that the DNS system guarantees uniqueness and authoritativeness, so if you define a new namespace, using your own domain guarantees (sort of) to be unique. However, there is no consensus about what this URL should resolve to. In some cases, it refers to the XML schema. In others to a DTD, or to a stylesheet, or more frequently to nothing. You can have a picture of the situation from these articles 1, 2, and 3. The first two articles, in particular, advocate the use of RDDL to solve the problem. Basically in the ambiguity of what to put, the answer with RDDL is: none of them. Instead, provide a RDDL document that says where to find each of them (if provided). Not a bad idea.

Waiting for the community to decide what to put at that address, my personal choice went toward a still standard but not deliberately confusing choice. I put a uuid URN.

xmlns:foo="urn:uuid:212e2ac7-dc35-4112-ae86-cefd26abb856"

which is valid according to standard (you can put any URN), it is unique and at least it does not pretend to look like it's referencing to something. You can generate one with uuidgen.

An objection to this approach is that a uuid is not easy to remember, and so you have to copy and paste it every time. Well, the URL approach has more or less the same issue. Namespacing does not work correctly if you don't specify the URL namespace exactly, so you end up copying and pasting it anyway.

Update: Just today, the W3C released an interesting plea to developers in order to limit the traffic at w3.org. Apparently, they get an insane amount of traffic, due to the attempts by various softwares around the net to get the documents referred by the addresses in DTD and namespace. This is another indication that you should be very careful in putting an URL, in particular if your format becomes very popular and you don't have big pipes to hold the traffic.


Change MySQL config file

Date

It seems trivial, and indeed it is, to specify a different configuration file for the mysql client, from the standard .my.cnf to a different file. This is generally needed when you don't want the password to be visible in the process list, but at the same time you don't want to use the standard .my.cnf file.

Despite trivial, I was not able to understand which option was the correct one, and for some strange reason Google didn't help, so this post is mainly to fix this issue.

So basically you can specify a different file with the option --defaults-file=filename. You can alternatively add the file to the list with the option --defaults-extra-file=filename. In the latter case, the file will not take precedence on already existent my.cnf files.


MacOSX Leopard extended ls

Date

Apparently, something changed in the ls command with the release of Leopard. I don't remember seeing this kind of report on Tiger, although it looks like the features already existed since long time. Now, take this information with a grain of salt, as I am not a MacOSX expert, just a (very busy) occasional tinkerer on this OS.

drwxr-xr-x   5 stefano  stefano     170  9 Ott 10:11 Programs
drwxr-xr-x+  4 stefano  stefano     136  1 Ago 01:48 Public
-rw-r--r--@  1 stefano  stefano  221814 20 Ott 20:12 executable network.eps

This is an extract of an ls run in my home directory. As you can see, the long -l option produces an output containing two additional characters in the file mode field: an "at symbol" and a "plus symbol" permission.

According to the manual page, the plus symbol means that the file has extended security information, meaning ACLs. It is possible to print out this information with the option -e to the ls command.

stefano:~ stefano$ ls -led Public/
drwxr-xr-x+ 4 stefano  stefano  136  1 Ago 01:48 Public/
 0: group:everyone deny delete

Of course, you can also change the ACL information using chmod

stefano:~ stefano$ chmod +ai "guest allow write" Public/
stefano:~ stefano$ ls -le Public/
total 0
drwxr-xr-x+ 4 stefano  stefano  136  1 Ago 01:48 Public/
 0: group:everyone deny delete
 1: user:Guest inherited allow add_file

And you can remove the ACLs

stefano:~ stefano$ chmod -a# 1 Public
stefano:~ stefano$ chmod -a# 0 Public
stefano:~ stefano$ ls -led Public/
drwxr-xr-x  4 stefano  stefano  136  1 Ago 01:48 Public/

As you can see, the plus symbol disappears. Ideally, you can do these changes also from the Finder, Information window. However trying to do this lead to a crash of the Finder on my machine. Don't know if this behavior is due to my setup or an actual bug.

Then, what about the at symbol? This feature is undocumented, but if you add the -@ option to ls -l you will obtain the meaning:

stefano:~ stefano$ ls -l@ executable\ network.eps
-rw-r--r--@ 1 stefano  stefano  221814 20 Ott 20:12 executable network.eps
    com.apple.FinderInfo    32
    com.apple.ResourceFork 547102

So, apparently there are extended attributes, and this is the meaning of the at symbol. the first entry is the name of the extended attribute, the second entry is the size of the contents of the attribute.

You can peek into the contents of the attributes with the command xattr:

stefano:~ stefano$ xattr -p com.apple.ResourceFork executable\ network.eps >unknown_content
stefano:~ stefano$ file unknown_content
unknown_content: MS Windows icon resource

Apparently, the resource fork contains a icon file. ResKnife also confirms it. If I understood correctly, the old file/..namedfork/rsrc resource fork is mapped to com.apple.ResourceFork attribute, but you are welcome to correct me in the comments. I think I should read this book for a better understanding.

Here some relevant links:


Cairo PostScript rendering

Date

I am trying to render vector graphics with alpha blending. Too bad, PostScript, even Level 3, does not support alpha blending. This means that if you want to draw two objects (say, circles) in PostScript and you want to make them look as they are alpha blended you have to do "some" trick. An example:

  • draw the first, then the second object (which will overlap and overwrite the first) with proper colors (considering the background color)
  • calculate the intersection between the two areas
  • calculate the blended color, considering the transparency level of the two and the background color
  • draw the intersection part with the blended color

This of course gets more and more complex as more objects overlap along the z-axis (say, a line, a circle, two rectangles and a spline curve). For all of them you have to consider the stacking, how they overlap, decompose the overlapping parts properly, compute the proper colors of each intersecting region, draw all the parts in the proper order (so they stack properly when rendered). This is not trivial.

For this reason, I was considering Cairo, assuming that they solved this problem. After a little tinkering I found out that even Cairo gives up in rendering PostScript vector graphics with alpha blending. For example, this code draws non-blended circles (alpha = 1.0), producing a vector based PostScript

int main (int argc, char *argv[]) {
        cairo_surface_t *surface;
        cairo_t *cr;
        char *filename = "image.ps";

        surface = (cairo_surface_t *)cairo_ps_surface_create (filename, 80.0, 80.0);
        cr = cairo_create (surface);

        cairo_set_source_rgba (cr, 1.0, 0.0, 0.0, 1.0);
        cairo_set_line_width (cr, 1.0);
        cairo_set_fill_rule (cr, CAIRO_FILL_RULE_WINDING);

        cairo_arc(cr, 20.0, 20.0, 10.0, 0.0, 2*M_PI);
        cairo_fill(cr);
        cairo_stroke (cr);

        cairo_set_source_rgba (cr, 0.0, 0.0, 1.0, 1.0);
        cairo_arc(cr, 30.0, 20.0, 10.0, 0.0, 2*M_PI);
        cairo_fill(cr);
        cairo_stroke (cr);

        cairo_set_source_rgba (cr, 0.0, 1.0, 0.0, 1.0);
        cairo_arc(cr, 25.0, 28.6, 10.0, 0.0, 2*M_PI);
        cairo_fill(cr);
        cairo_stroke (cr);

        cairo_destroy (cr);
        cairo_surface_destroy (surface);

        return 0;
}

However, if you replace

cairo_set_source_rgba (cr, 1.0, 0.0, 0.0, 1.0);

with

cairo_set_source_rgba (cr, 1.0, 0.0, 0.0, 0.5);

consequently enabling the need for alpha blending, things change. The final PostScript file contains a rastered image, not a vector representation. If you render it using the PDF backend, the image contains vector-based graphics with blending. I'm not really into the PDF format, but this fact hints at native alpha blending support.

The picture show the three different cases. From left to right, PostScript backend without alpha blending, PostScript backend with alpha blending, and PDF backend with alpha blending.

image

The PDF output was obtained by changing the line

surface = (cairo_surface_t *)cairo_ps_surface_create (filename, 80.0, 80.0);

to

surface = (cairo_surface_t *)cairo_pdf_surface_create (filename, 80.0, 80.0);

and of course changing also the file name accordingly. Please note that both PostScript images were converted to PDF using ps2pdf before being displayed: due to some setup problems on my mac, I am not able to open PostScript files directly (which are converted by Preview.app to PDF anyway). This however does not affect the conclusion, which is also confirmed by peeking into the generated PostScript files with an editor.

The raster, quite low quality nature of the PostScript with alpha blending is evident. Apparently, Cairo renders what looks like a JPEG image, and then includes it using DataSource (See PostScript reference manual, Section 4.10).

Concluding, from my findings it looks like Cairo is not able produce a PostScript vectorial representation containing alpha blending. It must resort to a raster representation, therefore losing all the advantages of a vector format.

Despite this, I think Cairo is a very nice and powerful library to render vector graphics in a programmatic way. Kudos to the implementors!


Doing nothing in bash

Date

Today I had to "do nothing" in bash. In python you have "pass" for this task. In C, you can use ";". I found this post from someone having the same issue. He proposes either "sleep 0" or "A=0". However, looks like this is even nicer: from the bash manual

: [arguments]
No effect; the command does nothing beyond expanding arguments and performing any specified redirections. A zero exit code is returned.

So, a simple ":" will do the trick.


Table namespacing in databases

Date

I was trying to find out if some database implements namespacing for tables, but it looks like google produces no useful results for it. This is strange. In MySQL, tables are namespaced on a "database qualification" (like in dbname.tablename), but it is not possible, as far as I am able to see, to define dbname.namespacename.tablename. This would allow to group related tables into the same namespace, still keeping all the tables in the same database. This feature could also be useful during refactoring or handling of different table versions.

I wonder about the rationale behind the lack of this feature. Or maybe the feature exists but the name is different?

As a fix for the task I am going to do, I will use underscores as separators. This forces me to use camelcase for namespaces and table names (eg. NameSpace_TableName). Not a perfect solution, but at least it documents itself.