Skip to content

Is SVN slow?

I am using SVN to manage my current development repository. As the project grew, the operations became slower and slower. Things like updating or committing could require minutes.

I ran some strace, and apparently SVN takes a lot of time in walking through the repository tree, probably checking for differences between the previous copy (in .svn) and the current copy. Also, it walks through the tree to remove all the lock files it created. These operations grow with the number of subdirectories involved during the operation, so if you don’t want to spend your time staring at the ceiling while committing or updating, either you keep the number of subdirs to an acceptable low amount, or you invoke operations involving a low amount of directories (like, committing only a subtree).

My situation is maybe quite extreme, as my repository has 2000 directories (I have lots of small applications to keep track of). I could check out a part of the repository, or split the repository into smaller, independent ones, but I don’t want to risk: it could prove potentially dangerous right now, where I have to work on other issues with a quite tight schedule.

Despite this, I like SVN, but I am considering taking a look at git.

Where are they?

I found a very interesting commentary by Nick Bostrom, about the existence of extraterrestrial life and the so-called Fermi Paradox.
The point Nick Bostrom presents is sensible: the current evidence is that life is apparently not very frequent in the Universe. Despite all efforts we did toward finding life, intelligent or not, we failed. Moreover, the human progression went from very low technology to space exploration in 10.000 years, a blink of an eye on the Universe time scale. There are pieces of our civilization travelling out there: the Pioneer 10, the two Voyager and much more. In two or three hundreds years, we could be able to manipulate matter so to create self-assembling space probes to scout the galaxy, the Von Neumann probes. If an intelligent civilization exists or existed in the galaxy, we would be surrounded by Von Neumann probes, or at least we would be able to receive some kind of signal, but this does not happen. Apparently, there has to be a filtering event that prevents human-intelligent life to reach a status where galaxy colonization could be started and self maintained without further intervention.

This filtering event could be before our time, or in front of us. If the filtering is before our time, it must act as a showstopper for the development of life forms, meaning that life is rare, potentially unique even on a universe scale.

However, if we happen to find extraterrestrial life, for example on Mars, it would mean that the conditions to form life are rather loose. Life formation is not at all uncommon, and we could expect to find it on any exoplanet with the right conditions, potentially a huge amount in the galaxy. Therefore, to address the experimental evidence of no space colonization despite the billion of years of time passed since the boot-up of the Universe, we are forced to theorize that in this case the filtering event is in front of us: mass extinctions already happened in the past. Could it happen for humanity as well? Under this perspective, Nick Bostrom states that finding no evidence of life on Mars is a good news, as this means that the Great Filter was behind of us, and there’s hope (not certainty) for a bright future, but what if we find something?

Are we the creators of the filter event we have in front of us? Very soon, we will be able to manipulate DNA in its small details, design and create nanomachines from scratch, or fully understand the processes governing our brain and body. Despite the groundbreaking nature of these discoveries, a single accident could wipe out our civilization entirely. It takes a match to start a fire.

Do we need, as Stephen Hawking says, to start colonizing other planetary systems right now, or face the consequences of the “all the eggs in one basket” situation we currently have? Should we accept the fact that our current technology, knowledge of closed biospheres and control of human psychology do not allow us to send a crew on Mars and bring them back? Should we just start sending people there with no chance of coming back? And in any case, once humanity gets there and a colony is started, how can people survive on a planet where water is apparently scarce and there is no breathable air?

So you want to write a book?

So you want to start writing a book? cool. Here is a suggestion I give you: renounce.

Still wanting to start writing a book? Second suggestion: renounce.

Are you still convinced that you really want to start writing a book? Third suggestion: I’ll give you more suggestion, but remember that start writing a book is not the same as writing a book. Between the two, there’s a book!

Continue reading ›

Code or encode? This is the problem.

I am doing proofreading of a textbook and I spotted a quite frequent use of the term “protein-encoding genes”. I reported it as an error, preferring “protein-coding genes”.

Why? There is a strong difference between the two words:

  • to encode means to perform an operation that transform some information from one representational form to another. Something that is performing the act of encoding is, for example, an encrypter, or a compressor.
  • to code means to express information through a proper standard representation.

Saying “protein-encoding genes” literally means that the gene is an entity performing encoding on the proteins, that is, transforming a protein from one representational form to another. Saying instead “protein-coding genes” literally means that the gene expresses information relative to a protein, which is the correct one.

We could eventually say that the ribosome is performing decoding on a mRNA strand to transform its representation of information into a protein, assuming that in the Central Dogma picture, the DNA is the encoded information, and the protein is the decoded one.

Despite this, it looks like many biologists use this wrong wording as standard nomenclature. On Google, searching for “protein-coding genes” reports 442,000 results, while searching “protein-encoding genes”  reports 56,000 results, fewer, but still relevant.

Box and Whiskers plot. How ?

I am trying to produce box and whiskers plots. Actually, not the plot in itself, but the values for the boxes, the whiskers and so on. An example of box and whiskers is the following

box_and_whiskers.jpg

Browsing the net, I found that many sites just explain that a box plot is made according to the following recipe:

  1. Put the values in ascending order
  2. find the median (Q2), the first quartile (Q1) and the third quartile (Q3)
  3. make the box between the Q1 and Q3 values
  4. Put a line in the box where the median is
  5. make the whiskers including the lowest and highest values

This approach however does not represent outliers. A better recipe can be found on pedia:

  1. Put the values in ascending order
  2. Find the median (Q2), the first quartile (Q1) and the third quartile (Q3)
  3. make the box between the Q1 and Q3 values
  4. Put a line in the box where the median is
  5. Calculate the Inter Quartile Range (IQR) as Q3-Q1
  6. Find the lower fence and the higher fence values as LowFence = Q1 - (1.5 * IQR) ; HiFence = Q3 + (1.5 * IQR)
  7. Mark any value below LowFence or above HiFence as outlier, and represent it with a circle
  8. Find the lowest value not marked as outlier, and make the low whisker using this value (NOT the LowFence value)
  9. Find the highest value not marked as outlier, and make the high whiskers using this value

As an example, the following set of 15 values in R

> a=c(2,2,3,3,3,4,4,5,6,6,6,6,8,12,13)

have Q2 = 5, Q1 = 3 and Q3 = 6

> quantile(a)
  0%  25%  50%  75% 100%
   2    3    5    6   13

The IQR is therefore 6-3 = 3

> IQR(a)
[1] 3

The LowFence and HighFence values are therefore 3 - (3*1.5) = -1.5 and 6 + (3*1.5) = 10.5. Any value below LowFence (none) or above HiFence (12 and 13) are marked as outliers. The whiskers therefore are delimited by 2 and 8 (the lowest and highest values in the dataset that are not outliers). The result is the plot you see above.
You can obtain the values for the box and whiskers as

> boxplot.stats(a)
$stats
[1] 2 3 5 6 8

$n
[1] 15

$conf
[1] 3.776137 6.223863

$out
[1] 12 13

Where stats contains the relevant values for the boxes and whiskers, as computed by hand above, out is the set of outliers, n is the number of values in the dataset and conf is a set of values to plot the hinges, not displayed in the plot above.

Let’s burn water

So, apparently you can burn saltwater (1, 2, 3, 4) by irradiating it with radiofrequency, weakening the hydrogen-oxygen bond to break it, and igniting the hydrogen and oxygen gas you obtain. The burning flame can then be used to produce electricity.

My proposal is therefore the following: let’s power the RF generator with the electricity produced by the flame, and let’s condensate the water resulting from the hydrogen burning to replenish the tube. Free energy for everybody! After all, who needs the laws of thermodynamics?

Age of the universe

According to recent results from WMAP, the universe is 13.73 ± 0.12 billion years old. Moreover, the ordinary matter accounts less than 5% of the constituents of the universe (energy and matter), and the universe is practically flat, in the sense that the geometrical rules are those of an euclidean geometry.

This seems to provide even more convincing support for the Big Bang model, and the density parameter Ω very close to 1 seems to rule out the big crunch and so a cyclic existence of the universe.

Putting things into perspective with the geological clock it means that it took approximately 10 billions years for the Earth to form.

Unraveling Unicode problems in Wikka

While I was setting up the , I realized some problems with non-English letters, such as ö. Therefore I had to find out more details about Unicode and encoding (a task that does not happen frequently if you program in languages such as Fortran, where you normally have other class of problems to handle). I found this very interesting page on JoelOnSoftware, and another FAQ for Unix/Linux.

Basically, if I understood correctly, everything can be summed up to the following:

  • Unicode is a standard that define how to handle all the conceivable symbols (roman letters, numbers, Japanese and Chinese ideograms, arabic letters etc…). Actually, there is both ISO 10646 and Unicode, and they are not the same: Unicode does a lot more. So Unicode is a superset of ISO 10646, but for our discussion we will talk about Unicode even if talking about ISO 10646 would suffice just because is easier to type.
  • Unicode assigns to every symbol a given number, the code point. For example, the letter A is assigned to code point 0×41, and 食 (taberu), which is the japanese character for the verb “to eat”, has code point 0×98DF. The code point set can be eventually expanded to include any symbol, even those from invented languages (although unofficially, Tolkien’s Tengwar and Klingon are included).
  • The code point specifies an abstract symbol, not a way to draw that symbol. How this symbol is drawn is a task involving the font, which is used to produce a concrete glyph representation for each code point it covers.
  • To write code points into files or in memory, you have to use an encoding, basically a technique to represent the code point value in practice. There is an insane amount of encodings, but they can be classified in two rough categories: fixed length encodings and variable length encodings.
  • In fixed length encodings, every symbol require a fixed number of bytes, say 4. This is the approach UTF-32 uses: every code point requires four bytes to be written, always. This solution increases the space occupation four-fold even if you are only using plain English characters. Moreover, it is not backward compatible, so you cannot read a pure English text encoded in UTF-32 with cat. The advantage of this solution is that you know how much you have to skip to jump to the next or previous character, four bytes, and you know very easily how many characters you have: the occupied size divided by four.
  • In variable length encodings, whose major representative is UTF-8, the number of bytes needed to encode a code point depends on the code point itself. For code points in the interval 0×0-0×7F, UTF-8 uses just one byte. Code points between 0×80–0×7FF will need two bytes, code points between 0×800–0xD7FF will require three bytes and so on. With this approach, simple English text will continue to be small and backward compatible, because each code point will be represented with a single byte, and this code point is mapped as in ASCII. Another important advantage is that the space occupation is of course more efficient. The disadvantage is that it is more difficult to go forward or backward one symbol: in C, buf++ will not work, because you could instead jump in the middle of a two-byte encoded character. With a fixed length encoding you know how many bytes you have to skip. With UTF-8 you have to parse the code point representation, and then determine how many bytes to move. Normally, however, proper API is provided to take care of these issues once and for all.

The problems I had with the ö letter arised from a wrong encoding Wikka was assuming: it specified the document as iso-8859-1 encoded. ISO-8859-1 (also known as Latin-1) is basically a single-byte encoding like ASCII, where values from 0×80 to 0xFF map to most European symbols. However, the data I wrote was UTF-8 encoded, and this introduced a mismatch between the data and the decoding performed by the browser (in ISO-8859-1, as specified in the HTML). Since ö has code point U+00F6 (binary 11110110), this resulted in a two-byte UTF-8 encoding of the type 110yyyyy 10zzzzzz: 11000011 10110110, or 0xC3 0xB6. Each of these two bytes was interpreted as a single-byte ISO-8859-1 value, leading to a weird ö (capital A tilde and pilcrow end paragraph) instead of an ö (o with umlaut).

To solve, I basically changed the default meta tag content in Wikka, from

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

to

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

and now it works correctly.

Other references:

Multithreaded testing

Suppose you have to perform tests. Lots of tests. Functional tests, where each test could take a lot of time.
Suppose also that

  • you don’t want to wait three days until your tests are done
  • you have a massively parallel architecture available
  • you are using python unittest framework
  • you enjoy colors

What would you do?

You install testoob. This thing is life-changing, believe me.
So, let’s see an example. Suppose you have this test:

import unittest
import time
class MyTestCase(unittest.TestCase):
    def testFoo(self):
        time.sleep(3)
        self.assertEqual(0,0)
    def testBar(self):
        time.sleep(5)
        self.assertEqual(0,0)

if __name__ == '__main__':
    unittest.main()

If you run it, the whole testcase will take 8 seconds.

..
----------------------------------------------------------------------
Ran 2 tests in 8.000s

OK

But if you install testoob, now you have a nice executable

stefano$ testoob test.py
..
----------------------------------------------------------------------
Ran 2 tests in 8.001s
OK

Here is the magic: run it with the option –threads=2 and the result is served in just 5 seconds:

stefano$ testoob --threads=2 test.py
..
----------------------------------------------------------------------
Ran 2 tests in 5.001s
OK

Ok, but what about the colors? Well, I like the testing suites that print out something green for every successful test. testoob does it, so it brings way more fun and enjoyment!

Testoob does a lot more. If you feel limited by the current unittest module, you should definitely consider to take a look at testoob.

Lines of code

First, a disclaimer. I just want to play with numbers in this post. Lines of code (LOC) do not mean anything in terms of productivity. They can eventually be used as a rough estimate of the complexity of a project. MacOSX counts more than 80 millions LOC. the Linux kernel a little more than 5 millions. Openoffice is around 1 million. GIMP is 650.000 LOC. (Values from pedia). In other words, a huge project like an operating system and its environment goes in the tenths of millions. A large application could still be around hundreds of thousands.

I tried to evaluate the maximum number of lines of code a human being could be able to produce in the best case scenario. It turns out: less than 4000 per day, assuming an average typist (240 char per minute), lines of 30 characters and 8 hours of work.

We are of course in a very ideal and practically unrealistic case: no thinking, just typing, eight hours straight, with no breaks. The Linux kernel could be done by this typist in slightly more than 3.5 years. It actually took 15, with a huge amount of people involved.

It turns out, however, that the number of lines of code you can actually write is of course way lower. Unsurprisingly I must say. On a small project I did recently, I wrote 2000 lines of code (logic+test) in 40 hours of work. This means 50 lines of code per hour, or 400 lines per day, a factor of 10 lower than the (in)human limit. If you could actually count the fact that many lines were modified, debugged etc. you can see how the actual line counting does not reflect productivity. Does my productivity decrease if I refactor the code, removing clutter, and therefore reducing the lines of code? Obviously no.

On another longer (and more complex) project I took part on, I wrote 9200 lines of code (logic+test) in 45 days of work, meaning approx 200 lines of code/day. Testing code, by experience, equals the logic code in terms of amount.

In other words, assuming my cases as a rough guideline, I would expect to produce an average of 300 lines of code per day. Of these 300, only 150 are logic. The other 150 are testing.

This leads also to these rough estimates for a one-man-army approach:

  • A small project of 1000 lines of (logic) code can be done in one or two weeks
  • An average complexity project of 10.000 lines of (logic) code could be completed in two or three months.
  • A large project of 100.000 lines of (logic) code requires at least a couple of years

Note that 100.000 lines of code is still 1/6th of the GIMP complexity.

Some interesting reading:

Close
E-mail It