Random number generator hardware

Suppose you want to generate a large number of random dice rolls for a computer program. How do you do it ? With a robotic dice rolling machine, of course

Why would anyone need such device, I hear you ask? Well, software random number generators are technically not perfectly random, and if you have to reassure a crowd of people complaining about your rolls’ randomness, this solution is straightforward, creative, and a real pleasure to read in its gory and unexpected technical issues.

Posted in Hardware, Probability, Statistics. Comments Off »

How much statistics should one know ?

I just wrote an answer to this very interesting question on Stackoverflow. Now, as a disclaimer, I’m not an expert in statistics, but I did enough statistics to “know the beast”, or at least what are the dangers. I will rearrange my answer for this post, to address the more general case.

The main issue is “How much statistics should any person know?”. In our life, we all deal with statistics, willful or not. Polls, weather forecast, drug effectiveness, insurances, and of course some parts of computer science. Being able to critically analyze the presented data gives the line between picking the right understanding out of them or being scammed, tricked, or misdirected.

Technically, the following points are important:

All these points are critical if you want to interpret anything with a grain of salt. Yet, they are not the whole story. Let’s face it. Statistics needs understanding before anything can be inferred, otherwise wrong conclusions will be obtained. I will give you some examples:

  • The evaluation of the null hypothesis is critical for testing of the effectiveness of a method. For example, if a drug works, or if a fix to your hardware had a concrete result or it’s just a matter of chance. Say you want to improve the speed of a machine, and change the hard drive. Does this change matters? you could do sampling of performance with the old and new hard disk, and check for differences. Even if you find that the average with the new disk is lower, that does not mean the hard disk has an effect at all. Here enters Null hypothesis testing, and it will give you a probability, not a definitive answer, like: there’s a 90 % probability that changing the hard drive has a concrete effect on the performance of your machine. Depending on this value, you could decide to upgrade hard drives to all 10.000 machines in your server farm, or not.
  • Correlation is important to find out if two entities “change alike”. As the internet mantra “correlation is not causation” teaches, it should be taken with care. The fact that two random variables show correlation does not mean that one causes the other, nor that they are related by a third variable (which you are not measuring). They could just behave in the same way. Look for pirates and global warming to understand the point. A correlation reports the possible presence of a signal, it does not report a finding.
  • Bayesian inference. We all know Bayesian-based spam filter, but there’s more, and it’s important to see how human decisions and mood can be influenced by a clear understanding of data analysis. Suppose someone goes to a medical checkup and the result tells him/her has cancer. Fact is: most people at this point would think “I have cancer” without any doubt. That’s wrong. A positive testing for cancer moves your probability of having cancer from the baseline for the population (say, 12 % of women have the chance for breast cancer) to a higher value, which is not 100 %. How high is this number depends on the accuracy of the test. If the test is lousy, you could just be a false positive. The more accurate the method, the higher is the skew, but still not 100 %. Of course, if multiple independent tests all confirm cancer, then it’s very probable it is there, but still it’s not 100 %. maybe it’s 99.999 %. This is a point many people don’t understand about bayesian statistics.
  • Plotting methods. That’s another thing that is always left unattended. Analysis of data does not mean anything if you cannot convey effectively what they mean via a simple plot. Depending on what information you want to put into focus, or the kind of data you have, you will prefer a xy plot, a histogram, a violin plot, etc… Each data insight has a different preferred plot, exactly as each conversation has a different appropriate wording.

Statistics enter our lives every time we have to distill an answer or compare numerical (or reduced to numerical) data from unreliable sources: a signal from an instrument, a bunch of pages and the number of words they contain and so on. Think for example to the algorithm to perform click detection on the iphone. You are using a trembling, fat stylus (also known as finger) to point to an icon which is much smaller than the stylus itself. Clearly, the hardware (capacitive touchscreen) will send a bunch of data about the finger, plus a bunch of data about random noise from the environment. The driver must make sense out of this mess and give you a x,y coordinate on the screen. That needs a lot of statistics.

An additional issue is sampling. Sampling actually comes first than statistical analysis: you collect a sample, reduce it to a number, and perform statistics on this number (among many others). Sampling is a fine and delicate art, and no statistics will correct, or even point out at an incorrect sampling, unless you act smart. Sampling introduces bias, either from the sampler, the sampling method, the analysis method,  the nature of the sample, or the nature of nature itself. A good sampler knows these things and tries to reduce unwanted bias as much into a random distribution, so to treat it statistically.

As a closing remark, statistic is among the most powerful allies we have to understand the noisy universe we live in, but it’s also a very dangerous backstabber enemy, if not used properly. Willfully misusing it is definitely evil.

Posted in Statistics. Comments Off »

Box and Whiskers plot. How ?

I am trying to produce box and whiskers plots. Actually, not the plot in itself, but the values for the boxes, the whiskers and so on. An example of box and whiskers is the following

box_and_whiskers.jpg

Browsing the net, I found that many sites just explain that a box plot is made according to the following recipe:

  1. Put the values in ascending order
  2. find the median (Q2), the first quartile (Q1) and the third quartile (Q3)
  3. make the box between the Q1 and Q3 values
  4. Put a line in the box where the median is
  5. make the whiskers including the lowest and highest values

This approach however does not represent outliers. A better recipe can be found on Wikipedia:

  1. Put the values in ascending order
  2. Find the median (Q2), the first quartile (Q1) and the third quartile (Q3)
  3. make the box between the Q1 and Q3 values
  4. Put a line in the box where the median is
  5. Calculate the Inter Quartile Range (IQR) as Q3-Q1
  6. Find the lower fence and the higher fence values as LowFence = Q1 – (1.5 * IQR) ; HiFence = Q3 + (1.5 * IQR)
  7. Mark any value below LowFence or above HiFence as outlier, and represent it with a circle
  8. Find the lowest value not marked as outlier, and make the low whisker using this value (NOT the LowFence value)
  9. Find the highest value not marked as outlier, and make the high whiskers using this value

As an example, the following set of 15 values in R

> a=c(2,2,3,3,3,4,4,5,6,6,6,6,8,12,13)

have Q2 = 5, Q1 = 3 and Q3 = 6

> quantile(a)
  0%  25%  50%  75% 100% 
   2    3    5    6   13 

The IQR is therefore 6-3 = 3

> IQR(a)
[1] 3

The LowFence and HighFence values are therefore 3 – (3*1.5) = -1.5 and 6 + (3*1.5) = 10.5. Any value below LowFence (none) or above HiFence (12 and 13) are marked as outliers. The whiskers therefore are delimited by 2 and 8 (the lowest and highest values in the dataset that are not outliers). The result is the plot you see above.
You can obtain the values for the box and whiskers as

> boxplot.stats(a)
$stats
[1] 2 3 5 6 8

$n
[1] 15

$conf
[1] 3.776137 6.223863

$out
[1] 12 13

Where stats contains the relevant values for the boxes and whiskers, as computed by hand above, out is the set of outliers, n is the number of values in the dataset and conf is a set of values to plot the hinges, not displayed in the plot above.

Posted in R, Statistics. Comments Off »