Python and memory fragmentation

If you use CPython on 32 bit architectures, you may encounter a problem called memory fragmentation. It is more likely to happen on Windows for reasons that will soon be clear, but it’s not a Windows exclusive. It is also not an exclusive python problem, but tends to occur more often on CPython due to its intrinsic memory allocation strategy.

When you dynamically allocate memory in C, you do so in a particular area of the Virtual Memory, the heap. A requirement for this allocation is that the allocated chunk must be contiguous in virtual memory. CPython puts extreme stress on the heap: all objects are allocated dynamically, and when they are freed, a hole is left in the heap. This hole may be filled by a later allocation, if the requested memory fits in the hole, but if it doesn’t, the hole remains until something that fits is requested. On Linux, you can follow the VM occupation with this small python script

import sys
import subprocess

mmap = [' ']*(16*256)
out = subprocess.check_output(["/usr/bin/pmap","-x", "%d" % int(sys.argv[1])])
for i in out.splitlines()[2:-2]:
	values = i.split()[0:2]
	start = int("0x"+values[0], 16) / 2**20
	end = start + (int(values[1])*1024)/2**20
	for p in xrange(start, end+1):
		mmap[p] = '*'

for row in xrange(16):
	print hex(row)+" | "+"".join( 
                        mmap[row * 256:(row+1)*256]
                        )+"|"

On Windows, the great utility VMMap can be used to monitor the same information.

Given the above scenario, the Virtual Memory space can eventually become extremely fragmented, depending on the size of your objects, their order of allocation, if your application jumps between dynamically allocating large chunks of memory and small python objects, and so on. As a result, you may not be able to perform a large allocation, not because you are out of memory, but because you are out of contiguous memory in your VM address space. In a recent benchmark I performed on Windows 7, the largest contiguous chunk of memory available was a meager 32 megabytes (ow!), which means that despite the free memory being around 1 gigabyte, the biggest chunk I could request was only 32 megabytes. Anything bigger would have the malloc fail.

Additional conditions that can make the problem worse are dynamic libraries binding and excessive threading. The first invades your VM address space, and the second needs a stack per each thread, putting additional unmovable barriers throughout the VM and reducing real estate for contiguous blocks. See for example what happens with 10 threads on Linux

(gdb) thread apply all print $esp

Thread 10 (Thread 10606): $5 = (void *) 0x50d7184
Thread 9 (Thread 10607): $6 = (void *) 0x757ce90
Thread 8 (Thread 10608): $7 = (void *) 0x7b69e90
Thread 7 (Thread 10609): $8 = (void *) 0x7d6ae90
Thread 6 (Thread 10618): $9 = (void *) 0x8a4ae90
Thread 5 (Thread 10619): $10 = (void *) 0xb22fee90
Thread 4 (Thread 10620): $11 = (void *) 0xb20fde90
Thread 3 (Thread 10621): $12 = (void *) 0xb1efce90
Thread 2 (Thread 10806): $13 = (void *) 0xb2ea31c4
Thread 1 (Thread 0xb7f6b6c0 (LWP 10593)): $14 = (void *) 0xbffd1f3c

Fragmentation is eventually made irrelevant by a 64 bit architecture, where the VM address space is huge (for now ;) ). Yet, if you have a 32 bit machine and a long running python process that is juggling large and small allocations, you may eventually run out of contiguous memory and see malloc() fail.

How to solve? I found this interesting article that details the same issue and provides some mitigation techniques for Windows, because Windows is kind of special: on Windows, the 4 gigabytes address space is divided in two parts of 2GB EACH, the first for the process, the second reserved for the kernel. If you must stay on 32 bits, your best bet is to give an additional gigabyte of VM with the following recipe (only valid for Windows-7)

  1. run bcdedit /set IncreaseUserVa 3072 as administrator, then reboot the machine.
  2. mark your executable with EditBin.exe you_program.exe  /LARGEADDRESSAWARE

With this command, the VM subdivision is set to 3GB+1GB, granting one additional gigabyte to your process. This improves the situation, but sometimes it’s enough. If it is not, and you still need to work on 32 bit machines then you are in big trouble. You have to change your allocation strategy and be smarter on handling the fragmentation within your code.

Posted in Python, Windows. No Comments »

Nitpicking on python properties

I like python properties, I really do. Properties allow you to convert explicit setters and getters into lean code while keeping control of your operation. Instead of this

state.setMode(EditorMode.INSERT)
current_mode = state.mode()

with properties you can write a much more pleasant

state.mode = EditorMode.INSERT
current_mode = state.mode

In both cases, if it weren’t for the magic of properties, you would get direct access to a member, but if you create your class like this

class State(object):
    def __init__(self):
        self._mode = EditorMode.COMMAND

    @property
    def mode(self):
        return self._mode

    @mode.setter
    def mode(self, value):
        self._mode = value

the two decorated methods are called automatically. This gives you a chance for validating the passed value, for example, or using smart tricks like caching if getting the value is slow, basically the same tricks you would obtain by traditional accessor methods, but with a better syntax and without violating encapsulation.

That said, I noticed a few minor things with properties that are good to point out.

1. Refactoring

Imagine that you find “mode” too poor as a name. and you decide to use editor_mode instead. If you had an accessor method, you would refactor setMode() to setEditorMode(). If any part of your code still calls setMode(), you get an exception.

With a property, the same refactoring implies a change from state.mode to state.editor_mode. However, if other parts of your code still use state.mode = something, you will not get an exception. In fact, it’s trivially correct. This could produce hard-to-find bugs.

2. No callbacks

While you can store or pass around state.setEditorMode as a callback, you can’t achieve the same effect with a property, not trivially at least. No, you can’t use a lambda, because assignment is forbidden in a lambda.

3. Mocking

You can certainly mock a property, but requires a bit more care. Nothing impossible, but if you learn the mock module, you have to go on that extra bit if you want to cover properties.

4. Soft-returning a set operation details

Sometimes you might want your setter to return state information about the set operation. One trivial example may be True or False, depending if the operation was successful or not. You can certainly throw an exception for this specific case, but your mileage may vary depending on the specifics of your problem and what “looks better” for your design. A property does not give you flexibility to return a value during set.

5. Only one value at a time

If your setters are connected to some notification system, you might want to set multiple values at once to trigger a single notification. Once again, it’s a minor problem: you can use a special property accepting a tuple. For example, if you have values v1 and v2, on class Foo, you could have something like

class Foo(object): 
    def __init__(self): 
        self._v1 = None 
        self._v2 = None 
    @property 
    def v1(self): 
        return self._v1 
    @v1.setter 
    def v1(self, value): 
        self._v1 = v1 
        # notify listeners 
    @property 
    def v2(self): 
        return self._v2 
    @v2.setter 
    def v2(self, value): 
        self._v2 = v2 
        # notify listeners 
    @property 
    def values(self): 
        return (self._v1, self._v2) 
    @values.setter 
    def values(self, value_tuple): 
        self._v1, self._v2 = value_tuple 
        # notify listeners 
    
f=Foo() 
f.values = (1,2)

6. Magic!

There’s some kind of magic behind properties that you can’t perceive to be there when you read client code. For example, code like this

myobj.my_foo = 5

generally makes you think of a simple assignment taking place, but this is not the case if my_foo is a property. Maybe naming convention could disambiguate? I am not a fan of the strict PEP-8 requirements on naming of methods, so one could potentially decide for

myobj.myMethod()
myobj.myProperty = 5
myobj.my_member_var = 3

I am thinking out loud here, and I don’t have a strong opinion on this issue. That said, properties are definitely cool and will make your interfaces much more pleasant, so I definitely recommend their use, if no other constraints prevent you to do so.

Posted in Python. No Comments »

A raytracer in python – part 6: cameras

In the latest commit for the raytracer, I added cameras. The design changed so that now the responsible for rendering is the camera object. Actual cameras are specializations of an abstract class BaseCamera, which holds common information about positioning. The BaseCamera is then specialized into two concrete classes:

  1. PinholeCamera is a camera where rays are shot as diverging from a single point, called the eye_point. This allows perspective, which was not present previously as the rays were emerging from the ViewPlane pixels.
  2. LensCamera is a camera that simulates depth of field, that is, focus/out of focus. Contrary to the PinholeCamera, where everything is in focus, LensCamera allows different focusing. Objects that happen to be on the “focal plane” are in focus, while objects that are outside (either closer or farther from the camera) present less defined details proper of an out-of-focus object. To perform this effect, we need the random sampling on a disk implemented in the previous post.

The following picture shows how LensCamera performs. A set of hemispheres are deployed along a line. The camera is above them, slightly angled and with a roll angle appreciable from the horizon. In all three cases, the orange central sphere is focused, as the focus plane has been set to fall on the sphere’s position. Note how other objects are in focus for a Pinhole camera (left picture) which has no depth of field by construction, and become more out of focus as the lens size increases (1.0 in the center picture, 5.0 in the right one)

Focusing

From left to right, PinholeCamera, LensCamera with lens size 1.0, LensCamera with lens size 5.0

Other cameras may technically be possible: the book goes further in deploying fisheye and stereoscopic cameras, but I am not interested in them. I think the pinhole and lens camera are flexible enough for quality renderings and my desire to learn.

One important feature of the Camera system is that it requires the definition of local coordinates on the camera itself. The three vectors defining this set of coordinates, called u, v, w in the book, are obtained by building an orthonormal basis using the cross product between the observation vector (the vector between the “eye” of the camera and the look_at point) and an “up” vector, our default being in the same direction as the y axis. Doing the cross product of these two vectors (observation and up) produces the third remaining vector of the orthogonal basis centered on the camera. However, if the camera looks straight up, or straight down, the cross product is zero and we obtain a singularity, losing one degree of freedom (a condition also known as gimbal lock). The book proposes to detect this condition and treat it accordingly, by either overriding the specification and setting the vectors to an arbitrary, well defined alternative, or by “juggling” the up vector out of alignment so that the third vector is still defined. I decided for the third option, ignore the problem, as I am not going to use gimbal locked configurations for now, but it’s definitely a problem to add to the todo list.

With this post, I take a temporary break from the raytracing business. I may add optical effects such as reflections, refractions, materials, lights, but the point is that the amount of rays that must be propagated for these effects to show tends to be very high. I want to venture into CUDA, and therefore I will switch my attention to CUDA programming from now on, integrate it with the raytracing later on, then go back to light effects at a later stage. I will implement light effects first in python, then use CUDA to achieve the same results. My aim is to have fun, test CUDA/C/Python integration, compare performances, and provide a fully python raytracer with optional C/CUDA high-performance code to achieve the same task. For CUDA tinkering, I will switch back to my old friend, the mandelbrot set.

Posted in Python, Raytracing. Comments Off »

How to convert a QString to unicode object in python 2?

I had this problem to solve, and I tried to find the safest way. This program illustrates the solution

from PyQt4 import QtCore, QtGui                                                                                                                       
import sys                                                                                                                                            

app = QtGui.QApplication(sys.argv)                                                                                                                    
ed = QtGui.QLineEdit()                                                                                                                                

def editingFinished():                                                                                                                                
    # The text() returns a QString, which is unicode-aware                                                                                            
    print type(ed.text())                                                                                                                             
    # This returns a QByteArray, which is the encoded unicode string in the utf-8 encoding.                                                           
    print type(ed.text().toUtf8())                                                                                                                    
    # Since unicode() accepts a sequence of bytes, the safest and fully controlled way of performing                                                  
    # the transformation is to pass the encoded sequence of bytes, and let unicode know it is utf-8 encoded                                           
    print unicode(ed.text().toUtf8(), encoding="UTF-8")                                                                                               

QtCore.QObject.connect(ed, QtCore.SIGNAL("editingFinished()"), editingFinished)                                                                       
ed.show()                                                                                                                                             

app.exec_()

So the solution is

unicode(qstring_object.toUtf8(), encoding="UTF-8")

Maybe there’s another, simpler way that it’s also safe, but for now this solution is good enough.

Posted in Python. Comments Off »

A good lesson in python and unicode

Ever wanted to understand more about unicode in python? This talk is a good explanation on how to deal with it properly

Posted in Python, Unicode. Comments Off »

A pythonic way out of the GPL restrictions in MySQL client library

I recently became aware of this native Python package PyMySQL. The package has one important benefit vs. the other solutions to talk to a MySQL server, such as MySQLdb (AKA mysql-python) , namely, it reimplements the MySQL protocol, instead of binding to the MySQL connector library (also known as libmysqlclient). Why is this an issue? Well, because the MySQL connector library is GPL, and you can’t bind against GPL code unless your code is under a GPL-compatible license. This excludes all commercial uses, and makes all derivative works of libmysqlclient GPL as well, including the Python binding MySQLdb. If you thought about circumventing the problem using unixodbc, tough luck: the ODBC MySQL connector is also GPL, thus making unixodbc GPL as well.

Despite the low version number, PyMySQL, seems to be working, has no dependencies, it’s pure python, and it is released under the very liberal MIT license.

Posted in MySQL, Python, Software Licensing. Tags: . Comments Off »

A raytracer in python – part 5: non-planar samplers

In this post we are going to describe and implement non-planar samplers. In the previous post about samplers, we implemented and characterized different planar samplers to make antialiasing possible. The characteristic of these samplers was to produce regular or random points on a plane with x,y between zero and one. To implement effects such as simulation of lenses behavior, reflections and so on, we also need to be able to shoot rays according to  geometrical patterns other than the plane. More specifically, we need to be able to map points on a disk (to simulate lenses) or on a hemisphere (to simulate other optical effects such as reflections), while at the same time preserving the good characteristics of the random distributions outlined in the planar case.

To achieve this, the samplers now implement two new methods, BaseSampler.map_samples_to_disk() and BaseSampler.map_sampler_to_hemisphere(). They are in charge of remapping the planar distribution to a disk or to a hemisphere, but with a couple of twists: in the disk remap, the points in the range [0:1] must be remapped to a full circle from [-1:1] in both axes, so to cover the circle completely while preserving the distribution. This is done through a formulation called Shirley’s concentric maps.

In the hemisphere remapping, we also want to introduce a variation in the density so that it changes with the cosine of the polar angle from the top of the hemisphere. In other words, we want an adjustable parameter e to focus the point density closer to the top of the hemisphere.

We will need the characteristics of this distributions later on, when we will have to implement reflections and other optical effects. As you can see from the above plot, higher values of the parameter e produce a higher concentration of the points close to the top of the hemisphere. On the other hand, a low e parameter tend to produce a more uniform distribution over the full hemisphere.

To obtain the points, the sampler object has now three methods to request an iterator. We are no longer iterating on the object itself, because we need to provide three different iteration strategies. Methods BaseSampler.diskiter(), BaseSampler.hemisphereiter() and BaseSampler.squareiter(), each returning a generator over the proper set of points. Note that the hemisphere point generator returns 3D points, differently from the other two returning 2D points.

You can find the code for this post at github.

Posted in Python, Raytracing. Comments Off »

Calling a C routine from python, the easy way

I think it may be interesting for others to see how to call easily a C routine from python, without implementing a python module. What you need is the ctypes module. Remeber however that apparently the use of this module is generally frowned upon, at least according to a note I found in PEP 399:

Usage of ctypes to provide an API for a C library will continue to be frowned upon as ctypes lacks compiler guarantees that C code typically relies upon to prevent certain errors from occurring (e.g., API changes).

although to be honest, it may be in the context of the PEP itself, and not as a general recommendation.

Nevertheless, suppose you want to call the function

double pow(double, double)

in the standard math library.

The first thing to do is to define the prototype of the function. You achieve this via the following:

prototype=ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double, ctypes.c_double)

The call to ctypes.CFUNCTYPE creates a prototype object for a C function that returns a double (the first argument) and accepts two double (the second and third arguments).

Now you have a prototype. This entity is a class

>>> prototype
<class 'ctypes.CFunctionType'>

and you can bind this prototype to the actual function in the math library with the following. First you create an object representing the library

>>> dll=ctypes.cdll.LoadLibrary("libm.dylib") # on Mac. Linux use libm.so

and then you bind to the actual function

>>> pow = prototype(("pow", dll))
>>> pow(3,4)
81.0

This just brushes the surface, but I wanted to make a simple introductory post.

Posted in Python. Comments Off »

Copying and pasting from the python interpreter

One very powerful feature of python is the interactive interpreter: it allows you to test and quickly evaluate snippets of code. Occasionally, I need to rerun the same code, either during the same or another python interpreter session. One quick way to achieve would be to copy and paste the code again, but you quickly realize the prompt makes it hard:

>>> for i in xrange(10):
...     print i+10
...     print i-10
...     print i/2
... 
10
-10
and so on

if you directy copy and paste the above snippet, it clearly won’t work due to the presence of the prompts:

>>> >>> for i in xrange(10):
 File "<stdin>", line 1
 >>> for i in xrange(10):
 ^
SyntaxError: invalid syntax
>>> ...     print i+10

Frustrated, I decided to solve the problem once and for all: I created a .pythonrc file where I override the normal “>>> ” prompt to a header prompt, and the continuation prompt to the empty string:

import sys
sys.ps1='--- [Python] ---\n'
sys.ps2=''

Then, I added the PYTHONSTARTUP variable in my .bash_profile to refer to this file:

export PYTHONSTARTUP=$HOME/.pythonrc

Now my interactive session looks like this

Python 2.7.1 (r271:86832, Feb 27 2011, 20:04:04) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
--- [Python] ---
for i in xrange(2):
        print i

0
1
--- [Python] ---

I am now free to copy and paste the above code and replay it in the interpreter, or later on, directly into a vim session (but I will have to change tab spacing).

Posted in Python. Comments Off »

A raytracer in python – part 4: profiling

After having finally obtained a raytracer which produces antialiasing, it is now time to take a look at performance. We already saw some numbers in the last post. Rendering a 200×200 image with 16 samples per pixels (a grand total of 640.000 rays) takes definitely too much. I want to perform some profiling with python, find the hotspots in the code, and eventually devise a strategy to optimize them.

General profiling with cProfile

To perform basic profiling, I used the cProfile program provided in the standard library. It appears that the longest processing time is in the hit() function

$ python -m cProfile -s time test3.py
 Ordered by: internal time

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 2560000  85.329    0.000  108.486    0.000 Sphere.py:12(hit)
 1960020  30.157    0.000   30.157    0.000 {numpy.core.multiarray.array}
 7680000  23.115    0.000   23.115    0.000 {numpy.core._dotblas.dot}
 1        19.589   19.589  195.476  195.476 World.py:25(render)
 2560000   7.968    0.000  116.454    0.000 World.py:62(f)
 640000    6.710    0.000  133.563    0.000 World.py:61(hit_bare_bones_object)
 640025    4.438    0.000  120.902    0.000 {map}
 640000    3.347    0.000  136.910    0.000 Tracer.py:5(trace_ray)
 640000    3.009    0.000    3.009    0.000 {numpy.core.multiarray.zeros}
 640000    2.596    0.000    2.613    0.000 {sorted}
 640000    2.502    0.000    3.347    0.000 {filter}
 640000    1.835    0.000   16.784    0.000 Ray.py:4(__init__)

This does not surprise me, as the main computation a raytracer performs is to test each ray for intersection on the objects in the scene, in this case multiple Sphere objects.

Profiling line by line for hot spots

Understood that most of the time is spent into hit(), I wanted to perform line-by-line profiling. This is not possible with the standard python cProfile module, therefore I searched and found an alternative, line_profiler:

$ easy_install-2.7 --prefix=$HOME line_profiler
$ kernprof.py -l test3.py
Wrote profile results to test3.py.lprof
$ python -m line_profiler test3.py.lprof

Before running the commands above, I added the @profile decorator to the method I am interested in. This decorator is added by line_profiler to the __builtin__ module, so no explicit import statement is needed.

class Sphere(object):
    <...>
    @profile
    def hit(self, ray):
         <...>

The results of this profiling are

Line # Hits  Time    Per Hit  % Time Line Contents
==============================================================
12                                   @profile
13                                   def hit(self, ray):
14 2560000  27956358  10.9     19.2     temp = ray.origin - self.center
15 2560000  17944912   7.0     12.3     a = numpy.dot(ray.direction, ray.direction)
16 2560000  24132737   9.4     16.5     b = 2.0 * numpy.dot(temp, ray.direction)
17 2560000  37113811  14.5     25.4     c = numpy.dot(temp, temp) \
                                              - self.radius * self.radius
18 2560000  20808930   8.1     14.3     disc = b * b - 4.0 * a * c
19
20 2560000  10963318   4.3      7.5     if (disc < 0.0):
21 2539908   5403624   2.1      3.7         return None
22                                      else:
23   20092     75076   3.7      0.1         e = math.sqrt(disc)
24   20092    104950   5.2      0.1         denom = 2.0 * a
25   20092    115956   5.8      0.1         t = (-b - e) / denom
26   20092     83382   4.2      0.1         if (t > 1.0e-7):
27   20092    525272  26.1      0.4            normal = (temp + t * ray.direction)\
                                                           / self.radius
28   20092    333879  16.6      0.2            hit_point = ray.origin + t * \
                                                              ray.direction
29   20092    299494  14.9      0.2            return ShadeRecord.ShadeRecord(
                                                         normal=normal,
                                                         hit_point=hit_point,
                                                         parameter=t,
                                                         color=self.color)

Therefore, it appears that most of the time is spent in this chunk of code:

temp = ray.origin - self.center
a = numpy.dot(ray.direction, ray.direction)
b = 2.0 * numpy.dot(temp, ray.direction)
c = numpy.dot(temp, temp) - self.radius * self.radius
disc = b * b - 4.0 * a * c

We cannot really optimize much. We could precompute self.radius * self.radius, but it does not really have an impact. Something we can observe is the huge amount of routine calls. Is the routine call overhead relevant ? Maybe: Python has a relevant call overhead, but a very simple program like this

def main():
    def f():
        return 0
    a=0
    for i in xrange(2560000):
        if f():
            a = a+1

    print a

main()

is going to take 0.6 seconds, not small, but definitely not as huge as the numbers we see. Why is that ? And why is the raytracer so slow for the same task ? I think the bottleneck is somewhere else.

Finding the problem

I decided to profile World.render() to understand what’s going on: this is the routine in charge of going through the pixels, shooting the rays, then delegating the task of finding intersections to Tracer.trace_ray, which in turns re-delegates the task to World.hit_bare_bone_object. I don’t really like this design, but I stick to the book as much as possible, mostly because I don’t know how things will become later on.

The profiling showed two hot spots in World.render(), in the inner loop:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

    41    640000     18786192     29.4     29.2  ray = Ray.Ray(origin = origin,
                                                               direction = (0.0,0.0,-1.0))
    42
    43    640000     22414265     35.0     34.9  color += numpy.array(tracer.trace_ray(ray))

Why is it so slow to perform these two operations? It turns out that numpy is incredibly slow at creating arrays. This may indeed be the reason why it’s so slow to instantiate a Ray object (two numpy.arrays), to add the color (another instantiation) and to perform operations in the Sphere.hit slow lines. At this point I’m not sure I can trust numpy.array, and I decide to remove it completely replacing arrays with tuples. The result is pleasing

$ time python test3.py
real	0m31.215s
user	0m29.923s
sys	0m2.355s

This is an important point: tuples are much faster than small arrays. numpy seems to be optimized for large datasets and performs poorly when handling small ones. This includes not only the creation of the arrays, but also any operation in numpy that may create numpy arrays as a consequence, such as calling numpy.dot on two tuples instead of a trivial implementation such as

def dot(a,b):
    return a[0]*b[0]+a[1]*b[1]+a[2]*b[2]

in fact, if I use numpy.dot on tuples in Sphere.hit():

 a = numpy.dot(ray.direction, ray.direction)
 b = 2.0 * numpy.dot(temp, ray.direction)
 c = numpy.dot(temp, temp) - self.radius * self.radius

the total running time goes from 31 seconds to a staggering 316 seconds (5 minutes). My guess is that they are converted to numpy.arrays internally, followed by the actual vector-vector operation.

I call myself happy with a runtime of 30 seconds for now, and plan to optimize further when more complex operations are performed. You can find the version for this post at github.

Posted in Python, Raytracing. Comments Off »