Utilities for debugging

Date

Part of my daily job involves getting dirty at the assembler level, generally to debug segfaults, FPE, deadlocks and similar stuff. It's a fascinating challenge which makes me feel extremely powerful and in control of the hard, low-level technicalities of the OS, libraries and hardware I use. Most of my effort, however, would be impossible without a few tools I learned to appreciate and love through hours of solid dependency. I'm posting them here, and while the list is certainly not exhaustive, it's a starting kit for the hardcore debugger.

Linux

c++filt

This nifty utility converts C++ mangled symbols into the unmangled form. A simple example is the following

$ c++filt _ZNK3MPI4File19Get_position_sharedEv
MPI::File::Get_position_shared() const

More complex symbols can become unreadable in the mangled form, especially when they arise from template instantiation. A quick pass with c++filt will make everything much clearer.

pmap

Another nifty utility in the Linux arsenal, this easy program reports the VM layout of a running process.

$ pmap  21102
21102:   cat
0000000000400000     44K r-x--  /bin/cat
000000000060a000      4K r----  /bin/cat
000000000060b000      4K rw---  /bin/cat
0000000001969000    132K rw---    [ anon ]
00007f3b98aa9000   7068K r----  /usr/lib/locale/locale-archive
00007f3b99190000   1748K r-x--  /lib/x86_64-linux-gnu/libc-2.15.so
00007f3b99345000   2048K -----  /lib/x86_64-linux-gnu/libc-2.15.so
00007f3b99545000     16K r----  /lib/x86_64-linux-gnu/libc-2.15.so
00007f3b99549000      8K rw---  /lib/x86_64-linux-gnu/libc-2.15.so
00007f3b9954b000     20K rw---    [ anon ]
00007f3b99550000    136K r-x--  /lib/x86_64-linux-gnu/ld-2.15.so
00007f3b99749000     12K rw---    [ anon ]
00007f3b99770000      8K rw---    [ anon ]
00007f3b99772000      4K r----  /lib/x86_64-linux-gnu/ld-2.15.so
00007f3b99773000      8K rw---  /lib/x86_64-linux-gnu/ld-2.15.so
00007fff18ae9000    136K rw---    [ stack ]
00007fff18bff000      4K r-x--    [ anon ]
ffffffffff600000      4K r-x--    [ anon ]
 total            11404K

To achieve the same with a post-mortem core file, the gdb commands "info files" and "maintenance info sections" are your friends. When running, "info proc mappings" can achieve the same.

lddtree

lddtree is like ldd, but it traverses the full hierarchy, building a tree diagram of all dependencies. It is available on Ubuntu in the package pax-utils.

$ lddtree /bin/ls
ls => /bin/ls (interpreter => /lib64/ld-linux-x86-64.so.2)
    libselinux.so.1 => /lib/i386-linux-gnu/libselinux.so.1
        libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2
        ld-linux.so.2 => /lib/i386-linux-gnu/ld-linux.so.2
    librt.so.1 => /lib/i386-linux-gnu/librt.so.1
        libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0
    libacl.so.1 => /lib/x86_64-linux-gnu/libacl.so.1
        libattr.so.1 => /lib/x86_64-linux-gnu/libattr.so.1
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6

Windows

I am not a Windows person, but I build and debug on it rather often. With time, I gathered a few essential tools to understand the source of your problems during build, deployment and execution.

Process monitor

Reports events of different nature from all the processes in your system, such as disk access, registry access, and so on. I found this utility fundamental to track a strange crash that occurred in a third party closed-source library. The problem turned out to be an attempt to write to a directory that was not writable. From the website:

Process Monitor is an advanced monitoring tool for Windows that shows real-time file system, Registry and process/thread activity. It combines the features of two legacy Sysinternals utilities, Filemon and Regmon, and adds an extensive list of enhancements including rich and non-destructive filtering, comprehensive event properties such session IDs and user names, reliable process information, full thread stacks with integrated symbol support for each operation, simultaneous logging to a file, and much more. Its uniquely powerful features will make Process Monitor a core utility in your system troubleshooting and malware hunting toolkit.

Process Explorer

An equivalent of fuser on Linux. From the website:

Ever wondered which program has a particular file or directory open? Now you can find out. Process Explorer shows you information about which handles and DLLs processes have opened or loaded.

VMMap

Extremely useful tool to see the Virtual Memory layout. It is precious to check potential fragmentation scenarios that could lead to invalid allocations.

From the website:

VMMap is a process virtual and physical memory analysis utility. It shows a breakdown of a process's committed virtual memory types as well as the amount of physical memory (working set) assigned by the operating system to those types. Besides graphical representations of memory usage, VMMap also shows summary information and a detailed process memory map.

Dependency walker

Performs more or less the task of lddtree on Linux and otool on Mac. It's invaluable to figure our which libraries or symbols are unresolved. I soon discovered that Windows is not very verbose when it comes to unresolved dependencies at startup. From the website:

Dependency Walker is a free utility that scans any 32-bit or 64-bit Windows module (exe, dll, ocx, sys, etc.) and builds a hierarchical tree diagram of all dependent modules. For each module found, it lists all the functions that are exported by that module, and which of those functions are actually being called by other modules. Another view displays the minimum set of required files, along with detailed information about each file including a full path to the file, base address, version numbers, machine type, debug information, and more.

Event Viewer

Provided on a standard windows 7 installation. Another invaluable tool to understand what's going on in your program, reports error and anomalous conditions that prevent an application to start, and much more.


Undefined symbols for Fortran module variables in static library on OSX. Problem (and solution)

Date

If you program in Fortran on the Mac, you might meet this odd problem. You have a module test.f90 containing nothing but public variables

module Test
    implicit none
    integer :: value
end module

and a program

program main
    use Test
    implicit none

    value = 5
    print *, value
end program

If you try and compile as follows, you have no problems

ifort -c test.f90
ifort -c main.f90
ifort main.o test.o
./a.out

However, if you package the test.o in a test.a static library

ar cru test.a test.o
ranlib test.a
ifort main.o test.a

You will get an undefined symbol error for the value symbol. What gives? Well, the problem has to do with how ranlib on Mac works. You have two solutions to this problem:

  • either you add a module procedure to the test module
  • or you use ranlib -c instead of plain ranlib.

I thank Drew McCormack for figuring this out and going back to post the solution.


Beautiful Git rebasing of version branch

Date Tags

In this post, I want to share with you a technique I learned recently from a colleague. It's a really great trick to keep your history nice and clean, while being able to work and push feature branches.

Let's start with the workflow and the problem that generates. You have a master branch that is pushed to a remote, containing the following

* f2bc01d 2014-10-07 17:57:52 +0200 Stefano Borini (HEAD, origin/master, origin/HEAD, master) - added 5 to foo
* 2ecacd7 2014-10-07 17:57:52 +0200 Stefano Borini - added 4 to foo
* d0f53d8 2014-10-07 17:57:52 +0200 Stefano Borini - added 3 to foo
* d26a4db 2014-10-07 17:57:52 +0200 Stefano Borini - added 2 to foo
* f98ef63 2014-10-07 17:57:52 +0200 Stefano Borini - added 1 to foo
* d817b10 2014-10-07 17:57:04 +0200 Stefano Borini - Added foo
* 5fdec88 2014-10-07 17:50:00 +0200 Stefano Borini - Initial commit

You want to work on a feature, so you create a feature branch from the current master

sbo@NAS:~/tmp/git-experiment-1$ git checkout -b feature
Switched to a new branch 'feature'
sbo@NAS:~/tmp/git-experiment-1$ git push --set-upstream origin feature
Total 0 (delta 0), reused 0 (delta 0)
To git@github.com:stefanoborini/git-experiment-1.git
 * [new branch]      feature -> feature

Master keeps receiving patches from other developers. At the same time, you keep pushing your feature changes to your remote, because you don't want to risk losing all your work if your hard drive crashes. The situation ends up like this

* 453bada 2014-10-07 18:06:36 +0200 Stefano Borini (HEAD, origin/master, origin/HEAD, master) - added 30 to foo
* 1546da7 2014-10-07 18:06:36 +0200 Stefano Borini - added 29 to foo
* cdfa47d 2014-10-07 18:06:36 +0200 Stefano Borini - added 28 to foo
* 00a0051 2014-10-07 18:06:36 +0200 Stefano Borini - added 27 to foo
* 8bc15dd 2014-10-07 18:06:36 +0200 Stefano Borini - added 26 to foo
* a3b542a 2014-10-07 18:06:36 +0200 Stefano Borini - added 25 to foo
... more commits ...
* 4610ec5 2014-10-07 18:06:36 +0200 Stefano Borini - added 11 to foo
* bf30e67 2014-10-07 18:06:36 +0200 Stefano Borini - added 10 to foo
* 2404746 2014-10-07 18:06:36 +0200 Stefano Borini - added 9 to foo
* 0452b5c 2014-10-07 18:06:36 +0200 Stefano Borini - added 8 to foo
* 245864a 2014-10-07 18:06:36 +0200 Stefano Borini - added 7 to foo
* 85a9f39 2014-10-07 18:06:36 +0200 Stefano Borini - added 6 to foo
| * 8f25256 2014-10-07 18:04:42 +0200 Stefano Borini (origin/feature, feature) - added 5 to bar
| * 7f5c34f 2014-10-07 18:04:42 +0200 Stefano Borini - added 4 to bar
| * 4e57707 2014-10-07 18:04:42 +0200 Stefano Borini - added 3 to bar
| * 9c12235 2014-10-07 18:04:42 +0200 Stefano Borini - added 2 to bar
| * a714ec3 2014-10-07 18:04:42 +0200 Stefano Borini - added 1 to bar
|/
* f2bc01d 2014-10-07 17:57:52 +0200 Stefano Borini - added 5 to foo
* 2ecacd7 2014-10-07 17:57:52 +0200 Stefano Borini - added 4 to foo
* d0f53d8 2014-10-07 17:57:52 +0200 Stefano Borini - added 3 to foo
* d26a4db 2014-10-07 17:57:52 +0200 Stefano Borini - added 2 to foo
* f98ef63 2014-10-07 17:57:52 +0200 Stefano Borini - added 1 to foo
* d817b10 2014-10-07 17:57:04 +0200 Stefano Borini - Added foo
* 5fdec88 2014-10-07 17:50:00 +0200 Stefano Borini - Initial commit

Now you have a problem. If you merge feature back into master, you will get a very long line connecting your last commit on feature back to the current master. This may seem trivial, but in complex scenarios it can become unpleasant to browse.

You may think that rebasing is the solution, but unfortunately, you would introduce a lot of problems if you do so. The Feature branch is already pushed to remote, and it's on all developers' machines. Rebasing would mean force a push and rewrite history, with ugly consequences.

What you would like to do is to "cherry pick" all the commits from feature on top of your current master, but there's a more elegant way of doing it. First of all, checkout your feature branch

sbo@NAS:~/tmp/git-experiment-1$ git checkout feature
Switched to branch 'feature'

Then, create and checkout a local branch from the feature branch

sbo@NAS:~/tmp/git-experiment-1$ git checkout -b feature_rebaser
Switched to a new branch 'feature_rebaser'

Now, rebase this branch against the current master. This migrates all the patches from feature_rebaser to master, but only locally.

sbo@NAS:~/tmp/git-experiment-1$ git rebase master
First, rewinding head to replay your work on top of it...
Applying: added 1 to bar
Applying: added 2 to bar
Applying: added 3 to bar
Applying: added 4 to bar
Applying: added 5 to bar

And finally, you can merge onto master. You can use --no-ff to keep the patches "visually separated" instead of in continuity with the current master timeline.

sbo@NAS:~/tmp/git-experiment-1$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
sbo@NAS:~/tmp/git-experiment-1$ git merge --no-ff feature_rebaser

The resulting tree is the following

*   c15286c 2014-10-07 18:22:53 +0200 Stefano Borini (HEAD, master) - Merge branch 'feature_rebaser'
|\
| * 166bbe0 2014-10-07 18:19:45 +0200 Stefano Borini (feature_rebaser) - added 5 to bar
| * 49eb2ac 2014-10-07 18:19:45 +0200 Stefano Borini - added 4 to bar
| * a5e36d7 2014-10-07 18:19:44 +0200 Stefano Borini - added 3 to bar
| * a8ca8a2 2014-10-07 18:19:44 +0200 Stefano Borini - added 2 to bar
| * 69d5ad3 2014-10-07 18:19:44 +0200 Stefano Borini - added 1 to bar
|/
* 453bada 2014-10-07 18:06:36 +0200 Stefano Borini (origin/master, origin/HEAD) - added 30 to foo
* 1546da7 2014-10-07 18:06:36 +0200 Stefano Borini - added 29 to foo
* cdfa47d 2014-10-07 18:06:36 +0200 Stefano Borini - added 28 to foo
... more commits ...
* bf30e67 2014-10-07 18:06:36 +0200 Stefano Borini - added 10 to foo
* 2404746 2014-10-07 18:06:36 +0200 Stefano Borini - added 9 to foo
* 0452b5c 2014-10-07 18:06:36 +0200 Stefano Borini - added 8 to foo
* 245864a 2014-10-07 18:06:36 +0200 Stefano Borini - added 7 to foo
* 85a9f39 2014-10-07 18:06:36 +0200 Stefano Borini - added 6 to foo
| * 8f25256 2014-10-07 18:04:42 +0200 Stefano Borini (origin/feature, feature) - added 5 to bar
| * 7f5c34f 2014-10-07 18:04:42 +0200 Stefano Borini - added 4 to bar
| * 4e57707 2014-10-07 18:04:42 +0200 Stefano Borini - added 3 to bar
| * 9c12235 2014-10-07 18:04:42 +0200 Stefano Borini - added 2 to bar
| * a714ec3 2014-10-07 18:04:42 +0200 Stefano Borini - added 1 to bar
|/
* f2bc01d 2014-10-07 17:57:52 +0200 Stefano Borini - added 5 to foo
* 2ecacd7 2014-10-07 17:57:52 +0200 Stefano Borini - added 4 to foo
* d0f53d8 2014-10-07 17:57:52 +0200 Stefano Borini - added 3 to foo
* d26a4db 2014-10-07 17:57:52 +0200 Stefano Borini - added 2 to foo

Note how the feature branch is now "brought to present day" and inserts nice and easy. The feature branch can then be left dangling and deleted at your best convenience.

Coudn't we just rebase feature, instead of creating a new branch? Unfortunately no. I created the same situation with a new branch "feature2", pushed it to remote, and rebased feature2. Checking out master and then back to feature2 I get

sbo@NAS:~/tmp/git-experiment-1$ git checkout feature2
Switched to branch 'feature2'
Your branch and 'origin/feature2' have diverged,
and have 15 and 5 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

oops... I changed history on my local machine, and this differs from the situation that is on the remote server. If you pull now, you will apply the same patches twice

sbo@NAS:~/tmp/git-experiment-1$ git pull
Merge made by the 'recursive' strategy.

with the following tree

*   a037791 2014-10-07 18:37:03 +0200 Stefano Borini (HEAD, origin/feature2, feature2) - Merge branch 'feature2' of github.com:stefanoborini/git-experiment-1 into fea
|\
| * 6c9e82a 2014-10-07 18:30:45 +0200 Stefano Borini - added 10 to bar
| * 4bcfced 2014-10-07 18:30:45 +0200 Stefano Borini - added 9 to bar
| * 469517d 2014-10-07 18:30:45 +0200 Stefano Borini - added 8 to bar
| * a28a360 2014-10-07 18:30:45 +0200 Stefano Borini - added 7 to bar
| * 894b24c 2014-10-07 18:30:45 +0200 Stefano Borini - added 6 to bar
* | a2b14a4 2014-10-07 18:31:44 +0200 Stefano Borini (origin/master, origin/HEAD, master) - added 10 to bar
* | acefda0 2014-10-07 18:31:44 +0200 Stefano Borini - added 9 to bar
* | 2be9bef 2014-10-07 18:31:44 +0200 Stefano Borini - added 8 to bar
* | 5d528cb 2014-10-07 18:31:44 +0200 Stefano Borini - added 7 to bar
* | a6ec80f 2014-10-07 18:31:44 +0200 Stefano Borini - added 6 to bar
* | d1dc5d2 2014-10-07 18:31:17 +0200 Stefano Borini - added 40 to foo
* | bdafcce 2014-10-07 18:31:17 +0200 Stefano Borini - added 39 to foo
* | 2dd4415 2014-10-07 18:31:17 +0200 Stefano Borini - added 38 to foo
* | 11a8f22 2014-10-07 18:31:17 +0200 Stefano Borini - added 37 to foo
* | f208658 2014-10-07 18:31:17 +0200 Stefano Borini - added 36 to foo
* | ef1e319 2014-10-07 18:31:17 +0200 Stefano Borini - added 35 to foo

Git is unable to recognize the patches' duplication, because rebasing changes the SHA. In this case, it went smooth, but in a real-case scenario, you would get plenty of conflicts.


C++ stdlib std::srand/std::rand are not repeatable across platforms

Date

I recently observed that std::srand/std::rand are not required to yield the same pseudorandom sequence across platforms. This code, for example

#include <cstdlib>
#include <iostream>
#include <ctime>

int main()
{
     std::srand(333);
     int random_variable = std::rand();
     std::cout << "Random value on [0 " << RAND_MAX << "]: " << random_variable << '\n';
     random_variable = std::rand();
     std::cout << "Random value on [0 " << RAND_MAX << "]: " << random_variable << '\n';
     random_variable = std::rand();
     std::cout << "Random value on [0 " << RAND_MAX << "]: " << random_variable << '\n';
}

Gives different results on Linux and Mac. On Linux ubuntu 12.04, it gives

Random value on [0 2147483647]: 1756860556
Random value on [0 2147483647]: 1141727289
Random value on [0 2147483647]: 551934435

On Mac OSX 10.8 gives instead

Random value on [0 2147483647]: 5596731
Random value on [0 2147483647]: 1722461096
Random value on [0 2147483647]: 1324078912

Keep this into account if you need cross-platform behavior of your PRNG.


Simulating network problems on OSX

Recently I had to simulate unreliable connections to a remote server, for testing purposes. I needed two cases: a fully unreachable host, and a host with considerable packet loss and delay. The first is easy to achieve: just pull the cable. For the second, I had to google up, as my network-fu is a bit rusty on the latest developments, especially on OSX. I am more of a Linux guy when it comes to networking.

A quick search produced this blog, with great details on how to achieve my needs. I copy the details here for future reference.

Here I add a 1 second delay and a 0.4 packet loss in both directions

bash-3.2# ipfw add pipe 1 ip from any to 10.0.0.1
00100 pipe 1 ip from any to 10.0.0.1

bash-3.2# ipfw add pipe 1 ip from 10.0.0.1 to any
00200 pipe 1 ip from any to any src-ip 10.0.0.1

bash-3.2# ipfw pipe 1 config delay 1000ms plr 0.4
bash-3.2# ipfw pipe 2 config delay 1000ms plr 0.4

bash-3.2# ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
Request timeout for icmp_seq 0
64 bytes from 10.0.0.1: icmp_seq=0 ttl=64 time=2000.386 ms

This is really harsh, but it is what I need.

The ipfw utility is the traditional FreeBSD packet filter. The amount of power behind this little command is impressive and reminds me of Linux iptables. ipfw has been deprecated in OSX 10.9 in favor of pfctl. Similarly, iptables is going to be deprecated on Linux in favor of nftables.


New style, new direction

Date Tags

As you can see, I did some deep changes to the website. My hosting at bluehost is expiring and my needs have changed since I started. I am overall happy with bluehost, they are a great, reliable hosting, but I don't need their services anymore. I migrated my wordpress deployment to a static website written in reStructuredText and compiled to HTML with pelican. I then host both the reST and the HTML at github.

This approach requires a bit more manual work, but I think it's worth the savings and the simple technology behind it. In the process, I also got rid of the heavily spammed wordpress comment system and introduced google plus comments.

In the meantime, I decided to shift in focus. In the past I occasionally wrote about science in general. This will no longer happen, and I will post and focus exclusively on scientific programming and related issues, with a general bias toward python. I will try to keep some form of regularity in my posting frequency, but considering the increased manual work required, there will be more variability.

Currently, I am working on the following long term projects:

As you can see, I am overinvolved, which is quite typical for me. Since I am only one, I jump from one project to another in a round robin way. Don't assume a project is dead because you don't see changes. I am probably working on something else and I'll go back to it at a later stage.


Scrambling the stack for fun and profit

Date Tags

Like any good respected tinkerer, I sometimes like to play with the madness and intricacies of the hardware and software I use. Recently I was tracing problems in a software I won't name but whose objective is to prevent tampering attempts, and while dumping stacks and disassembling routines, I came around this very interesting backtrace

#0  0x0000003f95e0d654 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003f95e08f65 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003f95e08e63 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00002b67cbdeaded in ?? ()
#4  0x000000002d0e9608 in ?? ()
#5  0x00002b67cbd1e1f2 in ?? ()
#6  0x000000000000000b in ?? ()
#7  0x00002aaaca08e410 in ?? ()
#8  0x00002aaab405d558 in ?? ()
#9  0x00002aaaadf65f48 in ?? ()
#10 0x00002aaaadf65fa0 in ?? ()
#11 0x00002aaaadf65fc0 in ?? ()
#12 0x00002aaaadf65f40 in ?? ()
#13 0x00002aaaadf65f50 in ?? ()
#14 0x000000002d0e7460 in ?? ()
#15 0x0000000026014330 in ?? ()
#16 0x00002b67cc1d08b0 in ?? ()

Interesting, I said to myself. I had no idea why I didn't have any backtrace information. What's interesting is that the addresses appeared to be incorrect, or not pointing to any code section. Note for example #6. I hardly doubt that's a correct address. I had no idea why this happened.

Nevertheless, I eventually found what I was looking for and moved on to other things, but the weird stack remained in the back of my head. Could it be that the stack has been scrambled on purpose? Maybe... or maybe the library was just stripped? Then why the weird address at frame #6? I still don't know, but the idea of scrambling the stack was appealing. So I wrote this trivial program as a proof of concept, just for fun

#include <stdio.h>

int routine4() {
    printf("hello\n");
}
int routine3() {
    routine4();
}
int routine2() {
    routine3();
}
int routine1() {
    routine2();
}
int main() {
    routine1();
}

then I compiled it with a simple g++ file.cpp, set a breakpoint on routine4, ran it, and asked for a backtrace

#0  0x0000000100000e18 in routine4 ()
#1  0x0000000100000e2f in routine3 ()
#2  0x0000000100000e3a in routine2 ()
#3  0x0000000100000e45 in routine1 ()
#4  0x0000000100000e50 in main ()

Pretty nice backtrace. We see the full information and all routine names. If we disassemble routine4, we also see the printf, which is actually reworked into a puts on OSX 10.6.8 with gcc.

0x0000000100000e14 <_z8routine4v +0>:        push   %rbp
0x0000000100000e15 <_z8routine4v +1>:        mov    %rsp,%rbp
0x0000000100000e18 <_z8routine4v +4>:        lea    0x45(%rip),%rdi        # 0x100000e64
0x0000000100000e1f <_z8routine4v +11>:       callq  0x100000e5e <dyld_stub_puts>
0x0000000100000e24 <_z8routine4v +16>:       leaveq
0x0000000100000e25 <_z8routine4v +17>:       retq

If I instead strip the binary with strip a.out, I can't set a breakpoint on routine4 anymore, and rightly so

(gdb) break routine4
Function "routine4" not defined.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) break puts
Breakpoint 1 at 0x3126e978c12ef0

Thanks to strip, all symbols are gone and the debugger can only refer to addresses. I can only guess (but not a hard one) where the code is, by checking at which VM pages they are mapped to, and the entry point

(gdb) info file
Symbols from "/Users/sbo/tmp/a.out".
Mac OS X executable:
   /Users/sbo/tmp/a.out, file type mach-o-le.
   Entry point: 0x0000000100000dd8
   0x0000000100000000 - 0x0000000100001000 is LC_SEGMENT.__TEXT in /Users/sbo/tmp/a.out
   0x0000000100000dd8 - 0x0000000100000e57 is LC_SEGMENT.__TEXT.__text in /Users/sbo/tmp/a.out

In any case, with the breakpoint at puts I can get to the printf and issue a backtrace to get to our infamous condition

#0  0x00007fff86eb0ef0 in puts ()
#1  0x0000000100000e24 in ?? ()
#2  0x0000000100000e2f in ?? ()
#3  0x0000000100000e3a in ?? ()
#4  0x0000000100000e45 in ?? ()
#5  0x0000000100000e50 in ?? ()
#6  0x0000000100000e0c in ?? ()

Yet, as you can see, the stack makes sense. I cannot disassemble, but at least I can dump the contents and they make sense

(gdb) disas 0x0000000100000e24
No function contains specified address.
(gdb) x/30i  0x0000000100000e24
0x100000e24: leaveq
0x100000e25: retq
0x100000e26: push   %rbp
0x100000e27: mov    %rsp,%rbp
0x100000e2a: callq  0x100000e14
0x100000e2f: leaveq
0x100000e30: retq
0x100000e31: push   %rbp
0x100000e32: mov    %rsp,%rbp
0x100000e35: callq  0x100000e26
0x100000e3a: leaveq
0x100000e3b: retq
0x100000e3c: push   %rbp
0x100000e3d: mov    %rsp,%rbp
0x100000e40: callq  0x100000e31
0x100000e45: leaveq
0x100000e46: retq
0x100000e47: push   %rbp
0x100000e48: mov    %rsp,%rbp
0x100000e4b: callq  0x100000e3c
0x100000e50: mov    $0x0,%eax
0x100000e55: leaveq
0x100000e56: retq

In fact, you can see the whole shebang. All the calls of the routines, the stack pointer changes, and the final setting to zero of eax when main ends.

Scrambling the return address

Here is the idea: Instead of smashing the stack, I will try to scramble it. What does it mean? Well, let's see how the stack is when we are just about to be calling puts. We select the previous frame

(gdb) frame 1
#1  0x0000000100000e24 in ?? ()

Get the stack pointer at the current frame

(gdb) info registers
...snip...
rbp            0x7fff5fbff680        0x7fff5fbff680
rsp            0x7fff5fbff680        0x7fff5fbff680
...snip...

Then we take a look at what is in there

(gdb) x/10a 0x7fff5fbff680
0x7fff5fbff680:      0x7fff5fbff690  0x100000e2f
0x7fff5fbff690:      0x7fff5fbff6a0  0x100000e3a
0x7fff5fbff6a0:      0x7fff5fbff6b0  0x100000e45
0x7fff5fbff6b0:      0x7fff5fbff6c0  0x100000e50
0x7fff5fbff6c0:      0x7fff5fbff6d8  0x100000e0c
(gdb) bt
#0  0x00007fff86eb0ef0 in puts ()
#1  0x0000000100000e24 in ?? ()
#2  0x0000000100000e2f in ?? ()
#3  0x0000000100000e3a in ?? ()
#4  0x0000000100000e45 in ?? ()
#5  0x0000000100000e50 in ?? ()
#6  0x0000000100000e0c in ?? ()

Nothing unusual, it's simply the stack pointer and the return address, traditional stack contents for a routine call. When the routine returns, the old return address will be restored to the rip, and the program will continue where it left off, at the routine call. If we were to change this address in the stack, the program would jump to a different location, and that would be bad and likely lead to a crash. Note however that, in order for the stack to unwind correctly, only the frame below the current one is needed, and it's needed just before the return occurs.

So, we can technically scramble all the stack, set those addresses to something else and completely break the backtrace even of a non-stripped binary, provided that we restore the frame under the current one just before returning. The process will be:

  1. Inside every routine, we will drop at the assembly level and write a prologue section where we alter the underlying frame's return address.
  2. We do our thing inside the routine
  3. Again at the assembly level, we write an epilogue section where we restore the return address, just before issuing the return that needs it.

With this strategy in place, if you break anywhere inside the function all the frames (except the one your code is currently in) will be "scrambled" and pointing at nonsensical memory areas. Despite the completely trashed stack, the program will behave correctly because when those addresses will be needed at return, the right address has been restored just a few instructions earlier. Let's see:

int routine4() {
    asm("mov 8(%rsp), %rbx");
    asm("lea 0xdeeead(,%rbx,), %rbx");
    asm("mov %rbx, 8(%rsp)");
    printf("hello\n");
    asm("mov 8(%rsp), %rbx");
    asm("lea -0xdeeead(,%rbx,), %rbx");
    asm("mov %rbx, 8(%rsp)");
}

I altered the routine to perform the prologue and the epilogue. In the prologue, I extract the content of the stack pointer plus 8, which happens to be the return address. I put this value in rbx as it seems to be unused. Then, with lea, I add a fixed offset (oxdeeead) to the content of rbx. Finally, I write this value back in the stack at %rsp+8. In the epilogue, I simply perform the opposite operation, subtracting 0xdeeead and restoring the correct return address in the stack. If I compile and run, the program works correctly.

The gdb session is really nice

Breakpoint 1, 0x0000000100000df4 in routine4 ()
(gdb) bt
#0  0x0000000100000df4 in routine4 ()
#1  0x0000000100000e2f in routine3 ()
#2  0x0000000100000e3a in routine2 ()
#3  0x0000000100000e45 in routine1 ()
#4  0x0000000100000e50 in main ()

Note how the stack is correct, as we haven't executed the prologue yet.

(gdb) disas

Dump of assembler code for function _Z8routine4v:
0x0000000100000df0 <_z8routine4v +0>:        push   %rbp
0x0000000100000df1 <_z8routine4v +1>:        mov    %rsp,%rbp
0x0000000100000df4 <_z8routine4v +4>:        mov    0x8(%rsp),%rbx           # prologue
0x0000000100000df9 <_z8routine4v +9>:        lea    0xdeeead(,%rbx,1),%rbx   # prologue
0x0000000100000e01 <_z8routine4v +17>:       mov    %rbx,0x8(%rsp)           # prologue
0x0000000100000e06 <_z8routine4v +22>:       lea    0x57(%rip),%rdi          # 0x100000e64
0x0000000100000e0d <_z8routine4v +29>:       callq  0x100000e5e <dyld_stub_puts>
0x0000000100000e12 <_z8routine4v +34>:       mov    0x8(%rsp),%rbx           # epilogue
0x0000000100000e17 <_z8routine4v +39>:       lea    -0xdeeead(,%rbx,1),%rbx  # epilogue
0x0000000100000e1f <_z8routine4v +47>:       mov    %rbx,0x8(%rsp)           # epilogue
0x0000000100000e24 <_z8routine4v +52>:       leaveq
0x0000000100000e25 <_z8routine4v +53>:       retq
End of assembler dump.

The current situation looks like this

(gdb) info register
rbx            0x0   0
rsp            0x7fff5fbff680        0x7fff5fbff680
(gdb) x/10a 0x7fff5fbff680
0x7fff5fbff680:      0x7fff5fbff690  0x100000e2f <_z8routine3v +9>
0x7fff5fbff690:      0x7fff5fbff6a0  0x100000e3a <_z8routine2v +9>
0x7fff5fbff6a0:      0x7fff5fbff6b0  0x100000e45 <_z8routine1v +9>
0x7fff5fbff6b0:      0x7fff5fbff6c0  0x100000e50

Stepping instruction after instruction, we can follow the events: first the rbx register is filled with the return address from the stack

-> mov    0x8(%rsp),%rbx
(gdb) info register rbx
rbx 0x100000e2f 4294970927

Then, we add 0xdeeead

-> lea    0xdeeead(,%rbx,1),%rbx
(gdb) info register rbx
rbx            0x100defcdc   4309581020

and finally, we store it back into the stack

-> mov    %rbx,0x8(%rsp)           # prologue
(gdb) x/10a 0x7fff5fbff680
0x7fff5fbff680:      0x7fff5fbff690  0x100defcdc
0x7fff5fbff690:      0x7fff5fbff6a0  0x100000e3a <_z8routine2v +9>
0x7fff5fbff6a0:      0x7fff5fbff6b0  0x100000e45 <_z8routine1v +9>
0x7fff5fbff6b0:      0x7fff5fbff6c0  0x100000e50

Et voila'. The backtrace is now pointing to neverland

(gdb) bt
#0  0x0000000100000e06 in routine4 ()
#1  0x0000000100defcdc in ?? ()
#2  0x0000000100000e3a in routine2 ()
#3  0x0000000100000e45 in routine1 ()
#4  0x0000000100000e50 in main ()

If we were to return now, a segfault would occur: that return address is completely invalid. It's only by performing the reverse operation that we can land safely back into routine3

-> mov    0x8(%rsp),%rbx
rbx            0x100defcdc
-> lea    -0xdeeead(,%rbx,1),%rbx
rbx            0x100000e2f
-> mov    %rbx,0x8(%rsp)
Stack 0x7fff5fbff680:        0x7fff5fbff690  0x100000e2f <_z8routine3v +9>

Now the backtrace is sane again and we are ready to return

(gdb) bt
#0  0x0000000100000e24 in routine4 ()
#1  0x0000000100000e2f in routine3 ()
#2  0x0000000100000e3a in routine2 ()
#3  0x0000000100000e45 in routine1 ()
#4  0x0000000100000e50 in main ()

Now that we can reliably alter the stack frame, we can apply the same trick to our complete call hierarchy. Here is the full code:

#include <stdio.h>

#define scramble() asm("mov 8(%rsp), %rbx"); \
                    asm("lea 0xdead(,%rbx,), %rbx"); \
                    asm("mov %rbx, 8(%rsp)")

#define unscramble() asm("mov 8(%rsp), %rbx"); \
                     asm("lea -0xdead(,%rbx,), %rbx"); \
                     asm("mov %rbx, 8(%rsp)")
int routine4() {
    scramble();
    printf("hello\n");
    unscramble();
}
int routine3() {
    scramble();
    routine4();
    unscramble();
}
int routine2() {
    scramble();
    routine3();
    unscramble();
}
int routine1() {
    scramble();
    routine2();
    unscramble();
}
int main() {
    scramble();
    routine1();
    unscramble();
}

If you compile it, it runs

sbo@sbos-macbook:~/tmp$ g++ test.cpp
sbo@sbos-macbook:~/tmp$ ./a.out
hello

and if you debug it, break at puts, and backtrace, here is the funny result

(gdb) bt
#0  0x00007fff86eb0ef0 in puts ()
#1  0x0000000100000d82 in routine4 ()
#2  0x0000000100defc5e in ?? ()
#3  0x0000000100defc8d in ?? ()
#4  0x0000000100defcbc in ?? ()
#5  0x0000000100defceb in ?? ()
#6  0x0000000100defc05 in ?? ()
(gdb) x/10a 0x0000000100defc8d
0x100defc8d: Cannot access memory at address 0x100defc8d
(gdb) disas 0x0000000100defc8d
No function contains specified address.
(gdb) cont
Continuing.
hello

Program exited normally.

Now you can get creative. For example, you can

  • Scramble your frames according to a random number that you seed differently at every new run.
  • Scramble the whole frame content, not only the return address
  • Spread out preamble and epilogue throughout the routine code, so that it's harder to find out which opcode is devoted to actual execution, and which one is unscrambling the frame, maybe through tortuous operations full of indirections.

Of course, this stuff is extremely hard to do correctly. You have to keep into account that some stack content could be needed by callees, so you may have to unscramble any frame content at any time. It can also quickly turn into a portability nightmare, as different compilers may have different strategies to fill the stack with local variables.

Yet, it was fun, and I hope you enjoyed it.


Python and memory fragmentation

Date

If you use CPython on 32 bit architectures, you may encounter a problem called memory fragmentation. It is more likely to happen on Windows for reasons that will soon be clear, but it's not a Windows exclusive. It is also not an exclusive python problem, but tends to occur more often on CPython due to its intrinsic memory allocation strategy.

When you dynamically allocate memory in C, you do so in a particular area of the Virtual Memory, the heap. A requirement for this allocation is that the allocated chunk must be contiguous in virtual memory. CPython puts extreme stress on the heap: all objects are allocated dynamically, and when they are freed, a hole is left in the heap. This hole may be filled by a later allocation, if the requested memory fits in the hole, but if it doesn't, the hole remains until something that fits is requested. On Linux, you can follow the VM occupation with this small python script

import sys
import subprocess

mmap = [' ']*(16*256)
out = subprocess.check_output(["/usr/bin/pmap","-x", "%d" % int(sys.argv[1])])
for i in out.splitlines()[2:-2]:
    values = i.split()[0:2]
    start = int("0x"+values[0], 16) / 2**20
    end = start + (int(values[1])*1024)/2**20
    for p in xrange(start, end+1):
        mmap[p] = '*'

for row in xrange(16):
    print hex(row)+" | "+"".join(
                        mmap[row * 256:(row+1)*256]
                        )+"|"

On Windows, the great utility VMMap can be used to monitor the same information.

Given the above scenario, the Virtual Memory space can eventually become extremely fragmented, depending on the size of your objects, their order of allocation, if your application jumps between dynamically allocating large chunks of memory and small python objects, and so on. As a result, you may not be able to perform a large allocation, not because you are out of memory, but because you are out of contiguous memory in your VM address space. In a recent benchmark I performed on Windows 7, the largest contiguous chunk of memory available was a meager 32 megabytes (ow!), which means that despite the free memory being around 1 gigabyte, the biggest chunk I could request was only 32 megabytes. Anything bigger would have the malloc fail.

Additional conditions that can make the problem worse are dynamic libraries binding and excessive threading. The first invades your VM address space, and the second needs a stack per each thread, putting additional unmovable barriers throughout the VM and reducing real estate for contiguous blocks. See for example what happens with 10 threads on Linux

(gdb) thread apply all print $esp

Thread 10 (Thread 10606): $5 = (void *) 0x50d7184
Thread 9 (Thread 10607): $6 = (void *) 0x757ce90
Thread 8 (Thread 10608): $7 = (void *) 0x7b69e90
Thread 7 (Thread 10609): $8 = (void *) 0x7d6ae90
Thread 6 (Thread 10618): $9 = (void *) 0x8a4ae90
Thread 5 (Thread 10619): $10 = (void *) 0xb22fee90
Thread 4 (Thread 10620): $11 = (void *) 0xb20fde90
Thread 3 (Thread 10621): $12 = (void *) 0xb1efce90
Thread 2 (Thread 10806): $13 = (void *) 0xb2ea31c4
Thread 1 (Thread 0xb7f6b6c0 (LWP 10593)): $14 = (void *) 0xbffd1f3c

Fragmentation is eventually made irrelevant by a 64 bit architecture, where the VM address space is huge (for now ;) ). Yet, if you have a 32 bit machine and a long running python process that is juggling large and small allocations, you may eventually run out of contiguous memory and see malloc() fail.

How to solve? I found this interesting article that details the same issue and provides some mitigation techniques for Windows, because Windows is kind of special: on Windows, the 4 gigabytes address space is divided in two parts of 2GB EACH, the first for the process, the second reserved for the kernel. If you must stay on 32 bits, your best bet is to give an additional gigabyte of VM with the following recipe (only valid for Windows-7)

  1. run bcdedit /set IncreaseUserVa 3072 as administrator, then reboot the machine.
  2. mark your executable with EditBin.exe yourprogram.exe /LARGEADDRESSAWARE

With this command, the VM subdivision is set to 3GB+1GB, granting one additional gigabyte to your process. This improves the situation, but sometimes it's enough. If it is not, and you still need to work on 32 bit machines then you are in big trouble. You have to change your allocation strategy and be smarter on handling the fragmentation within your code.


Nitpicking on python properties

Date

I like python properties, I really do. Properties allow you to convert explicit setters and getters into lean code while keeping control of your operation. Instead of this

state.setMode(EditorMode.INSERT)
current_mode = state.mode()

with properties you can write a much more pleasant

state.mode = EditorMode.INSERT
current_mode = state.mode

In both cases, if it weren't for the magic of properties, you would get direct access to a member, but if you create your class like this

class State(object):
    def __init__(self):
        self._mode = EditorMode.COMMAND

    @property
    def mode(self):
        return self._mode

    @mode.setter
    def mode(self, value):
        self._mode = value

the two decorated methods are called automatically. This gives you a chance for validating the passed value, for example, or using smart tricks like caching if getting the value is slow, basically the same tricks you would obtain by traditional accessor methods, but with a better syntax and without violating encapsulation.

That said, I noticed a few minor things with properties that are good to point out.

1. Refactoring

Imagine that you find "mode" too poor as a name. and you decide to use editor_mode instead. If you had an accessor method, you would refactor setMode() to setEditorMode(). If any part of your code still calls setMode(), you get an exception.

With a property, the same refactoring implies a change from state.mode to state.editor_mode. However, if other parts of your code still use state.mode = something, you will not get an exception. In fact, it's trivially correct. This could produce hard-to-find bugs.

2. No callbacks

While you can store or pass around state.setEditorMode as a callback, you can't achieve the same effect with a property, not trivially at least. No, you can't use a lambda, because assignment is forbidden in a lambda.

3. Mocking

You can certainly mock a property, but requires a bit more care. Nothing impossible, but if you learn the mock module, you have to go on that extra bit if you want to cover properties.

4. Soft-returning a set operation details

Sometimes you might want your setter to return state information about the set operation. One trivial example may be True or False, depending if the operation was successful or not. You can certainly throw an exception for this specific case, but your mileage may vary depending on the specifics of your problem and what "looks better" for your design. A property does not give you flexibility to return a value during set.

5. Only one value at a time

If your setters are connected to some notification system, you might want to set multiple values at once to trigger a single notification. Once again, it's a minor problem: you can use a special property accepting a tuple. For example, if you have values v1 and v2, on class Foo, you could have something like

class Foo(object):
    def __init__(self):
        self._v1 = None
        self._v2 = None
    @property
    def v1(self):
        return self._v1
    @v1.setter
    def v1(self, value):
        self._v1 = v1
        # notify listeners
    @property
    def v2(self):
        return self._v2
    @v2.setter
    def v2(self, value):
        self._v2 = v2
        # notify listeners
    @property
    def values(self):
        return (self._v1, self._v2)
    @values.setter
    def values(self, value_tuple):
        self._v1, self._v2 = value_tuple
        # notify listeners

f=Foo()
f.values = (1,2)

6. Magic!

There's some kind of magic behind properties that you can't perceive to be there when you read client code. For example, code like this

myobj.my_foo = 5

generally makes you think of a simple assignment taking place, but this is not the case if my_foo is a property. Maybe naming convention could disambiguate? I am not a fan of the strict PEP-8 requirements on naming of methods, so one could potentially decide for

myobj.myMethod()
myobj.myProperty = 5
myobj.my_member_var = 3

I am thinking out loud here, and I don't have a strong opinion on this issue. That said, properties are definitely cool and will make your interfaces much more pleasant, so I definitely recommend their use, if no other constraints prevent you to do so.