Saturday, August 28, 2010

Some notes on Solaris 10 x86, 64 bit compilation, bugs and memory allocators

Over the past few months, I've spent a lot of time getting the PowerDNS Recursor to perform well on Solaris 10 on x86. Initially, I thought this could not be a lot of work since there are many happy Recursor users on UltraSPARC. "How hard could it be?"

Turns out that Solaris x86 and Solaris UltraSPARC are different in important respects.

What follows is a rather long winded story of a mostly stranger in a somewhat strange land. I view the world through Linux glasses. Some of the pain described below can indubitably be ascribed to that. However, some of the bits below are plainly caused by Oracle not doing a good job maintaining Solaris on x86. This situation is not bound to improve, it appears.

Before starting the rant in earnest, I'd like to thank one (so far) anonymous Sun/Oracle employee who helped me through the forest of Solaris bugtrackers, 'IDRs' and without whom this problem would definitely not have been solved. I'd also like to thank Ad, Bert, John, Martijn and Robin over at a big PowerDNS deployment for sticking through this whole adventure, and for pressuring Sun to actually fix the issues.

Here goes.

The first thing we noticed was that , the 'Ports' event multiplexer failed to work on x86 applications, as described in long standing Solaris bug 'CR 6268715 "library/libc port_getn(3C) and port_sendn(3C) not working on Solaris x86"'. Apache, libevent and PowerDNS all contain workarounds for this bug, but that workaround does come with performance implications. At the very least it is worrying.

Secondly, it turns out that Solaris 10 on x86 can't link 64 bits binaries as generated by system gcc compiler, at least, not those binaries using Thread Local Storage for objects at global scope. This is Solaris bug 'CR 6354160', aka 'Solaris linker includes more than one copy of code in binary when linking gnu object code', which we worked around by changing PowerDNS so it could be compiled as one big C++ file.

Using the native Sun Studio compiler failed, because it is not compliant enough with the C++ standard to compile PowerDNS, and the changes required were non-trivial.

Although both issues (ports_getn() and 64 bits linking) were known, and fixes were available in OpenSolaris, these had not made it into Solaris 10 production releases.

Eventually, PowerDNS was able to work around both bugs, but in the case of 6268715 at a runtime performance cost (note: Sun has now shipped 'IDR145429-01' which fixes this).

Which brings us to performance. For some reason, even though the PowerDNS Recursor uses 'share nothing' threads, there was no scalability when using multiple threads on Solaris. In fact performance was rather dismal anyhow, even with only one thread.

Firstly, we discovered that having multiple threads try to wait on a single socket does not scale beyond a single thread. This was fixed by having only a single thread wait on the socket, and manually distributing queries over threads in a round-robin fashion.

This turned out to help slightly, but not decisively. We then discovered that the default Solaris x86 memory allocator ('malloc()') is effectively single-threaded (unlike the UltraSPARC variant, which is completely different!). Solaris ships with no less than two alternative mallocs, called -lmtmalloc and -lumem respectively. Using libumem helped for benchmarking.

Finally, for Solaris, we had to bring back an old favorite, the 'fork-trick' which makes the whole PowerDNS Recursor fork itself into multiple processes, which helped bring Solaris performance up to par with our other major platform, Linux. We don't yet know why our 'share nothing' threads end up interfering with each other.

The resulting work was taken into production.. and crashed within 5 minutes of heavy load, indicating an out of memory error. With a 64 bit binary on an 8 gigabyte machine, this seemed doubtful.

After some further investigations, it was found that while libumem certainly was faster for multithreaded code, but that it also wastes memory on a prodigious scale. To be honest, this may be due to the fact that the g++ c++ runtime libraries are not making optimal use of the allocator, or our use of get/set/swap/makecontext(), but the amount of memory used was staggering. Think 450MB for storing 10MB of content.

We studied some of the articles available online, among which was 'A Comparison of Memory Allocators' on the 'Oracle Sun Development Network'. This one indeed showed graphs of libumem using large amounts of memory, and a thing called ptmalloc using very little. Oddly enough, ptmalloc is (more or less) the default allocator for Linux too.

We then built a PowerDNS with all the workarounds, plus ptmalloc linked in, and now finally have something that survives production use!

Rounding this off:
  • Solaris x86 is remarkably different from Solaris UltraSPARC (different bugs, different allocators)
  • Do not have n>1 threads wait on a single datagram socket filedescriptor, it does not scale
  • There now IS an IDR to get ports_getn() working, IDR145429-01, which should also speed up Apache and several other high-performance applications for Solaris
  • To build 64 bits binaries with thread local storage (__thread) at global scope, concatenate all your C++ into one big file, and compile that one
  • Be aware that the default allocator on Solaris 10 x86 is single-threaded
  • Be aware that both mtmalloc and libumem may use prohibitive amounts of memory for some programs
  • Consider ptmalloc3
  • We still have to investigate why fork() scales better than pthread_create()
  • Make sure that you have some friends within Sun engineering ;-)
All in all, we still consider Solaris 10 x86 a 'supported platform' for the PowerDNS Recursor, but along the way we had some doubts.. Solaris 10 on UltraSPARC continues to work very well meanwhile!

Bert