Published on (
 See this if you're having trouble printing code examples

Linux Multithreading Advances

by Jerry Cooperstein

Recent advances in Linux's threading implementation are expected to continue to ease migration from other Unix-like operating systems. These advancements have arrived with intense activity on two fronts. First, thread-handling improvements have greatly enhanced the kernel's scalability even to thousands of threads. Second, there are now two fresh, competing implementations of the POSIX pthreads standard (NGPT and NPTL) set to replace the aging LinuxThreads library.

In typical open source fashion, only time will tell exactly who wins out in which arena. However, both new library implementations should be API-compatible with the standard, so the choice will depend on performance and stability. The required changes will appear in the upcoming Linux 2.6 (or 3.0) kernel and can already be tested in late development versions.

Multithreading on Linux

Threading implementations typically have components in both user and kernel space. It is possible to do everything from the one side or the other, but each approach has problems. With everything on the user side, all related threads are part of one single process (which can only run on one CPU at a time), and multi-processor systems are underutilized. With everything on the kernel side, the kernel scheduler must bear a heavy load.

Related Reading

Understanding the Linux Kernel
By Daniel P. Bovet, Marco Cesati

Approaches have ranged from the 1:1 pure kernel thread model in which each user thread has its own kernel thread, to the M:1 model in which the kernel sees only one normal process, with an arbitrary number of threads with which to schedule in user space. The M:N model falls in between, associating M user threads with each of N kernel threads.

The Linux kernel uses the clone() function to create new processes. Flags control parent/child resource sharing, where resources range from everything (memory, signal handlers, file descriptors, etc.) to nothing. While the usual fork() inherits resources from the parent, it may share nothing. Copy-on-write techniques ensure each process gets its own copy as soon as either one tries to modify a shared resource.

Programs can call the clone() function as a system call, using it directly to produce multithreaded programs. However, it is completely Linux-specific and non-portable. Since there is no external standard, there is no guarantee that its interface will be stable. Threading library implementations do in fact use the clone() system call, and it is the job of library maintainers to keep up with kernel changes.

The LinuxThreads implementation of the POSIX threads standard (pthreads), originally written by Xavier Leroy, has been the dominant one for years and is now incorporated and maintained in glibc. It has two problems on Linux:

Compliance Issues

Virtually all compliance problems can be traced to the decision to use lightweight processes (LWPs, or the 1:1 model described above) as the basis of the implementation. New processes are created by clone(), with a shared-everything approach. While the new process is lighter due to the sharing, fundamentally it is still a process in its own right, with its own process identifier (pid), process descriptor, etc.

This led to the following standards compatibility problems:

If a pthreads application were written for Linux, one could expect easy portability. However, the inverse process, porting to Linux, was more difficult and slowed Linux deployment since important applications were now broken.

Some problems were resolvable by relatively minor kernel adjustments. For example, by modifying the basic data structure describing each process, (struct task_struct) to store a thread group identifier and some other bookkeeping, and then modifying the getpid() system call to return this identifier rather than the process identifier, one problem could be solved.

However, many key kernel developers resisted attempts to modify the kernel for compliance sake. On one hand, their taste runs to technically superior solutions rather than to "cut the toes to fit the shoes" to comply with standards. On the other hand, there was an aversion to creating many threads. Sentiments like "there is no need to create more threads than there are processors" were common. Thread-prolific languages such as Java were looked at with contempt for many reasons.

Performance Issues

The main performance problem has been scalability with growing numbers of threads. These difficulties are not unique to threads, but apply in all cases where the number of processes grows large.

Consider the process of obtaining a new pid. In the 2.4 kernel, Linux has to loop through all processes to ensure a candidate pid is not already assigned. With an outer loop on possible candidates, the time spent may scale quadratically; if there are thousands of processes, the system can slow down to a crawl. Since each thread has its own pid, creating zillions of threads is poisonous.

New Generation POSIX Threads

A group at IBM and Intel, led by Bill Abt at IBM, released the first version of the New Generation POSIX Threads (NGPT) library in May 2001. This consisted of a drop-in replacement for LinuxThreads, together with patches for kernels beginning with 2.4.0.

To ease acceptance, the group made a conscious effort to keep the impact on the kernel small. They worked to get the kernel modifications they needed through patient, piece-by-piece promotion and expected to have NGPT eventually replace LinuxThreads in the glibc system.

NGPT is a derivative of the GNU Pth (GNU Portable Threads) package, which up to now is based on an M:1 model. A user space priority and event-based, non-preemptive scheduler manages the M user threads. This was seen as an improvement over the 1:1 pure kernel thread model used by LinuxThreads where the kernel has to do a lot of scheduling work.

NGPT adopted the M:N hybrid model. Many developers saw this as the best path to good performance: keep all CPU's humming, minimize context switching between kernel threads, and switch mostly between user space threads. However, the M:N model is complex. It requires two cooperating schedulers, one each in user and kernel space. Signal handling is difficult and much work has to be done in user space. It takes fancy footwork to prevent one blocked thread from blocking other threads running in the same process.

Nonetheless, the NGPT team succeeded in implementing the full pthreads standard, and the kernel changes they needed were accepted in the mainline kernel early in the 2.5 development process (at kernel 2.5.4). They were also back-ported into the 2.4.19 kernel. Depending on the metric used, performance gains were claimed of up to 100 percent, and work continues on improvements.

On March 26-27, 2002, Compaq hosted a meeting to discuss the future replacement for the LinuxThreads library. In attendance were members of the NGPT team, some employees of (then distinct) Compaq and Hewlett-Packard, and representatives of the glibc team, including the head maintainer, Ulrich Drepper (a Red Hat employee), who wrote a summary of the meeting.

Pursuing the M:N approach, the report said:

"This is one of the reasons why it is absolutely necessary to think about two-level scheduling for the threads. I.e., the actual user threads are different from the kernel threads (or light-weighted process, or what ever one wants to call them) and scheduled separately. This is generally called the M-on-N model for a thread implementation. The 1-on-1 model dedicates a unique kernel thread for each user-level thread; this is the model used by the current, inadequate thread library implementation."

The report contains detailed analysis of how to get kernel and user-space schedulers to cooperate using the scheduler activations method.

It seemed the replacement for LinuxThreads would be based on NGPT.

Native POSIX Thread Library

On September 19, 2002, Ulrich Drepper and Ingo Molnar (also of Red Hat) released an alternative to NGPT called the Native POSIX Thread Library (NPTL). The project included a new user space library, changes to glibc, and kernel modifications. The initial announcement said in part:

"Unless major flaws in the design are found this code is intended to become the standard POSIX thread library on Linux system, and it will be included in the GNU C library distribution."

NPTL is based squarely on the 1:1 pure kernel thread model. A white paper explains why in detail.

Recent changes to kernel thread handling (mostly due to Ingo Molnar) had vastly improved thread performance. With these changes in place, the relative simplicity of the 1:1 model became very attractive.

Related Reading

Understanding the Linux Kernel
By Daniel P. Bovet, Marco Cesati

There is only one scheduler. Signal handling remains in the kernel's hands. Blocking problems are handled naturally because each kernel thread schedules independently. In addition, the user space implementation becomes fundamentally simpler.

In some sense, one has come full circle; developers who wanted to ensure full Posix compliance were frustrated by the kernel maintainers' unwillingness to adapt the Linux kernel to fit their needs, and thus NGPT was developed in part as a polite end run requiring minimal kernel changes. Then a programming tour de force, mostly by one key kernel programmer, is now claimed to enable reversion to a much simpler approach.

Linux Kernel Improvements

What changes have been made in the Linux kernel to make threads perform and scale better?

Consider the previous example of obtaining a new pid. The potentially quadratic search is gone. Instead, the kernel sets aside a small but dynamic number of memory pages as bitmaps for process identifiers. Obtaining a new pid means finding a page with free entries and then finding and clearing the first set bit. No locking is required, and the search time is very short and almost independent of the number of running processes.

Another key improvement is the O(1) scheduler, which no longer has to cycle through all processes to find the most deserving one. Each CPU has its own queue, a simple priority-sorted bitmask. Once again finding a new process is very fast and scales fantastically.

Where Do We Go From Here?

The NPTL team posted some benchmarks, such as this display of the minimum time needed to create a number of top-level threads.

In general, while NGPT beat the old methods by a factor of two, NPTL could do better by another factor of two.

It remains to be seen exactly how the two implementations will rank against each other. NGPT may not yet be tuned to take advantage of recent kernel improvements the way NPTL has. Furthermore, benchmarks are often used to misrepresent. It will take further development by both teams, independent benchmarks, and real life comparisons to see who really beats whom.

You can test drive NGPT by simply downloading the library and installing it, as long as you have kernel 2.4.19 or later. For NPTL, you can download the library, but you will need a very recent development kernel as well as bleeding edge glibc and gcc. The announcement contains detailed instructions.

While there may be some hard feelings on the socio-political side about how NPTL seemed to come out of the blue, the maintainers of NGTP have not griped in public. It seems that any battle between the two implementations will now be played out in public, in good open source fashion. Either one library will win out over the other, or each will become the preferred tool in some universe for some load. At any rate it will be fun to see what happens. Linux will benefit by having a standards-compliant, and well-performing threads implementation(s).

Jerry Cooperstein is a senior consultant and Linux training specialist at Axian Inc., in Beaverton Oregon, and lives in Corvallis, Oregon.

Return to the Linux DevCenter.

Copyright © 2009 O'Reilly Media, Inc.