LinuxDevCenter.com
oreilly.comSafari Books Online.Conferences.

advertisement


When Linux Runs Out of Memory

by Mulyadi Santosa
11/30/2006

Perhaps you rarely face it, but once you do, you surely know what's wrong: lack of free memory, or Out of Memory (OOM). The results are typical: you can no longer allocate more memory and the kernel kills a task (usually the current running one). Heavy swapping usually accompanies this situation, so both screen and disk activity reflect this.

At the bottom of this problem lie other questions: how much memory do you want to allocate? How much does the operating system (OS) allocate for you? The basic reason of OOM is simple: you've asked for more than the available virtual memory space. I say "virtual" because RAM isn't the only place counted as free memory; any swap areas apply.

Exploring OOM

To begin exploring OOM, first type and run this code snippet that allocates huge blocks of memory:

#include <stdio.h>
#include <stdlib.h>

#define MEGABYTE 1024*1024

int main(int argc, char *argv[])
{
        void *myblock = NULL;
        int count = 0;

        while (1)
        {
                myblock = (void *) malloc(MEGABYTE);
                if (!myblock) break;
                printf("Currently allocating %d MB\n", ++count);
        }
        
        exit(0);
}

Compile the program, run it, and wait for a moment. Sooner or later it will go OOM. Now compile the next program, which allocates huge blocks and fills them with 1:

#include <stdio.h>
#include <stdlib.h>

#define MEGABYTE 1024*1024

int main(int argc, char *argv[])
{
        void *myblock = NULL;
        int count = 0;

        while(1)
        {
                myblock = (void *) malloc(MEGABYTE);
                if (!myblock) break;
                memset(myblock,1, MEGABYTE);
                printf("Currently allocating %d MB\n",++count);
        }
        exit(0);
        
}

Notice the difference? Likely, program A allocates more memory blocks than program B does. It's also obvious that you will see the word "Killed" not too long after executing program B. Both programs end for the same reason: there is no more space available. More specifically, program A ends gracefully because of a failed malloc(). Program B ends because of the Linux kernel's so-called OOM killer.

The first fact to observe is the amount of allocated blocks. Assume that you have 256MB of RAM and 888MB of swap (my current Linux settings). Program B ended at:

Currently allocating 1081 MB

On the other hand, program A ended at:

Currently allocating 3056 MB

Where did A get that extra 1975MB? Did I cheat? Of course not! If you look closer on both listings, you will find out that program B fills the allocated memory space with 1s, while A merely simply allocates without doing anything. This happens because Linux employs deferred page allocation. In other words, allocation doesn't actually happen until the last moment you really use it; for example, by writing data to the block. So, unless you touch the block, you can keep asking for more. The technical term for this is optimistic memory allocation.

Checking /proc/<pid>/status on both programs will reveal the facts. Here's program A:

$ cat /proc/<pid of program A>/status
VmPeak:  3141876 kB
VmSize:  3141876 kB
VmLck:         0 kB
VmHWM:     12556 kB
VmRSS:     12556 kB
VmData:  3140564 kB
VmStk:        88 kB
VmExe:         4 kB
VmLib:      1204 kB
VmPTE:      3072 kB

Here's program B, shortly before the OOM killer struck:

$ cat /proc/<pid of program B>/status 
VmPeak:  1072512 kB
VmSize:  1072512 kB
VmLck:         0 kB
VmHWM:    234636 kB
VmRSS:    204692 kB
VmData:  1071200 kB
VmStk:        88 kB
VmExe:         4 kB
VmLib:      1204 kB
VmPTE:      1064 kB

VmRSS deserves further explanation. RSS stands for "Resident Set Size." It explains how many of the allocated blocks owned by the task currently reside in RAM. Also note that before B reaches OOM, swap usage is almost 100 percent (most of the 888MB), while A uses no swap at all. It's clear that malloc() itself did nothing more than just preserve a memory area, nothing else.

Another question also arises. "Even without touching the pages, why is the allocation limit 3056MB?" This exposes an unseen limit. For every application in a 32-bit system, there is 4GB of address space available for usage. The Linux kernel usually splits the linear address to provide 0 to 3GB for user space and 3GB to 4GB for kernel space. User space is a room where a task can do anything it wants, while kernel space is solely for the kernel. If you try to cross this 3GB border, you will get a segmentation fault.

(Side note: There is a kernel patch that gives the whole 4GB to userspace, at the cost of some context-switching.)

The conclusion is that OOM happens for two technical reasons:

  1. No more pages are available in the VM.
  2. No more user address space is available.
  3. Both #1 and #2.

Thus the strategies to prevent those circumstances are:

  1. Know how large the user address space is.
  2. Know how many pages are available.

When you ask for a memory block, usually by using malloc(), you're asking the runtime C library whether a preallocated block is available. This block's size must at least equal the user request. If there is already a memory block available, malloc() will assign this block to the user and mark it as "used." Otherwise, malloc() must allocate more memory by extending the heap. All requested blocks go in an area called the heap. Do not confuse it with the stack, because the stack stores local variable and function return addresses. These two sections have different jobs.

Where is the heap located in the address space? The process address map can tell you exactly where:

$ cat /proc/self/maps
0039d000-003b2000 r-xp 00000000 16:41 1080084    /lib/ld-2.3.3.so
003b2000-003b3000 r-xp 00014000 16:41 1080084    /lib/ld-2.3.3.so
003b3000-003b4000 rwxp 00015000 16:41 1080084    /lib/ld-2.3.3.so
003b6000-004cb000 r-xp 00000000 16:41 1080085    /lib/tls/libc-2.3.3.so
004cb000-004cd000 r-xp 00115000 16:41 1080085    /lib/tls/libc-2.3.3.so
004cd000-004cf000 rwxp 00117000 16:41 1080085    /lib/tls/libc-2.3.3.so
004cf000-004d1000 rwxp 004cf000 00:00 0
08048000-0804c000 r-xp 00000000 16:41 130592     /bin/cat
0804c000-0804d000 rwxp 00003000 16:41 130592     /bin/cat
0804d000-0806e000 rwxp 0804d000 00:00 0          [heap]
b7d95000-b7f95000 r-xp 00000000 16:41 2239455    /usr/lib/locale/locale-archive
b7f95000-b7f96000 rwxp b7f95000 00:00 0
b7fa9000-b7faa000 r-xp b7fa9000 00:00 0          [vdso]
bfe96000-bfeab000 rw-p bfe96000 00:00 0          [stack]

This is an actual address space layout shown for cat, but you may get different results. It is up to the Linux kernel and the runtime C library to arrange them. Notice that recent Linux kernel versions (2.6.x) kindly label the memory area, but don't completely rely on them.

The heap is basically free space not already given for program mapping and stack; thus, it narrows down the available address space. It's not a full 3GB, but it's 3GB minus everything else that's mapped. The bigger your program's code segment is, the less space you have for heap. The more dynamic libraries you link into your program, the less space you get for the heap. This is important to remember.

How does the map for program A look when it can't allocate more memory blocks? With a trivial change to pause the program (see loop.c and loop-calloc.c) just before it exits, the final map is:

0009a000-0039d000 rwxp 0009a000 00:00 0 ---------> (allocated block)
0039d000-003b2000 r-xp 00000000 16:41 1080084    /lib/ld-2.3.3.so
003b2000-003b3000 r-xp 00014000 16:41 1080084    /lib/ld-2.3.3.so
003b3000-003b4000 rwxp 00015000 16:41 1080084    /lib/ld-2.3.3.so
003b6000-004cb000 r-xp 00000000 16:41 1080085    /lib/tls/libc-2.3.3.so
004cb000-004cd000 r-xp 00115000 16:41 1080085    /lib/tls/libc-2.3.3.so
004cd000-004cf000 rwxp 00117000 16:41 1080085    /lib/tls/libc-2.3.3.so
004cf000-004d1000 rwxp 004cf000 00:00 0
005ce000-08048000 rwxp 005ce000 00:00 0 ---------> (allocated block)
08048000-08049000 r-xp 00000000 16:06 1267       /test-program/loop
08049000-0804a000 rwxp 00000000 16:06 1267       /test-program/loop
0806d000-b7f62000 rwxp 0806d000 00:00 0 ---------> (allocated block)
b7f73000-b7f75000 rwxp b7f73000 00:00 0 ---------> (allocated block)
b7f75000-b7f76000 r-xp b7f75000 00:00 0          [vdso]
b7f76000-bf7ee000 rwxp b7f76000 00:00 0 ---------> (allocated block)
bf80d000-bf822000 rw-p bf80d000 00:00 0          [stack]
bf822000-bff29000 rwxp bf822000 00:00 0 ---------> (allocated block)

Six Virtual Memory Areas, or VMAs, reflect the memory request. A VMA is a memory area that groups pages with the same access permission and/or the same backing file. VMAs can exist anywhere within user space, as long as that space is available.

Now you might think, "Why six? Why not a single big VMA containing all blocks?" There are two reasons. First, it is often impossible to find such a big "hole" to coalesce the blocks into a single VMA. Second, the program does not ask to allocate that approximately 3GB block all at once, but piece by piece. Thus, the glibc allocator has complete freedom to arrange the memory however it wants.

Why do I mention available pages? Memory allocation occurs in page-sized granularity. This is not a limit of the operating systems, but a feature of the Memory Management Unit (MMU) itself. Pages have various sizes, but the normal setting for x86 is 4K. You can discover the page size manually by using getpagesize() or sysconf() (with the _SC_PAGESIZE parameter) libc functions. The libc allocator manages each page: slicing them into smaller blocks, assigning them to processes, freeing them, and so on. For example, if your program uses 4097 bytes total, you need to use two pages, even though in reality the allocator gives you somewhere between 4105 to 4109 bytes.

With 256MB of RAM and no swap, you have 65536 available pages. Is that right? Not really. What you don't see is that some memory areas are in use by kernel code and data, so they're unavailable for any other need. There is also a reserved part of memory for emergencies or high-priority needs. dmesg reveals these numbers for you:

$ dmesg | grep -n kernel
36:Memory: 255716k/262080k available (2083k kernel code, 5772k reserved,
    637k data, 172k init, 0k highmem)
171:Freeing unused kernel memory: 172k freed

init refers to kernel code and data that is only necessary for the initialization stage; thus the kernel frees it when it is no longer useful. That leaves 2083 + 5772 + 637 = 8492KB. Practically speaking, 2123 pages are gone from the user's point of view. If you enable more kernel features or insert more kernel modules, you'll use up more pages for exclusive kernel use, so be wise.

Another kernel internal data structure is the page cache. The page cache buffers data recently read from block devices. The more caching work you do, the fewer free pages you actually have--but they are not really occupied, as the kernel will reclaim them when memory is tight.

From the kernel and hardware points of view, these are the important things to remember:

  1. There is no guarantee that allocated memory area is physically contiguous; it's only virtually contiguous.

    This "illusion" comes from the way address translation works. In a protected mode environment, users always work with virtual addresses, while hardware works with physical addresses. The page directory and page tables translate between these two. For example, two blocks with starting virtual addresses 0 and 4096 could map to the physical addresses 1024 and 8192.

    This makes allocation easier, because in reality it is unlikely to always get continuous blocks, especially for large requests (megabytes or even gigabytes). The kernel will look everywhere for free pages to satisfy the request, not just adjacent free blocks. However, it will do a little more work to arrange page tables so that they appear virtually contiguous.

    There is a price. Because memory blocks might be non-contiguous, sometimes the L1 and L2 caches go underused. Virtually adjacent memory blocks may be spread across different physical cache lines; this means slowing down (sequential) memory access.

  2. Memory allocation takes two steps: first extending the length of memory area and then allocating pages when needed. This is demand paging. During VMA extension, the kernel merely checks whether the request overlaps existing VMA and if the range is still inside user space. By default, it omits the check whether actual allocation can occur.

    Thus it is not strange if your program asks for a 1GB block and gets it, even if in reality you have only 16MB of RAM and 64MB of swap. This "optimistic" style might not please everybody, because you might get the false hope of thinking that there are still free pages available. The Linux kernel offers tunable parameters to control this overcommit behavior.

  3. There are two type of pages: anonymous pages and file-backed pages. A file-backed page originates from mmap()-ing a file in disk, whereas an anonymous page is the kind you get when doing malloc(). It has no relationship with any files at all. When the RAM becomes tight, the kernel swaps out anonymous pages to swap space and flushes file-backed pages to the file to give room for current requests. In other words, anonymous pages may consume swap area while file-backed pages don't. The only exception is for files mmap()-ed using the MAP_PRIVATE flag. In this case, file modification occurs in RAM only.

    This is where the understanding of swap as RAM extension comes from. Clearly, accessing the page requires bringing it back into RAM.

Pages: 1, 2

Next Pagearrow




Linux Online Certification

Linux/Unix System Administration Certificate Series
Linux/Unix System Administration Certificate Series — This course series targets both beginning and intermediate Linux/Unix users who want to acquire advanced system administration skills, and to back those skills up with a Certificate from the University of Illinois Office of Continuing Education.

Enroll today!


Linux Resources
  • Linux Online
  • The Linux FAQ
  • linux.java.net
  • Linux Kernel Archives
  • Kernel Traffic
  • DistroWatch.com


  • Sponsored by: