Python DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Understanding Network I/O, Part 2
Pages: 1, 2, 3, 4

Asynchronous I/O

Asynchronous I/O is a technique specifically targeted at handling multiple I/O requests efficiently. In contrast, threads are a general concurrency mechanism that can be used in situations not related to I/O. Most modern operating systems, such as Linux and Windows, support asynchronous I/O.



Asynchronous I/O works very differently from threads. Instead of having an application spawn multiple tasks (that can then be used to perform I/O), the operating system performs the I/O on the application's behalf. This makes it possible for just one thread to handle multiple I/O operations concurrently. While the application continues to run, the operating system takes care of the I/O in the background.

Due to potentially more efficient, kernel-level I/O processing, the reduction in the total number of threads in the system, and dramatically fewer context switches, asynchronous I/O is sometimes the best method to use. Its major disadvantage is an increase in the complexity of the application's logic — an increase that can be very significant.

Two common ways of asking the operating system to perform asynchronous I/O are the select and poll system calls. While Python provides direct access to these facilities (via the select module), there are easier ways to take advantage of asynchronous I/O in your programs.

In particular, the Twisted framework, as mentioned in the previous article, makes working with asynchronous I/O quite painless in many cases. The next subsection will present a Twisted-based variation on our weather reader example.

The asyncore library provides another alternative to using poll or select directly. This is a lightweight facility that remains sufficiently low-level to give you a good look at the nature of asynchronous I/O. See the The Asyncore Library subsection for details.

The Twisted Framework

Twisted is a large, comprehensive framework. It includes many diverse components such as a web server, a news server, and a web spider client. Achieving I/O concurrency with Twisted is not difficult, as the following example illustrates.

Example 6. A Twisted Framework client

# Import the Twisted network event monitoring loop.
from twisted.internet import reactor
# Import the Twisted web client function for retrieving
# a page using a URL.
from twisted.web.client import getPage

import re      # Library for finding patterns in text.

# Twisted will call this function to process the retrieved web page.
def process_result(webpage,name,url,nrequests):
    # Pattern which matches text like '66.9 F'.  The last
    # argument ('re.S') is a flag, which effectively causes
    # newlines to be treated as ordinary characters.
    match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S)

    # Print out the matched text and a descriptive message;
    # if there is no match, print an error message.
    if match == None:
        print 'No temperature reading at URL:',url
    else:
        print 'In '+name+', it is now',match.group(1),'degrees.'

    # Keep a shared count of requests (see article text for details).
    nrequests[0] = nrequests[0] - 1 # Just finished a request.
    if nrequests[0] <= 0:  # If this is the last request ...
        reactor.stop()     # ... stop the Twisted event loop.


# Twisted will call this function to indicate an error.
def process_error(error,name,url,nrequests):
    print 'Error getting information for',name,'( URL:',url,'):'
    print error

    # Keep a shared count of requests (see article text for details).
    nrequests[0] = nrequests[0] - 1 # Just finished a request.
    if nrequests[0] <= 0:  # If this is the last request ...
        reactor.stop()     # ... stop the Twisted event loop.


# Three NOAA web pages, showing current conditions in New York,
# London and Tokyo, respectively.
citydata = (('New York','http://weather.noaa.gov/weather/current/KNYC.html'),
            ('London',  'http://weather.noaa.gov/weather/current/EGLC.html'),
            ('Tokyo',   'http://weather.noaa.gov/weather/current/RJTT.html'))

# Initialize the shared count of the number of requests.  This will be
# passed as an argument to the callback functions above.  It cannot
# be a simple integer (see article text for an explanation).
nrequests = [len(citydata)]

# Tell Twisted to get the above pages; also register our
# processing functions, defined previously.
for name,url in citydata:
    getPage(url).addCallbacks(callback = process_result,
                              callbackArgs = (name,url,nrequests),
                              errback = process_error,
                              errbackArgs = (name,url,nrequests))

# Run the Twisted event loop.
reactor.run()

Here is the output produced by the client:

Example 7. Twisted Framework client output

In London, it is now 46 degrees.
In New York, it is now 37.9 degrees.
In Tokyo, it is now 48 degrees.

Note that -- just like with the thread-based examples -- the output may be in a different order from the inputs. After all, asynchronous I/O is also a concurrency technique. As before, when several requests are performed in parallel, the faster ones will tend to pass the slower ones, finishing earlier.

Even with Twisted hiding the low-level details, the event-driven nature of asynchronous I/O is readily apparent in the example. When certain events take place, Twisted calls the functions we have previously supplied. These functions are known as callbacks, because you call the framework to pass the functions to it, and the framework subsequently calls them back. Callbacks are also common in other event-driven systems, such as GUI libraries.

You may not wait inside your Twisted callbacks; it is important to complete the required processing as fast as possible, and return control back to the framework. Any waiting will suspend the other requests, because only a single thread is doing all of the work.

Twisted defines a special construct, Deferred , for triggering callbacks. In our example, the getPage function actually returns a Deferred object. We then use its addCallbacks method use to register our result-processing and error-handling functions.

The last line in our program (reactor.start()) starts the Twisted event loop. This transfer of control is common in event-driven systems; it allows the framework to invoke our callbacks in response to events. Our program will terminate when reactor.start() returns.

Now that we have surrendered control to Twisted, however, how can we make reactor.start() return? The example issues a reactor.stop() from either of our two callbacks. In order to prevent Twisted from exiting prematurely, we keep track of how many requests are left to process and only call reactor.stop() after all the requests have been processed.

We store the count of outstanding requests in a standard Python list that we specify as a parameter to our callbacks. In your own code, you may want to create a counter class for this purpose. Alternately, you can write your callbacks as methods of a class, keeping the count in an attribute. In any case, do not pass a simple integer to the callback. Any changes you make to such a type inside the function will be purely local, and will be discarded when the callback returns. See the Python Reference Manual for a deeper understanding of these issues.

Although all invocations of the callback share the count, we need no locking to protect the value. Only one thread makes every call, so each invocation must complete before the next one can start. This ensures that the count is always consistent, because no one operation on it may preempt another already in progress.

The Asyncore Library

Asyncore is another Python project for dealing with asynchronous I/O. In contrast to the large, comprehensive Twisted, asyncore is small, lightweight, and included as part of the standard Python distribution. You may also be interested in asynchat (also included with Python), which provides extra functionality on top of asyncore. The well-known Zope project, a powerful, sophisticated, web-application server is built using asyncore.

Asyncore's minimalist approach comes at a price. While this library is higher level than using select or poll directly, it does not provide additional facilities such as HTTP protocol support. Asyncore is a fundamental building block, with a tight focus on just the I/O process itself.

The asyncore documentation includes an easy-to-follow web client example. It is immediately clear from this example that we must do all the work pertaining to the HTTP protocol ourselves. Asyncore provides only the I/O channel. The example also illustrates how to use asyncore in our programs: by writing a class that inherits from a base class supplied by the library.

Now we are ready to reimplement our weather reader with asyncore. Ideally, we would like to reuse the code from the Twisted client. After all, neither the logic of our program nor the underlying I/O method will change. In the following example, we substitute our own (asyncore-derived) CustomDispatcher class for the facilities previously provided by Twisted, leaving the rest of our code virtually intact.

Example 8. An Asyncore client

import asyncore # Lightweight library for asynchronous I/O. 
import re       # Library for finding patterns in text.

# Our asyncore-based dispatcher class.
import CustomDispatcher 

# Function to process the retrieved web page.
def process_result(webpage,name,url):
   
    # Pattern which matches text like '66.9 F'.  The last
    # argument ('re.S') is a flag, which effectively causes
    # newlines to be treated as ordinary characters.
    match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S)

    # Print out the matched text and a descriptive message;
    # if there is no match, print an error message.
    if match == None:
        print 'No temperature reading at URL:',url
    else:
        print 'In '+name+', it is now',match.group(1),'degrees.'

# Function to indicate an error.
def process_error(error,name,url):
    print 'Error getting information for',name,'( URL:',url,'):'
    print error

# Three NOAA web pages, showing current conditions in New York,
# London and Tokyo, respectively.
citydata = (('New York','http://weather.noaa.gov/weather/current/KNYC.html'),
            ('London',  'http://weather.noaa.gov/weather/current/EGLC.html'),
            ('Tokyo',   'http://weather.noaa.gov/weather/current/RJTT.html'))

# Create one asyncore-based dispatcher for each of the above pages;
# also register our callback functions, defined previously.
for name,url in citydata:

    # No need to save the result of the constructor call, because
    # asyncore keeps a reference to our dispacher objects.
    CustomDispatcher.CustomDispatcher(url,
                                      process_func = process_result,
                                      process_args = (name,url),
                                      error_func = process_error,
                                      error_args = (name,url))

# Run the asyncore event loop.  The loop will terminate automatically
# once all I/O channels have been closed.
asyncore.loop()

The output is the same as in the Twisted example (of course, the order of the results returned may be different for each run). In addition, the code to stop the Twisted reactor is no longer required; asyncore will automatically exit its loop when all I/O channels have closed.

Most of the work required to create the asyncore example is actually in writing the CustomDispatcher class. Due to the amount of low-level details it must handle, CustomDispatcher is quite a long piece of code compared to the other programs shown in this article. You can download it from the previous link or read it in the appendix

CustomDispatcher strives to be a fairly complete example that is also compatible with several versions of Python and asyncore. In addition, the goal is to write simple code that makes it easier to understand the nature of asynchronous I/O, rather than come up with the most optimal implementation.

As mentioned in the Twisted discussion, programs relying on asynchronous I/O are event-driven by nature. This is certainly different from the threaded examples given earlier. The CustomDispatcher class is sufficiently low-level to clearly bring out these differences.

When using synchronous I/O with threads, the physical layout of the program can correspond closely with its internal logic. For instance, each thread in our multitasking examples performs, in order, the following tasks:

  1. Send a request to the server.
  2. Receive the response.
  3. Process the response and output the results.

All of these operations can be written naturally, from the top down, in the program's source code. While urllib takes care of the first two steps in our examples, it still does so in the context of the threads we create.

As a thread performs the first two steps, it may have to wait an unpredictable amount of time for the network I/O to complete. When a thread is waiting, the operating system will allow other threads to run. Thus, if one of the threads has entered a lengthy sleep (e.g., in step 2), it will not prevent the other threads from performing step 3.

The situation changes completely when we use asynchronous I/O. Now, waiting is not allowed -- there is only one thread doing all the work. Instead, we perform I/O operations only when the operating system tells us that they will succeed immediately.

For each such I/O event, it is entirely likely that we will write less data than is needed to complete our request, or read only part of the incoming reply, etc. The unfinished work will have to be continued when the next I/O event comes. We must therefore store enough state information in order to resume the partially completed operation correctly at a later time.

Asyncore translates the results of the low-level system call (select or poll) into calls to handle_read, handle_write, and so on. We provide these methods in our CustomDispatcher class.

Our class is also a great place to keep state information. In particular, note the __is_header member variable. It is used as a flag to indicate that we have not yet finished reading the HTTP header.

Due to the nature of asynchronous I/O, it is likely that handle_read will be called multiple times before the entire web page is read. In addition, one of these read operations will probably wind up reading the last part of the HTTP header and the first part of the body. After all, the low-level asynchronous I/O routines are not familiar with the HTTP protocol. The transition from header to body is meaningless to them. Our handle_read method must carefully preserve any body content as it discards the header; otherwise part of the information we are interested in would be lost. Keep these sorts of issues in mind when working with asynchronous I/O.

When writing your own asyncore-based dispatcher classes, you may also want to override the handle_expt, handle_error, and log methods. These methods deal with Out-Of-Band data (OOB), unhandled errors, and logging, respectively. See the asyncore documentation and the library source code itself (file asyncore.py, installed on your hard drive in the same place as the rest of the standard Python library) for more information. The asyncore source code is actually quite easy to read. Also note that OOB is a rarely used feature of the TCP/IP protocol family.

CustomDispatcher uses Python's built-in apply function to call the supplied callback functions. This allows the list of arguments to the callbacks to be generated dynamically. Note, however, that apply has been deprecated in Python version 2.3. Unless you want to support old versions of the language (notably version 1.5), you should use the extended call syntax to achieve the same result. See the documentation of the deprecated apply function for a description of the extended call syntax.

Pages: 1, 2, 3, 4

Next Pagearrow





Sponsored by: