Python DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Web Client Programming in Python

by Dave Warner
06/21/2000

Web client programming is a powerful technique for querying the Web. A web client is any program that retrieves data from a web server using the Hyper Text Transfer Protocol (the http in your URLs). A web browser is a client; so are web crawlers, programs that traverse the Web automatically to gather information. You can also use web clients to take advantage of services offered by others on the Web and add dynamic features to your own web site.

Web client programming belongs in any developer's toolbox. Perl aficionados have employed it for years. In Python, the process reaches even higher levels of convenience and flexibility. Three modules provide most of the functionality you will need: HTTPLIB, URLLIB, and a newer addition, XMLRPCLIB. In true Pythonesque fashion, each module builds upon its predecessor, providing a solid, well-designed base for your applications. We will cover the first two modules in this article, saving XMLRPCLIB for a later time.

For our examples, we will use Meerkat. If you are like me, you invest time tracking trends and developments in the open source community to give you a competitive edge. Meerkat is a tool that makes that task much easier. It is an open wire service that collects and collates an enormous amount of information on open source computing. Although its browser interface is flexible and customizable, using web client programming we can scan, extract, and even store this information off-line for later use. We will first access Meerkat using HTTPLIB interactively, and then move on to accessing Meerkat's Open API via URLLIB to create a customizable information-collecting tool.

HTTPLIB

HTTPLIB is a lightweight wrapper around the socket module. Of the three libraries I have mentioned, HTTPLIB provides the most control when accessing a web site. That control, however, comes at the cost of requiring more work to accomplish your task. The http protocol is "stateless," so it doesn't remember anything about your previous requests. You must construct a new HTTPLIB object to connect to the web site for each request. The requests form a conversation with the web server, mimicking a web browser. Let's connect to Meerkat using Rael Dornfest's Open API interactively and see what results we get. The conversation begins by building up a series of statements that first state what action you want to take, and then identify you to the web server:

>>> import httplib
>>> host = 'www.oreillynet.com'
>>> h = httplib.HTTP(host)
>>> h.putrequest('GET', '/meerkat/?_fl=minimal')
>>> h.putheader('Host', host)
>>> h.putheader('User-agent', 'python-httplib')
>>> h.endheaders()
>>> 

The GET request tells the server which page you want to receive. The Host header tells it the domain name you are querying. Modern servers using HTTP 1.1 can host several domains at the same address. If you don't tell it which domain name you want, you will get a '302' redirection response as your return code. The User-agent header tells the server what kind of client you are so it knows what it can and cannot send you. This is all the information you need for the web server to process your request. Next you ask for the response:

>>> returncode, returnmsg, headers = h.getreply()
>>> if returncode == 200:  #OK
...         f = h.getfile()
...         print f.read()
...

This will print out the current Meerkat page in the minimal flavor. The response header and content are returned separately, which aids in both troubleshooting and parsing any returned data. If you want to see the response headers use print headers.

HTTPLIB hides the mechanics of socket programming, and its use of a file object for buffering lets you use a familiar approach to manipulating the data. It is, however, best suited as a building block for more powerful web client applications, or for interactive conversations with a troubled web site. To aid in both areas, HTTPLIB has a useful debug capability. You access it by calling the method h.set_debuglevel(1) at any point after object initialization (the line h = httplib.HTTP(host) in our example). With the debug level set to 1, the module will echo requests and the results of any calls to getreply() to the screen.

The interactive nature of Python makes analyzing websites using HTTPLIB a joy. Familiarize yourself with this module and you will have a powerful, flexible tool for diagnosing web site problems. Take time to look at the source for HTTPLIB as well. With less than 200 lines of code, HTTPLIB is a quick and easy introduction to socket programming using Python.

Pages: 1, 2

Next Pagearrow





Sponsored by: