URLs and URIs, Proxies and Passwords
Pages: 1, 2, 3, 4, 5
Communicating with Server-Side Programs Through GET
The URL class makes it easy for Java applets and
applications to communicate with server-side programs such as CGIs,
servlets, PHP pages, and others that use the GET
method. (Server-side programs that use the POST
method require the URLConnection class and are
discussed in Chapter 15.) All you need to know
is what combination of names and values the program expects to
receive, and cook up a URL with a query string that provides the
requisite names and values. All names and values must be
x-www-form-url-encoded—as by the URLEncoder.encode() method, discussed earlier in this chapter.
There are a number of ways to determine the exact syntax for a query string that talks to a particular program. If you've written the server-side program yourself, you already know the name-value pairs it expects. If you've installed a third-party program on your own server, the documentation for that program should tell you what it expects.
On the other hand, if you're talking to a program on a third-party server, matters are a little trickier. You can always ask people at the remote server to provide you with the specifications for talking to their site. However, even if they don't mind doing this, there's probably no single person whose job description includes "telling third-party hackers with whom we have no business relationship exactly how to access our servers." Thus, unless you happen upon a particularly friendly or bored individual who has nothing better to do with their time except write long emails detailing exactly how to access their server, you're going to have to do a little reverse engineering.
TIP: This is beginning to change. A number of web sites have realized the value of opening up their systems to third party developers and have begin publishing developers' kits that provide detailed information on how to construct URLs to access their services. Sites like Safari and Amazon that offer RESTful, URL-based interfaces are easily accessed through the
URLclass. SOAP-based services like eBay's and Google's are much more difficult to work with.
Many programs are designed to process
form input. If this is the case, it's
straightforward to figure out what input the program expects. The
method the form uses should be the value of the
METHOD attribute of the FORM
element. This value should be either GET, in which
case you use the process described here, or POST,
in which case you use the process described in Chapter 15. The part of the URL that precedes the
query string is given by the value of the ACTION
attribute of the FORM element. Note that this may
be a relative URL, in which case you'll need to
determine the corresponding absolute URL. Finally, the name-value
pairs are simply the NAME attributes of the
INPUT elements, except for any
INPUT elements whose TYPE
attribute has the value submit.
For example, consider this HTML form for the local search engine on
my Cafe con Leche site. You can see that it uses the
GET method. The program that processes the form is
accessed via the URL http://www.google.com/search. It has four
separate name-value pairs, three of which have default values:
<form name="search" action="http://www.google.com/search" method="get">
<input name="q" />
<input type="hidden" value="cafeconleche.org" name="domains" />
<input type="hidden" name="sitesearch" value="cafeconleche.org" />
<input type="hidden" name="sitesearch2" value="cafeconleche.org" />
<br />
<input type="image" height="22" width="55"
src="images/search_blue.gif" alt="search" border="0"
name="search-image" />
</form>
The type of the INPUT field
doesn't matter—for instance, it
doesn't matter if it's a set of
checkboxes, a pop-up list, or a text field—only the name of
each INPUT field and the value you give it is
significant. The single exception is a submit input that tells the
web browser when to send the data but does not give the server any
extra information. In some cases, you may find hidden
INPUT fields that must have particular required
default values. This form has three hidden INPUT
fields.
In some cases, the program you're talking to may not be able to handle arbitrary text strings for values of particular inputs. However, since the form is meant to be read and filled in by human beings, it should provide sufficient clues to figure out what input is expected; for instance, that a particular field is supposed to be a two-letter state abbreviation or a phone number.
A program that doesn't respond to a form is much harder to reverse engineer. For example, at http://www.ibiblio.org/nywc/bios.phtml, you'll find a lot of links to PHP pages that talk to a database to retrieve a list of musical works by a particular composer. However, there's no form anywhere that corresponds to this program. It's all done by hardcoded URLs. In this case, the best you can do is look at as many of those URLs as possible and see whether you can guess what the server expects. If the designer hasn't tried to be too devious, this information isn't hard to figure out. For example, these URLs are all found on that page:
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Anderson
&first=Beth&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Austin
&first=Dorothea&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Bliss
&first=Marilyn&middle=
http://www.ibiblio.org/nywc/compositionsbycomposer.phtml?last=Hart
&first=Jane&middle=Smith
Looking at these, you can guess that this particular program expects three inputs named first, middle, and last, with values that consist of the first, middle, and last names of a composer, respectively. Sometimes the inputs may not have such obvious names. In this case, you have to do some experimenting, first copying some existing values and then tweaking them to see what values are and aren't accepted. You don't need to do this in a Java program. You can simply edit the URL in the Address or Location bar of your web browser window.
TIP: The likelihood that other hackers may experiment with your own server-side programs in such a fashion is a good reason to make them extremely robust against unexpected input.
Regardless of how you determine the set of name-value pairs the
server expects, communicating with it once you know them is simple.
All you have to do is create a query string that includes the
necessary name-value pairs, then form a URL that includes that query
string. Send the query string to the server and read its response
using the same methods you use to connect to a server and retrieve a
static HTML page. There's no special protocol to
follow once the URL is constructed. (There is a special protocol to
follow for the POST method, however, which is why
discussion of that method will have to wait until Chapter 15.)
To demonstrate this procedure, let's write a very simple command-line program to look up topics in the Netscape Open Directory (http://dmoz.org/). This site is shown in Figure 7-3 and it has the advantage of being really simple.

Figure 7-3. The basic user interface for the Open Directory
The basic Open Directory interface is a simple
form with one input field named search; input
typed in this field is sent to a CGI program at http://search.dmoz.org/cgi-bin/search, which
does the actual search. The HTML for the form looks like this:
<form accept-charset="UTF-8"
action="http://search.dmoz.org/cgi-bin/search" method="GET">
<input size=30 name=search>
<input type=submit value="Search">
<a href="http://search.dmoz.org/cgi-bin/search?a.x=0">
<small><i>advanced</i></small></a>
</form>
There are only two input fields in this form: the Submit button and a text field named Search. Thus, to submit a search request to the Open Directory, you just need to collect the search string, encode it in a query string, and send it to http://search.dmoz.org/cgi-bin/search. For example, to search for "java", you would open a connection to the URL http://search.dmoz.org/cgi-bin/search?search=java and read the resulting input stream. Example 7-12 does exactly this.
import com.macfaq.net.*;
import java.net.*;
import java.io.*;
public class DMoz {
public static void main(String[] args) {
String target = "";
for (int i = 0; i < args.length; i++) {
target += args[i] + " ";
}
target = target.trim( );
QueryString query = new QueryString("search", target);
try {
URL u = new URL("http://search.dmoz.org/cgi-bin/search?" + query);
InputStream in = new BufferedInputStream(u.openStream( ));
InputStreamReader theHTML = new InputStreamReader(in);
int c;
while ((c = theHTML.read( )) != -1) {
System.out.print((char) c);
}
}
catch (MalformedURLException ex) {
System.err.println(ex);
}
catch (IOException ex) {
System.err.println(ex);
}
}
}
Of course, a lot more effort could be expended on parsing and
displaying the results. But notice how simple the code was to talk to
this server. Aside from the funky-looking URL and the slightly
greater likelihood that some pieces of it need to be
x-www-form-url-encoded, talking to a server-side program that uses
GET is no harder than retrieving any other HTML
page.
Accessing Password-Protected Sites
Many popular sites, such as
TheWall Street Journal,
require a username and password for access. Some sites, such as the
W3C member pages, implement this correctly through HTTP
authentication. Others, such as the Java Developer Connection,
implement it incorrectly through cookies and HTML forms.
Java's
URL
class can access sites that use HTTP authentication, although
you'll of course need to tell it what username and
password to use. Java does not provide support
for sites that use nonstandard, cookie-based authentication, in part
because Java doesn't really support cookies in Java
1.4 and earlier, in part because this requires parsing and submitting
HTML forms, and, lastly, because cookies are completely contrary to
the architecture of the Web. (Java 1.5 does add some cookie support,
which we'll discuss in the next chapter. However, it
does not treat authentication cookies differently than any other
cookies.) You can provide this support yourself using the
URLConnection class to read and write the HTTP
headers where cookies are set and returned. However, doing so is
decidedly nontrivial and often requires custom code for each site you
want to connect to. It's really hard to do short of
implementing a complete web browser with full HTML forms and cookie
support. Accessing sites protected by standard, HTTP authentication
is much easier.
The Authenticator Class
The
java.net package includes an
Authenticator class you can use to provide a
username and password for sites that protect themselves using HTTP
authentication:
public abstract class Authenticator extends Object // Java 1.2
Since Authenticator is an abstract class, you must
subclass it. Different subclasses may retrieve the information in
different ways. For example, a character mode program might just ask
the user to type the username and password on
System.in. A GUI program would likely put up a
dialog box like the one shown in Figure 7-4. An automated robot might
read the username out of an encrypted file.

Figure 7-4. An authentication dialog
To make the URL class use the subclass, install it
as the default authenticator by passing it to the static
Authenticator.setDefault() method:
public static void setDefault(Authenticator a)
For example, if you've written an
Authenticator subclass named
DialogAuthenticator, you'd
install it like this:
Authenticator.setDefault(new DialogAuthenticator( ));
You only need to do this once. From this point forward, when the
URL class needs a username and password, it will
ask the DialogAuthenticator using the static
Authenticator.requestPasswordAuthentication() method:
public static PasswordAuthentication requestPasswordAuthentication(
InetAddress address, int port, String protocol, String prompt, String scheme)
throws SecurityException
The address argument is the host for which
authentication is required. The port argument is
the port on that host, and the protocol argument
is the application layer protocol by which the site is being
accessed. The HTTP server provides the prompt.
It's typically the name of the realm for which
authentication is required. (Some large web servers such as
www.ibiblio.org have multiple realms, each of
which requires different usernames and passwords.) The
scheme is the authentication scheme
being used. (Here the word scheme is not being
used as a synonym for protocol. Rather it is an
HTTP authentication scheme, typically basic.)
Untrusted applets are not allowed to ask the user for a name and
password. Trusted applets can do so, but only if they possess the
requestPasswordAuthenticationNetPermission. Otherwise,
Authenticator.requestPasswordAuthentication( )
throws a SecurityException.
The Authenticator subclass must override the
getPasswordAuthentication( ) method. Inside this
method, you collect the username and password from the user or some
other source and return it as an instance of the
java.net.PasswordAuthentication class:
protected PasswordAuthentication getPasswordAuthentication( )
If you don't want to authenticate this request,
return null, and Java will tell the server it
doesn't know how to authenticate the connection. If
you submit an incorrect username or password, Java will call
getPasswordAuthentication( ) again to give you
another chance to provide the right data. You normally have five
tries to get the username and password correct; after that,
openStream( ) throws a
ProtocolException.
Usernames and passwords are cached within the same virtual machine
session. Once you set the correct password for a realm, you
shouldn't be asked for it again unless
you've explicitly deleted the password by zeroing
out the char array that contains it.
You can get more details about the request by invoking any of these
methods inherited from the
Authenticator
superclass:
protected final InetAddress getRequestingSite( )
protected final int getRequestingPort( )
protected final String getRequestingProtocol( )
protected final String getRequestingPrompt( )
protected final String getRequestingScheme( )
protected final String getRequestingHost( ) // Java 1.4
These methods either return the information as given in the last call
to requestPasswordAuthentication( ) or return
null if that information is not available.
(getRequestingPort( ) returns -1 if the port
isn't available.) The last method,
getRequestingHost( ), is only available in Java
1.4 and later; in earlier releases you can call
getRequestingSite( ).getHostName( ) instead.
Java 1.5 adds two more methods to this class:
protected final String getRequestingURL( ) // Java 1.5
protected Authenticator.RequestorType getRequestorType( )
The getRequestingURL( ) method returns the
complete URL for which authentication has been requested—an
important detail if a site uses different names and passwords for
different files. The getRequestorType( ) method
returns one of the two named constants
Authenticator.RequestorType.PROXY or
Authenticator.RequestorType.SERVER to indicate
whether the server or the proxy server is requesting the
authentication.