Linux DevCenter    
 Published on Linux DevCenter (http://www.linuxdevcenter.com/)
 See this if you're having trouble printing code examples


Living Linux

Beyond Browsing the Web

07/05/2000

In addition to viewing URLs in the standard Web browsers, there are other useful ways of getting and using Web data on Linux systems right now. Here are a few of them.

Viewing images from the Web

If you want to view an image file that's on the Web, and you know its URL, you don't have to start a Web browser to do it -- give the URL as an argument to display, part of the ImageMagick suite of imaging tools (available in the Debian imagemagick package or here).

For example, to view the image at ftp://garbo.uwasa.fi/garbo-gifs/garbo01.gif, type:

display ftp://garbo.uwasa.fi/garbo-gifs/garbo01.gif

Click the right mouse button to get a menu; from there, you can save the image to a file if you want to.

Reading text from the Web

If I want to read the text of an article that's on the Web, and I just want the text and not the Web design, I'll often grab the URL with the lynx browser using the -dump option. This dumps the text of the given URL to the standard output; then I can pipe the output to less for perusal, or use redirection to save it to a file.

For example, to peruse the text of the URL http://www.sc.edu/fitzgerald/winterd/winter.html, type:
lynx -dump http://www.sc.edu/fitzgerald/winterd/winter.html | less

It's an old Net convention for italicized words to be displayed in an etext inside underscores like _this_; use the -underscore option to output any italicized text in this manner.

By default, lynx numbers all the hyperlinks and produces a list of footnoted links at the bottom of the screen. If you don't want them, add the -nolist option, and just the pure text will be returned.

To output the pure text, with underscores, of the above URL to the file winter_dreams, type (without the line break):

lynx -dump -underscore 
 http://www.sc.edu/fitzgerald/winterd/winter.html > winter_dreams

Or pipe the output to enscript to make a nice printout of it (again, don't enter the line breaks shown here):

lynx -dump -underscore 
 http://www.sc.edu/fitzgerald/winterd/winter.html |
 enscript -B -f "Times-Roman10"

Getting files from the Web

When I want to save the contents of a URL to a file, I often use GNU wget to do it. It keeps the file's original timestamp, it's smaller and faster to use than a browser, and it shows a visual display of the download progress. (You can get it from the Debian wget package or direct from any GNU archive).

So if I'm grabbing a webcam image, I'll do something like:

wget http://example.org/cam/cam.jpeg

This will save a copy of the image file as cam.jpeg, which will have the same timestamp attributes as the file on the example.org server.

If you interrupt a download before it's finished, use the -c option to resume from the point it left off:

wget -c http://example.org/cam/image.jpeg

Archiving an entire site

To archive a single Web site, use the -m ("mirror") option, which saves files with the exact timestamp of the originals, if possible, and sets the "recursive retrieval" option to download everything. To specify the number of retries to use when an error occurs in retrieval, use the -t option with a numeric argument -- -t3 is usually good for safely retrieving across the net; use -t0 to allow an infinite number of retries when your network connection is really bad but you really want to archive something, regardless of how long it takes. Finally, use the -o option with a filename as an argument to write a progress log to the file -- it can be useful to examine in case anything goes wrong. Once the archival process is complete and you've determined that it was successful, you can delete the logfile.

For example, to mirror the Web site at http://www.bloofga.org, giving up to three retries for retrieval of files and putting error messages in a logfile called mirror.log, type:

wget -m -t3 -o mirror.log http://www.bloofga.org/

To continue an archive that you've left off, use the -nc ("no clobber") option; it doesn't retrieve files that have already been downloaded. For this option to work the way you want it to, be sure to be in the same directory that you were in when you started to archive the site.

For example, to continue an interrupted mirror of the www.bloofga.org site, while making sure that existing files aren't downloaded and giving up to three retries for retrieval of files, type:

wget -nc -m -t3 http://www.bloofga.org/

Next week: Quick tools for command-line image transformations.

Michael Stutz was one of the first reporters to cover Linux and the free software movement in the mainstream press.


Read more Living Linux columns.

Copyright © 2009 O'Reilly Media, Inc.