ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Understanding Newlines
Pages: 1, 2, 3

Behind the Scenes

The previous section sacrificed accuracy in order to give you the whole picture. You are ready now to learn the details.

There are two important facts about \n in Perl that you must clearly understand:

  • The string literal "\n" consists of just one character. Always. Everywhere. "foo\n" has length 4 in all systems. In particular, "\n" is not CRLF on Windows.
  • The string literal "\n" is equal to "\012" in all systems except Mac OS pre-X, where it is equal to "\015".

Given that:

  print $fh "foo\n";

does the right thing on Windows, and given that "\n" is really LF there, you may wonder how the CRLF pair ends up correctly in $fh.

Perl inherits from C the approach to handle line terminators; there is a layer responsible for all I/O operations in Perl. Since 5.8.0, that layer is, by default, PerlIO. When your script prints, PerlIO intervenes and does some magic: if Perl is running on a CRLF platform, it transforms, on the fly, all "\n"s in the stream into CRLF pairs. It is totally transparent, so you won't notice it.

The C code that performs that transformation on Windows lives in the function PerlIOCrlf_write(), defined in perlio.c:

  if (*buf == '\n') {
      /* ... */
      *(b->ptr)++ = 0xd;      /* CR */
      *(b->ptr)++ = 0xa;      /* LF */
      /* ... */
  }

Note that the '\n' in that code is a C char, not a Perl string. Fortunately, the semantics coincide, and thus the condition tests what it has to test.

That transformation goes the other way around when reading text files on Windows or any other CRLF platform. The layer replaces any CRLF pair on the fly by a single '\n' character. That happens in PerlIOCrlf_get_cnt(), as well as in perlio.c:

  if (nl < b->end && *nl == 0xd) {
      test:
      if (nl + 1 < b->end) {
          if (nl[1] == 0xa) {
              *nl = '\n';
              c->nl = nl;
          }
          /* ... */
      }
      /* ... */
  }

Note that this handles only actual CRLF pairs. Isolated CRs or LFs will remain untouched.

Thus, if you read lines from a text file in Windows using the standard line-oriented while loop:

  while (my $line = <$fh>) {
      # ...
  }

No CRLF pair ever gets into $line, only LFs.

All that magic happens on file handles associated with files. By default, Perl opens files in text mode in CRLF platforms, which means no more and no fewer than those transformations occur. You can disable them with binmode(). Other streams--sockets, for example--are in binmode by default.

That's why you need to open images and all non-text files in binmode on Windows. Those conventions about newlines are just for regular text files. They have nothing to do with, say, the bytes used in PNG images. A PNG image could in principle have a CRLF pair by chance somewhere, meaning something else. If you open a PNG image for reading and don't set binmode on its file handle, the IO layer will perform those on-the-fly substitutions and may corrupt data by filtering out some bytes. The same happens with writing. If you have a buffer with bytes representing a song in MP3 format and write it to disk in text mode on Windows, all 0xas will get a 0xd inserted before, and the MP3 will become garbage.

On the other hand, you can activate the CRLF to and from "\n" no matter what the platform is, thanks to the :crlf PerlIO layer:

  open my $fh, "<:crlf", "alice.txt" or die $!;

With that little trick, your script will understand text with either native conventions or CRLF.

What Does "Portability" Mean?

As far as newlines go, a portable program does its job well on the assumption that the newline convention of text data is that of the runtime platform.

Those conventions are only knowable at runtime, perhaps in some other machine running some unknown operating system. It could be a nightmare, but fortunately, good languages give you idioms to accomplish this effortlessly. It is good practice to always write in a portable way.

In Perl, use "\n" to print newlines. Use <> to do line-based loops over file handles associated with text files, or slurp lines in list context. Use chomp() to delete the newline from a line of text. Use "\n" in regular expressions to match line terminators, and ^ and $ for assertions about line boundaries, with /m if necessary.

Set binmode on file handles associated with binary files, even if you develop in a system that does not distinguish text and binary files. Usually, binmode is also necessary if you plan to seek()/tell() and read(). That's because read() also transforms CRLF into "\n" in CRLF platforms in text mode, but seek()/tell() do not and byte offsets may differ.

Do not assume "\n" is "\012"--a common pitfall in sockets programming. That does not necessarily hold true. For instance, if you need a CRLF pair to generate raw HTTP headers, do not use "\r\n" as terminator. The meaning of that string depends on the system. Hardcode it instead, as in "\cM\cJ". Even better, use the variables provided by Socket.pm via the :crlf export tag: $CR, $LF, and $CRLF. These are portable solutions.

Portability, however, does not mean readiness to accept any kind of text--only text with the convention of the runtime platform. Suppose you need to write a portable Perl script that counts the number of lines of the files passed as arguments:

  my $lines = 0;
  ++$lines while <>;
  print "$lines\n";

That program is correct and portable. It portably handles line terminators on reading, delegating to the diamond operator, and it outputs a newline in a portable way, via "\n".

That does not mean it works in any possible situation, of course. For example, suppose that one day a coworker comes to you with a MacBook and says the script worked flawlessly for weeks, but suddenly some file with multiple lines is reported to have a single line. If you understand how newlines work, you'll debug that in a couple of minutes. Otherwise, you'll be lost.

The problem must be that the input file is not using LF for newlines, which is the convention in Mac OS X. Editors such as Vim, Emacs, and TextMate let the user configure newlines so there's always a risk. The diamond operator looks for LFs in Mac OS X. If the file uses CRs, the entire file lives in a single line according to Mac OS X conventions. That coincides with the observed behavior and becomes your conjecture.

Note that if the file used CRLF, the number of lines computed would be correct in Mac OS X (by accident, but correct).

Pages: 1, 2, 3

Next Pagearrow





Sponsored by: