ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Understanding Newlines

by Xavier Noria
08/17/2006

Programmers deal with text all the time:

  % perl -wpe1 alice.txt
  There was nothing so very remarkable in that; nor did Alice think
  it so very much out of the way to hear the Rabbit say to itself,
  `Oh dear! Oh dear! I shall be late!'

That's three nicely formatted lines of text. Now look at what's actually in the file alice.txt:

  % perl -w0777e 'print join ".", unpack "C*", <>' alice.txt
  84.104.101.114.101.32.119.97.115.32.110.111.116.104.105.110.103.\
  32.115.111.32.118.101.114.121.32.114.101.109.97.114.107.97.98.\
  108.101.32.105.110.32.116.104.97.116.59.32.110.111.114.32.100.\
  105.100.32.65.108.105.99.101.32.116.104.105.110.107.10.105...

It's just a bunch of codes in a row. Numbers.

What's going on? Computers only understand numbers. That's the way they work; they encode text using numbers. To interpret them as text, software maps between numbers and characters. The map used in the example--ASCII--establishes that number 84 corresponds to letter "T", number 104 to "h", number 101 to "e", and so on. Those mappings are technically called character encodings, and there are many. Most are extensions of ASCII.

Conversely, in:

  print $fh "foo";

... Perl prints the codes that correspond to letters "f", "o", and "o" into $fh. In fact, those letters are numbers already within Perl; that's the way Perl represents strings internally.

Not all codes correspond to letters, though. For instance, you may have noticed that there are no spaces in the list of numbers in the example above, while there are plenty of them in the original text. How do spaces end up in the console? Where do they come from? It turns out that in ASCII, number 32 corresponds to the space character. When printing the text, an empty spot appears whenever the number 32 comes up. This is the same principle that yields a "T" from number 84.

Similarly, when the number 10 appears, a Unix terminal starts a fresh new line and code translation continues. The result of this process is text that renders like a book. Yet, that is an illusion: there are no letters, spaces, or separated lines in files or strings.

Newlines are encoded using numbers as everything else. Forget about the way they look; they're just codes. That's the key point to understand newlines in computers.

What is "\n?"

For historical reasons, no single code is unequivocally interpreted as a newline. ASCII code 10 is technically called "newline," but unfortunately, the actual representation of newlines depends on the operating system or the application context.

The codes used to represent newlines in ASCII-based systems are:

  • LF: Line Feed, "\cJ", Unicode 000A, ASCII 0x0A, 012, 10.
  • CR: Carriage Return, "\cM", Unicode 000D, ASCII 0x0D, 015, 13.
  • CRLF: A pair, both codes together one right after the other, and in that order.

Thus, there are three different conventions. Each ASCII-based platform follows one of them:

  • LF: Unix and Unix-like systems, Mac OS X, Linux, AIX, Xenix, BeOS, Amiga, RISC OS, and others.
  • CR: Apple II family, Mac OS through version 9.
  • CRLF: Microsoft Windows, WinCE, DOS, OS/2, CP/M, MP/M, and others.

If you fire up an editor in three computers running an operating system from each of those three families, then enter x + Return + y in each one and save, the result on disk is different. Following the same nomenclature used earlier, you would see these bytes, respectively:

  • Ubuntu GNU/Linux: 120.10.121
  • Mac OS 9: 120.13.121
  • Windows NT: 120.13.10.121

Text editors do that transparently. How do you produce the right code or codes from Perl (or another programming language that works similarly)? Suppose you want to print "foo" followed by a newline on Linux. According to the previous list, you could write:

  print "foo\012";

That's correct. Now, what if you want to print "foo" followed by a newline on Windows? In theory, you would need to write instead:

  print "foo\015\012"; # but not actually true!

What if you don't know in advance the operating system your script is going to run on? Imagine that you are going to publish a program that must be OS-independent. That is, you want your program to be portable. In principle, you need to write something like:

  if ($^O eq "darwin" || ...) {
      print "foo\012";
  } elsif ($^O eq "MSWin32" || ...) {
      print "foo\015\012"; # not actually true, see below
  } else {
      print "foo\015";
  }

That's ugly, but it can be encapsulated somewhere so that you only need to write:

  print "foo", newline_for_runtime_platform();

That is, in a sense, what "\n" means. Not exactly with that implementation, but the idea is to use "\n" to output a newline in a portable way; Perl knows what to do in each system:

  print "foo\n"; # does the right thing in every system

That's common in C-based languages such as Perl. Other languages have different semantics for "\n". For instance, "\n" is not a portable newline in Java; in order to print foo followed by a newline in a portable way in Java, use method calls such as System.out.println("foo").

Pages: 1, 2, 3

Next Pagearrow





Sponsored by: