ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


How an Accident of Hardware Design Encouraged Open Source
Pages: 1, 2

Portable OS and Incompatible Byte Order Leads to Preference for ASCII

As I think about it, I believe that this problem of byte-order incompatibility ended up being a major driving force that resulted in the openness we find in the descendants of Unix. OSes prior to Unix were written in assembly language and were thus inextricably tied to the instruction set specific to the CPU they were written to run on. Unix was the first major OS written in a higher level language, and so it was easy to compile it to instruction sets for many different kinds of CPUs. Because other OSes ran only on a single architecture, they never encountered the byte-order incompatibility problem. But Unix's portability guaranteed that it would run into the problem, big time. When people tried to move data between big-endian and little-endian machines, bytes within words came through scrambled. The scrambled results became the name of the problem; it was known as either the XINU problem or the NUXI problem, depending on the specific architectures of the two machines.



Programmers tried a variety of approaches to deal with the problem that data formats that depended on byte order didn't transfer well between architectures. A common approach was to build something into the data format to indicate the byte-order. The first two bytes of a TIFF image file, for example, are either "II" if integers in the file are in little-endian byte order or "MM" if integers in the file are in big-endian byte order.

Another way to solve the byte-order problem was to avoid it entirely by only dealing with data in one-byte chunks. People quickly noticed that programs written to fetch their data from files that contained printable ASCII worked fine without any extra programming effort. Representing numeric data as plain ASCII had the disadvantage of taking up a lot more space than a binary representation. Given the memory constraints of the time, this was a much bigger deal back then than it is now. It also had the disadvantage of requiring additional CPU time to convert the data between ASCII and binary. Again, this was a bigger deal back then than it is now, due to the comparatively slow CPU speeds of the time.

But memory and disk capacities were increasing, so the advantage of ASCII quickly came to outweigh its disadvantages. In all but the most storage-intensive applications, file formats that were byte-order dependent tended to wither and die. The experience of cpio vs. tar illustrates this.

In the early `80s, cpio and tar were two common Unix archiving utilities, which was right around when a lot of commercial interest in using Unix began. Because lots of different companies were building their own hardware, each one inventing their own instruction set and addressing scheme, getting Unix to run on their machines required each of them to make the Unix source code compile, link, and then run properly on their hardware--a process known as porting. It was often necessary to copy many files at a time, some ASCII and some binary, back and forth between machines with different architectures. That meant aggregating many files into a single file for transfer among different machines. Both cpio and tar were available to do this job. However, where cpio's file format contained binary integers, the design of tar's file format had avoided binary and instead stored integers as ASCII strings. A cpio backup only worked where the destination machine had the same byte-order and wordsize as the source machine. Tar format, unlike cpio format, could be used to move files between machines with different byte-order and wordsize. Although you still had to deal with byte-order dependencies inside any binary files within the tar archive after they were on the destination machine, this was vastly better than the situation with cpio where you couldn't even transfer the files. Unsurprisingly, people gravitated toward tar. As a result, cpio fell into disuse. tar, on the other hand, is still in common use today, a quarter of a century later.

Storing Data in Binary Fosters Closed Proprietary Systems; ASCII Fosters Openness

Representing data in 1-byte chunks rather than multibyte chunks turned out to have an interesting side-effect. When storing numeric data in multibyte chunks (binary), you have a stream of groups of bytes that may be of varying size. You might have 1-, 2- or 4-byte integers, 4- or 8-byte floating point numbers, and character strings, all mixed in with each other. Unless you have documentation on the file format, there's no way to tell where one piece of data ends and the next begins.

If you're storing your data in 1-byte chunks, the most natural representation is displayable ASCII text. That means that the data typically contains line terminators (LF on Unix, CR/LF on DOS, CR on the Mac) and field delimiters (TAB, ",", etc.). It also means that a sequence of characters containing only digits is probably a number, and a sequence containing non-digits is clearly not a number.

In an environment in which it's easy to make your data and configuration files incomprehensible (unless you make a point of providing detailed documentation on your file format), programmers and the companies they work for expect that everything can be kept secret, even from other programmers, unless they can gain some business advantage by documenting the format of some particular file.

On the other hand, in an environment in which data files, and especially configuration files, are commonly ASCII rather than binary, programmers come to expect that they'll be able to figure out a lot about what's in a file, even in the absence of detailed documentation on the file format.

Although Richard Stallman probably would have founded the GNU project anyway, GNU's progress would have been greatly impeded in an environment where all data was stored in binary. Many of the F/OSS projects that are GNU's progeny may never have happened in an environment in which undocumented binary file formats predominated. As just one example of how companies use undocumented binary formats to protect their turf and impede the progress of others, note that although Linux can read from an NTFS filesystem (the filesystem format that Microsoft has moved to), the ability to reliably create or modify files in an NTFS filesystem has been a longstanding problem for Linux. (See Does Linux Support NTFS? and HOWTO: NTFS with read/write support using the ntfs-3g.) Microsoft has managed to lock Linux out simply by using sufficiently complex structures inside NTFS and keeping the documentation secret.

Conclusion

The IBM 360 and DEC's PDP-11 were the predominant big and little computers of the 1970s. If they had both used the same byte-ordering scheme, it would have been possible to transfer multi-byte chunks of data between the two architectures without encountering byte-ordering problems. If the PDP-11's designers had decided to copy the byte-ordering used by IBM, the Unix world would not have been pushed toward encoding data in ASCII, and the unintended consequential openness in the Unix/GNU/Linux world might never have come about.


Return to ONLamp.com.



Sponsored by: