ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Java RMI: Serialization
Pages: 1, 2, 3, 4, 5, 6

Versioning Classes

A few pages back, I described the serialization mechanism:

The serialization mechanism automatically, at runtime, converts class objects into metadata so instances can be serialized with the least amount of programmer work.

This is great as long as the classes don't change. When classes change, the metadata, which was created from obsolete class objects, accurately describes the serialized information. But it might not correspond to the current class implementations.

The Two Types of Versioning Problems

There are two basic types of versioning problems that can occur. The first occurs when a change is made to the class hierarchy (e.g., a superclass is added or removed). Suppose, for example, a personnel application made use of two serializable classes: Employeeand Manager(a subclass of Employee). For the next version of the application, two more classes need to be added: Contractorand Consultant. After careful thought, the new hierarchy is based on the abstract superclass Person, which has two direct subclasses: Employeeand Contractor. Consultantis defined as a subclass of Contractor, and Manageris a subclass of Employee. See Figure 10-8.

Diagram.
Figure 10-8. Changing the class hierarchy.

While introducing Personis probably good object-oriented design, it breaks serialization. Recall that serialization relied on the class hierarchy to define the data format.

The second type of version problem arises from local changes to a serializable class. Suppose, for example, that in our bank example, we want to add the possibility of handling different currencies. To do so, we define a new class, Currency, and change the definition of Money:

public class Money extends ValueObject {
public float amount;
public Currency typeOfMoney;
}

This completely changes the definition of Moneybut doesn't change the object hierarchy at all.

The important distinction between the two types of versioning problems is that the first type can't really be repaired. If you have old data lying around that was serialized using an older class hierarchy, and you need to use that data, your best option is probably something along the lines of the following:

  1. Using the old class definitions, write an application that deserializes the data into instances and writes the instance data out in a neutral format, say as tab-delimited columns of text.
  2. Using the new class definitions, write a program that reads in the neutral-format data, creates instances of the new classes, and serializes these new instances.

The second type of versioning problem, on the other hand, can be handled locally, within the class definition.

How Serialization Detects When a Class Has Changed

In order for serialization to gracefully detect when a versioning problem has occurred, it needs to be able to detect when a class has changed. As with all the other aspects of serialization, there is a default way that serialization does this. And there is a way for you to override the default.

The default involves a hashcode. Serialization creates a single hashcode, of type long, from the following information:

  • The class name and modifiers
  • The names of any interfaces the class implements
  • Descriptions of all methods and constructors except privatemethods and constructors
  • Descriptions of all fields except private, static, and private transient

This single long, called the class's stream unique identifier (often abbreviated suid), is used to detect when a class changes. It is an extraordinarily sensitive index. For example, suppose we add the following method to Money:

public boolean isBigBucks(  ) {
return _cents > 5000;
}

We haven't changed, added, or removed any fields; we've simply added a method with no side effects at all. But adding this method changes the suid. Prior to adding it, the suidwas 6625436957363978372L; afterwards, it was -3144267589449789474L. Moreover, if we had made isBigBucks( )a protected method, the suidwould have been 4747443272709729176L.

TIP: These numbers can be computed using the serialVer program that ships with the JDK. For example, these were all computed by typing serialVer com.ora.rmibook.chapter10.Moneyat the command line for slightly different versions of the Moneyclass.

The default behavior for the serialization mechanism is a classic "better safe than sorry" strategy. The serialization mechanism uses the suid, which defaults to an extremely sensitive index, to tell when a class has changed. If so, the serialization mechanism refuses to create instances of the new class using data that was serialized with the old classes.

Implementing Your Own Versioning Scheme

While this is reasonable as a default strategy, it would be painful if serialization didn't provide a way to override the default behavior. Fortunately, it does. Serialization uses only the default suidif a class definition doesn't provide one. That is, if a class definition includes a static final longnamed serialVersionUID, then serialization will use that static final longvalue as the suid. In the case of our Moneyexample, if we included the line:

private static final long serialVersionUID = 1;

in our source code, then the suidwould be 1, no matter how many changes we made to the rest of the class. Explicitly declaring serialVersionUIDallows us to change the class, and add convenience methods such as isBigBucks( ), without losing backwards compatibility.

TIP:   serialVersionUIDdoesn't have to be private. However, it must be static, final, and long.

The downside to using serialVersionUIDis that, if a significant change is made (for example, if a field is added to the class definition), the suidwill not reflect this difference. This means that the deserialization code might not detect an incompatible version of a class. Again, using Moneyas an example, suppose we had:

public class Money extends ValueObject {
private static final long serialVersionUID = 1;
protected int _cents;

and we migrated to:

public class Money extends ValueObject {
private static final long serialVersionUID = 1;
public float amount;
public Currency typeOfMoney;
}

The serialization mechanism won't detect that these are completely incompatible classes. Instead, when it tries to create the new instance, it will throw away all the data it reads in. Recall that, as part of the metadata, the serialization algorithm records the name and type of each field. Since it can't find the fields during deserialization, it simply discards the information.

The solution to this problem is to implement your own versioning inside of readObject( )and writeObject( ). The first line in your writeObject( )method should begin:

private void writeObject(java.io.ObjectOutputStream out) throws IOException {
stream.writeInt(VERSION_NUMBER);
....
}

In addition, your readObject( )code should start with a switch statement based on the version number:

private void readObject(java.io.ObjectInputStream in) throws IOException,
    ClassNotFoundException {
int version = in.readInt( );
switch(version) {
// version specific demarshalling code.
....}
}private void readObject(java.io.ObjectInputStream in) throws IOException,
    ClassNotFoundException {
int version = in.readInt( );
switch(version) {
// version specific demarshalling code.
....}
}

Doing this will enable you to explicitly control the versioning of your class. In addition to the added control you gain over the serialization process, there is an important consequence you ought to consider before doing this. As soon as you start to explicitly version your classes, defaultWriteObject( )and defaultReadObject( )lose a lot of their usefulness.

Trying to control versioning puts you in the position of explicitly writing all the marshalling and demarshalling code. This is a trade-off you might not want to make.

Performance Issues

Serialization is a generic marshalling and demarshalling algorithm, with many hooks for customization. As an experienced programmer, you should be skeptical--generic algorithms with many hooks for customization tend to be slow. Serialization is not an exception to this rule. It is, at times, both slow and bandwidth-intensive. There are three main performance problems with serialization: it depends on reflection, it has an incredibly verbose data format, and it is very easy to send more data than is required.

Serialization Depends on Reflection

The dependence on reflection is the hardest of these to eliminate. Both serializing and deserializing require the serialization mechanism to discover information about the instance it is serializing. At a minimum, the serialization algorithm needs to find out things such as the value of serialVersionUID, whether writeObject( )is implemented, and what the superclass structure is. What's more, using the default serialization mechanism, (or calling defaultWriteObject( )from within writeObject( )) will use reflection to discover all the field values. This can be quite slow.

TIP:   Setting serialVersionUIDis a simple, and often surprisingly noticeable, performance improvement. If you don't set serialVersionUID, the serialization mechanism has to compute it. This involves going through all the fields and methods and computing a hash. If you set serialVersionUID, on the other hand, the serialization mechanism simply looks up a single value.

Serialization Has a Verbose Data Format

Serialization's data format has two problems. The first is all the class description information included in the stream. To send a single instance of Money, we need to send all of the following:

  • The description of the ValueObjectclass
  • The description of the Moneyclass
  • The instance data associated with the specific instance of Money.

This isn't a lot of information, but it's information that RMI computes and sends with every method invocation. (Recall that RMI resets the serialization mechanism with every method call.) Even if the first two bullets comprise only 100 extra bytes of information, the cumulative impact is probably significant.

The second problem is that each serialized instance is treated as an individual unit. If we are sending large numbers of instances within a single method invocation, then there is a fairly good chance that we could compress the data by noticing commonalities across the instances being sent.

It Is Easy to Send More Data Than Is Required

Serialization is a recursive algorithm. You pass in a single object, and all the objects that can be reached from that object by following instance variables, are also serialized. To see why this can cause problems, suppose we have a simple application that uses the Employeeclass:

public class Employee implements Serializable {
public String firstName;
public String lastName;
Public String socialSecurityNumber;
}

In a later version of the application, someone adds a new piece of functionality. As part of doing so, they add a single additional field to Employee:

public class Employee implements Serializable {
public String firstName;
public String lastName;
Public String socialSecurityNumber;
Public Employee manager;
}

What happens as a result of this? On the bright side, the application still works. After everything is recompiled, the entire application, including the remote method invocations, will still work. That's the nice aspect of serialization--we added new fields, and the data format used to send arguments over the wire automatically adapted to handle our changes. We didn't have to do any work at all.

On the other hand, adding a new field redefined the data format associated with Employee. Because serialVersionUIDwasn't defined in the first version of the class, none of the old data can be read back in anymore. And there's an even more serious problem: we've just dramatically increased the bandwidth required by remote method calls.

Suppose Bob works in the mailroom. And we serialize the object associated with Bob. In the old version of our application, the data for serialization consisted of:

  • The class information for Employee
  • The instance data for Bob

In the new version, we send:

  • The class information for Employee
  • The instance data for Bob
  • The instance data for Sally (who runs the mailroom and is Bob's manager)
  • The instance data for Henry (who is in charge of building facilities)
  • The instance data for Alison (Director, Corporate Infrastructure)
  • The instance data for Mary (VP in charge of IT)

And so on...

The new version of the application isn't backwards-compatible because our old data can't be read by the new version of the application. In addition, it's slower and is much more likely to cause network congestion.

The Externalizable Interface

To solve the performance problems associated with making a class Serializable, the serialization mechanism allows you to declare that a class is Externalizableinstead. When ObjectOutputStream's writeObject( )method is called, it performs the following sequence of actions:

  1. It tests to see if the object is an instance of Externalizable. If so, it uses externalization to marshall the object.
  2. If the object isn't an instance of Externalizable, it tests to see whether the object is an instance of Serializable. If so, it uses serialization to marshall the object.
  3. If neither of these two cases apply, an exception is thrown.

Externalizableis an interface that consists of two methods:

public void readExternal(ObjectInput in);
public void writeExternal(ObjectOutput out);

These have roughly the same role that readObject( )and writeObject( )have for serialization. There are, however, some very important differences. The first, and most obvious, is that readExternal( )and writeExternal( )are part of the Externalizableinterface. An object cannot be declared to be Externalizablewithout implementing these methods.

However, the major difference lies in how these methods are used. The serialization mechanism always writes out class descriptions of all the serializable superclasses. And it always writes out the information associated with the instance when viewed as an instance of each individual superclasses.

Externalization gets rid of some of this. It writes out the identity of the class (which boils down to the name of the class and the appropriate serialVersionUID). It also stores the superclass structure and all the information about the class hierarchy. But instead of visiting each superclass and using that superclass to store some of the state information, it simply calls writeExternal( )on the local class definition. In a nutshell: it stores all the metadata, but writes out only the local instance information.

TIP:   This is true even if the superclass implements Serializable. The metadata about the class structure will be written to the stream, but the serialization mechanism will not be invoked. This can be useful if, for some reason, you want to avoid using serialization with the superclass. For example, some of the Swing classes, while they claim to implement Serializable, do so incorrectly (and will throw exceptions during the serialization process). (JTextAreais one of the most egregious offenders.) If you really need to use these classes, and you think serialization would be useful, you may want to think about creating a subclass and declaring it to be Externalizable. Instances of your class will be written out and read in using externalization. Because the superclass is never serialized or deserialized, the incorrect code is never invoked, and the exceptions are never thrown.

Pages: 1, 2, 3, 4, 5, 6

Next Pagearrow