Process migration is not a concept that was suddenly invented in a vacuum; like many other features, it is an incremental development that builds on and is motivated by other features found in many operating systems. This section will review some of this initial groundwork.
Perhaps the biggest single idea that is critical to the success of process migration is transparency. Transparency, in its most basic form, is the replication of an environment on more than one computer. This replication gives a user the ability to run the same program on any of a number of different computers, with the expectation that the program will have the same behavior regardless of which computer is actually running it.
Sun's Network File System [1], or NFS, is an early example of a system that supported limited transparency. By carefully mounting the same network filesystems in the same place on each client node, the namespace of the filesystem would appear to be the same on all of the nodes. Other facilities, such as Sun Network Information Service [2], or NIS, allows easy replication of network configuration data and user authentication information.
The first steps described above led to the development of ``computing clusters,'' or large-scale groups of computers under the same administrative pervue that present a homogeneous computing environment to the user. Examples of these cluster projects can be found in MIT's Project Athena [3] and CMU's Andrew [4]. The goal of such clusters was to allow a user to sit down at any workstation on campus and gain access to her familiar environment, including her home directory.
In a cluster of homogeneous workstations such as those described above, enough transparency was provided to allow remote process execution using the simple Berkeley rsh command. Advanced users soon became adept at splitting tasks into multiple, parallel jobs, and running each job on a different machine using rsh. While this method was effective in some cases, it does suffer from some obvious problems: for example, users running parallel jobs had to manually determine which machines were available for running jobs.
The LOCUS operating system [5], developed at the UCLA, was another early example of a system developed with transparency as one of its primary goals. It was designed ``from the ground up'' as a distributed operating system, supporting transparent access to data via a network-wide filesystem, and transparent remote process execution in a more elegant way than rsh. Although LOCUS did arguably present the user with a better interface than other systems, it still required manual selection of nodes on which to run processes.
At this point, the state of the art in operating systems provided high levels of transparency between machines--users now had at their disposal dozens or even hundreds of different computers, all of which could run a process with the same results. The next logical step was to develop methods to automate the distribution of processes to processors, based on usage or load patterns on the processors. Systems such as butler [6] helped to solve this problem by monitoring the load on the machines in a cluster and assigning processes to machines with low load. This is called static scheduling, because once the system makes the decision of the set of machines to receive processes, the processes must run to completion on those machines.
In addition, parallel programming environments such as MPI [7], PVM [8], and Linda [9] made it easier for programmers to take advantage of transparent system clusters to solve problems in parallel. Each of these systems consisted of two components. The first component is a programming language or language library, used by the programmer to specify a parallel computation. The second component is a run-time environment, used to execute the specified parallel computation.
A parallel run-time environment is responsible for the details of creating processes and assigning the processes to processors. The parallel run-time systems mentioned above all use static scheduling. Some of the run-time environments make their processor selections based on information such as the current load on the set of possible target machines.