Lab 3: Virtual Memory

Introduction

In this lab, you will add paged virtual memory management to PIOS. So far PIOS's "processes" have all been running in the same address space as the kernel, with full ability to read or modify any part of physical memory, which of course makes the whole system vulnerable to bugs or misbehavior in any process even when processes are executing in user mode. We will now use the x86's page-based address translation facilities to give each process an independent user-level address space, providing remaining key ingredient for protection between processes by preventing processes from accessing either the kernel's or other processes' address spaces. We will also enhance PIOS's system call API to allow a process to copy data into and out of child processes, using copy-on-write optimization to minimize the cost of copying, These system call enhancements will allow the parent process not only to "fork" child processes with cloned address spaces as in Unix, but also — moving a step beyond typical Unix APIs — allow the parent to merge results that a child process computes directly back into the parent's own address space without having to communicate indirectly through pipes or files.

This lab contains the following implementation components:

Page Table Management: Building paging structures for the kernel and for user processors, and enabling page translation.
Loading and Running an ELF Executable: Loading an ELF executable image into the first user-space process and preparing it for execution.
User Space Copyin/Copyout: Copying memory-based system call arguments out of or into user space while protecting against invalid arguments or traps.
Memory Management with Copy-on-Write: Efficiently "copying" memory via read-only shared mappings, and lazily copying actual page content only on demand.
Virtual Memory Merge: Merging memory changes made in one process since a memory snapshot into another process's address space.

Software Setup

In this lab you will build on the kernel you started in lab 2. Use the same procedure as in the previous lab to create a lab3 branch in your Git repository, fetch our skeleton source code for lab 3 from the master Git repository, and merge your lab2 solutions into it as your starting point for lab 3:

$ cd lab
$ git commit -am 'my solution to lab2'
$ git pull
$ git checkout -b lab3 origin/lab3
$ git merge lab2

Lab 3 contains the following new source files, which you should browse through and familiarize yourself with:

`kern/pmap.{c,h}`	Page mapping and virtual memory code template
`lib/entry.S`	Entrypoint code in assembly language for user-level processes
`lib/debug.c`	Debugging support code for user-level processes

Part 1: Page Table Management

Before doing anything else, make sure you thoroughly understand the x86's protected-mode memory management architecture for both segmentation and page translation.

Exercise 1. Read chapters 3 of the IA-32 System Programming Guide, if you haven't done so already. Pay especially careful attention to the details in sections 3.6, 3.7, and 3.11.

Next, carefully review all the paging-related definitions in the PIOS source files inc/mmu.h and kern/pmap.h. The former in particular has a variety of macros and constants that will be extremely useful in this lab if used appropriately.

Setting Up the Kernel's Address Space

Once paging is enabled on an x86 processor, it is enabled all the time, whether the processor is executing in kernel or user mode. While some processor architectures give the kernel a "special" way to access physical memory directly while running in kernel mode, the x86 is not such an architecture. Once an x86 kernel enables paging, the only way any running code, including the kernel, can access physical memory or I/O devices is through the paged memory management system. This means that to use page memory management at all, the kernel must first initialize at least one page directory/page table structure with the mappings it will need in order to continue executing, and must load this paging structure into the processor before enabling paging.

In PIOS, the kernel expects to access physical memory and I/O devices at virtual addresses identical to their physical addresses. This means that before enabling paging we will have to create a set of identity mappings, which map a given virtual address to the same physical address.

Exercise 2. Replace the panic in pmap_init() with code to set up identity-mappings for the kernel portion of PIOS's virtual address space, which is all the virtual address space below VM_USERLO and above VM_USERHI. The easiest way to do this is to use 4MB "super-pages", as described in the IA-32 System Programming Guide. Also, since these page mappings will never need to change when the processor context switches among user processes, it is worthwhile to mark these mappings global, again as described in the System Programming Guide: that way, the processor knows not to flush these mappings when the kernel reloads the CR3 register to change address spaces.

You should initialize all page directory entries corresponding to the user-mode address space, between VM_USERLO and VM_USERHI, to the constant PTE_ZERO, defined earlier in pmap.c. Note that this constant is not the same as NULL: it is the (nonzero!) physical address of a particular page that the kernel will keep permanently set to all zeros. Using PTE_ZERO instead of NULL in unmapped page directory and page table entries will slightly simplify code later in this lab, since the kernel can safely enable read permission on PTE_ZERO mappings and allow user code to read (but not write!) this all-zero page.

Hint: Be sure to take full advantage of the many macros we have provided in inc/mmu.h to manipulate paging structures. Appropriate use of macros such as these will make your code much simpler and easier to understand.

We have included code further down in pmap_init() to enable the processor paging features that PIOS uses (namely 4MB pages and Global mappings), load the bootstrap page directory into the CR3 register, and turn on paging. Once you have correctly initialized the bootstrap page directory, you should be able to execute past the final lcr0 instruction. If something is wrong with your page directory, the processor is likely to triple-fault and reset at this instruction, because once paging is enabled with a bad page directory, the processor is unlikely to be able to do just about anything — including dispatch a page fault to the kernel's page fault handler.

Per-Process Page Directories

Now that the kernel can execute with paging enabled, it is time to start building the mapping structures we will need to give each user process its own independent address space. Recall that the x86's basic 32-bit paging system uses a two-level mapping structure: a page directory represents the entire 32-bit address space, and the page tables that the page directory refers to each contain page mappings for a 4MB (2^22 byte) address region.

In PIOS, each process will always have two page directories exclusively for the use of that process: a working page directory (proc.pdir) and a reference page directory (proc.rpdir). The working page directory is the one the kernel loads into the processor when executing that process. We will see the purpose of the reference page directory in part 5 of this lab; for now all you need to know is that both page directories need to be allocated when a new process is allocated, and both get initialized to a copy of the pmap_bootpdir: i.e., with the standard mappings for the kernel part of the address space and an empty user address space region.

Exercise 3. Add code to proc_alloc() to allocate and initialize a process's working and reference page directories when the process is created. We have pmap_newpdir() and pmap_freepdir() functions in pmap.c that may be useful.

Page Table Management

You will now implement the "boilerplate" code the kernel will need to create and manage full two-level page tables for user processes. This basic management code consists of the following three functions:

pte_t *pmap_walk(pde_t *pdir, uint32_t va, bool writing): This is the main function to "walk" the 2-level structure and allocate page tables as needed by the kernel. The purpose of the function is to locate the page table entry (PTE) — the particular 32-bit word in the second-level table — that maps the 4KB page containing virtual address va. Note that va can be an arbitrary address within a page: it need not point to the beginning of a page.
Since the kernel initializes all page directory entries (PDEs) in a new page directory to PTE_ZERO, however, this means that a new directory has no associated page tables: instad the kernel allocates and initialize page tables on demand the first time it needs to map something into the 4MB region covered by a given PDE. This is one of the secondary tasks of pmap_walk().
The behavior of pmap_walk() varies depending on whether it is being used for "reading" or "writing" the paging structures, as indicated by the caller with the writing flag. If writing is zero (false) and pmap_walk() encounters a missing page table, it just returns NULL and assumes the caller will know how to proceed. If writing is nonzero (true), however, then pmap_walk() will attempt to "fill in" a missing PDE by allocating and initializing a new page table, returning NULL only if the mem_alloc() fails. When pmap_walk() allocates and inserts a new page table into the page directory, it initializes all the PTEs in the new page table to PTE_ZERO, but inserts the page table itself into the page directory with permissions allowing the processor to use the new page table, for both read and write accesses and in either user or kernel mode.
pte_t *pmap_insert(pde_t *pdir, pageinfo *pi, uint32_t va, int perm): Given a pointer to a pageinfo structure describing a given physical page, this function maps that page in the paging structure at virtual address va and with PTE permissions perm. If another page was already mapped at the same virtual address, this function first unmaps the previous page.
Both the unmapping and mapping processes adjust the reference counts associated physical pages (in the pageinfo structure) appropriately: incrementing the reference count for the page being mapped, and decrementing the reference count for any page being unmapped. If unmapping the old page releases the last reference to that page, then the old page must be freed as well. We have provided mem_incref() and mem_decref() functions in kern/mem.c to increment and decrement the per-page reference counts atomically: this means that you don't have to use spinlocks to maintain the consistency of the reference counts when multiple processors may be adding and releasing references to the same page concurrently. The mem_decref() function also calls mem_free() automatically once the reference count reaches zero: this means that you must be really done with the page when you call mem_decref(), since another processor could immediately allocate and start using the page the moment you release the last reference.
void pmap_remove(pde_t *pdir, uint32_t va, size_t size): This function removes an arbitrary contiguous range of page mappings from a virtual address space, starting at va (which must be page-aligned), and covering an address region of size size (which must likewise be a multiple of PAGESIZE). Unlike the above two functions, which deal with only a single 4KB page mapping at a time, this function iterates through the virtual address space, possibly removing many page mappings in the process. Also, if the address region to be removed covers the entire 4MB regions represented by one or more page tables, then pmap_remove() not only removes all the mappings in those page tables, but also unmaps and releases the page tables themselves.
As with pmap_insert(), this function decrements the reference counts of all unmapped pages, and frees any page whose last reference has been released. Unmapped page table entries are left in the same state that PTEs in freshly allocated page tables are initialized with: namely, set to PTE_ZERO.

Since both pmap_insert() and pmap_remove() can affect page mappings that have already been used by the processor and loaded into the processor's translation lookaside buffer (TLB), these functions also ensure that all the affected mappings are flushed from the TLB if the address space being modified is the address space of the currently running process.

In most multiprocessor operating systems, operations like pmap_insert() and pmap_remove() might have to flush not only the current processor's TLB, but the TLBs of other concurrently running processors. This procedure is known as TLB shootdown. Why do they have to do this? Think about what happens if multiple user-level threads share the same process's address space space. What characteristics of PIOS's process model make remote TLB shootdown unnecessary in PIOS's case?

Exercise 4. Implement the above three functions in kern/pmap.c. Be careful to maintain all reference counts properly, both when allocating and releasing page tables and when inserting and removing page mappings. Be careful to set all the permission bits correctly at both the page directory and page table levels. Also, when iterating through the address space in pmap_remove, be careful to handle the nontrivial cases properly: e.g., when given an address region that does not start or end on 4MB boundaries represented by particular page tables, but may nevertheless cover entire 4MB regions whose page tables must be unmapped and released.

For now, you can ignore the text in the comment for pmap_walk about copying read-shared page tables: you will deal with that, if necessary, in part 4 of the lab.

When you have completed this exercise correctly, you should be able to get through pmap_check() successfully.

Part 2: Loading and Running an ELF Executable

Now that you have code to set up paging structures, it's time to set up some paging structures: namely those required to run a "root" user-mode process in user address space and from an executable image separately linked from that of the kernel. We have provided a test program, user/testvm.c, which exercises and tests your virtual memory system, but you need to load that executable into the root process's address space and start it running.

You already encountered Executable and Linkable Format (ELF) files earlier while exploring the xv6 and PIOS boot process: the boot loader code in boot/main.c already loaded the kernel into physical memory from an ELF executable. Now you will write a very similar, and not much more complicated, loader in the kernel to load the first user-level program into memory. Now would be a good time to have a close look at the ELF specification and the definitions in inc/elf.h.

The main differences between what the boot loader did for the kernel and what your kernel code needs to do now are:

Whereas physical memory was already "just sitting there" waiting for the boot loader to load something into it, the root process initially contains no accessible memory at all in the user address space area where you need to load the program. The kernel must allocate physical pages and map them into the user virtual memory area as it loads the program.
An ELF executable can contain multiple segments intended to be loaded with different memory access permissions: e.g., read-only for code and constant strings, read/write for initialized and uninitialized data (bss). The flags in the p_flags member of the ELF Program Header (proghdr) structure, in particular, indicates how a segment should be mapped. The boot loader simply ignored these flags since there is no way to assign access permissions to physical memory. While loading the first user process into virtual memory, however, the kernel can and should set its memory permissions correctly, for good measure and aid in debugging. The only permissions you need to worry about are read and write, because until recently the x86 architecture didn't support page-level execute permissions.
Note: The relevant bit values for p_flags are defined in inc/elf.h and are not the same values as the x86 PTE permissions: you have to translate them when setting the page permissions. Also, an ELF program segment may span many virtual memory pages, and will not necessarily start or end on a page boundary. However, no two ELF program segments will load onto the same virtual memory page. (The linker enforces this rule while it is laying out the executable.)
The boot loader only loads the initialized portion of each program segment, represented by the p_filesz in the ELF Program Headers: i.e., the portion of the program segment contained in the ELF file. Any program segment can also contain an uninitialized portion, above p_filesz but below the in-memory segment size indicated by the Program Header's p_memsz member. An ELF program loader is expected to map this uninitialized data area but set its contents to zero. The PIOS boot loader just neglected this duty and instead let the kernel do that itself: that's what the memset(edata, 0, end - edata) does at the beginning of kern/init.c. When the kernel loads the first user-mode process, however, it should do it "the right way", mapping and initializing the entire p_memsz bytes of each program segment correctly. (In practice there is generally only one program segment for which p_memsz is greater than p_filesz, but there is no reason this fact needs to affect the ELF loader.)
Note: Neither the initialized nor uninitialized portions of a program segment will necessarily start or end on a page boundary. This implies that one page of a program segment — the one containing the "boundary" — may contain both initialized (from the ELF file) and uninitialized (cleared to zero) data.

Besides describing the segments to be loaded into memory, the ELF header also indicates where the program should start executing: i.e., the user-level EIP of the first instruction in the program. In addition to loading the program itself as described by the ELF image, the kernel will need to give the root process a stack to execute on. A small one-page stack should be sufficient; the root process can later allocate a bigger stack for itself if it needs one. The high end of the user virtual memory area — the last page just before VM_USERHI — would probably be a suitable place for the root process's stack.

Exercise 5. Implement a root program loader in kern/init.c, by replacing your lab 2 code to start a root process executing at user() in the kernel's address space (which should no longer work now that the kernel's address space is inaccessible when the processor is running in user mode). The PIOS makefiles have linked directly into the kernel a binary copy of the ELF executable for the root process, which you can find using the "magic" linker-defined symbol _binary_obj_user_testvm_start referenced at the top of kern/init.c.

Hint: There are at least two general approaches to the loading process:

For each virtual page affected by a program segment, allocate and map a physical page appropriately, and copy the correct data and/or zeros into that page using the page's physical address as the destination. This approach doesn't (yet) require the mappings to work, but slicing and dicing program segments into pages can require moderately complicated arithmetic.
First allocate and map all the pages a program segment covers, then initialize the segment all at once by accessing it at its virtual address. This approach may make the loading arithmetic simpler, but you'll need to make sure the processor is using the correct page directory — and how do you write to the virtual mapping of a program segment that (eventually) needs to be read-only?

Hint 2: Testing your program loader is likely to reveal bugs in your mapping structure management code. If something is wrong with your mappings, the processor will probably take a page fault, so set a breakpoint at trap() to catch these. Also, make sure your process management code does something sensible when trying to "reflect" a trap that occurs in the root process: since the root process has no parent to reflect the trap to, you might want to dump the trapframe and panic, for example. When the root process "returns" gracefully via sys_ret(), however, your kernel should simply call done(), which will help the grade scripts know when the root process is finished with all tests.

Once you have the program loader working, you should be able to step through proc_run() and into the user mode process. GDB initially won't know where you are, because testvm was linked separately from the kernel and GDB only has symbols for the kernel. You can fix this problem by typing this command into GDB:

add-symbol-file obj/user/testvm 0x40000100

This will augment the debugging symbols GDB already has for the kernel with the debugging symbols contained in the ELF image obj/user/testvm. The 0x40000100 tells GDB where testvm is loaded in virtual memory. GDB requires this argument because this command is normally used to load symbol tables for shared libraries, which are normally position-independent code (PIC). PIOS's root process is not position-independent, so GDB technically doesn't need the load location argument in our case, but GDB apparently doesn't know that.

Challenge: Add proper support for page-level execute permissions when running on AMD processors, via AMD's extension for "No Execute" bits in page table entries. This isn't as easy as it may sound, unfortunately, because at the time AMD introduced this feature there were no more bits available in 32-bit page table entries, so the "No Execute" bit is available only in conjunction with Intel's "Page Address Extensions" to support 64-bit PTEs.

Background: A number of years ago, long before the introduction of true 64-bit x86 processors, the 32-bit physical address space introduced by the 80386 processor started getting cramped, and hardware vendors wanted to build PCs with more than 4GB of RAM — even though users were still running 32-bit operating systems. In response to this demand, Intel created the Page Address Extensions (PAE), which allow 32-bit operating systems to use 64GB (36 bits worth) of physical RAM. All this RAM obviously can't be mapped into the kernel's — or any single user process's — 32-bit address space at once, but it can be used if the kernel doesn't need to have all physical RAM mapped into its address space all the time (as PIOS does) and if this physical RAM is distributed among several user processes.

PAE works by rearranging the paging structures: it increases all page table entries from 32 to 64 bits in size, thus halving the number of entries per page table or page directory, while making room for more physical address bits and other features. But halving the number of entries meant one page table level could translate only 9 bits of virtual address rather than 10, thus necessiating a third (small) level of translation. This "level 3" page table, called the page directory pointer table, contains the four "page directory pointers" necessary to map a full 32-bit virtual address space with 64-bit PTEs. Thus, making use of PAE does not exactly represent a trivial change to the kernel's page table management code, although nothing has changed fundamentally.

AMD later enhanced PAE mode further by adding the ability to disable code execution at page granularity, via a new "No Execute" (NX) bit in each PTE. This was touted as a major security feature, because it makes it more difficult for viruses and other malware to exploit buffer overflow bugs by injecting code into the heap or stack and then causing that code to be executed (from the heap or stack). Both Intel and AMD now support execute-disable in the new 64-bit mode, although only AMD supports it in 32-bit PAE mode (see AMD's latest architecture manuals for details). So if you try this challenge problem, make sure you have an AMD processor to test on (or be prepared to rewrite your kernel to run in 64-bit mode)!

Challenge: Intel processors provide the equivalent technology to AMD's extension for non-executable page. Read the latest version of Intel developer's manual and add support for Intel's non-executable page in PIOS.

Part 3: User Space Copyin/Copyout

Now that virtual memory has provided us some hope of protecting the kernel's state from user-level processes and protecting processes from each other, we need to reconsider how the kernel interacts with user-level code during system calls. System calls generally have arguments, which need to be passed from user space to the kernel. PIOS system calls pass simple arguments in registers: the user-mode system call stubs in lib/syscall.c just load the arguments into the appropriate registers before executing the INT T_SYSCALL instruction, and the kernel's system call handling code in kern/syscall.c retrieves these arguments from the user-level trapframe that the trap entrypoint code pushed on the kernel stack.

Many system calls take arguments that don't fit in registers, however, such as the string argument to sys_cputs and the cpustate pointer to sys_put and sys_get. For such arguments, PIOS's user-space system call stub just leaves the argument data in user space and passes a pointer to the data in a register. The kernel then needs to read the contents of the argument data from user space — or write system call results into user space, in the case of output arguments such as the cpustate structure that sys_get fills in. But what if the user code passes an invalid pointer for such an argument? Consider what would happen in your current system call handling code if a user mode program:

passes a pointer into the kernel's address space region, outside the VM_USERLO and VM_USERHI region.
passes a pointer into user space, but the area of virtual memory pointed to is either fully or partially unmapped (no access permissions).
passes a pointer to a user space data area that the system call will write to, such as the cpustate argument to sys_get, but part or all of that data area is mapped read-only.

The Confused Deputy Problem

This issue is one specific instance of a very general security issue called the confused deputy problem. In short, a trusted "deputy" (the kernel in this case) has multiple sets of authorities — namely the ability to access both kernel space and user virtual memory &mdash. The kernel has only one "namespace" of virtual addresses with which to access memory, however: kernel space and user process space are mixed into one 32-bit address space whose boundaries are defined only by the kernel programming conventions. Because the kernel needs to access memory under multiple authorities (i.e., wearing different "hats"), but has no way to associate the memory accesses it performs with the authority it intends to use when performing that access, without extreme care a user process can "confuse" the kernel into exercising its authority to access kernel space when it only intends to use its authority to access user space, thereby fatally compromising the kernel's security.

Security issues of this kind are a direct consequence of PIOS's use of page-level protection bits to distinguish kernel space from user space, because clearing the PTE_U in a PDE or PTE prevents user code from directly accessing privileged memory, but does nothing to prevent the kernel from accidentally accessing privileged memory when it is reading or writing data on behalf of the user using user-provided pointers.

In at least this one respect, the 80286's old segmentation-based virtual memory system may have actually provided a fundamentally cleaner and more secure design. When segmentation-based software passes pointers among system components, it typically does so using "far pointers", which encapsulate both a segment selector and an offset in a single "pointer object." The x86 processor's rules for handling segment registers ensure that user-level (ring 3) code can never load a segment associated with a higher privilege level. Unprivileged code cannot even "inherit" access to more-privileged segments from more privileged code that might accedentally leave privileged segments loaded in segment registers when returning to less privileged code, because the processor clears any such segment registers whenever it transfers control to a lower privilege level (via IRET for example). Therefore, if user code passes a far pointer to a kernel system call by passing the segment in a segment register and the offset in a normal register, the kernel may safely use that far pointer to access the user's segment, because the segment selector acts as a capability that binds the name of the memory to be accessed to the authority under which it should be accessed. In this design, user code cannot trick the kernel into accessing memory that the user code itself could not access because the user code must explicitly pass the authority with which the memory is to be accessed (the segment selector), it cannot pass an authority to which it does not itself have access (by virtue of being able to load the segment selector), and the kernel uses only the explicitly passed authority (the segment selector passed by the user) when accessing the memory named by that far pointer. See the Confused Deputy paper for other illuminating discussion of this issue.

Protecting PIOS System Calls

Since neither segmentation-based nor capability-based systems took off, however, we unfortunately live in a word in which kernels simply have to be extraordinarily careful whenever they access memory on behalf of a user process — or when any program with special privileges does anything with any kind of name supplied by a less-trusted program, for that matter. To fix PIOS's gaping security vulnerabilities, we need to do two things:

Whenever the kernel is about to access an area of memory that was specified by the user and is supposed to be pointing into user space, the kernel must carefully verify that the entire data area specified by the user indeed lies in the user's part of the address space (between VM_USERLO and VM_USERHI).
In addition, the kernel must anticipate that a user process may accidentally or maliciously pass a pointer to a memory area that is, in full or in part, not mapped with the correct permissions for the intended operation. In this case, the processor will take a trap while running in the kernel (e.g., in a memmove used to copy data into or out of user space), but the kernel must be careful to place the blame for this trap where it belongs: on the user process, not the kernel itself.

PIOS handles memory access errors caused by the user during system calls by behaving as if the INT T_SYSCALL instruction itself had caused the corresponding memory access error. For example, if the user issues a SYS_GET system call requesting that the kernel write a cpustate structure to an invalid or unmapped address, the user process will take a page fault exactly as if it had instead executed a MOV instruction that directly tried to write to the illegal address. This approach to handling memory access faults during system calls is merely PIOS's approach: Unix systems more commonly just cause the system call to return with an error code such as EFAULT in such a situation, which has advantages (it's a bit simpler for the user code to catch a simple error return than a simulated trap) and disadvantages (the user code is more likely just to ignore the return code, even though it likely indicates a serious program failure, and perhaps cause the error cause even more damage).

Exercise 6. Implement systrap, sysrecover, checkva, and usercopy in kern/syscall.c, to provide the logic necessary to access user memory safely using user-provided pointers. Then modify the system call handlers to use usercopy when copying system call argument data into or out of user space.

Part 4: Memory Management with Copy-on-Write

We have created a root process in a fully virtual address space, and finally taken proper measures to protect the kernel from user processes; what's still missing is a way for user processes themselves to change their own virtual address spaces, to set up virtual address spaces for their child processes, and to communicate data inputs and results with their children. Instead of adding more new system calls for this purpose, we will simply extend the existing GET and PUT system calls to enable the caller to perform memory management operations along with the existing functions of these system calls. This way, processes can easily combine several related operations into one GET or PUT system call for efficiency: e.g., a parent can set up both a child's register memory state and start the child running with one PUT system call, and can later synchronize with the child, retrieve its register state, and retrieve results from its virtual address space with one GET system call. The process specifies memory management operations to perform by ORing the following values into the system call command code:

Memory operation code in the SYS_MEMOP bit field. Must be one of the following values:
- SYS_ZERO: Remove all permissions from a range of virtual memory in the destination process (child for PUT, parent for GET), and replace the content of this virtual memory range with all zeros. This operation effectively reverts the given memory range to the "primordial" state all user memory starts out in when a user process is first created.
- SYS_COPY: Copy a virtual memory range from the source process (parent for PUT, child for GET) into the destination process (child for PUT, parent for GET). To make the copy more efficient, the kernel initially copies only memory mappings, making all affected mappings read-only in both the source and destination, and then copies actual page content only on demand as user code tries to write to the copied pages.
- SYS_MERGE (only available with GET): Copy the differences between the child's reference address space snapshot and its current working address space, within a given virtual memory range in the child's space, back into the parent's address space. This operation is described later in Part 5.
In all of these memory operations, for simplicity the memory region affected must start on a 4MB boundary and must be a multiple of 4MB in size: that is, these memory operations always affect complete page tables at a time worth of address space, and are not available to the user at page granularity. Taking advantage of this assumption will greatly simplify your code.
On entry to the GET/PUT system call, user code specifies the relevant memory ranges in the following registers:
- ECX: contains the memory region size. Must be a multiple of 4MB.
- ESI: contains the start of the source region in the relevant process (child for GET, parent for PUT). Must be a multiple of 4MB.
- EDI: contains the start of the destination region in the relevant process (parent for GET, child for PUT). Must be a multiple of 4MB.
SYS_PERM: If the calling process ORs this flag into the GET/PUT command code, then after performing any memory operation specified in the SYS_MEMOP bit field as described above (or after performing no memory operation if the SYS_MEMOP bits are zero), the kernel sets the nominal page permissions on all the pages in the destination memory operation range to the values specified in the SYS_READ and SYS_WRITE bits of the system call command code. Thus:
- SYS_PERM alone: If the caller specifies specifies just SYS_PERM without SYS_READ or SYS_WRITE, then the GET/PUT removes all access permissions from all pages in the destination memory region. This removal of access does not deallocate or zero the content of the destination memory region: the actual page contents remain associated with the relevant locations in the virtual address space, but merely temporarily "hidden" and inaccessible. A subsequent SYS_PERM operation can reinstate access permissions to the inaccessible pages, and the old content in these pages will then be reinstated unchanged. Only the SYS_ZERO operation above actually clears both permissions and page content.
- SYS_PERM | SYS_READ: Sets the permissions on all pages in the destination region to permit read-only access (but not write access), regardless of any of these pages' previous permissions. If the caller performs this operation on a destination memory region that has never been used, or on pages that have been reset with SYS_ZERO, these pages become accessible read-only and filled with zeros.
- SYS_PERM | SYS_READ | SYS_WRITE: Sets the permissions on all pages in the destination region to permit both read and write access, regardless of any of these pages' previous permissions. If the caller performs this operation on a destination memory region that has never been used, or on pages that have been reset with SYS_ZERO, these pages are initially filled with zeros but become accessible read/write, and thus may be used as "newly allocated" virtual memory. Calling SYS_PERM | SYS_READ | SYS_WRITE on a never-before-used or SYS_ZERO'd region is essentially PIOS's equivalent to sbrk() in Unix or xv6.
Although the memory SYS_MEMOP memory operations are restricted to operating on 4MB-aligned memory regions a multiple of 4MB in size, if the caller does not request a memory operation but only specifies SYS_PERM, then SYS_PERM allows the destination memory region to have arbitrary page alignment in both start and size. This allows processes to manage access permissions at page granularity, to set up page permissions when loading child processes from ELF executables for example, even though the kernel's "bulk memory management" facilities only support the much larger 4MB granularity.
SYS_SNAP (only available with PUT): If the calling process ORs in this flag in a PUT system call, then after performing any memory operation and/or permission change specified above, the kernel copies the child's entire working address space into the child's reference address space snapshot, represented by proc->rpdir, again using copy-on-write. This flag is discussed further in Part 5 below.

Copy-on-Write and Nominal versus Actual Page Permissions

When the kernel performs a "bulk" virtual address space copy as requested by a SYS_COPY or SYS_SNAP operation, it uses copy-on-write to make the copy "appear" to happen almost instantaneously, merely after rearranging some virtual memory mappings, while deferring the actual "hard labor" of copying physical memory until the affected processes actually start writing to the copied pages. When the kernel performs a virtual copy, for each "copied" page the kernel effectively just:

copies the page mapping rather than the page itself, so that the source and destination mappings actually refer to the same physical page in memory.
increments the reference count on each page shared this way, to keep track of the number of references that exist to each page.
sets both the source and destination mappings to read-only, to prevent either process from directly modifying the shared page and shattering the illusion that the two processes really have separate and independent copies of the page's content.

When either process tries to write to a page that was writable but is now read-only because of the kernel's copy-on-write optimization, the kernel needs handle this page fault transparently to the application. Specifically, the kernel allocates a new page, copies of the shared page's content into the new page, redirects the faulting mapping from the old, shared page to point to the new, exclusive page, and updates the pages' reference counts accordingly (releasing a reference to the old shared page and adding one to the new page). Since the new page is now owned exclusively by the process that took the page fault, the kernel can now safely reinstate write permission on the new page.

But how does the kernel keep track of which pages were writeable before it did a virtual copy, and which pages were never supposed to be writeable, e.g., because they are part of a read-only segment loaded from an ELF executable that contains program code or constant data? To make this distinction correctly we must associate two sets of permissions with each page: nominal and actual permissions.

The SYS_PERM operation described above affects the nominal page permissions of the destination memory region, which may sometimes be different from the actual page permissions of some or all of those virtual memory pages. The nominal page permissions of a range of virtual memory are the page permissions as specified and seen by user-level processes, whereas the actual page permissions are the "kernel-internal" permissions that the processor sees in the PTE_P and PTE_W bits of page table entries. A page's nominal and actual page permissions are often the same, but may differ when the kernel performs copy-on-write or other virtual memory tricks transparently to user mode code. For example, when the kernel uses copy-on-write to copy a memory page whose nominal permissions are SYS_READ | SYS_WRITE, the resulting two copies of the page mapping continue to have nominal permissions of SYS_READ | SYS_WRITE, but the kernel gives these mappings actual page permissions of only PTE_P but not PTE_W. When one of the user processes attempts to modify that page, the processor takes a page fault because the actual permissions do not allow writes. The kernel then notices that the mapping's nominal do allow writes, so it copies the shared page and raises the mapping's actual permissions once again to equal the nominal permissions.

If the user-specified nominal permissions on a page are only SYS_READ, however, then the kernel's page fault handler does not automatically raise the page's actual permissions on a write fault, since this nominal permission indicates that from the user's perspective the page should not be writable. A write to a page with nominal permissions of SYS_READ probably indicates a logical software bug — or at least means that user-level code does not want writes to happen to that page — so the kernel must reflect a write fault on that page to the parent process just as it would with other exceptions, rather than attempting to handle the fault transparently.

If you look at the definitions of SYS_READ and SYS_WRITE in inc/syscall.h and compare them with the PTE definitions in inc/mmu.h, you will notice that SYS_READ and SYS_WRITE both lie in the PTE_AVAIL part of the page table entry, which the processor ignores and leaves available for software use. This design is intentional: it makes it easy for the kernel to record each page's nominal permissions (as represented by the SYS_READ and SYS_WRITE bits) in the PTE_AVAIL portion of each page table entry, without interfering with the PTE's actual page permissions in the lower bits or with the physical address in bits 12-31.

Implementing PIOS Memory Management Operations

You now should have the information required to implement PIOS's memory management API. When implementing it, keep in mind that although it has a number of features described above, many of those features are fairly orthogonal, and their implementations can be kept relatively simple as long as you take advantage of their orthogonality. For example, even though a single PUT system call can potentially do a SYS_COPY, a SYS_PERM, and a SYS_SNAP (not to mention any or all of the other process management operations you implemented in lab 2), the kernel does these operations one at a time and independently. The implementation of one operation doesn't need to know if or how the other operations were or will be performed, as long as you simply perform them in the correct order as described above.

Exercise 7. Implement the pmap_copy() function in kern/pmap.c, which provides the basis for copy-on-write used by SYS_COPY (and SYS_SNAP, described below, but you don't have to worry about that while implementing pmap_copy()).

Hint: Keep in mind the API restriction stated above that both the source and destination memory regions are always aligned on 4MB boundaries. This API restriction makes the virtual copy operation much simpler: if you need more than about 50 lines of code, you are probably doing something wrong.

Hint: One design choice you will have to make is whether to perform copy-on-write optimization on page tables as well as on pages. Performing virtual copies of entire page tables makes the initial virtual copy operation even simpler and faster, but means that page tables may become shared read-only among multiple page directories, and you will have to modify pmap_walk() to copy and "un-share" a read-shared page table it encounters when called with the writing flag set to true. For simplicity, we recommend that you not share page tables in your initial implementation — just make pmap_copy directly copy all the individual page mappings from the source page table to the destination page table — and then perhaps try implementing copy-on-write for page tables once you have basic copy-on-write working.

Exercise 8. Now implement a kernel page fault handler to copy-on-write faults transparently to user mode code. We have provided a template function pmap_pagefault() in kern/pmap.c; you will need to fill it in and hook it into trap() at the appropriate point.

Hint: Think very carefully about exactly where the kernel's main trap handler should call pmap_pagefault(), especially in relation to the trap handler's recovery mechanism (the part that looks at the cpu_cur()->recover pointer). Suppose that a user-mode process asks the GET system call to deposit the child process's CPU register state into a memory buffer that has recently been copied via copy-on-write, and thus has nominal permissions of SYS_READ | SYS_WRITE but actual permissions of only PTE_P (but not PTE_W). How should the kernel handle the page fault that results during the system call - who is "at fault", so to speak, and who should handle the fault?

Exercise 9. Implement the memory operations SYS_ZERO and SYS_COPY, as described above. But be sure to check the memory region arguments for validity: e.g., they should not allow user processes to modify or copy data out of the kernel part of the address space. If user code attempts such evilness, the system call handler should call systrap() to issue a page fault (T_PGFLT).

Hint: After the above argument validity checking, SYS_ZERO is basically just a pmap_remove(), and SYS_COPY is basically just a pmap_copy().

Exercise 10. Finally, implement the SYS_PERM operation, which happens after the SYS_ZERO or SYS_COPY (if requested) in a GET or PUT system call.

Hint: Take full advantage of the fact that "empty" page table entries in new page tables allocated by pdir_walk(), as well as page table entries removed by pmap_remove(), are set to PTE_ZERO — which is not just NULL but the physical address of a page full of zeros. You can set the nominal permissions in such a PTE according to the caller's request, without actually having to allocate a page for that PTE. You can even set the actual permissions on such a PTE to PTE_P, allowing the user process to read this all-zero page. You cannot enable PTE_W in such a page mapping, of course, since that would allow the user process to scribble on this page that's only ever supposed to hold zeros. But if the user does try to write to a PTE_ZERO page with nominal permissions of SYS_READ | SYS_WRITE, just make sure your page fault handler knows how to make a new, exclusive copy of the zero page just as it would copy a read-shared page from a copy-on-write: the only real difference will be in the reference count handling, since PTE_ZERO mappings do not represent counted references. In effect, you are reusing your copy-on-write code to implement demand zero, or clearing "newly allocated" virtual pages on demand only as the user process actually starts writing to them.

Challenge: Enhance the memory operations above so that they all work on arbitrary 4KB page boundaries, not just 4MB page table boundaries as specified above. Also write some testing code to exercise all of these system calls at non-4MB boundaries, including testing "exotic" boundary cases, such as performing memory operations on a memory range that doesn't start or end on a 4MB boundary but is big enough to cover one or more complete 4MB page tables "in the middle" of the region.

Part 5: Virtual Memory Merge

You have seen how xv6 implements the Unix fork() system call, by simply copying the parent process's entire memory segment into a new memory segment for the child process. And now you have implemented, in the context of PIOS's rather different system call API, the same basic copy-on-write mechanism that all modern Unix kernels use to make their implementations of fork() efficient, especially in the common case where the child process actually writes to very few pages before it exec()s another program. Now we will take a step beyond the functionality Unix kernels offer and provide a mechanism to merge a child's results directly back into the parent's address space after executing independently for some (perhaps long) time period.

Recall that Unix's fork() effectively initializes the child process's state with a "one-time snapshot" of the parent process's state at the time of the fork(), but after that time the parent and child processes evolve separately and can communicate only via the small (typically 8-bit) return code that the child returns to the parent on exit(), or else via separate abstractions such as pipes, sockets, or files in the globally shared file system. In order to keep PIOS as simple as possible, however, we wish to minimize the number of abstractions the PIOS kernel needs to implement. Processes need some way to communicate with each other, and since processes need virtual memory in any case, PIOS uses virtual memory instead of pipes or files as the basic abstraction for inter-process communication.

The GET/PUT system call API you implemented above already provides a "coarse-grained" way for processes to communicate: a parent process can use SYS_PUT | SYS_COPY to copy itself or another program, together with suitable input data, into a child process's address space, run the child process, and then use SYS_GET | SYS_COPY to retrieve results the child left in its address space back into the parent's address space for further use. This is a coarse-grained mechanism, however, operating on a minimum granularity of 4MB address space regions (or 4KB pages even if you implement that challenge problem). And the parent has to "know" in exactly which contiguous region(s) the child will deposit its results and what region(s) in the parent's address space to copy those results to, and avoid clobbering any important data in the parent with the result data copied from the child. If the parent and child processes are closely related, the parent may prefer to have the child produce results at a much finer granularity: e.g., to have the child compute new values of a few particular word-size variables scattered throughout the parent's address space, and perhaps intermixed with memory areas that the child is not expected to modify — but which the parent itself or other children might modify in the meantime. Providing such a capability is the purpose of PIOS's virtual memory merge facility. Don't be afraid of the fancy-sounding name: it's not much more complicated than plain copy-on-write copying, but requires some explanation since it's likely to be an unfamiliar concept.

Threads Without Heisenbugs

In modern Unix kernels, tightly-coupled parallel activities would typically be implemented by multiple threads sharing the same address space, rather than by multiple processes. The motivation for threads is partly one of efficiency, and partly one of convenience: namely, threads provide exactly the capability described above, of allowing a program to "fire off" a parallel activity that computes certain values or updates certain variables that may be at arbitrary, scattered locations in the shared address space. But conventional threads come with the severe danger of heisenbugs. Because the memory access interleaving that the program sees among different threads depends on the relative execution speed of the different threads, which can vary from one run to another depending on a huge variety of subtle nondeterministic factors, the slightest error in synchronization is likely to lead to wild behavior that can be extremely difficult to reproduce and debug.

Instead of providing the heisenbug-ridden thread and shared memory abstractions of conventional kernels, therefore, PIOS instead simply enhances its support for independent processes so as to provide the convenience of threads without the nondeterminism and resulting susceptibility to heisenbugs. PIOS's virtual memory merge facility provides this capability by allowing a parent process to snapshot a child process's address space before execution, run the child process, then merge only the differences between the reference snapshot and the child's final virtual memory state back into the parent process. This mechanism operates not at a 4MB or 4KB granularity but at a machine word granularity. Thus, if the child modified a particular variable in the merged region since the parent forked it off, the new, modified value will appear back in the parent's address space at the corresponding location after the merge, allowing the parent to compute further with the new value or propagate the new data to future child processes it forks off. But for variables that the child did not change during its execution, the merge leaves the corresponding location unaffected in the parent process — even if the parent process itself, or some other child that also merges its results into the parent before or after the child in question, happens to modify that variable.

Thus, as long as the parent and its children are careful to work together and avoid modifying conflicting locations in the merged address region, each child can compute and return results in variables arbitrarily scattered throughout the parent's address space, just as threads could do in conventional operating systems. Unlike threads in conventional systems, however, these computed results won't appear in the parent's address space at apparently random times dependent on the child thread's execution speed, but will instead appear in the parent only at the instant the parent explicitly requests that those results appear, by issuing the merge operation as part of the GET system call. The point at which the parent synchronizes with the child (namely the SYS_GET system call), and the point at which child synchronizes with the parent (due to a SYS_RET system call or an exception from user mode), are both deterministic results of the respective processes' local, single-threaded executions and independent of relative execution speeds, and the virtual memory merge operation is similarly deterministic in its behavior. For this reason, PIOS's equivalent of "threads" does not suffer from the nondeterminism of conventional threads, ensuring that the result of any computation, correct or not, can always be exactly reproduced. PIOS's deterministic "threads" do come with a potential performance cost due to additional virtual memory and copying operations, of course: most convenient, high-level abstractions that kernels provide come with some performance cost, and the question is then whether the abstraction's convenience justifies its cost. For more information about this somewhat unusual programming model, see this short draft paper on Deterministic Consistency.

Merge API

PIOS's API support for virtual memory merge consists of two operations briefly mentioned earlier:

SYS_SNAP, only available in the PUT system call: When the caller specifies this flag, after performing any other requested memory and/or permission management operations, the kernel performs a copy-on-write copy of the child process's entire working user address space (from VM_USERLO to VM_USERHI) into the child process's reference address space (represented by proc->rpdir), overwriting any previous state in the child's reference address space.
SYS_MERGE, only available in the GET system call: When the caller specifies this memory operation (as an alternative to SYS_ZERO or SYS_COPY), the kernel copies data from the child to the parent much as with SYS_COPY, except that the kernel copies only those parts of the child's address space whose values have changed with respect to the child's last reference snapshot. This merge process occurs at 32-bit machine word granularity: if some words in a particular page of the child's working address space differ from the corresponding words in the child's reference space, then the kernel copies only those words into the destination region in the parent's space, leaving other words on the same destination page in the parent's address space unmodified.

When performing a merge, the kernel need not necessarily compare every word of every page in the merged part of the child's address space with the corresponding word in the child's reference space. Since the child's reference space was merely a copy-on-write snapshot of the child's working space from some time before the merge, pages in the working space that the child never wrote to at all will still contain the same, read-shared page mappings as in the reference space. Thus, on such pages, the kernel merely needs to compare page table entries, and need not delve into actual page contents unless the working page has been modified (and thus un-shared) with respect to the original reference page.

Exercise 11. Implement SYS_SNAP and SYS_MERGE as described above. We have provided the skeleton functions pmap_merge() and pmap_mergepage() in kern/pmap.c to provide an outline for how to implement the merge function. The testvm program contains several parallel computations to exercise and test your merge implementation.

Hint: Keep in mind that SYS_SNAP is essentially just a pmap_copy(), and that SYS_MERGE requires the source and destination memory range to be aligned on 4MB boundaries, which should greatly simplify your page directory and page table traversal logic.

Challenge: The obvious way to merge a page of the child's space back into the parent at a word granularity is to iterate through the page a word at a time, reading and comparing each word in the child's working page with the corresponding word in the reference page, and if different, copying the word into the parent's page. This approach is not the most efficient, however, because (a) it merges only one word at a time, and (b) it involves branches, which limit the performance of modern processors by decreasing potential parallelism and introducing pipeline bubbles.

Intel's Streaming SIMD Extensions (SSE) instructions can be used to merge pages in 128-bit rather than word-sized chunks, and without using any branches other than the one that iterates the loop (and that branch is "mostly harmless" because it's highly predictable). Not only that, but using this approach, we can implement not just word-granularity merges but byte-granularity merges with no performance penalty. For example, if a parent process creates a char[] array, and forks one child process to compute even elements of the array and a second child to compute odd elements of the array, a word-granularity merge will yield incorrect results due to false conflicts between changes to bytes within a word, while a byte-granularity merge will provide the desired behavior.

Implement a byte-granularity, branch-free merge using SSE. To use SSE instructions at all, you will need to modify a control register or two during kernel initialization to enable the SSE extensions: see the IA-32 System Programming Guide for details. To implement the merge itself, some particular SSE instructions you will probably want to look at include PCMPEQB, PAND, PANDN, and POR. Use the processor's timestamp counters (RDTSC instruction) or some other timing mechanism to compare the performance of your SSE-based merge against the performance of a naïve word-by-word merge.

This completes the lab.