【MIT6.S081】lab10 mmap

MIT 6.S081 lab10

感觉这个 lab 还是挺有意思的，但是感觉没有前面几个 lab 难，感觉前面几个 lab 的难度太大了，这个 lab 花了2 天左右就完成了，当前中间还是有些点比较有疑问的问题，后续还需要进一步来思考的空间。在这个 lab 中的学习内容比较多，主要学习了以下几个知识重点:

mmap 的实现原理：如果仅仅实现 mmap 的话，原理比较简单。每次发起系统调用 mmap 时，就会在该进程的内核空间中申请一块内存并与之对应 WMA AREA 相对应，并更新当前进程的页表，如果我们使用 lazy alloc，此时并不会马上更新物理内存的内容，也并未申请实际的物理页与之相对应。
当进程读取到该 VMA 对应的内存区域时，会发生缺页中断，当进程捕获到该中断时会检测缺页的地址是否属于某个 VMA，如果该地址合法则申请新的物理页并将其进行 mapping 到 VMA 给定的地址空间中。同时将给定的 fd 中对应的内容读取到该物理页中，此时对于用户空间的应用则可以读取到相应的 fd 对应的内容，我们可以将同一个 fd 中对应的内容 map 到多个进程中。实际上每个 fd 可能对应的是某个文件或者设备，我们如果将其映射到内存的地址空间中就可以很方便和快速的进行读写与共享。
当我们试图对 VMA 对应的内存进行修改时，修改的内容暂时保存在物理内存中，比如可能多个应用程序对某个文件进行多次写操作，但这些执行都保存在内存中并未实际写入到文件中，当我们执行 unmap 操作时，操作系统才可能将这些修改内容一次性的全部写入到文件中，这是一种非常高效的共享机制。
使得用户空间的程序能够处理进程捕获的 page fault，这就需要每个用户进程在内核空间存在一个对应的内核线程来传递这些 trap。用户空间中能够处理这样 trap，并同时能够对物理页进行操纵。用户空间的 handler 处理完成后，再返回内核。

重要的是本章节提到了 Baker's Real-Time Copying Garbage Collector 的运行工作原理:

当然其中的原理还是挺复杂，需要专门的一个章节来完成。

lab10 mmap

本章的 lab 比较起来就不算是特别难的 lab 了。但是 lab 后面的 Optional challenges 难度就非常大了，可能需要花费非常多的时间来完成了。所有的源代码都放在github。

题目

The mmap and munmap system calls allow UNIX programs to exert detailed control over their address spaces. They can be used to share memory among processes, to map files into process address spaces, and as part of user-level page fault schemes such as the garbage-collection algorithms discussed in lecture. In this lab you'll add mmap and munmap to xv6, focusing on memory-mapped files.
Fetch the xv6 source for the lab and check out the mmap branch:

  $ git fetch
  $ git checkout mmap
  $ make clean
The manual page (run man 2 mmap) shows this declaration for mmap:

void *mmap(void *addr, size_t length, int prot, int flags,
           int fd, off_t offset);
mmap can be called in many ways, but this lab requires only a subset of its features relevant to memory-mapping a file. You can assume that addr will always be zero, meaning that the kernel should decide the virtual address at which to map the file. mmap returns that address, or 0xffffffffffffffff if it fails. length is the number of bytes to map; it might not be the same as the file's length. prot indicates whether the memory should be mapped readable, writeable, and/or executable; you can assume that prot is PROT_READ or PROT_WRITE or both. flags will be either MAP_SHARED, meaning that modifications to the mapped memory should be written back to the file, or MAP_PRIVATE, meaning that they should not. You don't have to implement any other bits in flags. fd is the open file descriptor of the file to map. You can assume offset is zero (it's the starting point in the file at which to map).

It's OK if processes that map the same MAP_SHARED file do not share physical pages.

munmap(addr, length) should remove mmap mappings in the indicated address range. If the process has modified the memory and has it mapped MAP_SHARED, the modifications should first be written to the file. An munmap call might cover only a portion of an mmap-ed region, but you can assume that it will either unmap at the start, or at the end, or the whole region (but not punch a hole in the middle of a region).

思考

mmap 的实现: 在这个 lab 中 mmap 的实现就非常简单了，我们首先观察一下该函数的模型

void *mmap(void *addr, size_t length, int prot, int flags,
           int fd, off_t offset);

它接受 6 个参数:
- addr 指定的映射地址;
- length 指定的映射的内存长度;
- prot 指定的读写权限;
- flags 指定的该映射的模式，如果为 MAP_SHARED 表示映射的内存会回写到指定的fd 中, 还有其他的标志位；
- fd 指定的文件的描述符；
- offset 指定的文件的偏移量;

给定的 lab 只要求完成基本的功能即可，即不考虑文件的 偏移量 并且不指定映射的地址空间，实际我们实现时只需要在该进程的虚拟地址空间末尾增减 length 长度，然后将其新增加的尾部地址空间映射到指定 fd 上即可，实现非常简单了。里面有几个难点如下:

首先我们直接将该进程的地址空间增减 length 长度，当然这里涉及到内存对齐的问题，这是由于进程每次只能向内核申请以物理页为最小单位的内存，所以这里需要处理涉及到内存对齐的问题，即映射的地址空间的起始位置要对齐 PG_SIZE。我们可以参考 cow lab 中的实现，直接将内存空间增大即可。
第二步我们需要在该进程中记录映射的区域，即所有的 vma，由于给定的 lab 中最大的 vma 只有 16 个，我们可以采用数组或者链表来实现，实现的 linux 实现就非常复杂了，实际处理应该是用的 reb-black tree 来记录所有的 vma，因为涉及到 vma 的查找，此时利用高效的二分搜索才能提高查找效率。
第三步，比较重要的是我们要完成真正的映射，前面提到过我们仅仅增加了进程的地址空间但并未分配实际的物理页，当存在某些程序读取到该未分配的地址空间时就会触发缺页中断，此时我们就需要在 user trap 中进行处理，此时我们就查找该进程的 vma，查找缺页的地址是否在 vma 记录中，
- 如果在 vma 中，则我们从堆中申请一个新的物理页，并从 vma 中记录的 fd 中读取一个页长度的内容到新加入的物理页中。此时需要注意的是我们计算读取内从的 offset，稍微需要一点技巧。
- 如果该地址不在 vma 中，则将直接返回缺页中断。
以上即完成了基本的 map 功能。
munmap 的实现: 首先我们观察一下 munmap 的函数原型如下：
munmap(addr, length)
给定的参数只有两个指定的地址与长度，我们在进行 umap 操作时，首先会计算物理页的长度，由于题目给定的限定条件为 umanp 操作只会从起始地址到结束地址依次进行 umap 而不会存在从中间进行 umap 的操作，因此就比较容易实现。
- 首先我们从 vma 中查找给定的 addr 是否合法，如果合法则将其从中去除，并查看该 vma 的对应的 flag，如果 flag 为 MAP_SHARED 则将其写回到文件中，当然这里实际应该需要处理一下文件的 offset 的问题，存在一定的瑕疵。
- 上述操作完成后，则将该段地址对应的物理页内存全部释放掉。当然我比较异疑惑的一点，对应的物理内存页释放掉，但是进程的地址空间实际没有进行变化，按道理此时进程的地址空间也应该释放掉一部分，否则每次访问该对应的地址都会出现 page fault 的问题。
对于 fork 子进程时，子进程创建时需要复制父进程的 vma 表项，当进程退出进行 exit 时需要释放 vma，并将其所有的内容可以写回需要写回到文件中。

代码

增加 sys_map 与 sys_umap 的系统调用。

static uint64 (*syscalls[])(void) = {
[SYS_fork]    sys_fork,
[SYS_exit]    sys_exit,
[SYS_wait]    sys_wait,
[SYS_pipe]    sys_pipe,
[SYS_read]    sys_read,
[SYS_kill]    sys_kill,
[SYS_exec]    sys_exec,
[SYS_fstat]   sys_fstat,
[SYS_chdir]   sys_chdir,
[SYS_dup]     sys_dup,
[SYS_getpid]  sys_getpid,
[SYS_sbrk]    sys_sbrk,
[SYS_sleep]   sys_sleep,
[SYS_uptime]  sys_uptime,
[SYS_open]    sys_open,
[SYS_write]   sys_write,
[SYS_mknod]   sys_mknod,
[SYS_unlink]  sys_unlink,
[SYS_link]    sys_link,
[SYS_mkdir]   sys_mkdir,
[SYS_close]   sys_close,
[SYS_mmap]    sys_mmap,
[SYS_munmap]   sys_munmap
};

vma 结构的定义如下，并在 proc 结构体中增加 vma 表项，每次 vma 都对应记录地址映射的内容。

/* vam entry */
struct vmaEntry {
	uint64 addr;
	uint32 length;
	uint32 permissions;
	uint32 flag;
	struct file *pf;
	uint32 offset;
	uint32 valid;
	struct vmaEntry *next;
	struct vmaEntry *prev;
};

// Per-process state
struct proc {
  struct spinlock lock;

  // p->lock must be held when using these:
  enum procstate state;        // Process state
  struct proc *parent;         // Parent process
  void *chan;                  // If non-zero, sleeping on chan
  int killed;                  // If non-zero, have been killed
  int xstate;                  // Exit status to be returned to parent's wait
  int pid;                     // Process ID

  // these are private to the process, so p->lock need not be held.
  uint64 kstack;               // Virtual address of kernel stack
  uint64 sz;                   // Size of process memory (bytes)
  pagetable_t pagetable;       // User page table
  struct trapframe *trapframe; // data page for trampoline.S
  struct context context;      // swtch() here to run process
  struct file *ofile[NOFILE];  // Open files
  struct inode *cwd;           // Current directory
  char name[16];               // Process name (debugging)
  struct vmaEntry vma[16];         // Process vma entry
};

sys_map 与 sys_umap 的实现

#include "types.h"
#include "riscv.h"
#include "defs.h"
#include "param.h"
#include "stat.h"
#include "spinlock.h"
#include "proc.h"
#include "fs.h"
#include "sleeplock.h"
#include "file.h"
#include "fcntl.h"
#include "sysmap.h"

#define max(a, b) ((a) > (b) ? (a) : (b))
#define min(a, b) ((a) < (b) ? (a) : (b))

int map_readfile(int vma_offset, int offset, uint64 addr, int length) 
{
	struct proc * p = myproc();
	struct file *pf = p->vma[vma_offset].pf;
	int ret = 0;
	ilock(pf->ip);
	ret = readi(pf->ip, 0, addr, offset, length);
	iunlock(pf->ip);
	return ret;
}

int map_writefile(struct file *pf, int offset, uint64 addr, int length) 
{
	int ret = 0;
	begin_op();
	ilock(pf->ip);
	ret = writei(pf->ip, 1, addr, offset, length);
	ilock(pf->ip);
	end_op();
	return ret;
}


int map_vma_init(struct vmaEntry *vma) {
   for (int i = 0; i < MAX_ENTRY_SIZE; i++) {
	  vma[i].valid = 0;
   }
   return 0;
}

int map_vma_copy(struct vmaEntry *src, struct vmaEntry *dst) {
   for (int i = 0; i < MAX_ENTRY_SIZE; i++) {
		if (src[i].valid) {
		   dst[i].addr = src[i].addr;
		   dst[i].length = src[i].length;
		   dst[i].permissions = src[i].permissions;
		   dst[i].flag = src[i].flag;
		   dst[i].pf = src[i].pf;
		   dst[i].offset = src[i].offset;
		   dst[i].valid = 1;
		   dst[i].pf = filedup(dst[i].pf);
		} else {
		   dst[i].valid = 0;
		}
	 }
   return 0;
}


int map_vma_query(uint64 va) {
   struct proc *p = myproc();
   for (int i = 0; i < MAX_ENTRY_SIZE; i++) {
   	   if (p->vma[i].valid) {
		   int offset = va - p->vma[i].addr;
		   if (offset >= 0 && offset < p->vma[i].length) {
				return i;
		   }
   	   }
   }
   return -1;
}

int map_vma_add(uint64 addr, struct file *f, int len, int prot, int flag, int offset) {
  struct proc * p = myproc();
  
  /* record the vma of the current process */
  f->ref++;
  for (int i = 0; i < MAX_ENTRY_SIZE; i++) {
  	if (!p->vma[i].valid) {
		p->vma[i].valid = 1;
		p->vma[i].addr = addr;
		p->vma[i].length = len;
		p->vma[i].permissions = prot;
		p->vma[i].flag = flag;
		p->vma[i].pf = f;
		p->vma[i].offset = offset & ~(PGSIZE - 1);
		break;
	}
  }
  return 0;
}


int map_vma_remove(int vma_offset) {
	struct proc *p = myproc();
	
	/* remove all the pages from process */
	if ((p->vma[vma_offset].permissions & PROT_WRITE) && p->vma[vma_offset].flag == MAP_SHARED) {
		filewrite(p->vma[vma_offset].pf, p->vma[vma_offset].addr, p->vma[vma_offset].length);
	}
	uvmunmap(p->pagetable, p->vma[vma_offset].addr, p->vma[vma_offset].length / PGSIZE, 0);
	/* remove all mapped pages */
	p->vma[vma_offset].pf->ref--;
	p->vma[vma_offset].valid = 0;
	return 0;
}

int map_unmap(uint64 addr, int length) 
{
	int vma_offset = map_vma_query(addr);
	if (vma_offset == -1) {
		return -1;
	}
	struct proc *p = myproc();

	/* remove all the pages from process */
	if (p->vma[vma_offset].flag == MAP_SHARED) {
		filewrite(p->vma[vma_offset].pf, p->vma[vma_offset].addr, p->vma[vma_offset].length);
	}
	uvmunmap(p->pagetable, p->vma[vma_offset].addr, length / PGSIZE, 0);
	p->vma[vma_offset].length -= length;
	p->vma[vma_offset].addr += length;
	/* remove all mapped pages */
	if (p->vma[vma_offset].length == 0) {
		p->vma[vma_offset].pf->ref--;
		p->vma[vma_offset].valid = 0;
	}
	return 0;
}


/* map files into process address spaces */
uint64
sys_mmap(void)
{
  uint64 addr;	
  int len;
  int prot;
  int flag;
  int offset;
  int fd;
  struct file *f;

  /* parse syscall args */
  if(argaddr(0, &addr) < 0 || argint(1, &len) < 0 || argint(2, &prot) < 0 || \
  	 argint(3, &flag) < 0 || argint(4, &fd) < 0 || argint(5, &offset) < 0) {
  	 return -1;
  }

  /* fd to file struct */
  if(fd < 0 || fd >= NOFILE || (f = myproc()->ofile[fd]) == 0) {
     return -1;
  }

  /* we check the file readable and writeable */
  if (f->writable == 0 && flag == MAP_SHARED && (prot & PROT_WRITE)) {
	 return -1;
  }
  
  /* we add the map address at the end of virtual address */
  addr = myproc()->sz;
  struct proc * p = myproc();
  if(len > 0){
	if(p->sz + len > MAXVA) {
	  return -1;
	}
	p->sz += len;
  }
  map_vma_add(addr, f, len, prot, flag, offset);
  return addr;
}


/* unmap files into process address spaces */
uint64
sys_munmap(void)
{
	uint64 addr;	
	int len;

	/* parse syscall args */
	if(argaddr(0, &addr) < 0 || argint(1, &len) < 0) {
	 	return -1;
	}
	
	/* we add the map address at the end of virtual address */	
	int vma_offset = map_vma_query(addr);
	if (vma_offset == -1) {
		 return -1;
	}
  
	/* remove all the pages from process */
	map_unmap(addr, len);
	return 0;
}

```  

4. `usertrap` 的实现：需要检测缺页中断，并去按照不同的缺页中断去处理，主要检测缺页的地址是否属于 `vma` 表项。
```C++
uint64 pcode = r_scause();
uint64 va = PGROUNDDOWN(r_stval());
// uint64 epc = r_sepc();

// page fault, valid virtual address, 记录trap发生的原因
if (pcode == 15 || pcode == 13) {
    if(va >= p->sz || va <= p->trapframe->sp || va > MAXVA) {
        p->killed = 1;
        goto fail;
    }
    int vma_offset = map_vma_query(va);
    if (vma_offset >= 0) {
        // alloc a page
        char *mem = kalloc();  
        if(mem == 0){
            p->killed = 1;
            goto fail;
        }
        memset(mem, 0, PGSIZE);
        int offset = p->vma[vma_offset].offset + PGROUNDDOWN(va - p->vma[vma_offset].addr);
        printf("get mmap page fault! va = %p\n", va);
        map_readfile(vma_offset, offset, (uint64)mem, PGSIZE);

        // mapper a page for va
        if(mappages(p->pagetable,PGROUNDDOWN(va), PGSIZE, (uint64)mem, (p->vma[vma_offset].permissions << 1) | PTE_U) != 0){
            kfree(mem);
            printf("get mmap page fault! va = %p\n", va);
            p->killed = 1;
            goto fail;
        }
        // p->trapframe->epc += epc;
    }
}else {
    printf("usertrap(): unexpected scause %p pid=%d\n", r_scause(), p->pid);
    printf("            sepc=%p stval=%p\n", r_sepc(), r_stval());
    p->killed = 1;
}

fork 操作时同时复制 vma 表项。

map_vma_init(np->vma);
map_vma_copy(p->vma, np->vma);
for (int i = 0; i < 16; i++) {
    if (p->vma[i].valid) {
    np->vma[i].pf = filedup(p->vma[i].pf);
    }
}

exit 操作时，同时移除 vma 表项。

// remove all map virtual address
for (int i = 0; i < 16; i++) {
    if (p->vma[i].valid) {
    map_vma_remove(i);
    }
}

总结

总的来说这个 lab 还是非常有挑战性的，是个好的 lab，用户空间的 page fault 处理确实是个非常有效的利器，后期会继续深入学习以下 GC 的处理流程。

欢迎关注和打赏，感谢支持！

关注我的博客: http://mikemeng.org/
关注我的知乎：https://www.zhihu.com/people/da-hua-niu
关注我的微信公众号: 公务程序猿

Tags: 公开课学习

← 【build a computer】 project 10 kickstart 2021 round H →

扫描二维码，分享此文章