linux内核中的zero-page

发布时间 2023-12-23 18:02:11作者: tsecer

zero-page

操作系统给用户新分配的内容(通过mmap或者brk)都是清零过的,但是这些虚拟地址通常都是按需分配物理页面。这里的“按需”的需求可能是读取,也可能是写入。如果只是读取,只要保证读取内容是零即可,在MMU的基础上,可以让“所有”虚拟地址都映射到内容为0的物理页面中。

这样如果申请的内存大部分情况下都只是读取操作,那么并不会增加系统物理内存消耗量。

这种优化在linux内核的很早版本中就已经添加,但是(很早时候)中间有(很短)一段时间该功能被剔除。

剔除

提交记录,提交时间为“Oct 17, 2007 ”,提交版本为v2.6.24-rc1

提交日志提到的一个原因在于:使用zero-page会触发两次中断,一次是读取时,一次是写入时的COW。测试场景中1次页面清零的消耗明显要小于1000次中断处理(1 page clear is cheaper than a thousand faults)。

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

reinstate

提交记录,该提交日期为“Sep 22, 2009 ”,提交分支2.6.32rc。

KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
using in 2.6.24. And there were a couple of regression reports on LKML.

Following suggestions from Linus, reinstate do_anonymous_page() use of
the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
with (map)count updates - let vm_normal_page() regard it as abnormal.

Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
most powerpc): that's not essential, but minimizes additional branches
(keeping them in the unlikely pte_special case); and incidentally
excludes mips (some models of which needed eight colours of ZERO_PAGE
to avoid costly exceptions).

Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
callers won't want to make exceptions for it, so increment its count
there. Changes to mlock and migration? happily seems not needed.

In most places it's quicker to check pfn than struct page address:
prepare a __read_mostly zero_pfn for that. Does get_dump_page()
still need its ZERO_PAGE check? probably not, but keep it anyway.

也就是不存在zero-page的时间总共在两年左右,而且都是在2010年之前,对于现在的内核来说,都是包含了该功能的。

lwn的讨论

lwn文章提到了此次修改导致的问题:程序分配了大量的匿名空间,只是写入了很小一部分,但是最终会读取所有匿名空间,在新的内核版本中会导致系统内存不足。

Memory management changes made back in 2007 had the effect of adding reference counting to the zero page. And that turned out to be a problem on multiprocessor machines. Since all processors shared the same zero page (per-CPU differences being unlikely), they also all manipulated the same reference count. That led to serious problems with cache line bouncing, with a measurable performance impact. In response, Nick Piggin evaluated a number of possible fixes, including special hacks to avoid reference-counting the zero page or adding per-CPU zero pages. The patch that got merged, though, simply eliminated most use of the zero page altogether. The change was justified this way:

Inserting a ZERO_PAGE for anonymous read faults appears to be a false optimisation: if an application is performance critical, it would not be doing many read faults of new memory, or at least it could be expected to write to that memory soon afterwards. If cache or memory use is critical, it should not be working with a significant number of ZERO_PAGEs anyway (a more compact representation of zeroes should be used).
There was some nervousness about the patch at the time; Linus grumbled about the changes which created the problem in the first place, and worried:

The kernel has *always* (since pretty much day 1) done that ZERO_PAGE thing. This means that I would not be at all surprised if some application basically depends on it. I've written test-programs that depends on it - maybe people have written other code that basically has been written for and tested with a kernel that has basically always made read-only zero pages extra cheap.
Despite his misgivings, Linus merged the patch for 2.6.24 to see what sort of problems might come to the surface. For the next 18 months, it appeared that such problems were scarce indeed; most people forgot about the zero page altogether. In early June, though, Julian Phillips reported a problem he had observed:

I have a program which creates a reasonably large private anonymous map. The program then writes into a few places in the map, but ends up reading from all of them.
When I run this program on a system running 2.6.20.7 the process only ever seems to use enough memory to hold the data that has actually been written (well - in units of PAGE_SIZE). When I run the program on a system running 2.6.24.5 then as it reads the map the amount of memory used continues to increase until the complete map has actually been allocated (and since the total size is greater than the physically available RAM causes swapping). Basically I seem to be seeing copy-on-read instead of copy-on-write type behaviour.

实现

下面是匿名页面访问异常的处理代码:

/*
 * We enter with non-exclusive mmap_sem (to exclude vma changes,
 * but allow concurrent faults), and pte mapped but not yet locked.
 * We return with mmap_sem still held, but pte unmapped and unlocked.
 */
static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pte_t *page_table, pmd_t *pmd,
		unsigned int flags)
{
///...
	/* Use the zero-page for reads */
	if (!(flags & FAULT_FLAG_WRITE)) {
		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
						vma->vm_page_prot));
		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
		if (!pte_none(*page_table))
			goto unlock;
		goto setpte;
	}
///...
	entry = mk_pte(page, vma->vm_page_prot);
	if (vma->vm_flags & VM_WRITE)
		entry = pte_mkwrite(pte_mkdirty(entry));

	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
	if (!pte_none(*page_table))
		goto release;

	inc_mm_counter_fast(mm, MM_ANONPAGES);
	page_add_new_anon_rmap(page, vma, address);
setpte:
	set_pte_at(mm, address, page_table, entry);
///...
}

vm_page_prot的初始化是通过mmap_region中vm_get_page_prot获得

unsigned long mmap_region(struct file *file, unsigned long addr,
		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff)
{
///...
	vma->vm_page_prot = vm_get_page_prot(vm_flags);
///...
}

在vm_get_page_prot的计算中,即使映射中声明了可写权限,在vm_page_prot计算中也不会体现出可写属性(_PAGE_RW)。

/* description of effects of mapping type and prot in current implementation.
 * this is due to the limited x86 page protection hardware.  The expected
 * behavior is in parens:
 *
 * map_type	prot
 *		PROT_NONE	PROT_READ	PROT_WRITE	PROT_EXEC
 * MAP_SHARED	r: (no) no	r: (yes) yes	r: (no) yes	r: (no) yes
 *		w: (no) no	w: (no) no	w: (yes) yes	w: (no) no
 *		x: (no) no	x: (no) yes	x: (no) yes	x: (yes) yes
 *		
 * MAP_PRIVATE	r: (no) no	r: (yes) yes	r: (no) yes	r: (no) yes
 *		w: (no) no	w: (no) no	w: (copy) copy	w: (no) no
 *		x: (no) no	x: (no) yes	x: (no) yes	x: (yes) yes
 *
 */
pgprot_t protection_map[16] = {
	__P000, __P001, __P010, __P011, __P100, __P101, __P110, __P111,
	__S000, __S001, __S010, __S011, __S100, __S101, __S110, __S111
};

pgprot_t vm_get_page_prot(unsigned long vm_flags)
{
	return __pgprot(pgprot_val(protection_map[vm_flags &
				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
			pgprot_val(arch_vm_get_page_prot(vm_flags)));
}
	

对于典型的x86系统,write权限在vm_get_page_prot中并没有包含可写的

#define _PAGE_PRESENT	(_AT(pteval_t, 1) << _PAGE_BIT_PRESENT)
#define _PAGE_RW	(_AT(pteval_t, 1) << _PAGE_BIT_RW)
#define _PAGE_USER	(_AT(pteval_t, 1) << _PAGE_BIT_USER)
#define _PAGE_PWT	(_AT(pteval_t, 1) << _PAGE_BIT_PWT)
#define _PAGE_PCD	(_AT(pteval_t, 1) << _PAGE_BIT_PCD)
#define _PAGE_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)

/*         xwr */
#define __P000	PAGE_NONE
#define __P001	PAGE_READONLY
#define __P010	PAGE_COPY
#define __P011	PAGE_COPY
#define __P100	PAGE_READONLY_EXEC
#define __P101	PAGE_READONLY_EXEC
#define __P110	PAGE_COPY_EXEC
#define __P111	PAGE_COPY_EXEC

#define __S000	PAGE_NONE
#define __S001	PAGE_READONLY
#define __S010	PAGE_SHARED
#define __S011	PAGE_SHARED
#define __S100	PAGE_READONLY_EXEC
#define __S101	PAGE_READONLY_EXEC
#define __S110	PAGE_SHARED_EXEC
#define __S111	PAGE_SHARED_EXEC

这个属性需要通过pte_mkwrite来单独设置。

static inline pte_t pte_mkwrite(pte_t pte)
{
	return pte_set_flags(pte, _PAGE_RW);
}

这也意味着:对于通过mmap映射出来的内存,第一次访问的时候必然会出现缺页保护异常。

安全考虑

回过头来再看下这里的问题,可以发现根本的原因在于“操作系统给用户新分配的内存需要清零”,那么不清零行不行?为什么要清零?

从网上的资料看,这个主要是基于安全性的考虑。

安全原因

Here's the catch: Memory coming from the OS will be zeroed for security reasons.*

When the OS gives you memory, it could have been freed from a different process. So that memory could contain sensitive information such as a password. So to prevent you reading such data, the OS will zero it before it gives it to you.

这个解释更加简洁明了:主要是由于操作系统需要提供严格的隔离系统,如果不清零(当然全部设置为1也可以),那么一个进程可能会有意/无意的看到其它进程释放的页面,而页面中可能残留甚至如密码之类的非常敏感的信息,所以在分配给其它用户之前必须完全擦除页面中的内容。

What you describe as 'security' is really confidentiality, meaning that no process may read another processes memory, unless this memory is explicitly shared between these processes. In an operating system, this is one aspect of the isolation of concurrent activities, or processes.

What the operating system is doing to ensure this isolation, is whenever memory is requested by the process for heap or stack allocations, this memory is either coming from a region in physical memory that is filled whith zeroes, or that is filled with junk that is coming from the same process.

This ensures that you're only ever seeing zeroes, or your own junk, so confidentiality is ensured, and both heap and stack are 'secure', albeit not necessarily (zero-)initialized.

You're reading too much into your measurements.

MAP_ANONYMOUS + MAP_SHARED

在使用MAP_SHARED的时候,使用了共享内存文件系统,所以如果内存是通过MAP_ANONYMOUS + MAP_SHARED来映射,并不会使用zero-page。

unsigned long mmap_region(struct file *file, unsigned long addr,
		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
		struct list_head *uf)
{
///...
	if (file) {
	///...
		} else if (vm_flags & VM_SHARED) {
		error = shmem_zero_setup(vma);
		if (error)
			goto free_vma;
	}
///...
}

/**
 * shmem_zero_setup - setup a shared anonymous mapping
 * @vma: the vma to be mmapped is prepared by do_mmap_pgoff
 */
int shmem_zero_setup(struct vm_area_struct *vma)
{
	struct file *file;
	loff_t size = vma->vm_end - vma->vm_start;

	/*
	 * Cloning a new file under mmap_sem leads to a lock ordering conflict
	 * between XFS directory reading and selinux: since this file is only
	 * accessible to the user through its mapping, use S_PRIVATE flag to
	 * bypass file security, in the same way as shmem_kernel_file_setup().
	 */
	file = __shmem_file_setup("dev/zero", size, vma->vm_flags, S_PRIVATE);
	if (IS_ERR(file))
		return PTR_ERR(file);

	if (vma->vm_file)
		fput(vma->vm_file);
	vma->vm_file = file;
	vma->vm_ops = &shmem_vm_ops;

	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
			((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
			(vma->vm_end & HPAGE_PMD_MASK)) {
		khugepaged_enter(vma, vma->vm_flags);
	}

	return 0;
}

static int shmem_fault(struct vm_fault *vmf)
{
///...
	error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
				  gfp, vma, vmf, &ret);
	if (error)
		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
	return ret;
}

测试

测试代码

tsecer@harry: cat readanony.c 
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define MAPSIZE 0x1000000 * sizeof(int)

int main(int argc, const char *argv[])
{
    volatile int x;
    int *addr = (int*)mmap(0, MAPSIZE * sizeof(int), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    for (int i = 0; i < MAPSIZE; i++)
    {
        x = addr[i];
    }
    printf("start addr %p end addr %p\n", addr, addr + MAPSIZE);
    sleep(1000);
    return 0;
}

tsecer@harry: ./pagemap 27480 0x7f0367d1e000 0x7f0367d5e000
0x7f0367d1e000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d1f000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d20000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d21000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d22000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d23000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d24000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d25000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d26000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d27000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1
0x7f0367d28000     : pfn 2d8f             soft-dirty 0 file/shared 0 swapped 0 present 1