What good are “Large Pages” ?

I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.

The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.

The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.

Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):

AMD Opteron Family 10h Revision D (“Istanbul”):

L1 DTLB:
- 4kB pages: 48 entries;
- 2MB pages: 48 entries;
- 1GB pages: 48 entries
L2 TLB:
- 4kB pages: 512 entries;
- 2MB pages: 128 entries;
- 1GB pages: 16 entries

AMD Opteron Family 15h Model 6220 (“Interlagos”):

L1 DTLB
- 4KiB, 32 entry, fully associative
- 2MiB, 32 entry, fully associative
- 1GiB, 32 entry, fully associative
L2 DTLB: (none)
Unified L2 TLB:
- Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
- “An entry allocated by one core is not visible to the other core of a compute unit.”

Intel Xeon 56xx (“Westmere”):

L1 DTLB:
- 4KiB pages: 64 entries;
- 2MiB pages: 32 entries
L2 TLB:
- 4kiB pages: 512 entries;
- 2MB pages: none

Intel Xeon E5 26xx (“Sandy Bridge EP”):

L1 DTLB
- 4KiB, 64 entries
- 2MiB/4MiB, 32 entries
- 1GiB, 4 entries
STLB (second-level TLB)
- 4KiB, 512 entries
- (There are no entries for 2MiB pages or 1GiB pages in the STLB)

Xeon Phi Coprocessor SE10P: (Note 1)

L1 DTLB
- 4KiB, 64 entries, 4-way associative
- 2MiB, 8 entries, 4-way associative
L2 TLB
- 4KiB, 64 Page Directory Entries, 4-way associative (Note 2)
- 2MiB, 64 entries, 4-way associative

Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages. All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.

What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:

5 trips to memory to load data mapped on a 4KiB page,
4 trips to memory to load data mapped on a 2MiB page, and
3 trips to memory to load data mapped on a 1GiB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593). Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)

A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite. Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.

Note 1:

The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency. The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth. The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker. The third unusual feature is unusual enough to get a separate discussion in Note 2.

Note 2:

Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB. Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21. The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB. In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB. Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.

Note 3:

The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 4:

Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.