Print this page
8956 Implement KPTI
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
9208 hati_demap_func should take pagesize into account
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
@@ -25,10 +25,11 @@
* Copyright (c) 2010, Intel Corporation.
* All rights reserved.
*/
/*
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
+ * Copyright 2018 Joyent, Inc. All rights reserved.
* Copyright (c) 2014, 2015 by Delphix. All rights reserved.
*/
/*
* VM - Hardware Address Translation management for i386 and amd64
@@ -40,10 +41,195 @@
* that work in conjunction with this code.
*
* Routines used only inside of i86pc/vm start with hati_ for HAT Internal.
*/
+/*
+ * amd64 HAT Design
+ *
+ * ----------
+ * Background
+ * ----------
+ *
+ * On x86, the address space is shared between a user process and the kernel.
+ * This is different from SPARC. Conventionally, the kernel lives at the top of
+ * the address space and the user process gets to enjoy the rest of it. If you
+ * look at the image of the address map in uts/i86pc/os/startup.c, you'll get a
+ * rough sense of how the address space is laid out and used.
+ *
+ * Every unique address space is represented by an instance of a HAT structure
+ * called a 'hat_t'. In addition to a hat_t structure for each process, there is
+ * also one that is used for the kernel (kas.a_hat), and each CPU ultimately
+ * also has a HAT.
+ *
+ * Each HAT contains a pointer to its root page table. This root page table is
+ * what we call an L3 page table in illumos and Intel calls the PML4. It is the
+ * physical address of the L3 table that we place in the %cr3 register which the
+ * processor uses.
+ *
+ * Each of the many layers of the page table is represented by a structure
+ * called an htable_t. The htable_t manages a set of 512 8-byte entries. The
+ * number of entries in a given page table is constant across all different
+ * level page tables. Note, this is only true on amd64. This has not always been
+ * the case on x86.
+ *
+ * Each entry in a page table, generally referred to as a PTE, may refer to
+ * another page table or a memory location, depending on the level of the page
+ * table and the use of large pages. Importantly, the top-level L3 page table
+ * (PML4) only supports linking to further page tables. This is also true on
+ * systems which support a 5th level page table (which we do not currently
+ * support).
+ *
+ * Historically, on x86, when a process was running on CPU, the root of the page
+ * table was inserted into %cr3 on each CPU on which it was currently running.
+ * When processes would switch (by calling hat_switch()), then the value in %cr3
+ * on that CPU would change to that of the new HAT. While this behavior is still
+ * maintained in the xpv kernel, this is not what is done today.
+ *
+ * -------------------
+ * Per-CPU Page Tables
+ * -------------------
+ *
+ * Throughout the system the 64-bit kernel has a notion of what it calls a
+ * per-CPU page table or PCP. The notion of a per-CPU page table was originally
+ * introduced as part of the original work to support x86 PAE. On the 64-bit
+ * kernel, it was originally used for 32-bit processes running on the 64-bit
+ * kernel. The rationale behind this was that each 32-bit process could have all
+ * of its memory represented in a single L2 page table as each L2 page table
+ * entry represents 1 GbE of memory.
+ *
+ * Following on from this, the idea was that given that all of the L3 page table
+ * entries for 32-bit processes are basically going to be identical with the
+ * exception of the first entry in the page table, why not share those page
+ * table entries. This gave rise to the idea of a per-CPU page table.
+ *
+ * The way this works is that we have a member in the machcpu_t called the
+ * mcpu_hat_info. That structure contains two different 4k pages: one that
+ * represents the L3 page table and one that represents an L2 page table. When
+ * the CPU starts up, the L3 page table entries are copied in from the kernel's
+ * page table. The L3 kernel entries do not change throughout the lifetime of
+ * the kernel. The kernel portion of these L3 pages for each CPU have the same
+ * records, meaning that they point to the same L2 page tables and thus see a
+ * consistent view of the world.
+ *
+ * When a 32-bit process is loaded into this world, we copy the 32-bit process's
+ * four top-level page table entries into the CPU's L2 page table and then set
+ * the CPU's first L3 page table entry to point to the CPU's L2 page.
+ * Specifically, in hat_pcp_update(), we're copying from the process's
+ * HAT_COPIED_32 HAT into the page tables specific to this CPU.
+ *
+ * As part of the implementation of kernel page table isolation, this was also
+ * extended to 64-bit processes. When a 64-bit process runs, we'll copy their L3
+ * PTEs across into the current CPU's L3 page table. (As we can't do the
+ * first-L3-entry trick for 64-bit processes, ->hci_pcp_l2ptes is unused in this
+ * case.)
+ *
+ * The use of per-CPU page tables has a lot of implementation ramifications. A
+ * HAT that runs a user process will be flagged with the HAT_COPIED flag to
+ * indicate that it is using the per-CPU page table functionality. In tandem
+ * with the HAT, the top-level htable_t will be flagged with the HTABLE_COPIED
+ * flag. If the HAT represents a 32-bit process, then we will also set the
+ * HAT_COPIED_32 flag on that hat_t.
+ *
+ * These two flags work together. The top-level htable_t when using per-CPU page
+ * tables is 'virtual'. We never allocate a ptable for this htable_t (i.e.
+ * ht->ht_pfn is PFN_INVALID). Instead, when we need to modify a PTE in an
+ * HTABLE_COPIED ptable, x86pte_access_pagetable() will redirect any accesses to
+ * ht_hat->hat_copied_ptes.
+ *
+ * Of course, such a modification won't actually modify the HAT_PCP page tables
+ * that were copied from the HAT_COPIED htable. When we change the top level
+ * page table entries (L2 PTEs for a 32-bit process and L3 PTEs for a 64-bit
+ * process), we need to make sure to trigger hat_pcp_update() on all CPUs that
+ * are currently tied to this HAT (including the current CPU).
+ *
+ * To do this, PCP piggy-backs on TLB invalidation, specifically via the
+ * hat_tlb_inval() path from link_ptp() and unlink_ptp().
+ *
+ * (Importantly, in all such cases, when this is in operation, the top-level
+ * entry should not be able to refer to an actual page table entry that can be
+ * changed and consolidated into a large page. If large page consolidation is
+ * required here, then there will be much that needs to be reconsidered.)
+ *
+ * -----------------------------------------------
+ * Kernel Page Table Isolation and the Per-CPU HAT
+ * -----------------------------------------------
+ *
+ * All Intel CPUs that support speculative execution and paging are subject to a
+ * series of bugs that have been termed 'Meltdown'. These exploits allow a user
+ * process to read kernel memory through cache side channels and speculative
+ * execution. To mitigate this on vulnerable CPUs, we need to use a technique
+ * called kernel page table isolation. What this requires is that we have two
+ * different page table roots. When executing in kernel mode, we will use a %cr3
+ * value that has both the user and kernel pages. However when executing in user
+ * mode, we will need to have a %cr3 that has all of the user pages; however,
+ * only a subset of the kernel pages required to operate.
+ *
+ * These kernel pages that we need mapped are:
+ *
+ * o Kernel Text that allows us to switch between the cr3 values.
+ * o The current global descriptor table (GDT)
+ * o The current interrupt descriptor table (IDT)
+ * o The current task switching state (TSS)
+ * o The current local descriptor table (LDT)
+ * o Stacks and scratch space used by the interrupt handlers
+ *
+ * For more information on the stack switching techniques, construction of the
+ * trampolines, and more, please see i86pc/ml/kpti_trampolines.s. The most
+ * important part of these mappings are the following two constraints:
+ *
+ * o The mappings are all per-CPU (except for read-only text)
+ * o The mappings are static. They are all established before the CPU is
+ * started (with the exception of the boot CPU).
+ *
+ * To facilitate the kernel page table isolation we employ our per-CPU
+ * page tables discussed in the previous section and add the notion of a per-CPU
+ * HAT. Fundamentally we have a second page table root. There is both a kernel
+ * page table (hci_pcp_l3ptes), and a user L3 page table (hci_user_l3ptes).
+ * Both will have the user page table entries copied into them, the same way
+ * that we discussed in the section 'Per-CPU Page Tables'.
+ *
+ * The complex part of this is how do we construct the set of kernel mappings
+ * that should be present when running with the user page table. To answer that,
+ * we add the notion of a per-CPU HAT. This HAT functions like a normal HAT,
+ * except that it's not really associated with an address space the same way
+ * that other HATs are.
+ *
+ * This HAT lives off of the 'struct hat_cpu_info' which is a member of the
+ * machcpu in the member hci_user_hat. We use this per-CPU HAT to create the set
+ * of kernel mappings that should be present on this CPU. The kernel mappings
+ * are added to the per-CPU HAT through the function hati_cpu_punchin(). Once a
+ * mapping has been punched in, it may not be punched out. The reason that we
+ * opt to leverage a HAT structure is that it knows how to allocate and manage
+ * all of the lower level page tables as required.
+ *
+ * Because all of the mappings are present at the beginning of time for this CPU
+ * and none of the mappings are in the kernel pageable segment, we don't have to
+ * worry about faulting on these HAT structures and thus the notion of the
+ * current HAT that we're using is always the appropriate HAT for the process
+ * (usually a user HAT or the kernel's HAT).
+ *
+ * A further constraint we place on the system with these per-CPU HATs is that
+ * they are not subject to htable_steal(). Because each CPU will have a rather
+ * fixed number of page tables, the same way that we don't steal from the
+ * kernel's HAT, it was determined that we should not steal from this HAT due to
+ * the complications involved and somewhat criminal nature of htable_steal().
+ *
+ * The per-CPU HAT is initialized in hat_pcp_setup() which is called as part of
+ * onlining the CPU, but before the CPU is actually started. The per-CPU HAT is
+ * removed in hat_pcp_teardown() which is called when a CPU is being offlined to
+ * be removed from the system (which is different from what psradm usually
+ * does).
+ *
+ * Finally, once the CPU has been onlined, the set of mappings in the per-CPU
+ * HAT must not change. The HAT related functions that we call are not meant to
+ * be called when we're switching between processes. For example, it is quite
+ * possible that if they were, they would try to grab an htable mutex which
+ * another thread might have. One needs to treat hat_switch() as though they
+ * were above LOCK_LEVEL and therefore _must not_ block under any circumstance.
+ */
+
#include <sys/machparam.h>
#include <sys/machsystm.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/systm.h>
@@ -93,31 +279,34 @@
/*
* The page that is the kernel's top level pagetable.
*
* For 32 bit PAE support on i86pc, the kernel hat will use the 1st 4 entries
* on this 4K page for its top level page table. The remaining groups of
- * 4 entries are used for per processor copies of user VLP pagetables for
+ * 4 entries are used for per processor copies of user PCP pagetables for
* running threads. See hat_switch() and reload_pae32() for details.
*
- * vlp_page[0..3] - level==2 PTEs for kernel HAT
- * vlp_page[4..7] - level==2 PTEs for user thread on cpu 0
- * vlp_page[8..11] - level==2 PTE for user thread on cpu 1
+ * pcp_page[0..3] - level==2 PTEs for kernel HAT
+ * pcp_page[4..7] - level==2 PTEs for user thread on cpu 0
+ * pcp_page[8..11] - level==2 PTE for user thread on cpu 1
* etc...
+ *
+ * On the 64-bit kernel, this is the normal root of the page table and there is
+ * nothing special about it when used for other CPUs.
*/
-static x86pte_t *vlp_page;
+static x86pte_t *pcp_page;
/*
* forward declaration of internal utility routines
*/
static x86pte_t hati_update_pte(htable_t *ht, uint_t entry, x86pte_t expected,
x86pte_t new);
/*
- * The kernel address space exists in all HATs. To implement this the
- * kernel reserves a fixed number of entries in the topmost level(s) of page
- * tables. The values are setup during startup and then copied to every user
- * hat created by hat_alloc(). This means that kernelbase must be:
+ * The kernel address space exists in all non-HAT_COPIED HATs. To implement this
+ * the kernel reserves a fixed number of entries in the topmost level(s) of page
+ * tables. The values are setup during startup and then copied to every user hat
+ * created by hat_alloc(). This means that kernelbase must be:
*
* 4Meg aligned for 32 bit kernels
* 512Gig aligned for x86_64 64 bit kernel
*
* The hat_kernel_range_ts describe what needs to be copied from kernel hat
@@ -168,11 +357,11 @@
*/
kmutex_t hat_list_lock;
kcondvar_t hat_list_cv;
kmem_cache_t *hat_cache;
kmem_cache_t *hat_hash_cache;
-kmem_cache_t *vlp_hash_cache;
+kmem_cache_t *hat32_hash_cache;
/*
* Simple statistics
*/
struct hatstats hatstat;
@@ -186,16 +375,10 @@
* HAT uses cmpxchg() and the other paths (hypercall etc.) were never
* incorrect.
*/
int pt_kern;
-/*
- * useful stuff for atomic access/clearing/setting REF/MOD/RO bits in page_t's.
- */
-extern void atomic_orb(uchar_t *addr, uchar_t val);
-extern void atomic_andb(uchar_t *addr, uchar_t val);
-
#ifndef __xpv
extern pfn_t memseg_get_start(struct memseg *);
#endif
#define PP_GETRM(pp, rmmask) (pp->p_nrm & rmmask)
@@ -234,26 +417,53 @@
hat->hat_ht_hash = NULL;
return (0);
}
/*
+ * Put it at the start of the global list of all hats (used by stealing)
+ *
+ * kas.a_hat is not in the list but is instead used to find the
+ * first and last items in the list.
+ *
+ * - kas.a_hat->hat_next points to the start of the user hats.
+ * The list ends where hat->hat_next == NULL
+ *
+ * - kas.a_hat->hat_prev points to the last of the user hats.
+ * The list begins where hat->hat_prev == NULL
+ */
+static void
+hat_list_append(hat_t *hat)
+{
+ mutex_enter(&hat_list_lock);
+ hat->hat_prev = NULL;
+ hat->hat_next = kas.a_hat->hat_next;
+ if (hat->hat_next)
+ hat->hat_next->hat_prev = hat;
+ else
+ kas.a_hat->hat_prev = hat;
+ kas.a_hat->hat_next = hat;
+ mutex_exit(&hat_list_lock);
+}
+
+/*
* Allocate a hat structure for as. We also create the top level
* htable and initialize it to contain the kernel hat entries.
*/
hat_t *
hat_alloc(struct as *as)
{
hat_t *hat;
htable_t *ht; /* top level htable */
- uint_t use_vlp;
+ uint_t use_copied;
uint_t r;
hat_kernel_range_t *rp;
uintptr_t va;
uintptr_t eva;
uint_t start;
uint_t cnt;
htable_t *src;
+ boolean_t use_hat32_cache;
/*
* Once we start creating user process HATs we can enable
* the htable_steal() code.
*/
@@ -266,34 +476,63 @@
mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL);
ASSERT(hat->hat_flags == 0);
#if defined(__xpv)
/*
- * No VLP stuff on the hypervisor due to the 64-bit split top level
+ * No PCP stuff on the hypervisor due to the 64-bit split top level
* page tables. On 32-bit it's not needed as the hypervisor takes
* care of copying the top level PTEs to a below 4Gig page.
*/
- use_vlp = 0;
+ use_copied = 0;
+ use_hat32_cache = B_FALSE;
+ hat->hat_max_level = mmu.max_level;
+ hat->hat_num_copied = 0;
+ hat->hat_flags = 0;
#else /* __xpv */
- /* 32 bit processes uses a VLP style hat when running with PAE */
-#if defined(__amd64)
- use_vlp = (ttoproc(curthread)->p_model == DATAMODEL_ILP32);
-#elif defined(__i386)
- use_vlp = mmu.pae_hat;
-#endif
+
+ /*
+ * All processes use HAT_COPIED on the 64-bit kernel if KPTI is
+ * turned on.
+ */
+ if (ttoproc(curthread)->p_model == DATAMODEL_ILP32) {
+ use_copied = 1;
+ hat->hat_max_level = mmu.max_level32;
+ hat->hat_num_copied = mmu.num_copied_ents32;
+ use_hat32_cache = B_TRUE;
+ hat->hat_flags |= HAT_COPIED_32;
+ HATSTAT_INC(hs_hat_copied32);
+ } else if (kpti_enable == 1) {
+ use_copied = 1;
+ hat->hat_max_level = mmu.max_level;
+ hat->hat_num_copied = mmu.num_copied_ents;
+ use_hat32_cache = B_FALSE;
+ HATSTAT_INC(hs_hat_copied64);
+ } else {
+ use_copied = 0;
+ use_hat32_cache = B_FALSE;
+ hat->hat_max_level = mmu.max_level;
+ hat->hat_num_copied = 0;
+ hat->hat_flags = 0;
+ HATSTAT_INC(hs_hat_normal64);
+ }
#endif /* __xpv */
- if (use_vlp) {
- hat->hat_flags = HAT_VLP;
- bzero(hat->hat_vlp_ptes, VLP_SIZE);
+ if (use_copied) {
+ hat->hat_flags |= HAT_COPIED;
+ bzero(hat->hat_copied_ptes, sizeof (hat->hat_copied_ptes));
}
/*
- * Allocate the htable hash
+ * Allocate the htable hash. For 32-bit PCP processes we use the
+ * hat32_hash_cache. However, for 64-bit PCP processes we do not as the
+ * number of entries that they have to handle is closer to
+ * hat_hash_cache in count (though there will be more wastage when we
+ * have more DRAM in the system and thus push down the user address
+ * range).
*/
- if ((hat->hat_flags & HAT_VLP)) {
- hat->hat_num_hash = mmu.vlp_hash_cnt;
- hat->hat_ht_hash = kmem_cache_alloc(vlp_hash_cache, KM_SLEEP);
+ if (use_hat32_cache) {
+ hat->hat_num_hash = mmu.hat32_hash_cnt;
+ hat->hat_ht_hash = kmem_cache_alloc(hat32_hash_cache, KM_SLEEP);
} else {
hat->hat_num_hash = mmu.hash_cnt;
hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP);
}
bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
@@ -307,11 +546,11 @@
XPV_DISALLOW_MIGRATE();
ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL);
hat->hat_htable = ht;
#if defined(__amd64)
- if (hat->hat_flags & HAT_VLP)
+ if (hat->hat_flags & HAT_COPIED)
goto init_done;
#endif
for (r = 0; r < num_kernel_ranges; ++r) {
rp = &kernel_ranges[r];
@@ -332,13 +571,13 @@
(eva > rp->hkr_end_va || eva == 0))
cnt = htable_va2entry(rp->hkr_end_va, ht) -
start;
#if defined(__i386) && !defined(__xpv)
- if (ht->ht_flags & HTABLE_VLP) {
- bcopy(&vlp_page[start],
- &hat->hat_vlp_ptes[start],
+ if (ht->ht_flags & HTABLE_COPIED) {
+ bcopy(&pcp_page[start],
+ &hat->hat_copied_ptes[start],
cnt * sizeof (x86pte_t));
continue;
}
#endif
src = htable_lookup(kas.a_hat, va, rp->hkr_level);
@@ -359,34 +598,58 @@
xen_pin(hat->hat_user_ptable, mmu.max_level);
#endif
#endif
XPV_ALLOW_MIGRATE();
+ hat_list_append(hat);
+
+ return (hat);
+}
+
+#if !defined(__xpv)
+/*
+ * Cons up a HAT for a CPU. This represents the user mappings. This will have
+ * various kernel pages punched into it manually. Importantly, this hat is
+ * ineligible for stealing. We really don't want to deal with this ever
+ * faulting and figuring out that this is happening, much like we don't with
+ * kas.
+ */
+static hat_t *
+hat_cpu_alloc(cpu_t *cpu)
+{
+ hat_t *hat;
+ htable_t *ht;
+
+ hat = kmem_cache_alloc(hat_cache, KM_SLEEP);
+ hat->hat_as = NULL;
+ mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL);
+ hat->hat_max_level = mmu.max_level;
+ hat->hat_num_copied = 0;
+ hat->hat_flags = HAT_PCP;
+
+ hat->hat_num_hash = mmu.hash_cnt;
+ hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP);
+ bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
+
+ hat->hat_next = hat->hat_prev = NULL;
+
/*
- * Put it at the start of the global list of all hats (used by stealing)
- *
- * kas.a_hat is not in the list but is instead used to find the
- * first and last items in the list.
- *
- * - kas.a_hat->hat_next points to the start of the user hats.
- * The list ends where hat->hat_next == NULL
- *
- * - kas.a_hat->hat_prev points to the last of the user hats.
- * The list begins where hat->hat_prev == NULL
+ * Because this HAT will only ever be used by the current CPU, we'll go
+ * ahead and set the CPUSET up to only point to the CPU in question.
*/
- mutex_enter(&hat_list_lock);
- hat->hat_prev = NULL;
- hat->hat_next = kas.a_hat->hat_next;
- if (hat->hat_next)
- hat->hat_next->hat_prev = hat;
- else
- kas.a_hat->hat_prev = hat;
- kas.a_hat->hat_next = hat;
- mutex_exit(&hat_list_lock);
+ CPUSET_ADD(hat->hat_cpus, cpu->cpu_id);
+ hat->hat_htable = NULL;
+ hat->hat_ht_cached = NULL;
+ ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL);
+ hat->hat_htable = ht;
+
+ hat_list_append(hat);
+
return (hat);
}
+#endif /* !__xpv */
/*
* process has finished executing but as has not been cleaned up yet.
*/
/*ARGSUSED*/
@@ -439,10 +702,11 @@
#if defined(__xpv)
/*
* On the hypervisor, unpin top level page table(s)
*/
+ VERIFY3U(hat->hat_flags & HAT_PCP, ==, 0);
xen_unpin(hat->hat_htable->ht_pfn);
#if defined(__amd64)
xen_unpin(hat->hat_user_ptable);
#endif
#endif
@@ -453,18 +717,29 @@
htable_purge_hat(hat);
/*
* Decide which kmem cache the hash table came from, then free it.
*/
- if (hat->hat_flags & HAT_VLP)
- cache = vlp_hash_cache;
- else
+ if (hat->hat_flags & HAT_COPIED) {
+#if defined(__amd64)
+ if (hat->hat_flags & HAT_COPIED_32) {
+ cache = hat32_hash_cache;
+ } else {
cache = hat_hash_cache;
+ }
+#else
+ cache = hat32_hash_cache;
+#endif
+ } else {
+ cache = hat_hash_cache;
+ }
kmem_cache_free(cache, hat->hat_ht_hash);
hat->hat_ht_hash = NULL;
hat->hat_flags = 0;
+ hat->hat_max_level = 0;
+ hat->hat_num_copied = 0;
kmem_cache_free(hat_cache, hat);
}
/*
* round kernelbase down to a supported value to use for _userlimit
@@ -515,10 +790,46 @@
else
mmu.umax_page_level = lvl;
}
/*
+ * Determine the number of slots that are in used in the top-most level page
+ * table for user memory. This is based on _userlimit. In effect this is similar
+ * to htable_va2entry, but without the convenience of having an htable.
+ */
+void
+mmu_calc_user_slots(void)
+{
+ uint_t ent, nptes;
+ uintptr_t shift;
+
+ nptes = mmu.top_level_count;
+ shift = _userlimit >> mmu.level_shift[mmu.max_level];
+ ent = shift & (nptes - 1);
+
+ /*
+ * Ent tells us the slot that the page for _userlimit would fit in. We
+ * need to add one to this to cover the total number of entries.
+ */
+ mmu.top_level_uslots = ent + 1;
+
+ /*
+ * When running 32-bit compatability processes on a 64-bit kernel, we
+ * will only need to use one slot.
+ */
+ mmu.top_level_uslots32 = 1;
+
+ /*
+ * Record the number of PCP page table entries that we'll need to copy
+ * around. For 64-bit processes this is the number of user slots. For
+ * 32-bit proceses, this is 4 1 GiB pages.
+ */
+ mmu.num_copied_ents = mmu.top_level_uslots;
+ mmu.num_copied_ents32 = 4;
+}
+
+/*
* Initialize hat data structures based on processor MMU information.
*/
void
mmu_init(void)
{
@@ -533,11 +844,22 @@
*/
if (is_x86_feature(x86_featureset, X86FSET_PGE) &&
(getcr4() & CR4_PGE) != 0)
mmu.pt_global = PT_GLOBAL;
+#if !defined(__xpv)
/*
+ * The 64-bit x86 kernel has split user/kernel page tables. As such we
+ * cannot have the global bit set. The simplest way for us to deal with
+ * this is to just say that pt_global is zero, so the global bit isn't
+ * present.
+ */
+ if (kpti_enable == 1)
+ mmu.pt_global = 0;
+#endif
+
+ /*
* Detect NX and PAE usage.
*/
mmu.pae_hat = kbm_pae_support;
if (kbm_nx_support)
mmu.pt_nx = PT_NX;
@@ -591,10 +913,15 @@
mmu.num_level = 4;
mmu.max_level = 3;
mmu.ptes_per_table = 512;
mmu.top_level_count = 512;
+ /*
+ * 32-bit processes only use 1 GB ptes.
+ */
+ mmu.max_level32 = 2;
+
mmu.level_shift[0] = 12;
mmu.level_shift[1] = 21;
mmu.level_shift[2] = 30;
mmu.level_shift[3] = 39;
@@ -627,10 +954,11 @@
mmu.level_offset[i] = mmu.level_size[i] - 1;
mmu.level_mask[i] = ~mmu.level_offset[i];
}
set_max_page_level();
+ mmu_calc_user_slots();
mmu_page_sizes = mmu.max_page_level + 1;
mmu_exported_page_sizes = mmu.umax_page_level + 1;
/* restrict legacy applications from using pagesizes 1g and above */
@@ -662,11 +990,11 @@
*/
max_htables = physmax / mmu.ptes_per_table;
mmu.hash_cnt = MMU_PAGESIZE / sizeof (htable_t *);
while (mmu.hash_cnt > 16 && mmu.hash_cnt >= max_htables)
mmu.hash_cnt >>= 1;
- mmu.vlp_hash_cnt = mmu.hash_cnt;
+ mmu.hat32_hash_cnt = mmu.hash_cnt;
#if defined(__amd64)
/*
* If running in 64 bits and physical memory is large,
* increase the size of the cache to cover all of memory for
@@ -711,18 +1039,19 @@
hat_hash_cache = kmem_cache_create("HatHash",
mmu.hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL,
NULL, 0, 0);
/*
- * VLP hats can use a smaller hash table size on large memroy machines
+ * 32-bit PCP hats can use a smaller hash table size on large memory
+ * machines
*/
- if (mmu.hash_cnt == mmu.vlp_hash_cnt) {
- vlp_hash_cache = hat_hash_cache;
+ if (mmu.hash_cnt == mmu.hat32_hash_cnt) {
+ hat32_hash_cache = hat_hash_cache;
} else {
- vlp_hash_cache = kmem_cache_create("HatVlpHash",
- mmu.vlp_hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL,
- NULL, 0, 0);
+ hat32_hash_cache = kmem_cache_create("Hat32Hash",
+ mmu.hat32_hash_cnt * sizeof (htable_t *), 0, NULL, NULL,
+ NULL, NULL, 0, 0);
}
/*
* Set up the kernel's hat
*/
@@ -735,10 +1064,17 @@
CPUSET_ZERO(khat_cpuset);
CPUSET_ADD(khat_cpuset, CPU->cpu_id);
/*
+ * The kernel HAT doesn't use PCP regardless of architectures.
+ */
+ ASSERT3U(mmu.max_level, >, 0);
+ kas.a_hat->hat_max_level = mmu.max_level;
+ kas.a_hat->hat_num_copied = 0;
+
+ /*
* The kernel hat's next pointer serves as the head of the hat list .
* The kernel hat's prev pointer tracks the last hat on the list for
* htable_steal() to use.
*/
kas.a_hat->hat_next = NULL;
@@ -766,61 +1102,169 @@
*/
hrm_hashtab = kmem_zalloc(HRM_HASHSIZE * sizeof (struct hrmstat *),
KM_SLEEP);
}
+
+extern void kpti_tramp_start();
+extern void kpti_tramp_end();
+
+extern void kdi_isr_start();
+extern void kdi_isr_end();
+
+extern gate_desc_t kdi_idt[NIDT];
+
/*
- * Prepare CPU specific pagetables for VLP processes on 64 bit kernels.
+ * Prepare per-CPU pagetables for all processes on the 64 bit kernel.
*
* Each CPU has a set of 2 pagetables that are reused for any 32 bit
- * process it runs. They are the top level pagetable, hci_vlp_l3ptes, and
- * the next to top level table for the bottom 512 Gig, hci_vlp_l2ptes.
+ * process it runs. They are the top level pagetable, hci_pcp_l3ptes, and
+ * the next to top level table for the bottom 512 Gig, hci_pcp_l2ptes.
*/
/*ARGSUSED*/
static void
-hat_vlp_setup(struct cpu *cpu)
+hat_pcp_setup(struct cpu *cpu)
{
-#if defined(__amd64) && !defined(__xpv)
+#if !defined(__xpv)
struct hat_cpu_info *hci = cpu->cpu_hat_info;
- pfn_t pfn;
+ uintptr_t va;
+ size_t len;
/*
* allocate the level==2 page table for the bottom most
* 512Gig of address space (this is where 32 bit apps live)
*/
ASSERT(hci != NULL);
- hci->hci_vlp_l2ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
+ hci->hci_pcp_l2ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
/*
* Allocate a top level pagetable and copy the kernel's
- * entries into it. Then link in hci_vlp_l2ptes in the 1st entry.
+ * entries into it. Then link in hci_pcp_l2ptes in the 1st entry.
*/
- hci->hci_vlp_l3ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
- hci->hci_vlp_pfn =
- hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_vlp_l3ptes);
- ASSERT(hci->hci_vlp_pfn != PFN_INVALID);
- bcopy(vlp_page, hci->hci_vlp_l3ptes, MMU_PAGESIZE);
+ hci->hci_pcp_l3ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
+ hci->hci_pcp_l3pfn =
+ hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_pcp_l3ptes);
+ ASSERT3U(hci->hci_pcp_l3pfn, !=, PFN_INVALID);
+ bcopy(pcp_page, hci->hci_pcp_l3ptes, MMU_PAGESIZE);
- pfn = hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_vlp_l2ptes);
- ASSERT(pfn != PFN_INVALID);
- hci->hci_vlp_l3ptes[0] = MAKEPTP(pfn, 2);
-#endif /* __amd64 && !__xpv */
+ hci->hci_pcp_l2pfn =
+ hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_pcp_l2ptes);
+ ASSERT3U(hci->hci_pcp_l2pfn, !=, PFN_INVALID);
+
+ /*
+ * Now go through and allocate the user version of these structures.
+ * Unlike with the kernel version, we allocate a hat to represent the
+ * top-level page table as that will make it much simpler when we need
+ * to patch through user entries.
+ */
+ hci->hci_user_hat = hat_cpu_alloc(cpu);
+ hci->hci_user_l3pfn = hci->hci_user_hat->hat_htable->ht_pfn;
+ ASSERT3U(hci->hci_user_l3pfn, !=, PFN_INVALID);
+ hci->hci_user_l3ptes =
+ (x86pte_t *)hat_kpm_mapin_pfn(hci->hci_user_l3pfn);
+
+ /* Skip the rest of this if KPTI is switched off at boot. */
+ if (kpti_enable != 1)
+ return;
+
+ /*
+ * OK, now that we have this we need to go through and punch the normal
+ * holes in the CPU's hat for this. At this point we'll punch in the
+ * following:
+ *
+ * o GDT
+ * o IDT
+ * o LDT
+ * o Trampoline Code
+ * o machcpu KPTI page
+ * o kmdb ISR code page (just trampolines)
+ *
+ * If this is cpu0, then we also can initialize the following because
+ * they'll have already been allocated.
+ *
+ * o TSS for CPU 0
+ * o Double Fault for CPU 0
+ *
+ * The following items have yet to be allocated and have not been
+ * punched in yet. They will be punched in later:
+ *
+ * o TSS (mach_cpucontext_alloc_tables())
+ * o Double Fault Stack (mach_cpucontext_alloc_tables())
+ */
+ hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_gdt, PROT_READ);
+ hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_idt, PROT_READ);
+
+ /*
+ * As the KDI IDT is only active during kmdb sessions (including single
+ * stepping), typically we don't actually need this punched in (we
+ * consider the routines that switch to the user cr3 to be toxic). But
+ * if we ever accidentally end up on the user cr3 while on this IDT,
+ * we'd prefer not to triple fault.
+ */
+ hati_cpu_punchin(cpu, (uintptr_t)&kdi_idt, PROT_READ);
+
+ CTASSERT(((uintptr_t)&kpti_tramp_start % MMU_PAGESIZE) == 0);
+ CTASSERT(((uintptr_t)&kpti_tramp_end % MMU_PAGESIZE) == 0);
+ for (va = (uintptr_t)&kpti_tramp_start;
+ va < (uintptr_t)&kpti_tramp_end; va += MMU_PAGESIZE) {
+ hati_cpu_punchin(cpu, va, PROT_READ | PROT_EXEC);
+ }
+
+ VERIFY3U(((uintptr_t)cpu->cpu_m.mcpu_ldt) % MMU_PAGESIZE, ==, 0);
+ for (va = (uintptr_t)cpu->cpu_m.mcpu_ldt, len = LDT_CPU_SIZE;
+ len >= MMU_PAGESIZE; va += MMU_PAGESIZE, len -= MMU_PAGESIZE) {
+ hati_cpu_punchin(cpu, va, PROT_READ);
+ }
+
+ /* mcpu_pad2 is the start of the page containing the kpti_frames. */
+ hati_cpu_punchin(cpu, (uintptr_t)&cpu->cpu_m.mcpu_pad2[0],
+ PROT_READ | PROT_WRITE);
+
+ if (cpu == &cpus[0]) {
+ /*
+ * CPU0 uses a global for its double fault stack to deal with
+ * the chicken and egg problem. We need to punch it into its
+ * user HAT.
+ */
+ extern char dblfault_stack0[];
+
+ hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_m.mcpu_tss,
+ PROT_READ);
+
+ for (va = (uintptr_t)dblfault_stack0,
+ len = DEFAULTSTKSZ; len >= MMU_PAGESIZE;
+ va += MMU_PAGESIZE, len -= MMU_PAGESIZE) {
+ hati_cpu_punchin(cpu, va, PROT_READ | PROT_WRITE);
+ }
+ }
+
+ CTASSERT(((uintptr_t)&kdi_isr_start % MMU_PAGESIZE) == 0);
+ CTASSERT(((uintptr_t)&kdi_isr_end % MMU_PAGESIZE) == 0);
+ for (va = (uintptr_t)&kdi_isr_start;
+ va < (uintptr_t)&kdi_isr_end; va += MMU_PAGESIZE) {
+ hati_cpu_punchin(cpu, va, PROT_READ | PROT_EXEC);
+ }
+#endif /* !__xpv */
}
/*ARGSUSED*/
static void
-hat_vlp_teardown(cpu_t *cpu)
+hat_pcp_teardown(cpu_t *cpu)
{
-#if defined(__amd64) && !defined(__xpv)
+#if !defined(__xpv)
struct hat_cpu_info *hci;
if ((hci = cpu->cpu_hat_info) == NULL)
return;
- if (hci->hci_vlp_l2ptes)
- kmem_free(hci->hci_vlp_l2ptes, MMU_PAGESIZE);
- if (hci->hci_vlp_l3ptes)
- kmem_free(hci->hci_vlp_l3ptes, MMU_PAGESIZE);
+ if (hci->hci_pcp_l2ptes != NULL)
+ kmem_free(hci->hci_pcp_l2ptes, MMU_PAGESIZE);
+ if (hci->hci_pcp_l3ptes != NULL)
+ kmem_free(hci->hci_pcp_l3ptes, MMU_PAGESIZE);
+ if (hci->hci_user_hat != NULL) {
+ hat_free_start(hci->hci_user_hat);
+ hat_free_end(hci->hci_user_hat);
+ }
#endif
}
#define NEXT_HKR(r, l, s, e) { \
kernel_ranges[r].hkr_level = l; \
@@ -912,25 +1356,28 @@
}
/*
* 32 bit PAE metal kernels use only 4 of the 512 entries in the
* page holding the top level pagetable. We use the remainder for
- * the "per CPU" page tables for VLP processes.
+ * the "per CPU" page tables for PCP processes.
* Map the top level kernel pagetable into the kernel to make
* it easy to use bcopy access these tables.
+ *
+ * PAE is required for the 64-bit kernel which uses this as well to
+ * perform the per-CPU pagetables. See the big theory statement.
*/
if (mmu.pae_hat) {
- vlp_page = vmem_alloc(heap_arena, MMU_PAGESIZE, VM_SLEEP);
- hat_devload(kas.a_hat, (caddr_t)vlp_page, MMU_PAGESIZE,
+ pcp_page = vmem_alloc(heap_arena, MMU_PAGESIZE, VM_SLEEP);
+ hat_devload(kas.a_hat, (caddr_t)pcp_page, MMU_PAGESIZE,
kas.a_hat->hat_htable->ht_pfn,
#if !defined(__xpv)
PROT_WRITE |
#endif
PROT_READ | HAT_NOSYNC | HAT_UNORDERED_OK,
HAT_LOAD | HAT_LOAD_NOCONSIST);
}
- hat_vlp_setup(CPU);
+ hat_pcp_setup(CPU);
/*
* Create kmap (cached mappings of kernel PTEs)
* for 32 bit we map from segmap_start .. ekernelheap
* for 64 bit we map from segmap_start .. segmap_start + segmapsize;
@@ -939,10 +1386,16 @@
size = (uintptr_t)ekernelheap - segmap_start;
#elif defined(__amd64)
size = segmapsize;
#endif
hat_kmap_init((uintptr_t)segmap_start, size);
+
+#if !defined(__xpv)
+ ASSERT3U(kas.a_hat->hat_htable->ht_pfn, !=, PFN_INVALID);
+ ASSERT3U(kpti_safe_cr3, ==,
+ MAKECR3(kas.a_hat->hat_htable->ht_pfn, PCID_KERNEL));
+#endif
}
/*
* On 32 bit PAE mode, PTE's are 64 bits, but ordinary atomic memory references
* are 32 bit, so for safety we must use atomic_cas_64() to install these.
@@ -956,16 +1409,16 @@
x86pte_t pte;
int i;
/*
* Load the 4 entries of the level 2 page table into this
- * cpu's range of the vlp_page and point cr3 at them.
+ * cpu's range of the pcp_page and point cr3 at them.
*/
ASSERT(mmu.pae_hat);
- src = hat->hat_vlp_ptes;
- dest = vlp_page + (cpu->cpu_id + 1) * VLP_NUM_PTES;
- for (i = 0; i < VLP_NUM_PTES; ++i) {
+ src = hat->hat_copied_ptes;
+ dest = pcp_page + (cpu->cpu_id + 1) * MAX_COPIED_PTES;
+ for (i = 0; i < MAX_COPIED_PTES; ++i) {
for (;;) {
pte = dest[i];
if (pte == src[i])
break;
if (atomic_cas_64(dest + i, pte, src[i]) != src[i])
@@ -974,19 +1427,167 @@
}
}
#endif
/*
+ * Update the PCP data on the CPU cpu to the one on the hat. If this is a 32-bit
+ * process, then we must update the L2 pages and then the L3. If this is a
+ * 64-bit process then we must update the L3 entries.
+ */
+static void
+hat_pcp_update(cpu_t *cpu, const hat_t *hat)
+{
+ ASSERT3U(hat->hat_flags & HAT_COPIED, !=, 0);
+
+ if ((hat->hat_flags & HAT_COPIED_32) != 0) {
+ const x86pte_t *l2src;
+ x86pte_t *l2dst, *l3ptes, *l3uptes;
+ /*
+ * This is a 32-bit process. To set this up, we need to do the
+ * following:
+ *
+ * - Copy the 4 L2 PTEs into the dedicated L2 table
+ * - Zero the user L3 PTEs in the user and kernel page table
+ * - Set the first L3 PTE to point to the CPU L2 table
+ */
+ l2src = hat->hat_copied_ptes;
+ l2dst = cpu->cpu_hat_info->hci_pcp_l2ptes;
+ l3ptes = cpu->cpu_hat_info->hci_pcp_l3ptes;
+ l3uptes = cpu->cpu_hat_info->hci_user_l3ptes;
+
+ l2dst[0] = l2src[0];
+ l2dst[1] = l2src[1];
+ l2dst[2] = l2src[2];
+ l2dst[3] = l2src[3];
+
+ /*
+ * Make sure to use the mmu to get the number of slots. The
+ * number of PLP entries that this has will always be less as
+ * it's a 32-bit process.
+ */
+ bzero(l3ptes, sizeof (x86pte_t) * mmu.top_level_uslots);
+ l3ptes[0] = MAKEPTP(cpu->cpu_hat_info->hci_pcp_l2pfn, 2);
+ bzero(l3uptes, sizeof (x86pte_t) * mmu.top_level_uslots);
+ l3uptes[0] = MAKEPTP(cpu->cpu_hat_info->hci_pcp_l2pfn, 2);
+ } else {
+ /*
+ * This is a 64-bit process. To set this up, we need to do the
+ * following:
+ *
+ * - Zero the 4 L2 PTEs in the CPU structure for safety
+ * - Copy over the new user L3 PTEs into the kernel page table
+ * - Copy over the new user L3 PTEs into the user page table
+ */
+ ASSERT3S(kpti_enable, ==, 1);
+ bzero(cpu->cpu_hat_info->hci_pcp_l2ptes, sizeof (x86pte_t) * 4);
+ bcopy(hat->hat_copied_ptes, cpu->cpu_hat_info->hci_pcp_l3ptes,
+ sizeof (x86pte_t) * mmu.top_level_uslots);
+ bcopy(hat->hat_copied_ptes, cpu->cpu_hat_info->hci_user_l3ptes,
+ sizeof (x86pte_t) * mmu.top_level_uslots);
+ }
+}
+
+static void
+reset_kpti(struct kpti_frame *fr, uint64_t kcr3, uint64_t ucr3)
+{
+ ASSERT3U(fr->kf_tr_flag, ==, 0);
+#if DEBUG
+ if (fr->kf_kernel_cr3 != 0) {
+ ASSERT3U(fr->kf_lower_redzone, ==, 0xdeadbeefdeadbeef);
+ ASSERT3U(fr->kf_middle_redzone, ==, 0xdeadbeefdeadbeef);
+ ASSERT3U(fr->kf_upper_redzone, ==, 0xdeadbeefdeadbeef);
+ }
+#endif
+
+ bzero(fr, offsetof(struct kpti_frame, kf_kernel_cr3));
+ bzero(&fr->kf_unused, sizeof (struct kpti_frame) -
+ offsetof(struct kpti_frame, kf_unused));
+
+ fr->kf_kernel_cr3 = kcr3;
+ fr->kf_user_cr3 = ucr3;
+ fr->kf_tr_ret_rsp = (uintptr_t)&fr->kf_tr_rsp;
+
+ fr->kf_lower_redzone = 0xdeadbeefdeadbeef;
+ fr->kf_middle_redzone = 0xdeadbeefdeadbeef;
+ fr->kf_upper_redzone = 0xdeadbeefdeadbeef;
+}
+
+#ifdef __xpv
+static void
+hat_switch_xen(hat_t *hat)
+{
+ struct mmuext_op t[2];
+ uint_t retcnt;
+ uint_t opcnt = 1;
+ uint64_t newcr3;
+
+ ASSERT(!(hat->hat_flags & HAT_COPIED));
+ ASSERT(!(getcr4() & CR4_PCIDE));
+
+ newcr3 = MAKECR3((uint64_t)hat->hat_htable->ht_pfn, PCID_NONE);
+
+ t[0].cmd = MMUEXT_NEW_BASEPTR;
+ t[0].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
+
+ /*
+ * There's an interesting problem here, as to what to actually specify
+ * when switching to the kernel hat. For now we'll reuse the kernel hat
+ * again.
+ */
+ t[1].cmd = MMUEXT_NEW_USER_BASEPTR;
+ if (hat == kas.a_hat)
+ t[1].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
+ else
+ t[1].arg1.mfn = pfn_to_mfn(hat->hat_user_ptable);
+ ++opcnt;
+
+ if (HYPERVISOR_mmuext_op(t, opcnt, &retcnt, DOMID_SELF) < 0)
+ panic("HYPERVISOR_mmu_update() failed");
+ ASSERT(retcnt == opcnt);
+}
+#endif /* __xpv */
+
+/*
* Switch to a new active hat, maintaining bit masks to track active CPUs.
*
- * On the 32-bit PAE hypervisor, %cr3 is a 64-bit value, on metal it
- * remains a 32-bit value.
+ * With KPTI, all our HATs except kas should be using PCP. Thus, to switch
+ * HATs, we need to copy over the new user PTEs, then set our trampoline context
+ * as appropriate.
+ *
+ * If lacking PCID, we then load our new cr3, which will flush the TLB: we may
+ * have established userspace TLB entries via kernel accesses, and these are no
+ * longer valid. We have to do this eagerly, as we just deleted this CPU from
+ * ->hat_cpus, so would no longer see any TLB shootdowns.
+ *
+ * With PCID enabled, things get a little more complicated. We would like to
+ * keep TLB context around when entering and exiting the kernel, and to do this,
+ * we partition the TLB into two different spaces:
+ *
+ * PCID_KERNEL is defined as zero, and used both by kas and all other address
+ * spaces while in the kernel (post-trampoline).
+ *
+ * PCID_USER is used while in userspace. Therefore, userspace cannot use any
+ * lingering PCID_KERNEL entries to kernel addresses it should not be able to
+ * read.
+ *
+ * The trampoline cr3s are set not to invalidate on a mov to %cr3. This means if
+ * we take a journey through the kernel without switching HATs, we have some
+ * hope of keeping our TLB state around.
+ *
+ * On a hat switch, rather than deal with any necessary flushes on the way out
+ * of the trampolines, we do them upfront here. If we're switching from kas, we
+ * shouldn't need any invalidation.
+ *
+ * Otherwise, we can have stale userspace entries for both PCID_USER (what
+ * happened before we move onto the kcr3) and PCID_KERNEL (any subsequent
+ * userspace accesses such as ddi_copyin()). Since setcr3() won't do these
+ * flushes on its own in PCIDE, we'll do a non-flushing load and then
+ * invalidate everything.
*/
void
hat_switch(hat_t *hat)
{
- uint64_t newcr3;
cpu_t *cpu = CPU;
hat_t *old = cpu->cpu_current_hat;
/*
* set up this information first, so we don't miss any cross calls
@@ -1004,56 +1605,67 @@
if (hat != kas.a_hat) {
CPUSET_ATOMIC_ADD(hat->hat_cpus, cpu->cpu_id);
}
cpu->cpu_current_hat = hat;
- /*
- * now go ahead and load cr3
- */
- if (hat->hat_flags & HAT_VLP) {
-#if defined(__amd64)
- x86pte_t *vlpptep = cpu->cpu_hat_info->hci_vlp_l2ptes;
+#if defined(__xpv)
+ hat_switch_xen(hat);
+#else
+ struct hat_cpu_info *info = cpu->cpu_m.mcpu_hat_info;
+ uint64_t pcide = getcr4() & CR4_PCIDE;
+ uint64_t kcr3, ucr3;
+ pfn_t tl_kpfn;
+ ulong_t flag;
- VLP_COPY(hat->hat_vlp_ptes, vlpptep);
- newcr3 = MAKECR3(cpu->cpu_hat_info->hci_vlp_pfn);
-#elif defined(__i386)
- reload_pae32(hat, cpu);
- newcr3 = MAKECR3(kas.a_hat->hat_htable->ht_pfn) +
- (cpu->cpu_id + 1) * VLP_SIZE;
-#endif
+ EQUIV(kpti_enable, !mmu.pt_global);
+
+ if (hat->hat_flags & HAT_COPIED) {
+ hat_pcp_update(cpu, hat);
+ tl_kpfn = info->hci_pcp_l3pfn;
} else {
- newcr3 = MAKECR3((uint64_t)hat->hat_htable->ht_pfn);
+ IMPLY(kpti_enable, hat == kas.a_hat);
+ tl_kpfn = hat->hat_htable->ht_pfn;
}
-#ifdef __xpv
- {
- struct mmuext_op t[2];
- uint_t retcnt;
- uint_t opcnt = 1;
- t[0].cmd = MMUEXT_NEW_BASEPTR;
- t[0].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
-#if defined(__amd64)
+ if (pcide) {
+ ASSERT(kpti_enable);
+
+ kcr3 = MAKECR3(tl_kpfn, PCID_KERNEL) | CR3_NOINVL_BIT;
+ ucr3 = MAKECR3(info->hci_user_l3pfn, PCID_USER) |
+ CR3_NOINVL_BIT;
+
+ setcr3(kcr3);
+ if (old != kas.a_hat)
+ mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
+ } else {
+ kcr3 = MAKECR3(tl_kpfn, PCID_NONE);
+ ucr3 = kpti_enable ?
+ MAKECR3(info->hci_user_l3pfn, PCID_NONE) :
+ 0;
+
+ setcr3(kcr3);
+ }
+
/*
- * There's an interesting problem here, as to what to
- * actually specify when switching to the kernel hat.
- * For now we'll reuse the kernel hat again.
+ * We will already be taking shootdowns for our new HAT, and as KPTI
+ * invpcid emulation needs to use kf_user_cr3, make sure we don't get
+ * any cross calls while we're inconsistent. Note that it's harmless to
+ * have a *stale* kf_user_cr3 (we just did a FLUSH_TLB_ALL), but a
+ * *zero* kf_user_cr3 is not going to go very well.
*/
- t[1].cmd = MMUEXT_NEW_USER_BASEPTR;
- if (hat == kas.a_hat)
- t[1].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
- else
- t[1].arg1.mfn = pfn_to_mfn(hat->hat_user_ptable);
- ++opcnt;
-#endif /* __amd64 */
- if (HYPERVISOR_mmuext_op(t, opcnt, &retcnt, DOMID_SELF) < 0)
- panic("HYPERVISOR_mmu_update() failed");
- ASSERT(retcnt == opcnt);
+ if (pcide)
+ flag = intr_clear();
- }
-#else
- setcr3(newcr3);
-#endif
+ reset_kpti(&cpu->cpu_m.mcpu_kpti, kcr3, ucr3);
+ reset_kpti(&cpu->cpu_m.mcpu_kpti_flt, kcr3, ucr3);
+ reset_kpti(&cpu->cpu_m.mcpu_kpti_dbg, kcr3, ucr3);
+
+ if (pcide)
+ intr_restore(flag);
+
+#endif /* !__xpv */
+
ASSERT(cpu == CPU);
}
/*
* Utility to return a valid x86pte_t from protections, pfn, and level number
@@ -1361,14 +1973,13 @@
x86_hm_exit(pp);
} else {
ASSERT(flags & HAT_LOAD_NOCONSIST);
}
#if defined(__amd64)
- if (ht->ht_flags & HTABLE_VLP) {
+ if (ht->ht_flags & HTABLE_COPIED) {
cpu_t *cpu = CPU;
- x86pte_t *vlpptep = cpu->cpu_hat_info->hci_vlp_l2ptes;
- VLP_COPY(hat->hat_vlp_ptes, vlpptep);
+ hat_pcp_update(cpu, hat);
}
#endif
HTABLE_INC(ht->ht_valid_cnt);
PGCNT_INC(hat, l);
return (rv);
@@ -1436,11 +2047,12 @@
* early before we blow out the kernel stack.
*/
++curthread->t_hatdepth;
ASSERT(curthread->t_hatdepth < 16);
- ASSERT(hat == kas.a_hat || AS_LOCK_HELD(hat->hat_as));
+ ASSERT(hat == kas.a_hat || (hat->hat_flags & HAT_PCP) != 0 ||
+ AS_LOCK_HELD(hat->hat_as));
if (flags & HAT_LOAD_SHARE)
hat->hat_flags |= HAT_SHARED;
/*
@@ -1456,19 +2068,27 @@
if (ht == NULL) {
ht = htable_create(hat, va, level, NULL);
ASSERT(ht != NULL);
}
+ /*
+ * htable_va2entry checks this condition as well, but it won't include
+ * much useful info in the panic. So we do it in advance here to include
+ * all the context.
+ */
+ if (ht->ht_vaddr > va || va > HTABLE_LAST_PAGE(ht)) {
+ panic("hati_load_common: bad htable: va=%p, last page=%p, "
+ "ht->ht_vaddr=%p, ht->ht_level=%d", (void *)va,
+ (void *)HTABLE_LAST_PAGE(ht), (void *)ht->ht_vaddr,
+ (int)ht->ht_level);
+ }
entry = htable_va2entry(va, ht);
/*
* a bunch of paranoid error checking
*/
ASSERT(ht->ht_busy > 0);
- if (ht->ht_vaddr > va || va > HTABLE_LAST_PAGE(ht))
- panic("hati_load_common: bad htable %p, va %p",
- (void *)ht, (void *)va);
ASSERT(ht->ht_level == level);
/*
* construct the new PTE
*/
@@ -1914,90 +2534,63 @@
panic("No shared region support on x86");
}
#if !defined(__xpv)
/*
- * Cross call service routine to demap a virtual page on
- * the current CPU or flush all mappings in TLB.
+ * Cross call service routine to demap a range of virtual
+ * pages on the current CPU or flush all mappings in TLB.
*/
-/*ARGSUSED*/
static int
hati_demap_func(xc_arg_t a1, xc_arg_t a2, xc_arg_t a3)
{
+ _NOTE(ARGUNUSED(a3));
hat_t *hat = (hat_t *)a1;
- caddr_t addr = (caddr_t)a2;
- size_t len = (size_t)a3;
+ tlb_range_t *range = (tlb_range_t *)a2;
/*
* If the target hat isn't the kernel and this CPU isn't operating
* in the target hat, we can ignore the cross call.
*/
if (hat != kas.a_hat && hat != CPU->cpu_current_hat)
return (0);
- /*
- * For a normal address, we flush a range of contiguous mappings
- */
- if ((uintptr_t)addr != DEMAP_ALL_ADDR) {
- for (size_t i = 0; i < len; i += MMU_PAGESIZE)
- mmu_tlbflush_entry(addr + i);
+ if (range->tr_va != DEMAP_ALL_ADDR) {
+ mmu_flush_tlb(FLUSH_TLB_RANGE, range);
return (0);
}
/*
- * Otherwise we reload cr3 to effect a complete TLB flush.
+ * We are flushing all of userspace.
*
- * A reload of cr3 on a VLP process also means we must also recopy in
- * the pte values from the struct hat
+ * When using PCP, we first need to update this CPU's idea of the PCP
+ * PTEs.
*/
- if (hat->hat_flags & HAT_VLP) {
+ if (hat->hat_flags & HAT_COPIED) {
#if defined(__amd64)
- x86pte_t *vlpptep = CPU->cpu_hat_info->hci_vlp_l2ptes;
-
- VLP_COPY(hat->hat_vlp_ptes, vlpptep);
+ hat_pcp_update(CPU, hat);
#elif defined(__i386)
reload_pae32(hat, CPU);
#endif
}
- reload_cr3();
+
+ mmu_flush_tlb(FLUSH_TLB_NONGLOBAL, NULL);
return (0);
}
-/*
- * Flush all TLB entries, including global (ie. kernel) ones.
- */
-static void
-flush_all_tlb_entries(void)
-{
- ulong_t cr4 = getcr4();
-
- if (cr4 & CR4_PGE) {
- setcr4(cr4 & ~(ulong_t)CR4_PGE);
- setcr4(cr4);
-
- /*
- * 32 bit PAE also needs to always reload_cr3()
- */
- if (mmu.max_level == 2)
- reload_cr3();
- } else {
- reload_cr3();
- }
-}
-
-#define TLB_CPU_HALTED (01ul)
-#define TLB_INVAL_ALL (02ul)
+#define TLBIDLE_CPU_HALTED (0x1UL)
+#define TLBIDLE_INVAL_ALL (0x2UL)
#define CAS_TLB_INFO(cpu, old, new) \
atomic_cas_ulong((ulong_t *)&(cpu)->cpu_m.mcpu_tlb_info, (old), (new))
/*
* Record that a CPU is going idle
*/
void
tlb_going_idle(void)
{
- atomic_or_ulong((ulong_t *)&CPU->cpu_m.mcpu_tlb_info, TLB_CPU_HALTED);
+ atomic_or_ulong((ulong_t *)&CPU->cpu_m.mcpu_tlb_info,
+ TLBIDLE_CPU_HALTED);
}
/*
* Service a delayed TLB flush if coming out of being idle.
* It will be called from cpu idle notification with interrupt disabled.
@@ -2010,37 +2603,38 @@
/*
* We only have to do something if coming out of being idle.
*/
tlb_info = CPU->cpu_m.mcpu_tlb_info;
- if (tlb_info & TLB_CPU_HALTED) {
+ if (tlb_info & TLBIDLE_CPU_HALTED) {
ASSERT(CPU->cpu_current_hat == kas.a_hat);
/*
* Atomic clear and fetch of old state.
*/
while ((found = CAS_TLB_INFO(CPU, tlb_info, 0)) != tlb_info) {
- ASSERT(found & TLB_CPU_HALTED);
+ ASSERT(found & TLBIDLE_CPU_HALTED);
tlb_info = found;
SMT_PAUSE();
}
- if (tlb_info & TLB_INVAL_ALL)
- flush_all_tlb_entries();
+ if (tlb_info & TLBIDLE_INVAL_ALL)
+ mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
}
}
#endif /* !__xpv */
/*
* Internal routine to do cross calls to invalidate a range of pages on
* all CPUs using a given hat.
*/
void
-hat_tlb_inval_range(hat_t *hat, uintptr_t va, size_t len)
+hat_tlb_inval_range(hat_t *hat, tlb_range_t *in_range)
{
extern int flushes_require_xcalls; /* from mp_startup.c */
cpuset_t justme;
cpuset_t cpus_to_shootdown;
+ tlb_range_t range = *in_range;
#ifndef __xpv
cpuset_t check_cpus;
cpu_t *cpup;
int c;
#endif
@@ -2057,27 +2651,28 @@
* entire set of user TLBs, since we don't know what addresses
* these were shared at.
*/
if (hat->hat_flags & HAT_SHARED) {
hat = kas.a_hat;
- va = DEMAP_ALL_ADDR;
+ range.tr_va = DEMAP_ALL_ADDR;
}
/*
* if not running with multiple CPUs, don't use cross calls
*/
if (panicstr || !flushes_require_xcalls) {
#ifdef __xpv
- if (va == DEMAP_ALL_ADDR) {
+ if (range.tr_va == DEMAP_ALL_ADDR) {
xen_flush_tlb();
} else {
- for (size_t i = 0; i < len; i += MMU_PAGESIZE)
- xen_flush_va((caddr_t)(va + i));
+ for (size_t i = 0; i < TLB_RANGE_LEN(&range);
+ i += MMU_PAGESIZE) {
+ xen_flush_va((caddr_t)(range.tr_va + i));
}
+ }
#else
- (void) hati_demap_func((xc_arg_t)hat,
- (xc_arg_t)va, (xc_arg_t)len);
+ (void) hati_demap_func((xc_arg_t)hat, (xc_arg_t)&range, 0);
#endif
return;
}
@@ -2107,17 +2702,17 @@
cpup = cpu[c];
if (cpup == NULL)
continue;
tlb_info = cpup->cpu_m.mcpu_tlb_info;
- while (tlb_info == TLB_CPU_HALTED) {
- (void) CAS_TLB_INFO(cpup, TLB_CPU_HALTED,
- TLB_CPU_HALTED | TLB_INVAL_ALL);
+ while (tlb_info == TLBIDLE_CPU_HALTED) {
+ (void) CAS_TLB_INFO(cpup, TLBIDLE_CPU_HALTED,
+ TLBIDLE_CPU_HALTED | TLBIDLE_INVAL_ALL);
SMT_PAUSE();
tlb_info = cpup->cpu_m.mcpu_tlb_info;
}
- if (tlb_info == (TLB_CPU_HALTED | TLB_INVAL_ALL)) {
+ if (tlb_info == (TLBIDLE_CPU_HALTED | TLBIDLE_INVAL_ALL)) {
HATSTAT_INC(hs_tlb_inval_delayed);
CPUSET_DEL(cpus_to_shootdown, c);
}
}
#endif
@@ -2124,35 +2719,37 @@
if (CPUSET_ISNULL(cpus_to_shootdown) ||
CPUSET_ISEQUAL(cpus_to_shootdown, justme)) {
#ifdef __xpv
- if (va == DEMAP_ALL_ADDR) {
+ if (range.tr_va == DEMAP_ALL_ADDR) {
xen_flush_tlb();
} else {
- for (size_t i = 0; i < len; i += MMU_PAGESIZE)
- xen_flush_va((caddr_t)(va + i));
+ for (size_t i = 0; i < TLB_RANGE_LEN(&range);
+ i += MMU_PAGESIZE) {
+ xen_flush_va((caddr_t)(range.tr_va + i));
}
+ }
#else
- (void) hati_demap_func((xc_arg_t)hat,
- (xc_arg_t)va, (xc_arg_t)len);
+ (void) hati_demap_func((xc_arg_t)hat, (xc_arg_t)&range, 0);
#endif
} else {
CPUSET_ADD(cpus_to_shootdown, CPU->cpu_id);
#ifdef __xpv
- if (va == DEMAP_ALL_ADDR) {
+ if (range.tr_va == DEMAP_ALL_ADDR) {
xen_gflush_tlb(cpus_to_shootdown);
} else {
- for (size_t i = 0; i < len; i += MMU_PAGESIZE) {
- xen_gflush_va((caddr_t)(va + i),
+ for (size_t i = 0; i < TLB_RANGE_LEN(&range);
+ i += MMU_PAGESIZE) {
+ xen_gflush_va((caddr_t)(range.tr_va + i),
cpus_to_shootdown);
}
}
#else
- xc_call((xc_arg_t)hat, (xc_arg_t)va, (xc_arg_t)len,
+ xc_call((xc_arg_t)hat, (xc_arg_t)&range, 0,
CPUSET2BV(cpus_to_shootdown), hati_demap_func);
#endif
}
kpreempt_enable();
@@ -2159,11 +2756,19 @@
}
void
hat_tlb_inval(hat_t *hat, uintptr_t va)
{
- hat_tlb_inval_range(hat, va, MMU_PAGESIZE);
+ /*
+ * Create range for a single page.
+ */
+ tlb_range_t range;
+ range.tr_va = va;
+ range.tr_cnt = 1; /* one page */
+ range.tr_level = MIN_PAGE_LEVEL; /* pages are MMU_PAGESIZE */
+
+ hat_tlb_inval_range(hat, &range);
}
/*
* Interior routine for HAT_UNLOADs from hat_unload_callback(),
* hat_kmap_unload() OR from hat_steal() code. This routine doesn't
@@ -2326,36 +2931,25 @@
}
XPV_ALLOW_MIGRATE();
}
/*
- * Do the callbacks for ranges being unloaded.
- */
-typedef struct range_info {
- uintptr_t rng_va;
- ulong_t rng_cnt;
- level_t rng_level;
-} range_info_t;
-
-/*
* Invalidate the TLB, and perform the callback to the upper level VM system,
* for the specified ranges of contiguous pages.
*/
static void
-handle_ranges(hat_t *hat, hat_callback_t *cb, uint_t cnt, range_info_t *range)
+handle_ranges(hat_t *hat, hat_callback_t *cb, uint_t cnt, tlb_range_t *range)
{
while (cnt > 0) {
- size_t len;
-
--cnt;
- len = range[cnt].rng_cnt << LEVEL_SHIFT(range[cnt].rng_level);
- hat_tlb_inval_range(hat, (uintptr_t)range[cnt].rng_va, len);
+ hat_tlb_inval_range(hat, &range[cnt]);
if (cb != NULL) {
- cb->hcb_start_addr = (caddr_t)range[cnt].rng_va;
+ cb->hcb_start_addr = (caddr_t)range[cnt].tr_va;
cb->hcb_end_addr = cb->hcb_start_addr;
- cb->hcb_end_addr += len;
+ cb->hcb_end_addr += range[cnt].tr_cnt <<
+ LEVEL_SHIFT(range[cnt].tr_level);
cb->hcb_function(cb);
}
}
}
@@ -2381,11 +2975,11 @@
uintptr_t vaddr = (uintptr_t)addr;
uintptr_t eaddr = vaddr + len;
htable_t *ht = NULL;
uint_t entry;
uintptr_t contig_va = (uintptr_t)-1L;
- range_info_t r[MAX_UNLOAD_CNT];
+ tlb_range_t r[MAX_UNLOAD_CNT];
uint_t r_cnt = 0;
x86pte_t old_pte;
XPV_DISALLOW_MIGRATE();
ASSERT(hat == kas.a_hat || eaddr <= _userlimit);
@@ -2421,18 +3015,18 @@
/*
* We'll do the call backs for contiguous ranges
*/
if (vaddr != contig_va ||
- (r_cnt > 0 && r[r_cnt - 1].rng_level != ht->ht_level)) {
+ (r_cnt > 0 && r[r_cnt - 1].tr_level != ht->ht_level)) {
if (r_cnt == MAX_UNLOAD_CNT) {
handle_ranges(hat, cb, r_cnt, r);
r_cnt = 0;
}
- r[r_cnt].rng_va = vaddr;
- r[r_cnt].rng_cnt = 0;
- r[r_cnt].rng_level = ht->ht_level;
+ r[r_cnt].tr_va = vaddr;
+ r[r_cnt].tr_cnt = 0;
+ r[r_cnt].tr_level = ht->ht_level;
++r_cnt;
}
/*
* Unload one mapping (for a single page) from the page tables.
@@ -2446,11 +3040,11 @@
entry = htable_va2entry(vaddr, ht);
hat_pte_unmap(ht, entry, flags, old_pte, NULL, B_FALSE);
ASSERT(ht->ht_level <= mmu.max_page_level);
vaddr += LEVEL_SIZE(ht->ht_level);
contig_va = vaddr;
- ++r[r_cnt - 1].rng_cnt;
+ ++r[r_cnt - 1].tr_cnt;
}
if (ht)
htable_release(ht);
/*
@@ -2475,18 +3069,18 @@
sz = hat_getpagesize(hat, va);
if (sz < 0) {
#ifdef __xpv
xen_flush_tlb();
#else
- flush_all_tlb_entries();
+ mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
#endif
break;
}
#ifdef __xpv
xen_flush_va(va);
#else
- mmu_tlbflush_entry(va);
+ mmu_flush_tlb_kpage((uintptr_t)va);
#endif
va += sz;
}
}
@@ -3148,11 +3742,11 @@
}
}
/*
* flush the TLBs - since we're probably dealing with MANY mappings
- * we do just one CR3 reload.
+ * we just do a full invalidation.
*/
if (!(hat->hat_flags & HAT_FREEING) && need_demaps)
hat_tlb_inval(hat, DEMAP_ALL_ADDR);
/*
@@ -3931,11 +4525,11 @@
(pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
if (mmu.pae_hat)
*pteptr = 0;
else
*(x86pte32_t *)pteptr = 0;
- mmu_tlbflush_entry(addr);
+ mmu_flush_tlb_kpage((uintptr_t)addr);
x86pte_mapout();
}
#endif
ht = htable_getpte(kas.a_hat, ALIGN2PAGE(addr), NULL, NULL, 0);
@@ -3992,11 +4586,11 @@
(pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
if (mmu.pae_hat)
*(x86pte_t *)pteptr = pte;
else
*(x86pte32_t *)pteptr = (x86pte32_t)pte;
- mmu_tlbflush_entry(addr);
+ mmu_flush_tlb_kpage((uintptr_t)addr);
x86pte_mapout();
}
#endif
XPV_ALLOW_MIGRATE();
}
@@ -4026,11 +4620,11 @@
void
hat_cpu_online(struct cpu *cpup)
{
if (cpup != CPU) {
x86pte_cpu_init(cpup);
- hat_vlp_setup(cpup);
+ hat_pcp_setup(cpup);
}
CPUSET_ATOMIC_ADD(khat_cpuset, cpup->cpu_id);
}
/*
@@ -4041,11 +4635,11 @@
hat_cpu_offline(struct cpu *cpup)
{
ASSERT(cpup != CPU);
CPUSET_ATOMIC_DEL(khat_cpuset, cpup->cpu_id);
- hat_vlp_teardown(cpup);
+ hat_pcp_teardown(cpup);
x86pte_cpu_fini(cpup);
}
/*
* Function called after all CPUs are brought online.
@@ -4488,5 +5082,34 @@
htable_release(ht);
htable_release(ht);
XPV_ALLOW_MIGRATE();
}
#endif /* __xpv */
+
+/*
+ * Helper function to punch in a mapping that we need with the specified
+ * attributes.
+ */
+void
+hati_cpu_punchin(cpu_t *cpu, uintptr_t va, uint_t attrs)
+{
+ int ret;
+ pfn_t pfn;
+ hat_t *cpu_hat = cpu->cpu_hat_info->hci_user_hat;
+
+ ASSERT3S(kpti_enable, ==, 1);
+ ASSERT3P(cpu_hat, !=, NULL);
+ ASSERT3U(cpu_hat->hat_flags & HAT_PCP, ==, HAT_PCP);
+ ASSERT3U(va & MMU_PAGEOFFSET, ==, 0);
+
+ pfn = hat_getpfnum(kas.a_hat, (caddr_t)va);
+ VERIFY3U(pfn, !=, PFN_INVALID);
+
+ /*
+ * We purposefully don't try to find the page_t. This means that this
+ * will be marked PT_NOCONSIST; however, given that this is pretty much
+ * a static mapping that we're using we should be relatively OK.
+ */
+ attrs |= HAT_STORECACHING_OK;
+ ret = hati_load_common(cpu_hat, va, NULL, attrs, 0, 0, pfn);
+ VERIFY3S(ret, ==, 0);
+}