Print this page
8956 Implement KPTI
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
9208 hati_demap_func should take pagesize into account
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
*** 25,34 ****
--- 25,35 ----
* Copyright (c) 2010, Intel Corporation.
* All rights reserved.
*/
/*
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
+ * Copyright 2018 Joyent, Inc. All rights reserved.
* Copyright (c) 2014, 2015 by Delphix. All rights reserved.
*/
/*
* VM - Hardware Address Translation management for i386 and amd64
*** 40,49 ****
--- 41,235 ----
* that work in conjunction with this code.
*
* Routines used only inside of i86pc/vm start with hati_ for HAT Internal.
*/
+ /*
+ * amd64 HAT Design
+ *
+ * ----------
+ * Background
+ * ----------
+ *
+ * On x86, the address space is shared between a user process and the kernel.
+ * This is different from SPARC. Conventionally, the kernel lives at the top of
+ * the address space and the user process gets to enjoy the rest of it. If you
+ * look at the image of the address map in uts/i86pc/os/startup.c, you'll get a
+ * rough sense of how the address space is laid out and used.
+ *
+ * Every unique address space is represented by an instance of a HAT structure
+ * called a 'hat_t'. In addition to a hat_t structure for each process, there is
+ * also one that is used for the kernel (kas.a_hat), and each CPU ultimately
+ * also has a HAT.
+ *
+ * Each HAT contains a pointer to its root page table. This root page table is
+ * what we call an L3 page table in illumos and Intel calls the PML4. It is the
+ * physical address of the L3 table that we place in the %cr3 register which the
+ * processor uses.
+ *
+ * Each of the many layers of the page table is represented by a structure
+ * called an htable_t. The htable_t manages a set of 512 8-byte entries. The
+ * number of entries in a given page table is constant across all different
+ * level page tables. Note, this is only true on amd64. This has not always been
+ * the case on x86.
+ *
+ * Each entry in a page table, generally referred to as a PTE, may refer to
+ * another page table or a memory location, depending on the level of the page
+ * table and the use of large pages. Importantly, the top-level L3 page table
+ * (PML4) only supports linking to further page tables. This is also true on
+ * systems which support a 5th level page table (which we do not currently
+ * support).
+ *
+ * Historically, on x86, when a process was running on CPU, the root of the page
+ * table was inserted into %cr3 on each CPU on which it was currently running.
+ * When processes would switch (by calling hat_switch()), then the value in %cr3
+ * on that CPU would change to that of the new HAT. While this behavior is still
+ * maintained in the xpv kernel, this is not what is done today.
+ *
+ * -------------------
+ * Per-CPU Page Tables
+ * -------------------
+ *
+ * Throughout the system the 64-bit kernel has a notion of what it calls a
+ * per-CPU page table or PCP. The notion of a per-CPU page table was originally
+ * introduced as part of the original work to support x86 PAE. On the 64-bit
+ * kernel, it was originally used for 32-bit processes running on the 64-bit
+ * kernel. The rationale behind this was that each 32-bit process could have all
+ * of its memory represented in a single L2 page table as each L2 page table
+ * entry represents 1 GbE of memory.
+ *
+ * Following on from this, the idea was that given that all of the L3 page table
+ * entries for 32-bit processes are basically going to be identical with the
+ * exception of the first entry in the page table, why not share those page
+ * table entries. This gave rise to the idea of a per-CPU page table.
+ *
+ * The way this works is that we have a member in the machcpu_t called the
+ * mcpu_hat_info. That structure contains two different 4k pages: one that
+ * represents the L3 page table and one that represents an L2 page table. When
+ * the CPU starts up, the L3 page table entries are copied in from the kernel's
+ * page table. The L3 kernel entries do not change throughout the lifetime of
+ * the kernel. The kernel portion of these L3 pages for each CPU have the same
+ * records, meaning that they point to the same L2 page tables and thus see a
+ * consistent view of the world.
+ *
+ * When a 32-bit process is loaded into this world, we copy the 32-bit process's
+ * four top-level page table entries into the CPU's L2 page table and then set
+ * the CPU's first L3 page table entry to point to the CPU's L2 page.
+ * Specifically, in hat_pcp_update(), we're copying from the process's
+ * HAT_COPIED_32 HAT into the page tables specific to this CPU.
+ *
+ * As part of the implementation of kernel page table isolation, this was also
+ * extended to 64-bit processes. When a 64-bit process runs, we'll copy their L3
+ * PTEs across into the current CPU's L3 page table. (As we can't do the
+ * first-L3-entry trick for 64-bit processes, ->hci_pcp_l2ptes is unused in this
+ * case.)
+ *
+ * The use of per-CPU page tables has a lot of implementation ramifications. A
+ * HAT that runs a user process will be flagged with the HAT_COPIED flag to
+ * indicate that it is using the per-CPU page table functionality. In tandem
+ * with the HAT, the top-level htable_t will be flagged with the HTABLE_COPIED
+ * flag. If the HAT represents a 32-bit process, then we will also set the
+ * HAT_COPIED_32 flag on that hat_t.
+ *
+ * These two flags work together. The top-level htable_t when using per-CPU page
+ * tables is 'virtual'. We never allocate a ptable for this htable_t (i.e.
+ * ht->ht_pfn is PFN_INVALID). Instead, when we need to modify a PTE in an
+ * HTABLE_COPIED ptable, x86pte_access_pagetable() will redirect any accesses to
+ * ht_hat->hat_copied_ptes.
+ *
+ * Of course, such a modification won't actually modify the HAT_PCP page tables
+ * that were copied from the HAT_COPIED htable. When we change the top level
+ * page table entries (L2 PTEs for a 32-bit process and L3 PTEs for a 64-bit
+ * process), we need to make sure to trigger hat_pcp_update() on all CPUs that
+ * are currently tied to this HAT (including the current CPU).
+ *
+ * To do this, PCP piggy-backs on TLB invalidation, specifically via the
+ * hat_tlb_inval() path from link_ptp() and unlink_ptp().
+ *
+ * (Importantly, in all such cases, when this is in operation, the top-level
+ * entry should not be able to refer to an actual page table entry that can be
+ * changed and consolidated into a large page. If large page consolidation is
+ * required here, then there will be much that needs to be reconsidered.)
+ *
+ * -----------------------------------------------
+ * Kernel Page Table Isolation and the Per-CPU HAT
+ * -----------------------------------------------
+ *
+ * All Intel CPUs that support speculative execution and paging are subject to a
+ * series of bugs that have been termed 'Meltdown'. These exploits allow a user
+ * process to read kernel memory through cache side channels and speculative
+ * execution. To mitigate this on vulnerable CPUs, we need to use a technique
+ * called kernel page table isolation. What this requires is that we have two
+ * different page table roots. When executing in kernel mode, we will use a %cr3
+ * value that has both the user and kernel pages. However when executing in user
+ * mode, we will need to have a %cr3 that has all of the user pages; however,
+ * only a subset of the kernel pages required to operate.
+ *
+ * These kernel pages that we need mapped are:
+ *
+ * o Kernel Text that allows us to switch between the cr3 values.
+ * o The current global descriptor table (GDT)
+ * o The current interrupt descriptor table (IDT)
+ * o The current task switching state (TSS)
+ * o The current local descriptor table (LDT)
+ * o Stacks and scratch space used by the interrupt handlers
+ *
+ * For more information on the stack switching techniques, construction of the
+ * trampolines, and more, please see i86pc/ml/kpti_trampolines.s. The most
+ * important part of these mappings are the following two constraints:
+ *
+ * o The mappings are all per-CPU (except for read-only text)
+ * o The mappings are static. They are all established before the CPU is
+ * started (with the exception of the boot CPU).
+ *
+ * To facilitate the kernel page table isolation we employ our per-CPU
+ * page tables discussed in the previous section and add the notion of a per-CPU
+ * HAT. Fundamentally we have a second page table root. There is both a kernel
+ * page table (hci_pcp_l3ptes), and a user L3 page table (hci_user_l3ptes).
+ * Both will have the user page table entries copied into them, the same way
+ * that we discussed in the section 'Per-CPU Page Tables'.
+ *
+ * The complex part of this is how do we construct the set of kernel mappings
+ * that should be present when running with the user page table. To answer that,
+ * we add the notion of a per-CPU HAT. This HAT functions like a normal HAT,
+ * except that it's not really associated with an address space the same way
+ * that other HATs are.
+ *
+ * This HAT lives off of the 'struct hat_cpu_info' which is a member of the
+ * machcpu in the member hci_user_hat. We use this per-CPU HAT to create the set
+ * of kernel mappings that should be present on this CPU. The kernel mappings
+ * are added to the per-CPU HAT through the function hati_cpu_punchin(). Once a
+ * mapping has been punched in, it may not be punched out. The reason that we
+ * opt to leverage a HAT structure is that it knows how to allocate and manage
+ * all of the lower level page tables as required.
+ *
+ * Because all of the mappings are present at the beginning of time for this CPU
+ * and none of the mappings are in the kernel pageable segment, we don't have to
+ * worry about faulting on these HAT structures and thus the notion of the
+ * current HAT that we're using is always the appropriate HAT for the process
+ * (usually a user HAT or the kernel's HAT).
+ *
+ * A further constraint we place on the system with these per-CPU HATs is that
+ * they are not subject to htable_steal(). Because each CPU will have a rather
+ * fixed number of page tables, the same way that we don't steal from the
+ * kernel's HAT, it was determined that we should not steal from this HAT due to
+ * the complications involved and somewhat criminal nature of htable_steal().
+ *
+ * The per-CPU HAT is initialized in hat_pcp_setup() which is called as part of
+ * onlining the CPU, but before the CPU is actually started. The per-CPU HAT is
+ * removed in hat_pcp_teardown() which is called when a CPU is being offlined to
+ * be removed from the system (which is different from what psradm usually
+ * does).
+ *
+ * Finally, once the CPU has been onlined, the set of mappings in the per-CPU
+ * HAT must not change. The HAT related functions that we call are not meant to
+ * be called when we're switching between processes. For example, it is quite
+ * possible that if they were, they would try to grab an htable mutex which
+ * another thread might have. One needs to treat hat_switch() as though they
+ * were above LOCK_LEVEL and therefore _must not_ block under any circumstance.
+ */
+
#include <sys/machparam.h>
#include <sys/machsystm.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/systm.h>
*** 93,123 ****
/*
* The page that is the kernel's top level pagetable.
*
* For 32 bit PAE support on i86pc, the kernel hat will use the 1st 4 entries
* on this 4K page for its top level page table. The remaining groups of
! * 4 entries are used for per processor copies of user VLP pagetables for
* running threads. See hat_switch() and reload_pae32() for details.
*
! * vlp_page[0..3] - level==2 PTEs for kernel HAT
! * vlp_page[4..7] - level==2 PTEs for user thread on cpu 0
! * vlp_page[8..11] - level==2 PTE for user thread on cpu 1
* etc...
*/
! static x86pte_t *vlp_page;
/*
* forward declaration of internal utility routines
*/
static x86pte_t hati_update_pte(htable_t *ht, uint_t entry, x86pte_t expected,
x86pte_t new);
/*
! * The kernel address space exists in all HATs. To implement this the
! * kernel reserves a fixed number of entries in the topmost level(s) of page
! * tables. The values are setup during startup and then copied to every user
! * hat created by hat_alloc(). This means that kernelbase must be:
*
* 4Meg aligned for 32 bit kernels
* 512Gig aligned for x86_64 64 bit kernel
*
* The hat_kernel_range_ts describe what needs to be copied from kernel hat
--- 279,312 ----
/*
* The page that is the kernel's top level pagetable.
*
* For 32 bit PAE support on i86pc, the kernel hat will use the 1st 4 entries
* on this 4K page for its top level page table. The remaining groups of
! * 4 entries are used for per processor copies of user PCP pagetables for
* running threads. See hat_switch() and reload_pae32() for details.
*
! * pcp_page[0..3] - level==2 PTEs for kernel HAT
! * pcp_page[4..7] - level==2 PTEs for user thread on cpu 0
! * pcp_page[8..11] - level==2 PTE for user thread on cpu 1
* etc...
+ *
+ * On the 64-bit kernel, this is the normal root of the page table and there is
+ * nothing special about it when used for other CPUs.
*/
! static x86pte_t *pcp_page;
/*
* forward declaration of internal utility routines
*/
static x86pte_t hati_update_pte(htable_t *ht, uint_t entry, x86pte_t expected,
x86pte_t new);
/*
! * The kernel address space exists in all non-HAT_COPIED HATs. To implement this
! * the kernel reserves a fixed number of entries in the topmost level(s) of page
! * tables. The values are setup during startup and then copied to every user hat
! * created by hat_alloc(). This means that kernelbase must be:
*
* 4Meg aligned for 32 bit kernels
* 512Gig aligned for x86_64 64 bit kernel
*
* The hat_kernel_range_ts describe what needs to be copied from kernel hat
*** 168,178 ****
*/
kmutex_t hat_list_lock;
kcondvar_t hat_list_cv;
kmem_cache_t *hat_cache;
kmem_cache_t *hat_hash_cache;
! kmem_cache_t *vlp_hash_cache;
/*
* Simple statistics
*/
struct hatstats hatstat;
--- 357,367 ----
*/
kmutex_t hat_list_lock;
kcondvar_t hat_list_cv;
kmem_cache_t *hat_cache;
kmem_cache_t *hat_hash_cache;
! kmem_cache_t *hat32_hash_cache;
/*
* Simple statistics
*/
struct hatstats hatstat;
*** 186,201 ****
* HAT uses cmpxchg() and the other paths (hypercall etc.) were never
* incorrect.
*/
int pt_kern;
- /*
- * useful stuff for atomic access/clearing/setting REF/MOD/RO bits in page_t's.
- */
- extern void atomic_orb(uchar_t *addr, uchar_t val);
- extern void atomic_andb(uchar_t *addr, uchar_t val);
-
#ifndef __xpv
extern pfn_t memseg_get_start(struct memseg *);
#endif
#define PP_GETRM(pp, rmmask) (pp->p_nrm & rmmask)
--- 375,384 ----
*** 234,259 ****
hat->hat_ht_hash = NULL;
return (0);
}
/*
* Allocate a hat structure for as. We also create the top level
* htable and initialize it to contain the kernel hat entries.
*/
hat_t *
hat_alloc(struct as *as)
{
hat_t *hat;
htable_t *ht; /* top level htable */
! uint_t use_vlp;
uint_t r;
hat_kernel_range_t *rp;
uintptr_t va;
uintptr_t eva;
uint_t start;
uint_t cnt;
htable_t *src;
/*
* Once we start creating user process HATs we can enable
* the htable_steal() code.
*/
--- 417,469 ----
hat->hat_ht_hash = NULL;
return (0);
}
/*
+ * Put it at the start of the global list of all hats (used by stealing)
+ *
+ * kas.a_hat is not in the list but is instead used to find the
+ * first and last items in the list.
+ *
+ * - kas.a_hat->hat_next points to the start of the user hats.
+ * The list ends where hat->hat_next == NULL
+ *
+ * - kas.a_hat->hat_prev points to the last of the user hats.
+ * The list begins where hat->hat_prev == NULL
+ */
+ static void
+ hat_list_append(hat_t *hat)
+ {
+ mutex_enter(&hat_list_lock);
+ hat->hat_prev = NULL;
+ hat->hat_next = kas.a_hat->hat_next;
+ if (hat->hat_next)
+ hat->hat_next->hat_prev = hat;
+ else
+ kas.a_hat->hat_prev = hat;
+ kas.a_hat->hat_next = hat;
+ mutex_exit(&hat_list_lock);
+ }
+
+ /*
* Allocate a hat structure for as. We also create the top level
* htable and initialize it to contain the kernel hat entries.
*/
hat_t *
hat_alloc(struct as *as)
{
hat_t *hat;
htable_t *ht; /* top level htable */
! uint_t use_copied;
uint_t r;
hat_kernel_range_t *rp;
uintptr_t va;
uintptr_t eva;
uint_t start;
uint_t cnt;
htable_t *src;
+ boolean_t use_hat32_cache;
/*
* Once we start creating user process HATs we can enable
* the htable_steal() code.
*/
*** 266,299 ****
mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL);
ASSERT(hat->hat_flags == 0);
#if defined(__xpv)
/*
! * No VLP stuff on the hypervisor due to the 64-bit split top level
* page tables. On 32-bit it's not needed as the hypervisor takes
* care of copying the top level PTEs to a below 4Gig page.
*/
! use_vlp = 0;
#else /* __xpv */
! /* 32 bit processes uses a VLP style hat when running with PAE */
! #if defined(__amd64)
! use_vlp = (ttoproc(curthread)->p_model == DATAMODEL_ILP32);
! #elif defined(__i386)
! use_vlp = mmu.pae_hat;
! #endif
#endif /* __xpv */
! if (use_vlp) {
! hat->hat_flags = HAT_VLP;
! bzero(hat->hat_vlp_ptes, VLP_SIZE);
}
/*
! * Allocate the htable hash
*/
! if ((hat->hat_flags & HAT_VLP)) {
! hat->hat_num_hash = mmu.vlp_hash_cnt;
! hat->hat_ht_hash = kmem_cache_alloc(vlp_hash_cache, KM_SLEEP);
} else {
hat->hat_num_hash = mmu.hash_cnt;
hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP);
}
bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
--- 476,538 ----
mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL);
ASSERT(hat->hat_flags == 0);
#if defined(__xpv)
/*
! * No PCP stuff on the hypervisor due to the 64-bit split top level
* page tables. On 32-bit it's not needed as the hypervisor takes
* care of copying the top level PTEs to a below 4Gig page.
*/
! use_copied = 0;
! use_hat32_cache = B_FALSE;
! hat->hat_max_level = mmu.max_level;
! hat->hat_num_copied = 0;
! hat->hat_flags = 0;
#else /* __xpv */
!
! /*
! * All processes use HAT_COPIED on the 64-bit kernel if KPTI is
! * turned on.
! */
! if (ttoproc(curthread)->p_model == DATAMODEL_ILP32) {
! use_copied = 1;
! hat->hat_max_level = mmu.max_level32;
! hat->hat_num_copied = mmu.num_copied_ents32;
! use_hat32_cache = B_TRUE;
! hat->hat_flags |= HAT_COPIED_32;
! HATSTAT_INC(hs_hat_copied32);
! } else if (kpti_enable == 1) {
! use_copied = 1;
! hat->hat_max_level = mmu.max_level;
! hat->hat_num_copied = mmu.num_copied_ents;
! use_hat32_cache = B_FALSE;
! HATSTAT_INC(hs_hat_copied64);
! } else {
! use_copied = 0;
! use_hat32_cache = B_FALSE;
! hat->hat_max_level = mmu.max_level;
! hat->hat_num_copied = 0;
! hat->hat_flags = 0;
! HATSTAT_INC(hs_hat_normal64);
! }
#endif /* __xpv */
! if (use_copied) {
! hat->hat_flags |= HAT_COPIED;
! bzero(hat->hat_copied_ptes, sizeof (hat->hat_copied_ptes));
}
/*
! * Allocate the htable hash. For 32-bit PCP processes we use the
! * hat32_hash_cache. However, for 64-bit PCP processes we do not as the
! * number of entries that they have to handle is closer to
! * hat_hash_cache in count (though there will be more wastage when we
! * have more DRAM in the system and thus push down the user address
! * range).
*/
! if (use_hat32_cache) {
! hat->hat_num_hash = mmu.hat32_hash_cnt;
! hat->hat_ht_hash = kmem_cache_alloc(hat32_hash_cache, KM_SLEEP);
} else {
hat->hat_num_hash = mmu.hash_cnt;
hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP);
}
bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
*** 307,317 ****
XPV_DISALLOW_MIGRATE();
ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL);
hat->hat_htable = ht;
#if defined(__amd64)
! if (hat->hat_flags & HAT_VLP)
goto init_done;
#endif
for (r = 0; r < num_kernel_ranges; ++r) {
rp = &kernel_ranges[r];
--- 546,556 ----
XPV_DISALLOW_MIGRATE();
ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL);
hat->hat_htable = ht;
#if defined(__amd64)
! if (hat->hat_flags & HAT_COPIED)
goto init_done;
#endif
for (r = 0; r < num_kernel_ranges; ++r) {
rp = &kernel_ranges[r];
*** 332,344 ****
(eva > rp->hkr_end_va || eva == 0))
cnt = htable_va2entry(rp->hkr_end_va, ht) -
start;
#if defined(__i386) && !defined(__xpv)
! if (ht->ht_flags & HTABLE_VLP) {
! bcopy(&vlp_page[start],
! &hat->hat_vlp_ptes[start],
cnt * sizeof (x86pte_t));
continue;
}
#endif
src = htable_lookup(kas.a_hat, va, rp->hkr_level);
--- 571,583 ----
(eva > rp->hkr_end_va || eva == 0))
cnt = htable_va2entry(rp->hkr_end_va, ht) -
start;
#if defined(__i386) && !defined(__xpv)
! if (ht->ht_flags & HTABLE_COPIED) {
! bcopy(&pcp_page[start],
! &hat->hat_copied_ptes[start],
cnt * sizeof (x86pte_t));
continue;
}
#endif
src = htable_lookup(kas.a_hat, va, rp->hkr_level);
*** 359,392 ****
xen_pin(hat->hat_user_ptable, mmu.max_level);
#endif
#endif
XPV_ALLOW_MIGRATE();
/*
! * Put it at the start of the global list of all hats (used by stealing)
! *
! * kas.a_hat is not in the list but is instead used to find the
! * first and last items in the list.
! *
! * - kas.a_hat->hat_next points to the start of the user hats.
! * The list ends where hat->hat_next == NULL
! *
! * - kas.a_hat->hat_prev points to the last of the user hats.
! * The list begins where hat->hat_prev == NULL
*/
! mutex_enter(&hat_list_lock);
! hat->hat_prev = NULL;
! hat->hat_next = kas.a_hat->hat_next;
! if (hat->hat_next)
! hat->hat_next->hat_prev = hat;
! else
! kas.a_hat->hat_prev = hat;
! kas.a_hat->hat_next = hat;
! mutex_exit(&hat_list_lock);
return (hat);
}
/*
* process has finished executing but as has not been cleaned up yet.
*/
/*ARGSUSED*/
--- 598,655 ----
xen_pin(hat->hat_user_ptable, mmu.max_level);
#endif
#endif
XPV_ALLOW_MIGRATE();
+ hat_list_append(hat);
+
+ return (hat);
+ }
+
+ #if !defined(__xpv)
+ /*
+ * Cons up a HAT for a CPU. This represents the user mappings. This will have
+ * various kernel pages punched into it manually. Importantly, this hat is
+ * ineligible for stealing. We really don't want to deal with this ever
+ * faulting and figuring out that this is happening, much like we don't with
+ * kas.
+ */
+ static hat_t *
+ hat_cpu_alloc(cpu_t *cpu)
+ {
+ hat_t *hat;
+ htable_t *ht;
+
+ hat = kmem_cache_alloc(hat_cache, KM_SLEEP);
+ hat->hat_as = NULL;
+ mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL);
+ hat->hat_max_level = mmu.max_level;
+ hat->hat_num_copied = 0;
+ hat->hat_flags = HAT_PCP;
+
+ hat->hat_num_hash = mmu.hash_cnt;
+ hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP);
+ bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
+
+ hat->hat_next = hat->hat_prev = NULL;
+
/*
! * Because this HAT will only ever be used by the current CPU, we'll go
! * ahead and set the CPUSET up to only point to the CPU in question.
*/
! CPUSET_ADD(hat->hat_cpus, cpu->cpu_id);
+ hat->hat_htable = NULL;
+ hat->hat_ht_cached = NULL;
+ ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL);
+ hat->hat_htable = ht;
+
+ hat_list_append(hat);
+
return (hat);
}
+ #endif /* !__xpv */
/*
* process has finished executing but as has not been cleaned up yet.
*/
/*ARGSUSED*/
*** 439,448 ****
--- 702,712 ----
#if defined(__xpv)
/*
* On the hypervisor, unpin top level page table(s)
*/
+ VERIFY3U(hat->hat_flags & HAT_PCP, ==, 0);
xen_unpin(hat->hat_htable->ht_pfn);
#if defined(__amd64)
xen_unpin(hat->hat_user_ptable);
#endif
#endif
*** 453,470 ****
htable_purge_hat(hat);
/*
* Decide which kmem cache the hash table came from, then free it.
*/
! if (hat->hat_flags & HAT_VLP)
! cache = vlp_hash_cache;
! else
cache = hat_hash_cache;
kmem_cache_free(cache, hat->hat_ht_hash);
hat->hat_ht_hash = NULL;
hat->hat_flags = 0;
kmem_cache_free(hat_cache, hat);
}
/*
* round kernelbase down to a supported value to use for _userlimit
--- 717,745 ----
htable_purge_hat(hat);
/*
* Decide which kmem cache the hash table came from, then free it.
*/
! if (hat->hat_flags & HAT_COPIED) {
! #if defined(__amd64)
! if (hat->hat_flags & HAT_COPIED_32) {
! cache = hat32_hash_cache;
! } else {
cache = hat_hash_cache;
+ }
+ #else
+ cache = hat32_hash_cache;
+ #endif
+ } else {
+ cache = hat_hash_cache;
+ }
kmem_cache_free(cache, hat->hat_ht_hash);
hat->hat_ht_hash = NULL;
hat->hat_flags = 0;
+ hat->hat_max_level = 0;
+ hat->hat_num_copied = 0;
kmem_cache_free(hat_cache, hat);
}
/*
* round kernelbase down to a supported value to use for _userlimit
*** 515,524 ****
--- 790,835 ----
else
mmu.umax_page_level = lvl;
}
/*
+ * Determine the number of slots that are in used in the top-most level page
+ * table for user memory. This is based on _userlimit. In effect this is similar
+ * to htable_va2entry, but without the convenience of having an htable.
+ */
+ void
+ mmu_calc_user_slots(void)
+ {
+ uint_t ent, nptes;
+ uintptr_t shift;
+
+ nptes = mmu.top_level_count;
+ shift = _userlimit >> mmu.level_shift[mmu.max_level];
+ ent = shift & (nptes - 1);
+
+ /*
+ * Ent tells us the slot that the page for _userlimit would fit in. We
+ * need to add one to this to cover the total number of entries.
+ */
+ mmu.top_level_uslots = ent + 1;
+
+ /*
+ * When running 32-bit compatability processes on a 64-bit kernel, we
+ * will only need to use one slot.
+ */
+ mmu.top_level_uslots32 = 1;
+
+ /*
+ * Record the number of PCP page table entries that we'll need to copy
+ * around. For 64-bit processes this is the number of user slots. For
+ * 32-bit proceses, this is 4 1 GiB pages.
+ */
+ mmu.num_copied_ents = mmu.top_level_uslots;
+ mmu.num_copied_ents32 = 4;
+ }
+
+ /*
* Initialize hat data structures based on processor MMU information.
*/
void
mmu_init(void)
{
*** 533,543 ****
--- 844,865 ----
*/
if (is_x86_feature(x86_featureset, X86FSET_PGE) &&
(getcr4() & CR4_PGE) != 0)
mmu.pt_global = PT_GLOBAL;
+ #if !defined(__xpv)
/*
+ * The 64-bit x86 kernel has split user/kernel page tables. As such we
+ * cannot have the global bit set. The simplest way for us to deal with
+ * this is to just say that pt_global is zero, so the global bit isn't
+ * present.
+ */
+ if (kpti_enable == 1)
+ mmu.pt_global = 0;
+ #endif
+
+ /*
* Detect NX and PAE usage.
*/
mmu.pae_hat = kbm_pae_support;
if (kbm_nx_support)
mmu.pt_nx = PT_NX;
*** 591,600 ****
--- 913,927 ----
mmu.num_level = 4;
mmu.max_level = 3;
mmu.ptes_per_table = 512;
mmu.top_level_count = 512;
+ /*
+ * 32-bit processes only use 1 GB ptes.
+ */
+ mmu.max_level32 = 2;
+
mmu.level_shift[0] = 12;
mmu.level_shift[1] = 21;
mmu.level_shift[2] = 30;
mmu.level_shift[3] = 39;
*** 627,636 ****
--- 954,964 ----
mmu.level_offset[i] = mmu.level_size[i] - 1;
mmu.level_mask[i] = ~mmu.level_offset[i];
}
set_max_page_level();
+ mmu_calc_user_slots();
mmu_page_sizes = mmu.max_page_level + 1;
mmu_exported_page_sizes = mmu.umax_page_level + 1;
/* restrict legacy applications from using pagesizes 1g and above */
*** 662,672 ****
*/
max_htables = physmax / mmu.ptes_per_table;
mmu.hash_cnt = MMU_PAGESIZE / sizeof (htable_t *);
while (mmu.hash_cnt > 16 && mmu.hash_cnt >= max_htables)
mmu.hash_cnt >>= 1;
! mmu.vlp_hash_cnt = mmu.hash_cnt;
#if defined(__amd64)
/*
* If running in 64 bits and physical memory is large,
* increase the size of the cache to cover all of memory for
--- 990,1000 ----
*/
max_htables = physmax / mmu.ptes_per_table;
mmu.hash_cnt = MMU_PAGESIZE / sizeof (htable_t *);
while (mmu.hash_cnt > 16 && mmu.hash_cnt >= max_htables)
mmu.hash_cnt >>= 1;
! mmu.hat32_hash_cnt = mmu.hash_cnt;
#if defined(__amd64)
/*
* If running in 64 bits and physical memory is large,
* increase the size of the cache to cover all of memory for
*** 711,728 ****
hat_hash_cache = kmem_cache_create("HatHash",
mmu.hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL,
NULL, 0, 0);
/*
! * VLP hats can use a smaller hash table size on large memroy machines
*/
! if (mmu.hash_cnt == mmu.vlp_hash_cnt) {
! vlp_hash_cache = hat_hash_cache;
} else {
! vlp_hash_cache = kmem_cache_create("HatVlpHash",
! mmu.vlp_hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL,
! NULL, 0, 0);
}
/*
* Set up the kernel's hat
*/
--- 1039,1057 ----
hat_hash_cache = kmem_cache_create("HatHash",
mmu.hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL,
NULL, 0, 0);
/*
! * 32-bit PCP hats can use a smaller hash table size on large memory
! * machines
*/
! if (mmu.hash_cnt == mmu.hat32_hash_cnt) {
! hat32_hash_cache = hat_hash_cache;
} else {
! hat32_hash_cache = kmem_cache_create("Hat32Hash",
! mmu.hat32_hash_cnt * sizeof (htable_t *), 0, NULL, NULL,
! NULL, NULL, 0, 0);
}
/*
* Set up the kernel's hat
*/
*** 735,744 ****
--- 1064,1080 ----
CPUSET_ZERO(khat_cpuset);
CPUSET_ADD(khat_cpuset, CPU->cpu_id);
/*
+ * The kernel HAT doesn't use PCP regardless of architectures.
+ */
+ ASSERT3U(mmu.max_level, >, 0);
+ kas.a_hat->hat_max_level = mmu.max_level;
+ kas.a_hat->hat_num_copied = 0;
+
+ /*
* The kernel hat's next pointer serves as the head of the hat list .
* The kernel hat's prev pointer tracks the last hat on the list for
* htable_steal() to use.
*/
kas.a_hat->hat_next = NULL;
*** 766,826 ****
*/
hrm_hashtab = kmem_zalloc(HRM_HASHSIZE * sizeof (struct hrmstat *),
KM_SLEEP);
}
/*
! * Prepare CPU specific pagetables for VLP processes on 64 bit kernels.
*
* Each CPU has a set of 2 pagetables that are reused for any 32 bit
! * process it runs. They are the top level pagetable, hci_vlp_l3ptes, and
! * the next to top level table for the bottom 512 Gig, hci_vlp_l2ptes.
*/
/*ARGSUSED*/
static void
! hat_vlp_setup(struct cpu *cpu)
{
! #if defined(__amd64) && !defined(__xpv)
struct hat_cpu_info *hci = cpu->cpu_hat_info;
! pfn_t pfn;
/*
* allocate the level==2 page table for the bottom most
* 512Gig of address space (this is where 32 bit apps live)
*/
ASSERT(hci != NULL);
! hci->hci_vlp_l2ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
/*
* Allocate a top level pagetable and copy the kernel's
! * entries into it. Then link in hci_vlp_l2ptes in the 1st entry.
*/
! hci->hci_vlp_l3ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
! hci->hci_vlp_pfn =
! hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_vlp_l3ptes);
! ASSERT(hci->hci_vlp_pfn != PFN_INVALID);
! bcopy(vlp_page, hci->hci_vlp_l3ptes, MMU_PAGESIZE);
! pfn = hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_vlp_l2ptes);
! ASSERT(pfn != PFN_INVALID);
! hci->hci_vlp_l3ptes[0] = MAKEPTP(pfn, 2);
! #endif /* __amd64 && !__xpv */
}
/*ARGSUSED*/
static void
! hat_vlp_teardown(cpu_t *cpu)
{
! #if defined(__amd64) && !defined(__xpv)
struct hat_cpu_info *hci;
if ((hci = cpu->cpu_hat_info) == NULL)
return;
! if (hci->hci_vlp_l2ptes)
! kmem_free(hci->hci_vlp_l2ptes, MMU_PAGESIZE);
! if (hci->hci_vlp_l3ptes)
! kmem_free(hci->hci_vlp_l3ptes, MMU_PAGESIZE);
#endif
}
#define NEXT_HKR(r, l, s, e) { \
kernel_ranges[r].hkr_level = l; \
--- 1102,1270 ----
*/
hrm_hashtab = kmem_zalloc(HRM_HASHSIZE * sizeof (struct hrmstat *),
KM_SLEEP);
}
+
+ extern void kpti_tramp_start();
+ extern void kpti_tramp_end();
+
+ extern void kdi_isr_start();
+ extern void kdi_isr_end();
+
+ extern gate_desc_t kdi_idt[NIDT];
+
/*
! * Prepare per-CPU pagetables for all processes on the 64 bit kernel.
*
* Each CPU has a set of 2 pagetables that are reused for any 32 bit
! * process it runs. They are the top level pagetable, hci_pcp_l3ptes, and
! * the next to top level table for the bottom 512 Gig, hci_pcp_l2ptes.
*/
/*ARGSUSED*/
static void
! hat_pcp_setup(struct cpu *cpu)
{
! #if !defined(__xpv)
struct hat_cpu_info *hci = cpu->cpu_hat_info;
! uintptr_t va;
! size_t len;
/*
* allocate the level==2 page table for the bottom most
* 512Gig of address space (this is where 32 bit apps live)
*/
ASSERT(hci != NULL);
! hci->hci_pcp_l2ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
/*
* Allocate a top level pagetable and copy the kernel's
! * entries into it. Then link in hci_pcp_l2ptes in the 1st entry.
*/
! hci->hci_pcp_l3ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
! hci->hci_pcp_l3pfn =
! hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_pcp_l3ptes);
! ASSERT3U(hci->hci_pcp_l3pfn, !=, PFN_INVALID);
! bcopy(pcp_page, hci->hci_pcp_l3ptes, MMU_PAGESIZE);
! hci->hci_pcp_l2pfn =
! hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_pcp_l2ptes);
! ASSERT3U(hci->hci_pcp_l2pfn, !=, PFN_INVALID);
!
! /*
! * Now go through and allocate the user version of these structures.
! * Unlike with the kernel version, we allocate a hat to represent the
! * top-level page table as that will make it much simpler when we need
! * to patch through user entries.
! */
! hci->hci_user_hat = hat_cpu_alloc(cpu);
! hci->hci_user_l3pfn = hci->hci_user_hat->hat_htable->ht_pfn;
! ASSERT3U(hci->hci_user_l3pfn, !=, PFN_INVALID);
! hci->hci_user_l3ptes =
! (x86pte_t *)hat_kpm_mapin_pfn(hci->hci_user_l3pfn);
!
! /* Skip the rest of this if KPTI is switched off at boot. */
! if (kpti_enable != 1)
! return;
!
! /*
! * OK, now that we have this we need to go through and punch the normal
! * holes in the CPU's hat for this. At this point we'll punch in the
! * following:
! *
! * o GDT
! * o IDT
! * o LDT
! * o Trampoline Code
! * o machcpu KPTI page
! * o kmdb ISR code page (just trampolines)
! *
! * If this is cpu0, then we also can initialize the following because
! * they'll have already been allocated.
! *
! * o TSS for CPU 0
! * o Double Fault for CPU 0
! *
! * The following items have yet to be allocated and have not been
! * punched in yet. They will be punched in later:
! *
! * o TSS (mach_cpucontext_alloc_tables())
! * o Double Fault Stack (mach_cpucontext_alloc_tables())
! */
! hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_gdt, PROT_READ);
! hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_idt, PROT_READ);
!
! /*
! * As the KDI IDT is only active during kmdb sessions (including single
! * stepping), typically we don't actually need this punched in (we
! * consider the routines that switch to the user cr3 to be toxic). But
! * if we ever accidentally end up on the user cr3 while on this IDT,
! * we'd prefer not to triple fault.
! */
! hati_cpu_punchin(cpu, (uintptr_t)&kdi_idt, PROT_READ);
!
! CTASSERT(((uintptr_t)&kpti_tramp_start % MMU_PAGESIZE) == 0);
! CTASSERT(((uintptr_t)&kpti_tramp_end % MMU_PAGESIZE) == 0);
! for (va = (uintptr_t)&kpti_tramp_start;
! va < (uintptr_t)&kpti_tramp_end; va += MMU_PAGESIZE) {
! hati_cpu_punchin(cpu, va, PROT_READ | PROT_EXEC);
! }
!
! VERIFY3U(((uintptr_t)cpu->cpu_m.mcpu_ldt) % MMU_PAGESIZE, ==, 0);
! for (va = (uintptr_t)cpu->cpu_m.mcpu_ldt, len = LDT_CPU_SIZE;
! len >= MMU_PAGESIZE; va += MMU_PAGESIZE, len -= MMU_PAGESIZE) {
! hati_cpu_punchin(cpu, va, PROT_READ);
! }
!
! /* mcpu_pad2 is the start of the page containing the kpti_frames. */
! hati_cpu_punchin(cpu, (uintptr_t)&cpu->cpu_m.mcpu_pad2[0],
! PROT_READ | PROT_WRITE);
!
! if (cpu == &cpus[0]) {
! /*
! * CPU0 uses a global for its double fault stack to deal with
! * the chicken and egg problem. We need to punch it into its
! * user HAT.
! */
! extern char dblfault_stack0[];
!
! hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_m.mcpu_tss,
! PROT_READ);
!
! for (va = (uintptr_t)dblfault_stack0,
! len = DEFAULTSTKSZ; len >= MMU_PAGESIZE;
! va += MMU_PAGESIZE, len -= MMU_PAGESIZE) {
! hati_cpu_punchin(cpu, va, PROT_READ | PROT_WRITE);
! }
! }
!
! CTASSERT(((uintptr_t)&kdi_isr_start % MMU_PAGESIZE) == 0);
! CTASSERT(((uintptr_t)&kdi_isr_end % MMU_PAGESIZE) == 0);
! for (va = (uintptr_t)&kdi_isr_start;
! va < (uintptr_t)&kdi_isr_end; va += MMU_PAGESIZE) {
! hati_cpu_punchin(cpu, va, PROT_READ | PROT_EXEC);
! }
! #endif /* !__xpv */
}
/*ARGSUSED*/
static void
! hat_pcp_teardown(cpu_t *cpu)
{
! #if !defined(__xpv)
struct hat_cpu_info *hci;
if ((hci = cpu->cpu_hat_info) == NULL)
return;
! if (hci->hci_pcp_l2ptes != NULL)
! kmem_free(hci->hci_pcp_l2ptes, MMU_PAGESIZE);
! if (hci->hci_pcp_l3ptes != NULL)
! kmem_free(hci->hci_pcp_l3ptes, MMU_PAGESIZE);
! if (hci->hci_user_hat != NULL) {
! hat_free_start(hci->hci_user_hat);
! hat_free_end(hci->hci_user_hat);
! }
#endif
}
#define NEXT_HKR(r, l, s, e) { \
kernel_ranges[r].hkr_level = l; \
*** 912,936 ****
}
/*
* 32 bit PAE metal kernels use only 4 of the 512 entries in the
* page holding the top level pagetable. We use the remainder for
! * the "per CPU" page tables for VLP processes.
* Map the top level kernel pagetable into the kernel to make
* it easy to use bcopy access these tables.
*/
if (mmu.pae_hat) {
! vlp_page = vmem_alloc(heap_arena, MMU_PAGESIZE, VM_SLEEP);
! hat_devload(kas.a_hat, (caddr_t)vlp_page, MMU_PAGESIZE,
kas.a_hat->hat_htable->ht_pfn,
#if !defined(__xpv)
PROT_WRITE |
#endif
PROT_READ | HAT_NOSYNC | HAT_UNORDERED_OK,
HAT_LOAD | HAT_LOAD_NOCONSIST);
}
! hat_vlp_setup(CPU);
/*
* Create kmap (cached mappings of kernel PTEs)
* for 32 bit we map from segmap_start .. ekernelheap
* for 64 bit we map from segmap_start .. segmap_start + segmapsize;
--- 1356,1383 ----
}
/*
* 32 bit PAE metal kernels use only 4 of the 512 entries in the
* page holding the top level pagetable. We use the remainder for
! * the "per CPU" page tables for PCP processes.
* Map the top level kernel pagetable into the kernel to make
* it easy to use bcopy access these tables.
+ *
+ * PAE is required for the 64-bit kernel which uses this as well to
+ * perform the per-CPU pagetables. See the big theory statement.
*/
if (mmu.pae_hat) {
! pcp_page = vmem_alloc(heap_arena, MMU_PAGESIZE, VM_SLEEP);
! hat_devload(kas.a_hat, (caddr_t)pcp_page, MMU_PAGESIZE,
kas.a_hat->hat_htable->ht_pfn,
#if !defined(__xpv)
PROT_WRITE |
#endif
PROT_READ | HAT_NOSYNC | HAT_UNORDERED_OK,
HAT_LOAD | HAT_LOAD_NOCONSIST);
}
! hat_pcp_setup(CPU);
/*
* Create kmap (cached mappings of kernel PTEs)
* for 32 bit we map from segmap_start .. ekernelheap
* for 64 bit we map from segmap_start .. segmap_start + segmapsize;
*** 939,948 ****
--- 1386,1401 ----
size = (uintptr_t)ekernelheap - segmap_start;
#elif defined(__amd64)
size = segmapsize;
#endif
hat_kmap_init((uintptr_t)segmap_start, size);
+
+ #if !defined(__xpv)
+ ASSERT3U(kas.a_hat->hat_htable->ht_pfn, !=, PFN_INVALID);
+ ASSERT3U(kpti_safe_cr3, ==,
+ MAKECR3(kas.a_hat->hat_htable->ht_pfn, PCID_KERNEL));
+ #endif
}
/*
* On 32 bit PAE mode, PTE's are 64 bits, but ordinary atomic memory references
* are 32 bit, so for safety we must use atomic_cas_64() to install these.
*** 956,971 ****
x86pte_t pte;
int i;
/*
* Load the 4 entries of the level 2 page table into this
! * cpu's range of the vlp_page and point cr3 at them.
*/
ASSERT(mmu.pae_hat);
! src = hat->hat_vlp_ptes;
! dest = vlp_page + (cpu->cpu_id + 1) * VLP_NUM_PTES;
! for (i = 0; i < VLP_NUM_PTES; ++i) {
for (;;) {
pte = dest[i];
if (pte == src[i])
break;
if (atomic_cas_64(dest + i, pte, src[i]) != src[i])
--- 1409,1424 ----
x86pte_t pte;
int i;
/*
* Load the 4 entries of the level 2 page table into this
! * cpu's range of the pcp_page and point cr3 at them.
*/
ASSERT(mmu.pae_hat);
! src = hat->hat_copied_ptes;
! dest = pcp_page + (cpu->cpu_id + 1) * MAX_COPIED_PTES;
! for (i = 0; i < MAX_COPIED_PTES; ++i) {
for (;;) {
pte = dest[i];
if (pte == src[i])
break;
if (atomic_cas_64(dest + i, pte, src[i]) != src[i])
*** 974,992 ****
}
}
#endif
/*
* Switch to a new active hat, maintaining bit masks to track active CPUs.
*
! * On the 32-bit PAE hypervisor, %cr3 is a 64-bit value, on metal it
! * remains a 32-bit value.
*/
void
hat_switch(hat_t *hat)
{
- uint64_t newcr3;
cpu_t *cpu = CPU;
hat_t *old = cpu->cpu_current_hat;
/*
* set up this information first, so we don't miss any cross calls
--- 1427,1593 ----
}
}
#endif
/*
+ * Update the PCP data on the CPU cpu to the one on the hat. If this is a 32-bit
+ * process, then we must update the L2 pages and then the L3. If this is a
+ * 64-bit process then we must update the L3 entries.
+ */
+ static void
+ hat_pcp_update(cpu_t *cpu, const hat_t *hat)
+ {
+ ASSERT3U(hat->hat_flags & HAT_COPIED, !=, 0);
+
+ if ((hat->hat_flags & HAT_COPIED_32) != 0) {
+ const x86pte_t *l2src;
+ x86pte_t *l2dst, *l3ptes, *l3uptes;
+ /*
+ * This is a 32-bit process. To set this up, we need to do the
+ * following:
+ *
+ * - Copy the 4 L2 PTEs into the dedicated L2 table
+ * - Zero the user L3 PTEs in the user and kernel page table
+ * - Set the first L3 PTE to point to the CPU L2 table
+ */
+ l2src = hat->hat_copied_ptes;
+ l2dst = cpu->cpu_hat_info->hci_pcp_l2ptes;
+ l3ptes = cpu->cpu_hat_info->hci_pcp_l3ptes;
+ l3uptes = cpu->cpu_hat_info->hci_user_l3ptes;
+
+ l2dst[0] = l2src[0];
+ l2dst[1] = l2src[1];
+ l2dst[2] = l2src[2];
+ l2dst[3] = l2src[3];
+
+ /*
+ * Make sure to use the mmu to get the number of slots. The
+ * number of PLP entries that this has will always be less as
+ * it's a 32-bit process.
+ */
+ bzero(l3ptes, sizeof (x86pte_t) * mmu.top_level_uslots);
+ l3ptes[0] = MAKEPTP(cpu->cpu_hat_info->hci_pcp_l2pfn, 2);
+ bzero(l3uptes, sizeof (x86pte_t) * mmu.top_level_uslots);
+ l3uptes[0] = MAKEPTP(cpu->cpu_hat_info->hci_pcp_l2pfn, 2);
+ } else {
+ /*
+ * This is a 64-bit process. To set this up, we need to do the
+ * following:
+ *
+ * - Zero the 4 L2 PTEs in the CPU structure for safety
+ * - Copy over the new user L3 PTEs into the kernel page table
+ * - Copy over the new user L3 PTEs into the user page table
+ */
+ ASSERT3S(kpti_enable, ==, 1);
+ bzero(cpu->cpu_hat_info->hci_pcp_l2ptes, sizeof (x86pte_t) * 4);
+ bcopy(hat->hat_copied_ptes, cpu->cpu_hat_info->hci_pcp_l3ptes,
+ sizeof (x86pte_t) * mmu.top_level_uslots);
+ bcopy(hat->hat_copied_ptes, cpu->cpu_hat_info->hci_user_l3ptes,
+ sizeof (x86pte_t) * mmu.top_level_uslots);
+ }
+ }
+
+ static void
+ reset_kpti(struct kpti_frame *fr, uint64_t kcr3, uint64_t ucr3)
+ {
+ ASSERT3U(fr->kf_tr_flag, ==, 0);
+ #if DEBUG
+ if (fr->kf_kernel_cr3 != 0) {
+ ASSERT3U(fr->kf_lower_redzone, ==, 0xdeadbeefdeadbeef);
+ ASSERT3U(fr->kf_middle_redzone, ==, 0xdeadbeefdeadbeef);
+ ASSERT3U(fr->kf_upper_redzone, ==, 0xdeadbeefdeadbeef);
+ }
+ #endif
+
+ bzero(fr, offsetof(struct kpti_frame, kf_kernel_cr3));
+ bzero(&fr->kf_unused, sizeof (struct kpti_frame) -
+ offsetof(struct kpti_frame, kf_unused));
+
+ fr->kf_kernel_cr3 = kcr3;
+ fr->kf_user_cr3 = ucr3;
+ fr->kf_tr_ret_rsp = (uintptr_t)&fr->kf_tr_rsp;
+
+ fr->kf_lower_redzone = 0xdeadbeefdeadbeef;
+ fr->kf_middle_redzone = 0xdeadbeefdeadbeef;
+ fr->kf_upper_redzone = 0xdeadbeefdeadbeef;
+ }
+
+ #ifdef __xpv
+ static void
+ hat_switch_xen(hat_t *hat)
+ {
+ struct mmuext_op t[2];
+ uint_t retcnt;
+ uint_t opcnt = 1;
+ uint64_t newcr3;
+
+ ASSERT(!(hat->hat_flags & HAT_COPIED));
+ ASSERT(!(getcr4() & CR4_PCIDE));
+
+ newcr3 = MAKECR3((uint64_t)hat->hat_htable->ht_pfn, PCID_NONE);
+
+ t[0].cmd = MMUEXT_NEW_BASEPTR;
+ t[0].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
+
+ /*
+ * There's an interesting problem here, as to what to actually specify
+ * when switching to the kernel hat. For now we'll reuse the kernel hat
+ * again.
+ */
+ t[1].cmd = MMUEXT_NEW_USER_BASEPTR;
+ if (hat == kas.a_hat)
+ t[1].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
+ else
+ t[1].arg1.mfn = pfn_to_mfn(hat->hat_user_ptable);
+ ++opcnt;
+
+ if (HYPERVISOR_mmuext_op(t, opcnt, &retcnt, DOMID_SELF) < 0)
+ panic("HYPERVISOR_mmu_update() failed");
+ ASSERT(retcnt == opcnt);
+ }
+ #endif /* __xpv */
+
+ /*
* Switch to a new active hat, maintaining bit masks to track active CPUs.
*
! * With KPTI, all our HATs except kas should be using PCP. Thus, to switch
! * HATs, we need to copy over the new user PTEs, then set our trampoline context
! * as appropriate.
! *
! * If lacking PCID, we then load our new cr3, which will flush the TLB: we may
! * have established userspace TLB entries via kernel accesses, and these are no
! * longer valid. We have to do this eagerly, as we just deleted this CPU from
! * ->hat_cpus, so would no longer see any TLB shootdowns.
! *
! * With PCID enabled, things get a little more complicated. We would like to
! * keep TLB context around when entering and exiting the kernel, and to do this,
! * we partition the TLB into two different spaces:
! *
! * PCID_KERNEL is defined as zero, and used both by kas and all other address
! * spaces while in the kernel (post-trampoline).
! *
! * PCID_USER is used while in userspace. Therefore, userspace cannot use any
! * lingering PCID_KERNEL entries to kernel addresses it should not be able to
! * read.
! *
! * The trampoline cr3s are set not to invalidate on a mov to %cr3. This means if
! * we take a journey through the kernel without switching HATs, we have some
! * hope of keeping our TLB state around.
! *
! * On a hat switch, rather than deal with any necessary flushes on the way out
! * of the trampolines, we do them upfront here. If we're switching from kas, we
! * shouldn't need any invalidation.
! *
! * Otherwise, we can have stale userspace entries for both PCID_USER (what
! * happened before we move onto the kcr3) and PCID_KERNEL (any subsequent
! * userspace accesses such as ddi_copyin()). Since setcr3() won't do these
! * flushes on its own in PCIDE, we'll do a non-flushing load and then
! * invalidate everything.
*/
void
hat_switch(hat_t *hat)
{
cpu_t *cpu = CPU;
hat_t *old = cpu->cpu_current_hat;
/*
* set up this information first, so we don't miss any cross calls
*** 1004,1059 ****
if (hat != kas.a_hat) {
CPUSET_ATOMIC_ADD(hat->hat_cpus, cpu->cpu_id);
}
cpu->cpu_current_hat = hat;
! /*
! * now go ahead and load cr3
! */
! if (hat->hat_flags & HAT_VLP) {
! #if defined(__amd64)
! x86pte_t *vlpptep = cpu->cpu_hat_info->hci_vlp_l2ptes;
! VLP_COPY(hat->hat_vlp_ptes, vlpptep);
! newcr3 = MAKECR3(cpu->cpu_hat_info->hci_vlp_pfn);
! #elif defined(__i386)
! reload_pae32(hat, cpu);
! newcr3 = MAKECR3(kas.a_hat->hat_htable->ht_pfn) +
! (cpu->cpu_id + 1) * VLP_SIZE;
! #endif
} else {
! newcr3 = MAKECR3((uint64_t)hat->hat_htable->ht_pfn);
}
- #ifdef __xpv
- {
- struct mmuext_op t[2];
- uint_t retcnt;
- uint_t opcnt = 1;
! t[0].cmd = MMUEXT_NEW_BASEPTR;
! t[0].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
! #if defined(__amd64)
/*
! * There's an interesting problem here, as to what to
! * actually specify when switching to the kernel hat.
! * For now we'll reuse the kernel hat again.
*/
! t[1].cmd = MMUEXT_NEW_USER_BASEPTR;
! if (hat == kas.a_hat)
! t[1].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
! else
! t[1].arg1.mfn = pfn_to_mfn(hat->hat_user_ptable);
! ++opcnt;
! #endif /* __amd64 */
! if (HYPERVISOR_mmuext_op(t, opcnt, &retcnt, DOMID_SELF) < 0)
! panic("HYPERVISOR_mmu_update() failed");
! ASSERT(retcnt == opcnt);
! }
! #else
! setcr3(newcr3);
! #endif
ASSERT(cpu == CPU);
}
/*
* Utility to return a valid x86pte_t from protections, pfn, and level number
--- 1605,1671 ----
if (hat != kas.a_hat) {
CPUSET_ATOMIC_ADD(hat->hat_cpus, cpu->cpu_id);
}
cpu->cpu_current_hat = hat;
! #if defined(__xpv)
! hat_switch_xen(hat);
! #else
! struct hat_cpu_info *info = cpu->cpu_m.mcpu_hat_info;
! uint64_t pcide = getcr4() & CR4_PCIDE;
! uint64_t kcr3, ucr3;
! pfn_t tl_kpfn;
! ulong_t flag;
! EQUIV(kpti_enable, !mmu.pt_global);
!
! if (hat->hat_flags & HAT_COPIED) {
! hat_pcp_update(cpu, hat);
! tl_kpfn = info->hci_pcp_l3pfn;
} else {
! IMPLY(kpti_enable, hat == kas.a_hat);
! tl_kpfn = hat->hat_htable->ht_pfn;
}
! if (pcide) {
! ASSERT(kpti_enable);
!
! kcr3 = MAKECR3(tl_kpfn, PCID_KERNEL) | CR3_NOINVL_BIT;
! ucr3 = MAKECR3(info->hci_user_l3pfn, PCID_USER) |
! CR3_NOINVL_BIT;
!
! setcr3(kcr3);
! if (old != kas.a_hat)
! mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
! } else {
! kcr3 = MAKECR3(tl_kpfn, PCID_NONE);
! ucr3 = kpti_enable ?
! MAKECR3(info->hci_user_l3pfn, PCID_NONE) :
! 0;
!
! setcr3(kcr3);
! }
!
/*
! * We will already be taking shootdowns for our new HAT, and as KPTI
! * invpcid emulation needs to use kf_user_cr3, make sure we don't get
! * any cross calls while we're inconsistent. Note that it's harmless to
! * have a *stale* kf_user_cr3 (we just did a FLUSH_TLB_ALL), but a
! * *zero* kf_user_cr3 is not going to go very well.
*/
! if (pcide)
! flag = intr_clear();
! reset_kpti(&cpu->cpu_m.mcpu_kpti, kcr3, ucr3);
! reset_kpti(&cpu->cpu_m.mcpu_kpti_flt, kcr3, ucr3);
! reset_kpti(&cpu->cpu_m.mcpu_kpti_dbg, kcr3, ucr3);
!
! if (pcide)
! intr_restore(flag);
!
! #endif /* !__xpv */
!
ASSERT(cpu == CPU);
}
/*
* Utility to return a valid x86pte_t from protections, pfn, and level number
*** 1361,1374 ****
x86_hm_exit(pp);
} else {
ASSERT(flags & HAT_LOAD_NOCONSIST);
}
#if defined(__amd64)
! if (ht->ht_flags & HTABLE_VLP) {
cpu_t *cpu = CPU;
! x86pte_t *vlpptep = cpu->cpu_hat_info->hci_vlp_l2ptes;
! VLP_COPY(hat->hat_vlp_ptes, vlpptep);
}
#endif
HTABLE_INC(ht->ht_valid_cnt);
PGCNT_INC(hat, l);
return (rv);
--- 1973,1985 ----
x86_hm_exit(pp);
} else {
ASSERT(flags & HAT_LOAD_NOCONSIST);
}
#if defined(__amd64)
! if (ht->ht_flags & HTABLE_COPIED) {
cpu_t *cpu = CPU;
! hat_pcp_update(cpu, hat);
}
#endif
HTABLE_INC(ht->ht_valid_cnt);
PGCNT_INC(hat, l);
return (rv);
*** 1436,1446 ****
* early before we blow out the kernel stack.
*/
++curthread->t_hatdepth;
ASSERT(curthread->t_hatdepth < 16);
! ASSERT(hat == kas.a_hat || AS_LOCK_HELD(hat->hat_as));
if (flags & HAT_LOAD_SHARE)
hat->hat_flags |= HAT_SHARED;
/*
--- 2047,2058 ----
* early before we blow out the kernel stack.
*/
++curthread->t_hatdepth;
ASSERT(curthread->t_hatdepth < 16);
! ASSERT(hat == kas.a_hat || (hat->hat_flags & HAT_PCP) != 0 ||
! AS_LOCK_HELD(hat->hat_as));
if (flags & HAT_LOAD_SHARE)
hat->hat_flags |= HAT_SHARED;
/*
*** 1456,1474 ****
if (ht == NULL) {
ht = htable_create(hat, va, level, NULL);
ASSERT(ht != NULL);
}
entry = htable_va2entry(va, ht);
/*
* a bunch of paranoid error checking
*/
ASSERT(ht->ht_busy > 0);
- if (ht->ht_vaddr > va || va > HTABLE_LAST_PAGE(ht))
- panic("hati_load_common: bad htable %p, va %p",
- (void *)ht, (void *)va);
ASSERT(ht->ht_level == level);
/*
* construct the new PTE
*/
--- 2068,2094 ----
if (ht == NULL) {
ht = htable_create(hat, va, level, NULL);
ASSERT(ht != NULL);
}
+ /*
+ * htable_va2entry checks this condition as well, but it won't include
+ * much useful info in the panic. So we do it in advance here to include
+ * all the context.
+ */
+ if (ht->ht_vaddr > va || va > HTABLE_LAST_PAGE(ht)) {
+ panic("hati_load_common: bad htable: va=%p, last page=%p, "
+ "ht->ht_vaddr=%p, ht->ht_level=%d", (void *)va,
+ (void *)HTABLE_LAST_PAGE(ht), (void *)ht->ht_vaddr,
+ (int)ht->ht_level);
+ }
entry = htable_va2entry(va, ht);
/*
* a bunch of paranoid error checking
*/
ASSERT(ht->ht_busy > 0);
ASSERT(ht->ht_level == level);
/*
* construct the new PTE
*/
*** 1914,2003 ****
panic("No shared region support on x86");
}
#if !defined(__xpv)
/*
! * Cross call service routine to demap a virtual page on
! * the current CPU or flush all mappings in TLB.
*/
- /*ARGSUSED*/
static int
hati_demap_func(xc_arg_t a1, xc_arg_t a2, xc_arg_t a3)
{
hat_t *hat = (hat_t *)a1;
! caddr_t addr = (caddr_t)a2;
! size_t len = (size_t)a3;
/*
* If the target hat isn't the kernel and this CPU isn't operating
* in the target hat, we can ignore the cross call.
*/
if (hat != kas.a_hat && hat != CPU->cpu_current_hat)
return (0);
! /*
! * For a normal address, we flush a range of contiguous mappings
! */
! if ((uintptr_t)addr != DEMAP_ALL_ADDR) {
! for (size_t i = 0; i < len; i += MMU_PAGESIZE)
! mmu_tlbflush_entry(addr + i);
return (0);
}
/*
! * Otherwise we reload cr3 to effect a complete TLB flush.
*
! * A reload of cr3 on a VLP process also means we must also recopy in
! * the pte values from the struct hat
*/
! if (hat->hat_flags & HAT_VLP) {
#if defined(__amd64)
! x86pte_t *vlpptep = CPU->cpu_hat_info->hci_vlp_l2ptes;
!
! VLP_COPY(hat->hat_vlp_ptes, vlpptep);
#elif defined(__i386)
reload_pae32(hat, CPU);
#endif
}
! reload_cr3();
return (0);
}
! /*
! * Flush all TLB entries, including global (ie. kernel) ones.
! */
! static void
! flush_all_tlb_entries(void)
! {
! ulong_t cr4 = getcr4();
!
! if (cr4 & CR4_PGE) {
! setcr4(cr4 & ~(ulong_t)CR4_PGE);
! setcr4(cr4);
!
! /*
! * 32 bit PAE also needs to always reload_cr3()
! */
! if (mmu.max_level == 2)
! reload_cr3();
! } else {
! reload_cr3();
! }
! }
!
! #define TLB_CPU_HALTED (01ul)
! #define TLB_INVAL_ALL (02ul)
#define CAS_TLB_INFO(cpu, old, new) \
atomic_cas_ulong((ulong_t *)&(cpu)->cpu_m.mcpu_tlb_info, (old), (new))
/*
* Record that a CPU is going idle
*/
void
tlb_going_idle(void)
{
! atomic_or_ulong((ulong_t *)&CPU->cpu_m.mcpu_tlb_info, TLB_CPU_HALTED);
}
/*
* Service a delayed TLB flush if coming out of being idle.
* It will be called from cpu idle notification with interrupt disabled.
--- 2534,2596 ----
panic("No shared region support on x86");
}
#if !defined(__xpv)
/*
! * Cross call service routine to demap a range of virtual
! * pages on the current CPU or flush all mappings in TLB.
*/
static int
hati_demap_func(xc_arg_t a1, xc_arg_t a2, xc_arg_t a3)
{
+ _NOTE(ARGUNUSED(a3));
hat_t *hat = (hat_t *)a1;
! tlb_range_t *range = (tlb_range_t *)a2;
/*
* If the target hat isn't the kernel and this CPU isn't operating
* in the target hat, we can ignore the cross call.
*/
if (hat != kas.a_hat && hat != CPU->cpu_current_hat)
return (0);
! if (range->tr_va != DEMAP_ALL_ADDR) {
! mmu_flush_tlb(FLUSH_TLB_RANGE, range);
return (0);
}
/*
! * We are flushing all of userspace.
*
! * When using PCP, we first need to update this CPU's idea of the PCP
! * PTEs.
*/
! if (hat->hat_flags & HAT_COPIED) {
#if defined(__amd64)
! hat_pcp_update(CPU, hat);
#elif defined(__i386)
reload_pae32(hat, CPU);
#endif
}
!
! mmu_flush_tlb(FLUSH_TLB_NONGLOBAL, NULL);
return (0);
}
! #define TLBIDLE_CPU_HALTED (0x1UL)
! #define TLBIDLE_INVAL_ALL (0x2UL)
#define CAS_TLB_INFO(cpu, old, new) \
atomic_cas_ulong((ulong_t *)&(cpu)->cpu_m.mcpu_tlb_info, (old), (new))
/*
* Record that a CPU is going idle
*/
void
tlb_going_idle(void)
{
! atomic_or_ulong((ulong_t *)&CPU->cpu_m.mcpu_tlb_info,
! TLBIDLE_CPU_HALTED);
}
/*
* Service a delayed TLB flush if coming out of being idle.
* It will be called from cpu idle notification with interrupt disabled.
*** 2010,2046 ****
/*
* We only have to do something if coming out of being idle.
*/
tlb_info = CPU->cpu_m.mcpu_tlb_info;
! if (tlb_info & TLB_CPU_HALTED) {
ASSERT(CPU->cpu_current_hat == kas.a_hat);
/*
* Atomic clear and fetch of old state.
*/
while ((found = CAS_TLB_INFO(CPU, tlb_info, 0)) != tlb_info) {
! ASSERT(found & TLB_CPU_HALTED);
tlb_info = found;
SMT_PAUSE();
}
! if (tlb_info & TLB_INVAL_ALL)
! flush_all_tlb_entries();
}
}
#endif /* !__xpv */
/*
* Internal routine to do cross calls to invalidate a range of pages on
* all CPUs using a given hat.
*/
void
! hat_tlb_inval_range(hat_t *hat, uintptr_t va, size_t len)
{
extern int flushes_require_xcalls; /* from mp_startup.c */
cpuset_t justme;
cpuset_t cpus_to_shootdown;
#ifndef __xpv
cpuset_t check_cpus;
cpu_t *cpup;
int c;
#endif
--- 2603,2640 ----
/*
* We only have to do something if coming out of being idle.
*/
tlb_info = CPU->cpu_m.mcpu_tlb_info;
! if (tlb_info & TLBIDLE_CPU_HALTED) {
ASSERT(CPU->cpu_current_hat == kas.a_hat);
/*
* Atomic clear and fetch of old state.
*/
while ((found = CAS_TLB_INFO(CPU, tlb_info, 0)) != tlb_info) {
! ASSERT(found & TLBIDLE_CPU_HALTED);
tlb_info = found;
SMT_PAUSE();
}
! if (tlb_info & TLBIDLE_INVAL_ALL)
! mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
}
}
#endif /* !__xpv */
/*
* Internal routine to do cross calls to invalidate a range of pages on
* all CPUs using a given hat.
*/
void
! hat_tlb_inval_range(hat_t *hat, tlb_range_t *in_range)
{
extern int flushes_require_xcalls; /* from mp_startup.c */
cpuset_t justme;
cpuset_t cpus_to_shootdown;
+ tlb_range_t range = *in_range;
#ifndef __xpv
cpuset_t check_cpus;
cpu_t *cpup;
int c;
#endif
*** 2057,2083 ****
* entire set of user TLBs, since we don't know what addresses
* these were shared at.
*/
if (hat->hat_flags & HAT_SHARED) {
hat = kas.a_hat;
! va = DEMAP_ALL_ADDR;
}
/*
* if not running with multiple CPUs, don't use cross calls
*/
if (panicstr || !flushes_require_xcalls) {
#ifdef __xpv
! if (va == DEMAP_ALL_ADDR) {
xen_flush_tlb();
} else {
! for (size_t i = 0; i < len; i += MMU_PAGESIZE)
! xen_flush_va((caddr_t)(va + i));
}
#else
! (void) hati_demap_func((xc_arg_t)hat,
! (xc_arg_t)va, (xc_arg_t)len);
#endif
return;
}
--- 2651,2678 ----
* entire set of user TLBs, since we don't know what addresses
* these were shared at.
*/
if (hat->hat_flags & HAT_SHARED) {
hat = kas.a_hat;
! range.tr_va = DEMAP_ALL_ADDR;
}
/*
* if not running with multiple CPUs, don't use cross calls
*/
if (panicstr || !flushes_require_xcalls) {
#ifdef __xpv
! if (range.tr_va == DEMAP_ALL_ADDR) {
xen_flush_tlb();
} else {
! for (size_t i = 0; i < TLB_RANGE_LEN(&range);
! i += MMU_PAGESIZE) {
! xen_flush_va((caddr_t)(range.tr_va + i));
}
+ }
#else
! (void) hati_demap_func((xc_arg_t)hat, (xc_arg_t)&range, 0);
#endif
return;
}
*** 2107,2123 ****
cpup = cpu[c];
if (cpup == NULL)
continue;
tlb_info = cpup->cpu_m.mcpu_tlb_info;
! while (tlb_info == TLB_CPU_HALTED) {
! (void) CAS_TLB_INFO(cpup, TLB_CPU_HALTED,
! TLB_CPU_HALTED | TLB_INVAL_ALL);
SMT_PAUSE();
tlb_info = cpup->cpu_m.mcpu_tlb_info;
}
! if (tlb_info == (TLB_CPU_HALTED | TLB_INVAL_ALL)) {
HATSTAT_INC(hs_tlb_inval_delayed);
CPUSET_DEL(cpus_to_shootdown, c);
}
}
#endif
--- 2702,2718 ----
cpup = cpu[c];
if (cpup == NULL)
continue;
tlb_info = cpup->cpu_m.mcpu_tlb_info;
! while (tlb_info == TLBIDLE_CPU_HALTED) {
! (void) CAS_TLB_INFO(cpup, TLBIDLE_CPU_HALTED,
! TLBIDLE_CPU_HALTED | TLBIDLE_INVAL_ALL);
SMT_PAUSE();
tlb_info = cpup->cpu_m.mcpu_tlb_info;
}
! if (tlb_info == (TLBIDLE_CPU_HALTED | TLBIDLE_INVAL_ALL)) {
HATSTAT_INC(hs_tlb_inval_delayed);
CPUSET_DEL(cpus_to_shootdown, c);
}
}
#endif
*** 2124,2158 ****
if (CPUSET_ISNULL(cpus_to_shootdown) ||
CPUSET_ISEQUAL(cpus_to_shootdown, justme)) {
#ifdef __xpv
! if (va == DEMAP_ALL_ADDR) {
xen_flush_tlb();
} else {
! for (size_t i = 0; i < len; i += MMU_PAGESIZE)
! xen_flush_va((caddr_t)(va + i));
}
#else
! (void) hati_demap_func((xc_arg_t)hat,
! (xc_arg_t)va, (xc_arg_t)len);
#endif
} else {
CPUSET_ADD(cpus_to_shootdown, CPU->cpu_id);
#ifdef __xpv
! if (va == DEMAP_ALL_ADDR) {
xen_gflush_tlb(cpus_to_shootdown);
} else {
! for (size_t i = 0; i < len; i += MMU_PAGESIZE) {
! xen_gflush_va((caddr_t)(va + i),
cpus_to_shootdown);
}
}
#else
! xc_call((xc_arg_t)hat, (xc_arg_t)va, (xc_arg_t)len,
CPUSET2BV(cpus_to_shootdown), hati_demap_func);
#endif
}
kpreempt_enable();
--- 2719,2755 ----
if (CPUSET_ISNULL(cpus_to_shootdown) ||
CPUSET_ISEQUAL(cpus_to_shootdown, justme)) {
#ifdef __xpv
! if (range.tr_va == DEMAP_ALL_ADDR) {
xen_flush_tlb();
} else {
! for (size_t i = 0; i < TLB_RANGE_LEN(&range);
! i += MMU_PAGESIZE) {
! xen_flush_va((caddr_t)(range.tr_va + i));
}
+ }
#else
! (void) hati_demap_func((xc_arg_t)hat, (xc_arg_t)&range, 0);
#endif
} else {
CPUSET_ADD(cpus_to_shootdown, CPU->cpu_id);
#ifdef __xpv
! if (range.tr_va == DEMAP_ALL_ADDR) {
xen_gflush_tlb(cpus_to_shootdown);
} else {
! for (size_t i = 0; i < TLB_RANGE_LEN(&range);
! i += MMU_PAGESIZE) {
! xen_gflush_va((caddr_t)(range.tr_va + i),
cpus_to_shootdown);
}
}
#else
! xc_call((xc_arg_t)hat, (xc_arg_t)&range, 0,
CPUSET2BV(cpus_to_shootdown), hati_demap_func);
#endif
}
kpreempt_enable();
*** 2159,2169 ****
}
void
hat_tlb_inval(hat_t *hat, uintptr_t va)
{
! hat_tlb_inval_range(hat, va, MMU_PAGESIZE);
}
/*
* Interior routine for HAT_UNLOADs from hat_unload_callback(),
* hat_kmap_unload() OR from hat_steal() code. This routine doesn't
--- 2756,2774 ----
}
void
hat_tlb_inval(hat_t *hat, uintptr_t va)
{
! /*
! * Create range for a single page.
! */
! tlb_range_t range;
! range.tr_va = va;
! range.tr_cnt = 1; /* one page */
! range.tr_level = MIN_PAGE_LEVEL; /* pages are MMU_PAGESIZE */
!
! hat_tlb_inval_range(hat, &range);
}
/*
* Interior routine for HAT_UNLOADs from hat_unload_callback(),
* hat_kmap_unload() OR from hat_steal() code. This routine doesn't
*** 2326,2361 ****
}
XPV_ALLOW_MIGRATE();
}
/*
- * Do the callbacks for ranges being unloaded.
- */
- typedef struct range_info {
- uintptr_t rng_va;
- ulong_t rng_cnt;
- level_t rng_level;
- } range_info_t;
-
- /*
* Invalidate the TLB, and perform the callback to the upper level VM system,
* for the specified ranges of contiguous pages.
*/
static void
! handle_ranges(hat_t *hat, hat_callback_t *cb, uint_t cnt, range_info_t *range)
{
while (cnt > 0) {
- size_t len;
-
--cnt;
! len = range[cnt].rng_cnt << LEVEL_SHIFT(range[cnt].rng_level);
! hat_tlb_inval_range(hat, (uintptr_t)range[cnt].rng_va, len);
if (cb != NULL) {
! cb->hcb_start_addr = (caddr_t)range[cnt].rng_va;
cb->hcb_end_addr = cb->hcb_start_addr;
! cb->hcb_end_addr += len;
cb->hcb_function(cb);
}
}
}
--- 2931,2955 ----
}
XPV_ALLOW_MIGRATE();
}
/*
* Invalidate the TLB, and perform the callback to the upper level VM system,
* for the specified ranges of contiguous pages.
*/
static void
! handle_ranges(hat_t *hat, hat_callback_t *cb, uint_t cnt, tlb_range_t *range)
{
while (cnt > 0) {
--cnt;
! hat_tlb_inval_range(hat, &range[cnt]);
if (cb != NULL) {
! cb->hcb_start_addr = (caddr_t)range[cnt].tr_va;
cb->hcb_end_addr = cb->hcb_start_addr;
! cb->hcb_end_addr += range[cnt].tr_cnt <<
! LEVEL_SHIFT(range[cnt].tr_level);
cb->hcb_function(cb);
}
}
}
*** 2381,2391 ****
uintptr_t vaddr = (uintptr_t)addr;
uintptr_t eaddr = vaddr + len;
htable_t *ht = NULL;
uint_t entry;
uintptr_t contig_va = (uintptr_t)-1L;
! range_info_t r[MAX_UNLOAD_CNT];
uint_t r_cnt = 0;
x86pte_t old_pte;
XPV_DISALLOW_MIGRATE();
ASSERT(hat == kas.a_hat || eaddr <= _userlimit);
--- 2975,2985 ----
uintptr_t vaddr = (uintptr_t)addr;
uintptr_t eaddr = vaddr + len;
htable_t *ht = NULL;
uint_t entry;
uintptr_t contig_va = (uintptr_t)-1L;
! tlb_range_t r[MAX_UNLOAD_CNT];
uint_t r_cnt = 0;
x86pte_t old_pte;
XPV_DISALLOW_MIGRATE();
ASSERT(hat == kas.a_hat || eaddr <= _userlimit);
*** 2421,2438 ****
/*
* We'll do the call backs for contiguous ranges
*/
if (vaddr != contig_va ||
! (r_cnt > 0 && r[r_cnt - 1].rng_level != ht->ht_level)) {
if (r_cnt == MAX_UNLOAD_CNT) {
handle_ranges(hat, cb, r_cnt, r);
r_cnt = 0;
}
! r[r_cnt].rng_va = vaddr;
! r[r_cnt].rng_cnt = 0;
! r[r_cnt].rng_level = ht->ht_level;
++r_cnt;
}
/*
* Unload one mapping (for a single page) from the page tables.
--- 3015,3032 ----
/*
* We'll do the call backs for contiguous ranges
*/
if (vaddr != contig_va ||
! (r_cnt > 0 && r[r_cnt - 1].tr_level != ht->ht_level)) {
if (r_cnt == MAX_UNLOAD_CNT) {
handle_ranges(hat, cb, r_cnt, r);
r_cnt = 0;
}
! r[r_cnt].tr_va = vaddr;
! r[r_cnt].tr_cnt = 0;
! r[r_cnt].tr_level = ht->ht_level;
++r_cnt;
}
/*
* Unload one mapping (for a single page) from the page tables.
*** 2446,2456 ****
entry = htable_va2entry(vaddr, ht);
hat_pte_unmap(ht, entry, flags, old_pte, NULL, B_FALSE);
ASSERT(ht->ht_level <= mmu.max_page_level);
vaddr += LEVEL_SIZE(ht->ht_level);
contig_va = vaddr;
! ++r[r_cnt - 1].rng_cnt;
}
if (ht)
htable_release(ht);
/*
--- 3040,3050 ----
entry = htable_va2entry(vaddr, ht);
hat_pte_unmap(ht, entry, flags, old_pte, NULL, B_FALSE);
ASSERT(ht->ht_level <= mmu.max_page_level);
vaddr += LEVEL_SIZE(ht->ht_level);
contig_va = vaddr;
! ++r[r_cnt - 1].tr_cnt;
}
if (ht)
htable_release(ht);
/*
*** 2475,2492 ****
sz = hat_getpagesize(hat, va);
if (sz < 0) {
#ifdef __xpv
xen_flush_tlb();
#else
! flush_all_tlb_entries();
#endif
break;
}
#ifdef __xpv
xen_flush_va(va);
#else
! mmu_tlbflush_entry(va);
#endif
va += sz;
}
}
--- 3069,3086 ----
sz = hat_getpagesize(hat, va);
if (sz < 0) {
#ifdef __xpv
xen_flush_tlb();
#else
! mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
#endif
break;
}
#ifdef __xpv
xen_flush_va(va);
#else
! mmu_flush_tlb_kpage((uintptr_t)va);
#endif
va += sz;
}
}
*** 3148,3158 ****
}
}
/*
* flush the TLBs - since we're probably dealing with MANY mappings
! * we do just one CR3 reload.
*/
if (!(hat->hat_flags & HAT_FREEING) && need_demaps)
hat_tlb_inval(hat, DEMAP_ALL_ADDR);
/*
--- 3742,3752 ----
}
}
/*
* flush the TLBs - since we're probably dealing with MANY mappings
! * we just do a full invalidation.
*/
if (!(hat->hat_flags & HAT_FREEING) && need_demaps)
hat_tlb_inval(hat, DEMAP_ALL_ADDR);
/*
*** 3931,3941 ****
(pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
if (mmu.pae_hat)
*pteptr = 0;
else
*(x86pte32_t *)pteptr = 0;
! mmu_tlbflush_entry(addr);
x86pte_mapout();
}
#endif
ht = htable_getpte(kas.a_hat, ALIGN2PAGE(addr), NULL, NULL, 0);
--- 4525,4535 ----
(pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
if (mmu.pae_hat)
*pteptr = 0;
else
*(x86pte32_t *)pteptr = 0;
! mmu_flush_tlb_kpage((uintptr_t)addr);
x86pte_mapout();
}
#endif
ht = htable_getpte(kas.a_hat, ALIGN2PAGE(addr), NULL, NULL, 0);
*** 3992,4002 ****
(pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
if (mmu.pae_hat)
*(x86pte_t *)pteptr = pte;
else
*(x86pte32_t *)pteptr = (x86pte32_t)pte;
! mmu_tlbflush_entry(addr);
x86pte_mapout();
}
#endif
XPV_ALLOW_MIGRATE();
}
--- 4586,4596 ----
(pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
if (mmu.pae_hat)
*(x86pte_t *)pteptr = pte;
else
*(x86pte32_t *)pteptr = (x86pte32_t)pte;
! mmu_flush_tlb_kpage((uintptr_t)addr);
x86pte_mapout();
}
#endif
XPV_ALLOW_MIGRATE();
}
*** 4026,4036 ****
void
hat_cpu_online(struct cpu *cpup)
{
if (cpup != CPU) {
x86pte_cpu_init(cpup);
! hat_vlp_setup(cpup);
}
CPUSET_ATOMIC_ADD(khat_cpuset, cpup->cpu_id);
}
/*
--- 4620,4630 ----
void
hat_cpu_online(struct cpu *cpup)
{
if (cpup != CPU) {
x86pte_cpu_init(cpup);
! hat_pcp_setup(cpup);
}
CPUSET_ATOMIC_ADD(khat_cpuset, cpup->cpu_id);
}
/*
*** 4041,4051 ****
hat_cpu_offline(struct cpu *cpup)
{
ASSERT(cpup != CPU);
CPUSET_ATOMIC_DEL(khat_cpuset, cpup->cpu_id);
! hat_vlp_teardown(cpup);
x86pte_cpu_fini(cpup);
}
/*
* Function called after all CPUs are brought online.
--- 4635,4645 ----
hat_cpu_offline(struct cpu *cpup)
{
ASSERT(cpup != CPU);
CPUSET_ATOMIC_DEL(khat_cpuset, cpup->cpu_id);
! hat_pcp_teardown(cpup);
x86pte_cpu_fini(cpup);
}
/*
* Function called after all CPUs are brought online.
*** 4488,4492 ****
--- 5082,5115 ----
htable_release(ht);
htable_release(ht);
XPV_ALLOW_MIGRATE();
}
#endif /* __xpv */
+
+ /*
+ * Helper function to punch in a mapping that we need with the specified
+ * attributes.
+ */
+ void
+ hati_cpu_punchin(cpu_t *cpu, uintptr_t va, uint_t attrs)
+ {
+ int ret;
+ pfn_t pfn;
+ hat_t *cpu_hat = cpu->cpu_hat_info->hci_user_hat;
+
+ ASSERT3S(kpti_enable, ==, 1);
+ ASSERT3P(cpu_hat, !=, NULL);
+ ASSERT3U(cpu_hat->hat_flags & HAT_PCP, ==, HAT_PCP);
+ ASSERT3U(va & MMU_PAGEOFFSET, ==, 0);
+
+ pfn = hat_getpfnum(kas.a_hat, (caddr_t)va);
+ VERIFY3U(pfn, !=, PFN_INVALID);
+
+ /*
+ * We purposefully don't try to find the page_t. This means that this
+ * will be marked PT_NOCONSIST; however, given that this is pretty much
+ * a static mapping that we're using we should be relatively OK.
+ */
+ attrs |= HAT_STORECACHING_OK;
+ ret = hati_load_common(cpu_hat, va, NULL, attrs, 0, 0, pfn);
+ VERIFY3S(ret, ==, 0);
+ }