Print this page
8956 Implement KPTI
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
9208 hati_demap_func should take pagesize into account
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>

*** 25,34 **** --- 25,35 ---- * Copyright (c) 2010, Intel Corporation. * All rights reserved. */ /* * Copyright 2011 Nexenta Systems, Inc. All rights reserved. + * Copyright 2018 Joyent, Inc. All rights reserved. * Copyright (c) 2014, 2015 by Delphix. All rights reserved. */ /* * VM - Hardware Address Translation management for i386 and amd64
*** 40,49 **** --- 41,235 ---- * that work in conjunction with this code. * * Routines used only inside of i86pc/vm start with hati_ for HAT Internal. */ + /* + * amd64 HAT Design + * + * ---------- + * Background + * ---------- + * + * On x86, the address space is shared between a user process and the kernel. + * This is different from SPARC. Conventionally, the kernel lives at the top of + * the address space and the user process gets to enjoy the rest of it. If you + * look at the image of the address map in uts/i86pc/os/startup.c, you'll get a + * rough sense of how the address space is laid out and used. + * + * Every unique address space is represented by an instance of a HAT structure + * called a 'hat_t'. In addition to a hat_t structure for each process, there is + * also one that is used for the kernel (kas.a_hat), and each CPU ultimately + * also has a HAT. + * + * Each HAT contains a pointer to its root page table. This root page table is + * what we call an L3 page table in illumos and Intel calls the PML4. It is the + * physical address of the L3 table that we place in the %cr3 register which the + * processor uses. + * + * Each of the many layers of the page table is represented by a structure + * called an htable_t. The htable_t manages a set of 512 8-byte entries. The + * number of entries in a given page table is constant across all different + * level page tables. Note, this is only true on amd64. This has not always been + * the case on x86. + * + * Each entry in a page table, generally referred to as a PTE, may refer to + * another page table or a memory location, depending on the level of the page + * table and the use of large pages. Importantly, the top-level L3 page table + * (PML4) only supports linking to further page tables. This is also true on + * systems which support a 5th level page table (which we do not currently + * support). + * + * Historically, on x86, when a process was running on CPU, the root of the page + * table was inserted into %cr3 on each CPU on which it was currently running. + * When processes would switch (by calling hat_switch()), then the value in %cr3 + * on that CPU would change to that of the new HAT. While this behavior is still + * maintained in the xpv kernel, this is not what is done today. + * + * ------------------- + * Per-CPU Page Tables + * ------------------- + * + * Throughout the system the 64-bit kernel has a notion of what it calls a + * per-CPU page table or PCP. The notion of a per-CPU page table was originally + * introduced as part of the original work to support x86 PAE. On the 64-bit + * kernel, it was originally used for 32-bit processes running on the 64-bit + * kernel. The rationale behind this was that each 32-bit process could have all + * of its memory represented in a single L2 page table as each L2 page table + * entry represents 1 GbE of memory. + * + * Following on from this, the idea was that given that all of the L3 page table + * entries for 32-bit processes are basically going to be identical with the + * exception of the first entry in the page table, why not share those page + * table entries. This gave rise to the idea of a per-CPU page table. + * + * The way this works is that we have a member in the machcpu_t called the + * mcpu_hat_info. That structure contains two different 4k pages: one that + * represents the L3 page table and one that represents an L2 page table. When + * the CPU starts up, the L3 page table entries are copied in from the kernel's + * page table. The L3 kernel entries do not change throughout the lifetime of + * the kernel. The kernel portion of these L3 pages for each CPU have the same + * records, meaning that they point to the same L2 page tables and thus see a + * consistent view of the world. + * + * When a 32-bit process is loaded into this world, we copy the 32-bit process's + * four top-level page table entries into the CPU's L2 page table and then set + * the CPU's first L3 page table entry to point to the CPU's L2 page. + * Specifically, in hat_pcp_update(), we're copying from the process's + * HAT_COPIED_32 HAT into the page tables specific to this CPU. + * + * As part of the implementation of kernel page table isolation, this was also + * extended to 64-bit processes. When a 64-bit process runs, we'll copy their L3 + * PTEs across into the current CPU's L3 page table. (As we can't do the + * first-L3-entry trick for 64-bit processes, ->hci_pcp_l2ptes is unused in this + * case.) + * + * The use of per-CPU page tables has a lot of implementation ramifications. A + * HAT that runs a user process will be flagged with the HAT_COPIED flag to + * indicate that it is using the per-CPU page table functionality. In tandem + * with the HAT, the top-level htable_t will be flagged with the HTABLE_COPIED + * flag. If the HAT represents a 32-bit process, then we will also set the + * HAT_COPIED_32 flag on that hat_t. + * + * These two flags work together. The top-level htable_t when using per-CPU page + * tables is 'virtual'. We never allocate a ptable for this htable_t (i.e. + * ht->ht_pfn is PFN_INVALID). Instead, when we need to modify a PTE in an + * HTABLE_COPIED ptable, x86pte_access_pagetable() will redirect any accesses to + * ht_hat->hat_copied_ptes. + * + * Of course, such a modification won't actually modify the HAT_PCP page tables + * that were copied from the HAT_COPIED htable. When we change the top level + * page table entries (L2 PTEs for a 32-bit process and L3 PTEs for a 64-bit + * process), we need to make sure to trigger hat_pcp_update() on all CPUs that + * are currently tied to this HAT (including the current CPU). + * + * To do this, PCP piggy-backs on TLB invalidation, specifically via the + * hat_tlb_inval() path from link_ptp() and unlink_ptp(). + * + * (Importantly, in all such cases, when this is in operation, the top-level + * entry should not be able to refer to an actual page table entry that can be + * changed and consolidated into a large page. If large page consolidation is + * required here, then there will be much that needs to be reconsidered.) + * + * ----------------------------------------------- + * Kernel Page Table Isolation and the Per-CPU HAT + * ----------------------------------------------- + * + * All Intel CPUs that support speculative execution and paging are subject to a + * series of bugs that have been termed 'Meltdown'. These exploits allow a user + * process to read kernel memory through cache side channels and speculative + * execution. To mitigate this on vulnerable CPUs, we need to use a technique + * called kernel page table isolation. What this requires is that we have two + * different page table roots. When executing in kernel mode, we will use a %cr3 + * value that has both the user and kernel pages. However when executing in user + * mode, we will need to have a %cr3 that has all of the user pages; however, + * only a subset of the kernel pages required to operate. + * + * These kernel pages that we need mapped are: + * + * o Kernel Text that allows us to switch between the cr3 values. + * o The current global descriptor table (GDT) + * o The current interrupt descriptor table (IDT) + * o The current task switching state (TSS) + * o The current local descriptor table (LDT) + * o Stacks and scratch space used by the interrupt handlers + * + * For more information on the stack switching techniques, construction of the + * trampolines, and more, please see i86pc/ml/kpti_trampolines.s. The most + * important part of these mappings are the following two constraints: + * + * o The mappings are all per-CPU (except for read-only text) + * o The mappings are static. They are all established before the CPU is + * started (with the exception of the boot CPU). + * + * To facilitate the kernel page table isolation we employ our per-CPU + * page tables discussed in the previous section and add the notion of a per-CPU + * HAT. Fundamentally we have a second page table root. There is both a kernel + * page table (hci_pcp_l3ptes), and a user L3 page table (hci_user_l3ptes). + * Both will have the user page table entries copied into them, the same way + * that we discussed in the section 'Per-CPU Page Tables'. + * + * The complex part of this is how do we construct the set of kernel mappings + * that should be present when running with the user page table. To answer that, + * we add the notion of a per-CPU HAT. This HAT functions like a normal HAT, + * except that it's not really associated with an address space the same way + * that other HATs are. + * + * This HAT lives off of the 'struct hat_cpu_info' which is a member of the + * machcpu in the member hci_user_hat. We use this per-CPU HAT to create the set + * of kernel mappings that should be present on this CPU. The kernel mappings + * are added to the per-CPU HAT through the function hati_cpu_punchin(). Once a + * mapping has been punched in, it may not be punched out. The reason that we + * opt to leverage a HAT structure is that it knows how to allocate and manage + * all of the lower level page tables as required. + * + * Because all of the mappings are present at the beginning of time for this CPU + * and none of the mappings are in the kernel pageable segment, we don't have to + * worry about faulting on these HAT structures and thus the notion of the + * current HAT that we're using is always the appropriate HAT for the process + * (usually a user HAT or the kernel's HAT). + * + * A further constraint we place on the system with these per-CPU HATs is that + * they are not subject to htable_steal(). Because each CPU will have a rather + * fixed number of page tables, the same way that we don't steal from the + * kernel's HAT, it was determined that we should not steal from this HAT due to + * the complications involved and somewhat criminal nature of htable_steal(). + * + * The per-CPU HAT is initialized in hat_pcp_setup() which is called as part of + * onlining the CPU, but before the CPU is actually started. The per-CPU HAT is + * removed in hat_pcp_teardown() which is called when a CPU is being offlined to + * be removed from the system (which is different from what psradm usually + * does). + * + * Finally, once the CPU has been onlined, the set of mappings in the per-CPU + * HAT must not change. The HAT related functions that we call are not meant to + * be called when we're switching between processes. For example, it is quite + * possible that if they were, they would try to grab an htable mutex which + * another thread might have. One needs to treat hat_switch() as though they + * were above LOCK_LEVEL and therefore _must not_ block under any circumstance. + */ + #include <sys/machparam.h> #include <sys/machsystm.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/systm.h>
*** 93,123 **** /* * The page that is the kernel's top level pagetable. * * For 32 bit PAE support on i86pc, the kernel hat will use the 1st 4 entries * on this 4K page for its top level page table. The remaining groups of ! * 4 entries are used for per processor copies of user VLP pagetables for * running threads. See hat_switch() and reload_pae32() for details. * ! * vlp_page[0..3] - level==2 PTEs for kernel HAT ! * vlp_page[4..7] - level==2 PTEs for user thread on cpu 0 ! * vlp_page[8..11] - level==2 PTE for user thread on cpu 1 * etc... */ ! static x86pte_t *vlp_page; /* * forward declaration of internal utility routines */ static x86pte_t hati_update_pte(htable_t *ht, uint_t entry, x86pte_t expected, x86pte_t new); /* ! * The kernel address space exists in all HATs. To implement this the ! * kernel reserves a fixed number of entries in the topmost level(s) of page ! * tables. The values are setup during startup and then copied to every user ! * hat created by hat_alloc(). This means that kernelbase must be: * * 4Meg aligned for 32 bit kernels * 512Gig aligned for x86_64 64 bit kernel * * The hat_kernel_range_ts describe what needs to be copied from kernel hat --- 279,312 ---- /* * The page that is the kernel's top level pagetable. * * For 32 bit PAE support on i86pc, the kernel hat will use the 1st 4 entries * on this 4K page for its top level page table. The remaining groups of ! * 4 entries are used for per processor copies of user PCP pagetables for * running threads. See hat_switch() and reload_pae32() for details. * ! * pcp_page[0..3] - level==2 PTEs for kernel HAT ! * pcp_page[4..7] - level==2 PTEs for user thread on cpu 0 ! * pcp_page[8..11] - level==2 PTE for user thread on cpu 1 * etc... + * + * On the 64-bit kernel, this is the normal root of the page table and there is + * nothing special about it when used for other CPUs. */ ! static x86pte_t *pcp_page; /* * forward declaration of internal utility routines */ static x86pte_t hati_update_pte(htable_t *ht, uint_t entry, x86pte_t expected, x86pte_t new); /* ! * The kernel address space exists in all non-HAT_COPIED HATs. To implement this ! * the kernel reserves a fixed number of entries in the topmost level(s) of page ! * tables. The values are setup during startup and then copied to every user hat ! * created by hat_alloc(). This means that kernelbase must be: * * 4Meg aligned for 32 bit kernels * 512Gig aligned for x86_64 64 bit kernel * * The hat_kernel_range_ts describe what needs to be copied from kernel hat
*** 168,178 **** */ kmutex_t hat_list_lock; kcondvar_t hat_list_cv; kmem_cache_t *hat_cache; kmem_cache_t *hat_hash_cache; ! kmem_cache_t *vlp_hash_cache; /* * Simple statistics */ struct hatstats hatstat; --- 357,367 ---- */ kmutex_t hat_list_lock; kcondvar_t hat_list_cv; kmem_cache_t *hat_cache; kmem_cache_t *hat_hash_cache; ! kmem_cache_t *hat32_hash_cache; /* * Simple statistics */ struct hatstats hatstat;
*** 186,201 **** * HAT uses cmpxchg() and the other paths (hypercall etc.) were never * incorrect. */ int pt_kern; - /* - * useful stuff for atomic access/clearing/setting REF/MOD/RO bits in page_t's. - */ - extern void atomic_orb(uchar_t *addr, uchar_t val); - extern void atomic_andb(uchar_t *addr, uchar_t val); - #ifndef __xpv extern pfn_t memseg_get_start(struct memseg *); #endif #define PP_GETRM(pp, rmmask) (pp->p_nrm & rmmask) --- 375,384 ----
*** 234,259 **** hat->hat_ht_hash = NULL; return (0); } /* * Allocate a hat structure for as. We also create the top level * htable and initialize it to contain the kernel hat entries. */ hat_t * hat_alloc(struct as *as) { hat_t *hat; htable_t *ht; /* top level htable */ ! uint_t use_vlp; uint_t r; hat_kernel_range_t *rp; uintptr_t va; uintptr_t eva; uint_t start; uint_t cnt; htable_t *src; /* * Once we start creating user process HATs we can enable * the htable_steal() code. */ --- 417,469 ---- hat->hat_ht_hash = NULL; return (0); } /* + * Put it at the start of the global list of all hats (used by stealing) + * + * kas.a_hat is not in the list but is instead used to find the + * first and last items in the list. + * + * - kas.a_hat->hat_next points to the start of the user hats. + * The list ends where hat->hat_next == NULL + * + * - kas.a_hat->hat_prev points to the last of the user hats. + * The list begins where hat->hat_prev == NULL + */ + static void + hat_list_append(hat_t *hat) + { + mutex_enter(&hat_list_lock); + hat->hat_prev = NULL; + hat->hat_next = kas.a_hat->hat_next; + if (hat->hat_next) + hat->hat_next->hat_prev = hat; + else + kas.a_hat->hat_prev = hat; + kas.a_hat->hat_next = hat; + mutex_exit(&hat_list_lock); + } + + /* * Allocate a hat structure for as. We also create the top level * htable and initialize it to contain the kernel hat entries. */ hat_t * hat_alloc(struct as *as) { hat_t *hat; htable_t *ht; /* top level htable */ ! uint_t use_copied; uint_t r; hat_kernel_range_t *rp; uintptr_t va; uintptr_t eva; uint_t start; uint_t cnt; htable_t *src; + boolean_t use_hat32_cache; /* * Once we start creating user process HATs we can enable * the htable_steal() code. */
*** 266,299 **** mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL); ASSERT(hat->hat_flags == 0); #if defined(__xpv) /* ! * No VLP stuff on the hypervisor due to the 64-bit split top level * page tables. On 32-bit it's not needed as the hypervisor takes * care of copying the top level PTEs to a below 4Gig page. */ ! use_vlp = 0; #else /* __xpv */ ! /* 32 bit processes uses a VLP style hat when running with PAE */ ! #if defined(__amd64) ! use_vlp = (ttoproc(curthread)->p_model == DATAMODEL_ILP32); ! #elif defined(__i386) ! use_vlp = mmu.pae_hat; ! #endif #endif /* __xpv */ ! if (use_vlp) { ! hat->hat_flags = HAT_VLP; ! bzero(hat->hat_vlp_ptes, VLP_SIZE); } /* ! * Allocate the htable hash */ ! if ((hat->hat_flags & HAT_VLP)) { ! hat->hat_num_hash = mmu.vlp_hash_cnt; ! hat->hat_ht_hash = kmem_cache_alloc(vlp_hash_cache, KM_SLEEP); } else { hat->hat_num_hash = mmu.hash_cnt; hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP); } bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *)); --- 476,538 ---- mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL); ASSERT(hat->hat_flags == 0); #if defined(__xpv) /* ! * No PCP stuff on the hypervisor due to the 64-bit split top level * page tables. On 32-bit it's not needed as the hypervisor takes * care of copying the top level PTEs to a below 4Gig page. */ ! use_copied = 0; ! use_hat32_cache = B_FALSE; ! hat->hat_max_level = mmu.max_level; ! hat->hat_num_copied = 0; ! hat->hat_flags = 0; #else /* __xpv */ ! ! /* ! * All processes use HAT_COPIED on the 64-bit kernel if KPTI is ! * turned on. ! */ ! if (ttoproc(curthread)->p_model == DATAMODEL_ILP32) { ! use_copied = 1; ! hat->hat_max_level = mmu.max_level32; ! hat->hat_num_copied = mmu.num_copied_ents32; ! use_hat32_cache = B_TRUE; ! hat->hat_flags |= HAT_COPIED_32; ! HATSTAT_INC(hs_hat_copied32); ! } else if (kpti_enable == 1) { ! use_copied = 1; ! hat->hat_max_level = mmu.max_level; ! hat->hat_num_copied = mmu.num_copied_ents; ! use_hat32_cache = B_FALSE; ! HATSTAT_INC(hs_hat_copied64); ! } else { ! use_copied = 0; ! use_hat32_cache = B_FALSE; ! hat->hat_max_level = mmu.max_level; ! hat->hat_num_copied = 0; ! hat->hat_flags = 0; ! HATSTAT_INC(hs_hat_normal64); ! } #endif /* __xpv */ ! if (use_copied) { ! hat->hat_flags |= HAT_COPIED; ! bzero(hat->hat_copied_ptes, sizeof (hat->hat_copied_ptes)); } /* ! * Allocate the htable hash. For 32-bit PCP processes we use the ! * hat32_hash_cache. However, for 64-bit PCP processes we do not as the ! * number of entries that they have to handle is closer to ! * hat_hash_cache in count (though there will be more wastage when we ! * have more DRAM in the system and thus push down the user address ! * range). */ ! if (use_hat32_cache) { ! hat->hat_num_hash = mmu.hat32_hash_cnt; ! hat->hat_ht_hash = kmem_cache_alloc(hat32_hash_cache, KM_SLEEP); } else { hat->hat_num_hash = mmu.hash_cnt; hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP); } bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
*** 307,317 **** XPV_DISALLOW_MIGRATE(); ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL); hat->hat_htable = ht; #if defined(__amd64) ! if (hat->hat_flags & HAT_VLP) goto init_done; #endif for (r = 0; r < num_kernel_ranges; ++r) { rp = &kernel_ranges[r]; --- 546,556 ---- XPV_DISALLOW_MIGRATE(); ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL); hat->hat_htable = ht; #if defined(__amd64) ! if (hat->hat_flags & HAT_COPIED) goto init_done; #endif for (r = 0; r < num_kernel_ranges; ++r) { rp = &kernel_ranges[r];
*** 332,344 **** (eva > rp->hkr_end_va || eva == 0)) cnt = htable_va2entry(rp->hkr_end_va, ht) - start; #if defined(__i386) && !defined(__xpv) ! if (ht->ht_flags & HTABLE_VLP) { ! bcopy(&vlp_page[start], ! &hat->hat_vlp_ptes[start], cnt * sizeof (x86pte_t)); continue; } #endif src = htable_lookup(kas.a_hat, va, rp->hkr_level); --- 571,583 ---- (eva > rp->hkr_end_va || eva == 0)) cnt = htable_va2entry(rp->hkr_end_va, ht) - start; #if defined(__i386) && !defined(__xpv) ! if (ht->ht_flags & HTABLE_COPIED) { ! bcopy(&pcp_page[start], ! &hat->hat_copied_ptes[start], cnt * sizeof (x86pte_t)); continue; } #endif src = htable_lookup(kas.a_hat, va, rp->hkr_level);
*** 359,392 **** xen_pin(hat->hat_user_ptable, mmu.max_level); #endif #endif XPV_ALLOW_MIGRATE(); /* ! * Put it at the start of the global list of all hats (used by stealing) ! * ! * kas.a_hat is not in the list but is instead used to find the ! * first and last items in the list. ! * ! * - kas.a_hat->hat_next points to the start of the user hats. ! * The list ends where hat->hat_next == NULL ! * ! * - kas.a_hat->hat_prev points to the last of the user hats. ! * The list begins where hat->hat_prev == NULL */ ! mutex_enter(&hat_list_lock); ! hat->hat_prev = NULL; ! hat->hat_next = kas.a_hat->hat_next; ! if (hat->hat_next) ! hat->hat_next->hat_prev = hat; ! else ! kas.a_hat->hat_prev = hat; ! kas.a_hat->hat_next = hat; ! mutex_exit(&hat_list_lock); return (hat); } /* * process has finished executing but as has not been cleaned up yet. */ /*ARGSUSED*/ --- 598,655 ---- xen_pin(hat->hat_user_ptable, mmu.max_level); #endif #endif XPV_ALLOW_MIGRATE(); + hat_list_append(hat); + + return (hat); + } + + #if !defined(__xpv) + /* + * Cons up a HAT for a CPU. This represents the user mappings. This will have + * various kernel pages punched into it manually. Importantly, this hat is + * ineligible for stealing. We really don't want to deal with this ever + * faulting and figuring out that this is happening, much like we don't with + * kas. + */ + static hat_t * + hat_cpu_alloc(cpu_t *cpu) + { + hat_t *hat; + htable_t *ht; + + hat = kmem_cache_alloc(hat_cache, KM_SLEEP); + hat->hat_as = NULL; + mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL); + hat->hat_max_level = mmu.max_level; + hat->hat_num_copied = 0; + hat->hat_flags = HAT_PCP; + + hat->hat_num_hash = mmu.hash_cnt; + hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP); + bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *)); + + hat->hat_next = hat->hat_prev = NULL; + /* ! * Because this HAT will only ever be used by the current CPU, we'll go ! * ahead and set the CPUSET up to only point to the CPU in question. */ ! CPUSET_ADD(hat->hat_cpus, cpu->cpu_id); + hat->hat_htable = NULL; + hat->hat_ht_cached = NULL; + ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL); + hat->hat_htable = ht; + + hat_list_append(hat); + return (hat); } + #endif /* !__xpv */ /* * process has finished executing but as has not been cleaned up yet. */ /*ARGSUSED*/
*** 439,448 **** --- 702,712 ---- #if defined(__xpv) /* * On the hypervisor, unpin top level page table(s) */ + VERIFY3U(hat->hat_flags & HAT_PCP, ==, 0); xen_unpin(hat->hat_htable->ht_pfn); #if defined(__amd64) xen_unpin(hat->hat_user_ptable); #endif #endif
*** 453,470 **** htable_purge_hat(hat); /* * Decide which kmem cache the hash table came from, then free it. */ ! if (hat->hat_flags & HAT_VLP) ! cache = vlp_hash_cache; ! else cache = hat_hash_cache; kmem_cache_free(cache, hat->hat_ht_hash); hat->hat_ht_hash = NULL; hat->hat_flags = 0; kmem_cache_free(hat_cache, hat); } /* * round kernelbase down to a supported value to use for _userlimit --- 717,745 ---- htable_purge_hat(hat); /* * Decide which kmem cache the hash table came from, then free it. */ ! if (hat->hat_flags & HAT_COPIED) { ! #if defined(__amd64) ! if (hat->hat_flags & HAT_COPIED_32) { ! cache = hat32_hash_cache; ! } else { cache = hat_hash_cache; + } + #else + cache = hat32_hash_cache; + #endif + } else { + cache = hat_hash_cache; + } kmem_cache_free(cache, hat->hat_ht_hash); hat->hat_ht_hash = NULL; hat->hat_flags = 0; + hat->hat_max_level = 0; + hat->hat_num_copied = 0; kmem_cache_free(hat_cache, hat); } /* * round kernelbase down to a supported value to use for _userlimit
*** 515,524 **** --- 790,835 ---- else mmu.umax_page_level = lvl; } /* + * Determine the number of slots that are in used in the top-most level page + * table for user memory. This is based on _userlimit. In effect this is similar + * to htable_va2entry, but without the convenience of having an htable. + */ + void + mmu_calc_user_slots(void) + { + uint_t ent, nptes; + uintptr_t shift; + + nptes = mmu.top_level_count; + shift = _userlimit >> mmu.level_shift[mmu.max_level]; + ent = shift & (nptes - 1); + + /* + * Ent tells us the slot that the page for _userlimit would fit in. We + * need to add one to this to cover the total number of entries. + */ + mmu.top_level_uslots = ent + 1; + + /* + * When running 32-bit compatability processes on a 64-bit kernel, we + * will only need to use one slot. + */ + mmu.top_level_uslots32 = 1; + + /* + * Record the number of PCP page table entries that we'll need to copy + * around. For 64-bit processes this is the number of user slots. For + * 32-bit proceses, this is 4 1 GiB pages. + */ + mmu.num_copied_ents = mmu.top_level_uslots; + mmu.num_copied_ents32 = 4; + } + + /* * Initialize hat data structures based on processor MMU information. */ void mmu_init(void) {
*** 533,543 **** --- 844,865 ---- */ if (is_x86_feature(x86_featureset, X86FSET_PGE) && (getcr4() & CR4_PGE) != 0) mmu.pt_global = PT_GLOBAL; + #if !defined(__xpv) /* + * The 64-bit x86 kernel has split user/kernel page tables. As such we + * cannot have the global bit set. The simplest way for us to deal with + * this is to just say that pt_global is zero, so the global bit isn't + * present. + */ + if (kpti_enable == 1) + mmu.pt_global = 0; + #endif + + /* * Detect NX and PAE usage. */ mmu.pae_hat = kbm_pae_support; if (kbm_nx_support) mmu.pt_nx = PT_NX;
*** 591,600 **** --- 913,927 ---- mmu.num_level = 4; mmu.max_level = 3; mmu.ptes_per_table = 512; mmu.top_level_count = 512; + /* + * 32-bit processes only use 1 GB ptes. + */ + mmu.max_level32 = 2; + mmu.level_shift[0] = 12; mmu.level_shift[1] = 21; mmu.level_shift[2] = 30; mmu.level_shift[3] = 39;
*** 627,636 **** --- 954,964 ---- mmu.level_offset[i] = mmu.level_size[i] - 1; mmu.level_mask[i] = ~mmu.level_offset[i]; } set_max_page_level(); + mmu_calc_user_slots(); mmu_page_sizes = mmu.max_page_level + 1; mmu_exported_page_sizes = mmu.umax_page_level + 1; /* restrict legacy applications from using pagesizes 1g and above */
*** 662,672 **** */ max_htables = physmax / mmu.ptes_per_table; mmu.hash_cnt = MMU_PAGESIZE / sizeof (htable_t *); while (mmu.hash_cnt > 16 && mmu.hash_cnt >= max_htables) mmu.hash_cnt >>= 1; ! mmu.vlp_hash_cnt = mmu.hash_cnt; #if defined(__amd64) /* * If running in 64 bits and physical memory is large, * increase the size of the cache to cover all of memory for --- 990,1000 ---- */ max_htables = physmax / mmu.ptes_per_table; mmu.hash_cnt = MMU_PAGESIZE / sizeof (htable_t *); while (mmu.hash_cnt > 16 && mmu.hash_cnt >= max_htables) mmu.hash_cnt >>= 1; ! mmu.hat32_hash_cnt = mmu.hash_cnt; #if defined(__amd64) /* * If running in 64 bits and physical memory is large, * increase the size of the cache to cover all of memory for
*** 711,728 **** hat_hash_cache = kmem_cache_create("HatHash", mmu.hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL, NULL, 0, 0); /* ! * VLP hats can use a smaller hash table size on large memroy machines */ ! if (mmu.hash_cnt == mmu.vlp_hash_cnt) { ! vlp_hash_cache = hat_hash_cache; } else { ! vlp_hash_cache = kmem_cache_create("HatVlpHash", ! mmu.vlp_hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL, ! NULL, 0, 0); } /* * Set up the kernel's hat */ --- 1039,1057 ---- hat_hash_cache = kmem_cache_create("HatHash", mmu.hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL, NULL, 0, 0); /* ! * 32-bit PCP hats can use a smaller hash table size on large memory ! * machines */ ! if (mmu.hash_cnt == mmu.hat32_hash_cnt) { ! hat32_hash_cache = hat_hash_cache; } else { ! hat32_hash_cache = kmem_cache_create("Hat32Hash", ! mmu.hat32_hash_cnt * sizeof (htable_t *), 0, NULL, NULL, ! NULL, NULL, 0, 0); } /* * Set up the kernel's hat */
*** 735,744 **** --- 1064,1080 ---- CPUSET_ZERO(khat_cpuset); CPUSET_ADD(khat_cpuset, CPU->cpu_id); /* + * The kernel HAT doesn't use PCP regardless of architectures. + */ + ASSERT3U(mmu.max_level, >, 0); + kas.a_hat->hat_max_level = mmu.max_level; + kas.a_hat->hat_num_copied = 0; + + /* * The kernel hat's next pointer serves as the head of the hat list . * The kernel hat's prev pointer tracks the last hat on the list for * htable_steal() to use. */ kas.a_hat->hat_next = NULL;
*** 766,826 **** */ hrm_hashtab = kmem_zalloc(HRM_HASHSIZE * sizeof (struct hrmstat *), KM_SLEEP); } /* ! * Prepare CPU specific pagetables for VLP processes on 64 bit kernels. * * Each CPU has a set of 2 pagetables that are reused for any 32 bit ! * process it runs. They are the top level pagetable, hci_vlp_l3ptes, and ! * the next to top level table for the bottom 512 Gig, hci_vlp_l2ptes. */ /*ARGSUSED*/ static void ! hat_vlp_setup(struct cpu *cpu) { ! #if defined(__amd64) && !defined(__xpv) struct hat_cpu_info *hci = cpu->cpu_hat_info; ! pfn_t pfn; /* * allocate the level==2 page table for the bottom most * 512Gig of address space (this is where 32 bit apps live) */ ASSERT(hci != NULL); ! hci->hci_vlp_l2ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP); /* * Allocate a top level pagetable and copy the kernel's ! * entries into it. Then link in hci_vlp_l2ptes in the 1st entry. */ ! hci->hci_vlp_l3ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP); ! hci->hci_vlp_pfn = ! hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_vlp_l3ptes); ! ASSERT(hci->hci_vlp_pfn != PFN_INVALID); ! bcopy(vlp_page, hci->hci_vlp_l3ptes, MMU_PAGESIZE); ! pfn = hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_vlp_l2ptes); ! ASSERT(pfn != PFN_INVALID); ! hci->hci_vlp_l3ptes[0] = MAKEPTP(pfn, 2); ! #endif /* __amd64 && !__xpv */ } /*ARGSUSED*/ static void ! hat_vlp_teardown(cpu_t *cpu) { ! #if defined(__amd64) && !defined(__xpv) struct hat_cpu_info *hci; if ((hci = cpu->cpu_hat_info) == NULL) return; ! if (hci->hci_vlp_l2ptes) ! kmem_free(hci->hci_vlp_l2ptes, MMU_PAGESIZE); ! if (hci->hci_vlp_l3ptes) ! kmem_free(hci->hci_vlp_l3ptes, MMU_PAGESIZE); #endif } #define NEXT_HKR(r, l, s, e) { \ kernel_ranges[r].hkr_level = l; \ --- 1102,1270 ---- */ hrm_hashtab = kmem_zalloc(HRM_HASHSIZE * sizeof (struct hrmstat *), KM_SLEEP); } + + extern void kpti_tramp_start(); + extern void kpti_tramp_end(); + + extern void kdi_isr_start(); + extern void kdi_isr_end(); + + extern gate_desc_t kdi_idt[NIDT]; + /* ! * Prepare per-CPU pagetables for all processes on the 64 bit kernel. * * Each CPU has a set of 2 pagetables that are reused for any 32 bit ! * process it runs. They are the top level pagetable, hci_pcp_l3ptes, and ! * the next to top level table for the bottom 512 Gig, hci_pcp_l2ptes. */ /*ARGSUSED*/ static void ! hat_pcp_setup(struct cpu *cpu) { ! #if !defined(__xpv) struct hat_cpu_info *hci = cpu->cpu_hat_info; ! uintptr_t va; ! size_t len; /* * allocate the level==2 page table for the bottom most * 512Gig of address space (this is where 32 bit apps live) */ ASSERT(hci != NULL); ! hci->hci_pcp_l2ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP); /* * Allocate a top level pagetable and copy the kernel's ! * entries into it. Then link in hci_pcp_l2ptes in the 1st entry. */ ! hci->hci_pcp_l3ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP); ! hci->hci_pcp_l3pfn = ! hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_pcp_l3ptes); ! ASSERT3U(hci->hci_pcp_l3pfn, !=, PFN_INVALID); ! bcopy(pcp_page, hci->hci_pcp_l3ptes, MMU_PAGESIZE); ! hci->hci_pcp_l2pfn = ! hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_pcp_l2ptes); ! ASSERT3U(hci->hci_pcp_l2pfn, !=, PFN_INVALID); ! ! /* ! * Now go through and allocate the user version of these structures. ! * Unlike with the kernel version, we allocate a hat to represent the ! * top-level page table as that will make it much simpler when we need ! * to patch through user entries. ! */ ! hci->hci_user_hat = hat_cpu_alloc(cpu); ! hci->hci_user_l3pfn = hci->hci_user_hat->hat_htable->ht_pfn; ! ASSERT3U(hci->hci_user_l3pfn, !=, PFN_INVALID); ! hci->hci_user_l3ptes = ! (x86pte_t *)hat_kpm_mapin_pfn(hci->hci_user_l3pfn); ! ! /* Skip the rest of this if KPTI is switched off at boot. */ ! if (kpti_enable != 1) ! return; ! ! /* ! * OK, now that we have this we need to go through and punch the normal ! * holes in the CPU's hat for this. At this point we'll punch in the ! * following: ! * ! * o GDT ! * o IDT ! * o LDT ! * o Trampoline Code ! * o machcpu KPTI page ! * o kmdb ISR code page (just trampolines) ! * ! * If this is cpu0, then we also can initialize the following because ! * they'll have already been allocated. ! * ! * o TSS for CPU 0 ! * o Double Fault for CPU 0 ! * ! * The following items have yet to be allocated and have not been ! * punched in yet. They will be punched in later: ! * ! * o TSS (mach_cpucontext_alloc_tables()) ! * o Double Fault Stack (mach_cpucontext_alloc_tables()) ! */ ! hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_gdt, PROT_READ); ! hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_idt, PROT_READ); ! ! /* ! * As the KDI IDT is only active during kmdb sessions (including single ! * stepping), typically we don't actually need this punched in (we ! * consider the routines that switch to the user cr3 to be toxic). But ! * if we ever accidentally end up on the user cr3 while on this IDT, ! * we'd prefer not to triple fault. ! */ ! hati_cpu_punchin(cpu, (uintptr_t)&kdi_idt, PROT_READ); ! ! CTASSERT(((uintptr_t)&kpti_tramp_start % MMU_PAGESIZE) == 0); ! CTASSERT(((uintptr_t)&kpti_tramp_end % MMU_PAGESIZE) == 0); ! for (va = (uintptr_t)&kpti_tramp_start; ! va < (uintptr_t)&kpti_tramp_end; va += MMU_PAGESIZE) { ! hati_cpu_punchin(cpu, va, PROT_READ | PROT_EXEC); ! } ! ! VERIFY3U(((uintptr_t)cpu->cpu_m.mcpu_ldt) % MMU_PAGESIZE, ==, 0); ! for (va = (uintptr_t)cpu->cpu_m.mcpu_ldt, len = LDT_CPU_SIZE; ! len >= MMU_PAGESIZE; va += MMU_PAGESIZE, len -= MMU_PAGESIZE) { ! hati_cpu_punchin(cpu, va, PROT_READ); ! } ! ! /* mcpu_pad2 is the start of the page containing the kpti_frames. */ ! hati_cpu_punchin(cpu, (uintptr_t)&cpu->cpu_m.mcpu_pad2[0], ! PROT_READ | PROT_WRITE); ! ! if (cpu == &cpus[0]) { ! /* ! * CPU0 uses a global for its double fault stack to deal with ! * the chicken and egg problem. We need to punch it into its ! * user HAT. ! */ ! extern char dblfault_stack0[]; ! ! hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_m.mcpu_tss, ! PROT_READ); ! ! for (va = (uintptr_t)dblfault_stack0, ! len = DEFAULTSTKSZ; len >= MMU_PAGESIZE; ! va += MMU_PAGESIZE, len -= MMU_PAGESIZE) { ! hati_cpu_punchin(cpu, va, PROT_READ | PROT_WRITE); ! } ! } ! ! CTASSERT(((uintptr_t)&kdi_isr_start % MMU_PAGESIZE) == 0); ! CTASSERT(((uintptr_t)&kdi_isr_end % MMU_PAGESIZE) == 0); ! for (va = (uintptr_t)&kdi_isr_start; ! va < (uintptr_t)&kdi_isr_end; va += MMU_PAGESIZE) { ! hati_cpu_punchin(cpu, va, PROT_READ | PROT_EXEC); ! } ! #endif /* !__xpv */ } /*ARGSUSED*/ static void ! hat_pcp_teardown(cpu_t *cpu) { ! #if !defined(__xpv) struct hat_cpu_info *hci; if ((hci = cpu->cpu_hat_info) == NULL) return; ! if (hci->hci_pcp_l2ptes != NULL) ! kmem_free(hci->hci_pcp_l2ptes, MMU_PAGESIZE); ! if (hci->hci_pcp_l3ptes != NULL) ! kmem_free(hci->hci_pcp_l3ptes, MMU_PAGESIZE); ! if (hci->hci_user_hat != NULL) { ! hat_free_start(hci->hci_user_hat); ! hat_free_end(hci->hci_user_hat); ! } #endif } #define NEXT_HKR(r, l, s, e) { \ kernel_ranges[r].hkr_level = l; \
*** 912,936 **** } /* * 32 bit PAE metal kernels use only 4 of the 512 entries in the * page holding the top level pagetable. We use the remainder for ! * the "per CPU" page tables for VLP processes. * Map the top level kernel pagetable into the kernel to make * it easy to use bcopy access these tables. */ if (mmu.pae_hat) { ! vlp_page = vmem_alloc(heap_arena, MMU_PAGESIZE, VM_SLEEP); ! hat_devload(kas.a_hat, (caddr_t)vlp_page, MMU_PAGESIZE, kas.a_hat->hat_htable->ht_pfn, #if !defined(__xpv) PROT_WRITE | #endif PROT_READ | HAT_NOSYNC | HAT_UNORDERED_OK, HAT_LOAD | HAT_LOAD_NOCONSIST); } ! hat_vlp_setup(CPU); /* * Create kmap (cached mappings of kernel PTEs) * for 32 bit we map from segmap_start .. ekernelheap * for 64 bit we map from segmap_start .. segmap_start + segmapsize; --- 1356,1383 ---- } /* * 32 bit PAE metal kernels use only 4 of the 512 entries in the * page holding the top level pagetable. We use the remainder for ! * the "per CPU" page tables for PCP processes. * Map the top level kernel pagetable into the kernel to make * it easy to use bcopy access these tables. + * + * PAE is required for the 64-bit kernel which uses this as well to + * perform the per-CPU pagetables. See the big theory statement. */ if (mmu.pae_hat) { ! pcp_page = vmem_alloc(heap_arena, MMU_PAGESIZE, VM_SLEEP); ! hat_devload(kas.a_hat, (caddr_t)pcp_page, MMU_PAGESIZE, kas.a_hat->hat_htable->ht_pfn, #if !defined(__xpv) PROT_WRITE | #endif PROT_READ | HAT_NOSYNC | HAT_UNORDERED_OK, HAT_LOAD | HAT_LOAD_NOCONSIST); } ! hat_pcp_setup(CPU); /* * Create kmap (cached mappings of kernel PTEs) * for 32 bit we map from segmap_start .. ekernelheap * for 64 bit we map from segmap_start .. segmap_start + segmapsize;
*** 939,948 **** --- 1386,1401 ---- size = (uintptr_t)ekernelheap - segmap_start; #elif defined(__amd64) size = segmapsize; #endif hat_kmap_init((uintptr_t)segmap_start, size); + + #if !defined(__xpv) + ASSERT3U(kas.a_hat->hat_htable->ht_pfn, !=, PFN_INVALID); + ASSERT3U(kpti_safe_cr3, ==, + MAKECR3(kas.a_hat->hat_htable->ht_pfn, PCID_KERNEL)); + #endif } /* * On 32 bit PAE mode, PTE's are 64 bits, but ordinary atomic memory references * are 32 bit, so for safety we must use atomic_cas_64() to install these.
*** 956,971 **** x86pte_t pte; int i; /* * Load the 4 entries of the level 2 page table into this ! * cpu's range of the vlp_page and point cr3 at them. */ ASSERT(mmu.pae_hat); ! src = hat->hat_vlp_ptes; ! dest = vlp_page + (cpu->cpu_id + 1) * VLP_NUM_PTES; ! for (i = 0; i < VLP_NUM_PTES; ++i) { for (;;) { pte = dest[i]; if (pte == src[i]) break; if (atomic_cas_64(dest + i, pte, src[i]) != src[i]) --- 1409,1424 ---- x86pte_t pte; int i; /* * Load the 4 entries of the level 2 page table into this ! * cpu's range of the pcp_page and point cr3 at them. */ ASSERT(mmu.pae_hat); ! src = hat->hat_copied_ptes; ! dest = pcp_page + (cpu->cpu_id + 1) * MAX_COPIED_PTES; ! for (i = 0; i < MAX_COPIED_PTES; ++i) { for (;;) { pte = dest[i]; if (pte == src[i]) break; if (atomic_cas_64(dest + i, pte, src[i]) != src[i])
*** 974,992 **** } } #endif /* * Switch to a new active hat, maintaining bit masks to track active CPUs. * ! * On the 32-bit PAE hypervisor, %cr3 is a 64-bit value, on metal it ! * remains a 32-bit value. */ void hat_switch(hat_t *hat) { - uint64_t newcr3; cpu_t *cpu = CPU; hat_t *old = cpu->cpu_current_hat; /* * set up this information first, so we don't miss any cross calls --- 1427,1593 ---- } } #endif /* + * Update the PCP data on the CPU cpu to the one on the hat. If this is a 32-bit + * process, then we must update the L2 pages and then the L3. If this is a + * 64-bit process then we must update the L3 entries. + */ + static void + hat_pcp_update(cpu_t *cpu, const hat_t *hat) + { + ASSERT3U(hat->hat_flags & HAT_COPIED, !=, 0); + + if ((hat->hat_flags & HAT_COPIED_32) != 0) { + const x86pte_t *l2src; + x86pte_t *l2dst, *l3ptes, *l3uptes; + /* + * This is a 32-bit process. To set this up, we need to do the + * following: + * + * - Copy the 4 L2 PTEs into the dedicated L2 table + * - Zero the user L3 PTEs in the user and kernel page table + * - Set the first L3 PTE to point to the CPU L2 table + */ + l2src = hat->hat_copied_ptes; + l2dst = cpu->cpu_hat_info->hci_pcp_l2ptes; + l3ptes = cpu->cpu_hat_info->hci_pcp_l3ptes; + l3uptes = cpu->cpu_hat_info->hci_user_l3ptes; + + l2dst[0] = l2src[0]; + l2dst[1] = l2src[1]; + l2dst[2] = l2src[2]; + l2dst[3] = l2src[3]; + + /* + * Make sure to use the mmu to get the number of slots. The + * number of PLP entries that this has will always be less as + * it's a 32-bit process. + */ + bzero(l3ptes, sizeof (x86pte_t) * mmu.top_level_uslots); + l3ptes[0] = MAKEPTP(cpu->cpu_hat_info->hci_pcp_l2pfn, 2); + bzero(l3uptes, sizeof (x86pte_t) * mmu.top_level_uslots); + l3uptes[0] = MAKEPTP(cpu->cpu_hat_info->hci_pcp_l2pfn, 2); + } else { + /* + * This is a 64-bit process. To set this up, we need to do the + * following: + * + * - Zero the 4 L2 PTEs in the CPU structure for safety + * - Copy over the new user L3 PTEs into the kernel page table + * - Copy over the new user L3 PTEs into the user page table + */ + ASSERT3S(kpti_enable, ==, 1); + bzero(cpu->cpu_hat_info->hci_pcp_l2ptes, sizeof (x86pte_t) * 4); + bcopy(hat->hat_copied_ptes, cpu->cpu_hat_info->hci_pcp_l3ptes, + sizeof (x86pte_t) * mmu.top_level_uslots); + bcopy(hat->hat_copied_ptes, cpu->cpu_hat_info->hci_user_l3ptes, + sizeof (x86pte_t) * mmu.top_level_uslots); + } + } + + static void + reset_kpti(struct kpti_frame *fr, uint64_t kcr3, uint64_t ucr3) + { + ASSERT3U(fr->kf_tr_flag, ==, 0); + #if DEBUG + if (fr->kf_kernel_cr3 != 0) { + ASSERT3U(fr->kf_lower_redzone, ==, 0xdeadbeefdeadbeef); + ASSERT3U(fr->kf_middle_redzone, ==, 0xdeadbeefdeadbeef); + ASSERT3U(fr->kf_upper_redzone, ==, 0xdeadbeefdeadbeef); + } + #endif + + bzero(fr, offsetof(struct kpti_frame, kf_kernel_cr3)); + bzero(&fr->kf_unused, sizeof (struct kpti_frame) - + offsetof(struct kpti_frame, kf_unused)); + + fr->kf_kernel_cr3 = kcr3; + fr->kf_user_cr3 = ucr3; + fr->kf_tr_ret_rsp = (uintptr_t)&fr->kf_tr_rsp; + + fr->kf_lower_redzone = 0xdeadbeefdeadbeef; + fr->kf_middle_redzone = 0xdeadbeefdeadbeef; + fr->kf_upper_redzone = 0xdeadbeefdeadbeef; + } + + #ifdef __xpv + static void + hat_switch_xen(hat_t *hat) + { + struct mmuext_op t[2]; + uint_t retcnt; + uint_t opcnt = 1; + uint64_t newcr3; + + ASSERT(!(hat->hat_flags & HAT_COPIED)); + ASSERT(!(getcr4() & CR4_PCIDE)); + + newcr3 = MAKECR3((uint64_t)hat->hat_htable->ht_pfn, PCID_NONE); + + t[0].cmd = MMUEXT_NEW_BASEPTR; + t[0].arg1.mfn = mmu_btop(pa_to_ma(newcr3)); + + /* + * There's an interesting problem here, as to what to actually specify + * when switching to the kernel hat. For now we'll reuse the kernel hat + * again. + */ + t[1].cmd = MMUEXT_NEW_USER_BASEPTR; + if (hat == kas.a_hat) + t[1].arg1.mfn = mmu_btop(pa_to_ma(newcr3)); + else + t[1].arg1.mfn = pfn_to_mfn(hat->hat_user_ptable); + ++opcnt; + + if (HYPERVISOR_mmuext_op(t, opcnt, &retcnt, DOMID_SELF) < 0) + panic("HYPERVISOR_mmu_update() failed"); + ASSERT(retcnt == opcnt); + } + #endif /* __xpv */ + + /* * Switch to a new active hat, maintaining bit masks to track active CPUs. * ! * With KPTI, all our HATs except kas should be using PCP. Thus, to switch ! * HATs, we need to copy over the new user PTEs, then set our trampoline context ! * as appropriate. ! * ! * If lacking PCID, we then load our new cr3, which will flush the TLB: we may ! * have established userspace TLB entries via kernel accesses, and these are no ! * longer valid. We have to do this eagerly, as we just deleted this CPU from ! * ->hat_cpus, so would no longer see any TLB shootdowns. ! * ! * With PCID enabled, things get a little more complicated. We would like to ! * keep TLB context around when entering and exiting the kernel, and to do this, ! * we partition the TLB into two different spaces: ! * ! * PCID_KERNEL is defined as zero, and used both by kas and all other address ! * spaces while in the kernel (post-trampoline). ! * ! * PCID_USER is used while in userspace. Therefore, userspace cannot use any ! * lingering PCID_KERNEL entries to kernel addresses it should not be able to ! * read. ! * ! * The trampoline cr3s are set not to invalidate on a mov to %cr3. This means if ! * we take a journey through the kernel without switching HATs, we have some ! * hope of keeping our TLB state around. ! * ! * On a hat switch, rather than deal with any necessary flushes on the way out ! * of the trampolines, we do them upfront here. If we're switching from kas, we ! * shouldn't need any invalidation. ! * ! * Otherwise, we can have stale userspace entries for both PCID_USER (what ! * happened before we move onto the kcr3) and PCID_KERNEL (any subsequent ! * userspace accesses such as ddi_copyin()). Since setcr3() won't do these ! * flushes on its own in PCIDE, we'll do a non-flushing load and then ! * invalidate everything. */ void hat_switch(hat_t *hat) { cpu_t *cpu = CPU; hat_t *old = cpu->cpu_current_hat; /* * set up this information first, so we don't miss any cross calls
*** 1004,1059 **** if (hat != kas.a_hat) { CPUSET_ATOMIC_ADD(hat->hat_cpus, cpu->cpu_id); } cpu->cpu_current_hat = hat; ! /* ! * now go ahead and load cr3 ! */ ! if (hat->hat_flags & HAT_VLP) { ! #if defined(__amd64) ! x86pte_t *vlpptep = cpu->cpu_hat_info->hci_vlp_l2ptes; ! VLP_COPY(hat->hat_vlp_ptes, vlpptep); ! newcr3 = MAKECR3(cpu->cpu_hat_info->hci_vlp_pfn); ! #elif defined(__i386) ! reload_pae32(hat, cpu); ! newcr3 = MAKECR3(kas.a_hat->hat_htable->ht_pfn) + ! (cpu->cpu_id + 1) * VLP_SIZE; ! #endif } else { ! newcr3 = MAKECR3((uint64_t)hat->hat_htable->ht_pfn); } - #ifdef __xpv - { - struct mmuext_op t[2]; - uint_t retcnt; - uint_t opcnt = 1; ! t[0].cmd = MMUEXT_NEW_BASEPTR; ! t[0].arg1.mfn = mmu_btop(pa_to_ma(newcr3)); ! #if defined(__amd64) /* ! * There's an interesting problem here, as to what to ! * actually specify when switching to the kernel hat. ! * For now we'll reuse the kernel hat again. */ ! t[1].cmd = MMUEXT_NEW_USER_BASEPTR; ! if (hat == kas.a_hat) ! t[1].arg1.mfn = mmu_btop(pa_to_ma(newcr3)); ! else ! t[1].arg1.mfn = pfn_to_mfn(hat->hat_user_ptable); ! ++opcnt; ! #endif /* __amd64 */ ! if (HYPERVISOR_mmuext_op(t, opcnt, &retcnt, DOMID_SELF) < 0) ! panic("HYPERVISOR_mmu_update() failed"); ! ASSERT(retcnt == opcnt); ! } ! #else ! setcr3(newcr3); ! #endif ASSERT(cpu == CPU); } /* * Utility to return a valid x86pte_t from protections, pfn, and level number --- 1605,1671 ---- if (hat != kas.a_hat) { CPUSET_ATOMIC_ADD(hat->hat_cpus, cpu->cpu_id); } cpu->cpu_current_hat = hat; ! #if defined(__xpv) ! hat_switch_xen(hat); ! #else ! struct hat_cpu_info *info = cpu->cpu_m.mcpu_hat_info; ! uint64_t pcide = getcr4() & CR4_PCIDE; ! uint64_t kcr3, ucr3; ! pfn_t tl_kpfn; ! ulong_t flag; ! EQUIV(kpti_enable, !mmu.pt_global); ! ! if (hat->hat_flags & HAT_COPIED) { ! hat_pcp_update(cpu, hat); ! tl_kpfn = info->hci_pcp_l3pfn; } else { ! IMPLY(kpti_enable, hat == kas.a_hat); ! tl_kpfn = hat->hat_htable->ht_pfn; } ! if (pcide) { ! ASSERT(kpti_enable); ! ! kcr3 = MAKECR3(tl_kpfn, PCID_KERNEL) | CR3_NOINVL_BIT; ! ucr3 = MAKECR3(info->hci_user_l3pfn, PCID_USER) | ! CR3_NOINVL_BIT; ! ! setcr3(kcr3); ! if (old != kas.a_hat) ! mmu_flush_tlb(FLUSH_TLB_ALL, NULL); ! } else { ! kcr3 = MAKECR3(tl_kpfn, PCID_NONE); ! ucr3 = kpti_enable ? ! MAKECR3(info->hci_user_l3pfn, PCID_NONE) : ! 0; ! ! setcr3(kcr3); ! } ! /* ! * We will already be taking shootdowns for our new HAT, and as KPTI ! * invpcid emulation needs to use kf_user_cr3, make sure we don't get ! * any cross calls while we're inconsistent. Note that it's harmless to ! * have a *stale* kf_user_cr3 (we just did a FLUSH_TLB_ALL), but a ! * *zero* kf_user_cr3 is not going to go very well. */ ! if (pcide) ! flag = intr_clear(); ! reset_kpti(&cpu->cpu_m.mcpu_kpti, kcr3, ucr3); ! reset_kpti(&cpu->cpu_m.mcpu_kpti_flt, kcr3, ucr3); ! reset_kpti(&cpu->cpu_m.mcpu_kpti_dbg, kcr3, ucr3); ! ! if (pcide) ! intr_restore(flag); ! ! #endif /* !__xpv */ ! ASSERT(cpu == CPU); } /* * Utility to return a valid x86pte_t from protections, pfn, and level number
*** 1361,1374 **** x86_hm_exit(pp); } else { ASSERT(flags & HAT_LOAD_NOCONSIST); } #if defined(__amd64) ! if (ht->ht_flags & HTABLE_VLP) { cpu_t *cpu = CPU; ! x86pte_t *vlpptep = cpu->cpu_hat_info->hci_vlp_l2ptes; ! VLP_COPY(hat->hat_vlp_ptes, vlpptep); } #endif HTABLE_INC(ht->ht_valid_cnt); PGCNT_INC(hat, l); return (rv); --- 1973,1985 ---- x86_hm_exit(pp); } else { ASSERT(flags & HAT_LOAD_NOCONSIST); } #if defined(__amd64) ! if (ht->ht_flags & HTABLE_COPIED) { cpu_t *cpu = CPU; ! hat_pcp_update(cpu, hat); } #endif HTABLE_INC(ht->ht_valid_cnt); PGCNT_INC(hat, l); return (rv);
*** 1436,1446 **** * early before we blow out the kernel stack. */ ++curthread->t_hatdepth; ASSERT(curthread->t_hatdepth < 16); ! ASSERT(hat == kas.a_hat || AS_LOCK_HELD(hat->hat_as)); if (flags & HAT_LOAD_SHARE) hat->hat_flags |= HAT_SHARED; /* --- 2047,2058 ---- * early before we blow out the kernel stack. */ ++curthread->t_hatdepth; ASSERT(curthread->t_hatdepth < 16); ! ASSERT(hat == kas.a_hat || (hat->hat_flags & HAT_PCP) != 0 || ! AS_LOCK_HELD(hat->hat_as)); if (flags & HAT_LOAD_SHARE) hat->hat_flags |= HAT_SHARED; /*
*** 1456,1474 **** if (ht == NULL) { ht = htable_create(hat, va, level, NULL); ASSERT(ht != NULL); } entry = htable_va2entry(va, ht); /* * a bunch of paranoid error checking */ ASSERT(ht->ht_busy > 0); - if (ht->ht_vaddr > va || va > HTABLE_LAST_PAGE(ht)) - panic("hati_load_common: bad htable %p, va %p", - (void *)ht, (void *)va); ASSERT(ht->ht_level == level); /* * construct the new PTE */ --- 2068,2094 ---- if (ht == NULL) { ht = htable_create(hat, va, level, NULL); ASSERT(ht != NULL); } + /* + * htable_va2entry checks this condition as well, but it won't include + * much useful info in the panic. So we do it in advance here to include + * all the context. + */ + if (ht->ht_vaddr > va || va > HTABLE_LAST_PAGE(ht)) { + panic("hati_load_common: bad htable: va=%p, last page=%p, " + "ht->ht_vaddr=%p, ht->ht_level=%d", (void *)va, + (void *)HTABLE_LAST_PAGE(ht), (void *)ht->ht_vaddr, + (int)ht->ht_level); + } entry = htable_va2entry(va, ht); /* * a bunch of paranoid error checking */ ASSERT(ht->ht_busy > 0); ASSERT(ht->ht_level == level); /* * construct the new PTE */
*** 1914,2003 **** panic("No shared region support on x86"); } #if !defined(__xpv) /* ! * Cross call service routine to demap a virtual page on ! * the current CPU or flush all mappings in TLB. */ - /*ARGSUSED*/ static int hati_demap_func(xc_arg_t a1, xc_arg_t a2, xc_arg_t a3) { hat_t *hat = (hat_t *)a1; ! caddr_t addr = (caddr_t)a2; ! size_t len = (size_t)a3; /* * If the target hat isn't the kernel and this CPU isn't operating * in the target hat, we can ignore the cross call. */ if (hat != kas.a_hat && hat != CPU->cpu_current_hat) return (0); ! /* ! * For a normal address, we flush a range of contiguous mappings ! */ ! if ((uintptr_t)addr != DEMAP_ALL_ADDR) { ! for (size_t i = 0; i < len; i += MMU_PAGESIZE) ! mmu_tlbflush_entry(addr + i); return (0); } /* ! * Otherwise we reload cr3 to effect a complete TLB flush. * ! * A reload of cr3 on a VLP process also means we must also recopy in ! * the pte values from the struct hat */ ! if (hat->hat_flags & HAT_VLP) { #if defined(__amd64) ! x86pte_t *vlpptep = CPU->cpu_hat_info->hci_vlp_l2ptes; ! ! VLP_COPY(hat->hat_vlp_ptes, vlpptep); #elif defined(__i386) reload_pae32(hat, CPU); #endif } ! reload_cr3(); return (0); } ! /* ! * Flush all TLB entries, including global (ie. kernel) ones. ! */ ! static void ! flush_all_tlb_entries(void) ! { ! ulong_t cr4 = getcr4(); ! ! if (cr4 & CR4_PGE) { ! setcr4(cr4 & ~(ulong_t)CR4_PGE); ! setcr4(cr4); ! ! /* ! * 32 bit PAE also needs to always reload_cr3() ! */ ! if (mmu.max_level == 2) ! reload_cr3(); ! } else { ! reload_cr3(); ! } ! } ! ! #define TLB_CPU_HALTED (01ul) ! #define TLB_INVAL_ALL (02ul) #define CAS_TLB_INFO(cpu, old, new) \ atomic_cas_ulong((ulong_t *)&(cpu)->cpu_m.mcpu_tlb_info, (old), (new)) /* * Record that a CPU is going idle */ void tlb_going_idle(void) { ! atomic_or_ulong((ulong_t *)&CPU->cpu_m.mcpu_tlb_info, TLB_CPU_HALTED); } /* * Service a delayed TLB flush if coming out of being idle. * It will be called from cpu idle notification with interrupt disabled. --- 2534,2596 ---- panic("No shared region support on x86"); } #if !defined(__xpv) /* ! * Cross call service routine to demap a range of virtual ! * pages on the current CPU or flush all mappings in TLB. */ static int hati_demap_func(xc_arg_t a1, xc_arg_t a2, xc_arg_t a3) { + _NOTE(ARGUNUSED(a3)); hat_t *hat = (hat_t *)a1; ! tlb_range_t *range = (tlb_range_t *)a2; /* * If the target hat isn't the kernel and this CPU isn't operating * in the target hat, we can ignore the cross call. */ if (hat != kas.a_hat && hat != CPU->cpu_current_hat) return (0); ! if (range->tr_va != DEMAP_ALL_ADDR) { ! mmu_flush_tlb(FLUSH_TLB_RANGE, range); return (0); } /* ! * We are flushing all of userspace. * ! * When using PCP, we first need to update this CPU's idea of the PCP ! * PTEs. */ ! if (hat->hat_flags & HAT_COPIED) { #if defined(__amd64) ! hat_pcp_update(CPU, hat); #elif defined(__i386) reload_pae32(hat, CPU); #endif } ! ! mmu_flush_tlb(FLUSH_TLB_NONGLOBAL, NULL); return (0); } ! #define TLBIDLE_CPU_HALTED (0x1UL) ! #define TLBIDLE_INVAL_ALL (0x2UL) #define CAS_TLB_INFO(cpu, old, new) \ atomic_cas_ulong((ulong_t *)&(cpu)->cpu_m.mcpu_tlb_info, (old), (new)) /* * Record that a CPU is going idle */ void tlb_going_idle(void) { ! atomic_or_ulong((ulong_t *)&CPU->cpu_m.mcpu_tlb_info, ! TLBIDLE_CPU_HALTED); } /* * Service a delayed TLB flush if coming out of being idle. * It will be called from cpu idle notification with interrupt disabled.
*** 2010,2046 **** /* * We only have to do something if coming out of being idle. */ tlb_info = CPU->cpu_m.mcpu_tlb_info; ! if (tlb_info & TLB_CPU_HALTED) { ASSERT(CPU->cpu_current_hat == kas.a_hat); /* * Atomic clear and fetch of old state. */ while ((found = CAS_TLB_INFO(CPU, tlb_info, 0)) != tlb_info) { ! ASSERT(found & TLB_CPU_HALTED); tlb_info = found; SMT_PAUSE(); } ! if (tlb_info & TLB_INVAL_ALL) ! flush_all_tlb_entries(); } } #endif /* !__xpv */ /* * Internal routine to do cross calls to invalidate a range of pages on * all CPUs using a given hat. */ void ! hat_tlb_inval_range(hat_t *hat, uintptr_t va, size_t len) { extern int flushes_require_xcalls; /* from mp_startup.c */ cpuset_t justme; cpuset_t cpus_to_shootdown; #ifndef __xpv cpuset_t check_cpus; cpu_t *cpup; int c; #endif --- 2603,2640 ---- /* * We only have to do something if coming out of being idle. */ tlb_info = CPU->cpu_m.mcpu_tlb_info; ! if (tlb_info & TLBIDLE_CPU_HALTED) { ASSERT(CPU->cpu_current_hat == kas.a_hat); /* * Atomic clear and fetch of old state. */ while ((found = CAS_TLB_INFO(CPU, tlb_info, 0)) != tlb_info) { ! ASSERT(found & TLBIDLE_CPU_HALTED); tlb_info = found; SMT_PAUSE(); } ! if (tlb_info & TLBIDLE_INVAL_ALL) ! mmu_flush_tlb(FLUSH_TLB_ALL, NULL); } } #endif /* !__xpv */ /* * Internal routine to do cross calls to invalidate a range of pages on * all CPUs using a given hat. */ void ! hat_tlb_inval_range(hat_t *hat, tlb_range_t *in_range) { extern int flushes_require_xcalls; /* from mp_startup.c */ cpuset_t justme; cpuset_t cpus_to_shootdown; + tlb_range_t range = *in_range; #ifndef __xpv cpuset_t check_cpus; cpu_t *cpup; int c; #endif
*** 2057,2083 **** * entire set of user TLBs, since we don't know what addresses * these were shared at. */ if (hat->hat_flags & HAT_SHARED) { hat = kas.a_hat; ! va = DEMAP_ALL_ADDR; } /* * if not running with multiple CPUs, don't use cross calls */ if (panicstr || !flushes_require_xcalls) { #ifdef __xpv ! if (va == DEMAP_ALL_ADDR) { xen_flush_tlb(); } else { ! for (size_t i = 0; i < len; i += MMU_PAGESIZE) ! xen_flush_va((caddr_t)(va + i)); } #else ! (void) hati_demap_func((xc_arg_t)hat, ! (xc_arg_t)va, (xc_arg_t)len); #endif return; } --- 2651,2678 ---- * entire set of user TLBs, since we don't know what addresses * these were shared at. */ if (hat->hat_flags & HAT_SHARED) { hat = kas.a_hat; ! range.tr_va = DEMAP_ALL_ADDR; } /* * if not running with multiple CPUs, don't use cross calls */ if (panicstr || !flushes_require_xcalls) { #ifdef __xpv ! if (range.tr_va == DEMAP_ALL_ADDR) { xen_flush_tlb(); } else { ! for (size_t i = 0; i < TLB_RANGE_LEN(&range); ! i += MMU_PAGESIZE) { ! xen_flush_va((caddr_t)(range.tr_va + i)); } + } #else ! (void) hati_demap_func((xc_arg_t)hat, (xc_arg_t)&range, 0); #endif return; }
*** 2107,2123 **** cpup = cpu[c]; if (cpup == NULL) continue; tlb_info = cpup->cpu_m.mcpu_tlb_info; ! while (tlb_info == TLB_CPU_HALTED) { ! (void) CAS_TLB_INFO(cpup, TLB_CPU_HALTED, ! TLB_CPU_HALTED | TLB_INVAL_ALL); SMT_PAUSE(); tlb_info = cpup->cpu_m.mcpu_tlb_info; } ! if (tlb_info == (TLB_CPU_HALTED | TLB_INVAL_ALL)) { HATSTAT_INC(hs_tlb_inval_delayed); CPUSET_DEL(cpus_to_shootdown, c); } } #endif --- 2702,2718 ---- cpup = cpu[c]; if (cpup == NULL) continue; tlb_info = cpup->cpu_m.mcpu_tlb_info; ! while (tlb_info == TLBIDLE_CPU_HALTED) { ! (void) CAS_TLB_INFO(cpup, TLBIDLE_CPU_HALTED, ! TLBIDLE_CPU_HALTED | TLBIDLE_INVAL_ALL); SMT_PAUSE(); tlb_info = cpup->cpu_m.mcpu_tlb_info; } ! if (tlb_info == (TLBIDLE_CPU_HALTED | TLBIDLE_INVAL_ALL)) { HATSTAT_INC(hs_tlb_inval_delayed); CPUSET_DEL(cpus_to_shootdown, c); } } #endif
*** 2124,2158 **** if (CPUSET_ISNULL(cpus_to_shootdown) || CPUSET_ISEQUAL(cpus_to_shootdown, justme)) { #ifdef __xpv ! if (va == DEMAP_ALL_ADDR) { xen_flush_tlb(); } else { ! for (size_t i = 0; i < len; i += MMU_PAGESIZE) ! xen_flush_va((caddr_t)(va + i)); } #else ! (void) hati_demap_func((xc_arg_t)hat, ! (xc_arg_t)va, (xc_arg_t)len); #endif } else { CPUSET_ADD(cpus_to_shootdown, CPU->cpu_id); #ifdef __xpv ! if (va == DEMAP_ALL_ADDR) { xen_gflush_tlb(cpus_to_shootdown); } else { ! for (size_t i = 0; i < len; i += MMU_PAGESIZE) { ! xen_gflush_va((caddr_t)(va + i), cpus_to_shootdown); } } #else ! xc_call((xc_arg_t)hat, (xc_arg_t)va, (xc_arg_t)len, CPUSET2BV(cpus_to_shootdown), hati_demap_func); #endif } kpreempt_enable(); --- 2719,2755 ---- if (CPUSET_ISNULL(cpus_to_shootdown) || CPUSET_ISEQUAL(cpus_to_shootdown, justme)) { #ifdef __xpv ! if (range.tr_va == DEMAP_ALL_ADDR) { xen_flush_tlb(); } else { ! for (size_t i = 0; i < TLB_RANGE_LEN(&range); ! i += MMU_PAGESIZE) { ! xen_flush_va((caddr_t)(range.tr_va + i)); } + } #else ! (void) hati_demap_func((xc_arg_t)hat, (xc_arg_t)&range, 0); #endif } else { CPUSET_ADD(cpus_to_shootdown, CPU->cpu_id); #ifdef __xpv ! if (range.tr_va == DEMAP_ALL_ADDR) { xen_gflush_tlb(cpus_to_shootdown); } else { ! for (size_t i = 0; i < TLB_RANGE_LEN(&range); ! i += MMU_PAGESIZE) { ! xen_gflush_va((caddr_t)(range.tr_va + i), cpus_to_shootdown); } } #else ! xc_call((xc_arg_t)hat, (xc_arg_t)&range, 0, CPUSET2BV(cpus_to_shootdown), hati_demap_func); #endif } kpreempt_enable();
*** 2159,2169 **** } void hat_tlb_inval(hat_t *hat, uintptr_t va) { ! hat_tlb_inval_range(hat, va, MMU_PAGESIZE); } /* * Interior routine for HAT_UNLOADs from hat_unload_callback(), * hat_kmap_unload() OR from hat_steal() code. This routine doesn't --- 2756,2774 ---- } void hat_tlb_inval(hat_t *hat, uintptr_t va) { ! /* ! * Create range for a single page. ! */ ! tlb_range_t range; ! range.tr_va = va; ! range.tr_cnt = 1; /* one page */ ! range.tr_level = MIN_PAGE_LEVEL; /* pages are MMU_PAGESIZE */ ! ! hat_tlb_inval_range(hat, &range); } /* * Interior routine for HAT_UNLOADs from hat_unload_callback(), * hat_kmap_unload() OR from hat_steal() code. This routine doesn't
*** 2326,2361 **** } XPV_ALLOW_MIGRATE(); } /* - * Do the callbacks for ranges being unloaded. - */ - typedef struct range_info { - uintptr_t rng_va; - ulong_t rng_cnt; - level_t rng_level; - } range_info_t; - - /* * Invalidate the TLB, and perform the callback to the upper level VM system, * for the specified ranges of contiguous pages. */ static void ! handle_ranges(hat_t *hat, hat_callback_t *cb, uint_t cnt, range_info_t *range) { while (cnt > 0) { - size_t len; - --cnt; ! len = range[cnt].rng_cnt << LEVEL_SHIFT(range[cnt].rng_level); ! hat_tlb_inval_range(hat, (uintptr_t)range[cnt].rng_va, len); if (cb != NULL) { ! cb->hcb_start_addr = (caddr_t)range[cnt].rng_va; cb->hcb_end_addr = cb->hcb_start_addr; ! cb->hcb_end_addr += len; cb->hcb_function(cb); } } } --- 2931,2955 ---- } XPV_ALLOW_MIGRATE(); } /* * Invalidate the TLB, and perform the callback to the upper level VM system, * for the specified ranges of contiguous pages. */ static void ! handle_ranges(hat_t *hat, hat_callback_t *cb, uint_t cnt, tlb_range_t *range) { while (cnt > 0) { --cnt; ! hat_tlb_inval_range(hat, &range[cnt]); if (cb != NULL) { ! cb->hcb_start_addr = (caddr_t)range[cnt].tr_va; cb->hcb_end_addr = cb->hcb_start_addr; ! cb->hcb_end_addr += range[cnt].tr_cnt << ! LEVEL_SHIFT(range[cnt].tr_level); cb->hcb_function(cb); } } }
*** 2381,2391 **** uintptr_t vaddr = (uintptr_t)addr; uintptr_t eaddr = vaddr + len; htable_t *ht = NULL; uint_t entry; uintptr_t contig_va = (uintptr_t)-1L; ! range_info_t r[MAX_UNLOAD_CNT]; uint_t r_cnt = 0; x86pte_t old_pte; XPV_DISALLOW_MIGRATE(); ASSERT(hat == kas.a_hat || eaddr <= _userlimit); --- 2975,2985 ---- uintptr_t vaddr = (uintptr_t)addr; uintptr_t eaddr = vaddr + len; htable_t *ht = NULL; uint_t entry; uintptr_t contig_va = (uintptr_t)-1L; ! tlb_range_t r[MAX_UNLOAD_CNT]; uint_t r_cnt = 0; x86pte_t old_pte; XPV_DISALLOW_MIGRATE(); ASSERT(hat == kas.a_hat || eaddr <= _userlimit);
*** 2421,2438 **** /* * We'll do the call backs for contiguous ranges */ if (vaddr != contig_va || ! (r_cnt > 0 && r[r_cnt - 1].rng_level != ht->ht_level)) { if (r_cnt == MAX_UNLOAD_CNT) { handle_ranges(hat, cb, r_cnt, r); r_cnt = 0; } ! r[r_cnt].rng_va = vaddr; ! r[r_cnt].rng_cnt = 0; ! r[r_cnt].rng_level = ht->ht_level; ++r_cnt; } /* * Unload one mapping (for a single page) from the page tables. --- 3015,3032 ---- /* * We'll do the call backs for contiguous ranges */ if (vaddr != contig_va || ! (r_cnt > 0 && r[r_cnt - 1].tr_level != ht->ht_level)) { if (r_cnt == MAX_UNLOAD_CNT) { handle_ranges(hat, cb, r_cnt, r); r_cnt = 0; } ! r[r_cnt].tr_va = vaddr; ! r[r_cnt].tr_cnt = 0; ! r[r_cnt].tr_level = ht->ht_level; ++r_cnt; } /* * Unload one mapping (for a single page) from the page tables.
*** 2446,2456 **** entry = htable_va2entry(vaddr, ht); hat_pte_unmap(ht, entry, flags, old_pte, NULL, B_FALSE); ASSERT(ht->ht_level <= mmu.max_page_level); vaddr += LEVEL_SIZE(ht->ht_level); contig_va = vaddr; ! ++r[r_cnt - 1].rng_cnt; } if (ht) htable_release(ht); /* --- 3040,3050 ---- entry = htable_va2entry(vaddr, ht); hat_pte_unmap(ht, entry, flags, old_pte, NULL, B_FALSE); ASSERT(ht->ht_level <= mmu.max_page_level); vaddr += LEVEL_SIZE(ht->ht_level); contig_va = vaddr; ! ++r[r_cnt - 1].tr_cnt; } if (ht) htable_release(ht); /*
*** 2475,2492 **** sz = hat_getpagesize(hat, va); if (sz < 0) { #ifdef __xpv xen_flush_tlb(); #else ! flush_all_tlb_entries(); #endif break; } #ifdef __xpv xen_flush_va(va); #else ! mmu_tlbflush_entry(va); #endif va += sz; } } --- 3069,3086 ---- sz = hat_getpagesize(hat, va); if (sz < 0) { #ifdef __xpv xen_flush_tlb(); #else ! mmu_flush_tlb(FLUSH_TLB_ALL, NULL); #endif break; } #ifdef __xpv xen_flush_va(va); #else ! mmu_flush_tlb_kpage((uintptr_t)va); #endif va += sz; } }
*** 3148,3158 **** } } /* * flush the TLBs - since we're probably dealing with MANY mappings ! * we do just one CR3 reload. */ if (!(hat->hat_flags & HAT_FREEING) && need_demaps) hat_tlb_inval(hat, DEMAP_ALL_ADDR); /* --- 3742,3752 ---- } } /* * flush the TLBs - since we're probably dealing with MANY mappings ! * we just do a full invalidation. */ if (!(hat->hat_flags & HAT_FREEING) && need_demaps) hat_tlb_inval(hat, DEMAP_ALL_ADDR); /*
*** 3931,3941 **** (pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL); if (mmu.pae_hat) *pteptr = 0; else *(x86pte32_t *)pteptr = 0; ! mmu_tlbflush_entry(addr); x86pte_mapout(); } #endif ht = htable_getpte(kas.a_hat, ALIGN2PAGE(addr), NULL, NULL, 0); --- 4525,4535 ---- (pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL); if (mmu.pae_hat) *pteptr = 0; else *(x86pte32_t *)pteptr = 0; ! mmu_flush_tlb_kpage((uintptr_t)addr); x86pte_mapout(); } #endif ht = htable_getpte(kas.a_hat, ALIGN2PAGE(addr), NULL, NULL, 0);
*** 3992,4002 **** (pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL); if (mmu.pae_hat) *(x86pte_t *)pteptr = pte; else *(x86pte32_t *)pteptr = (x86pte32_t)pte; ! mmu_tlbflush_entry(addr); x86pte_mapout(); } #endif XPV_ALLOW_MIGRATE(); } --- 4586,4596 ---- (pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL); if (mmu.pae_hat) *(x86pte_t *)pteptr = pte; else *(x86pte32_t *)pteptr = (x86pte32_t)pte; ! mmu_flush_tlb_kpage((uintptr_t)addr); x86pte_mapout(); } #endif XPV_ALLOW_MIGRATE(); }
*** 4026,4036 **** void hat_cpu_online(struct cpu *cpup) { if (cpup != CPU) { x86pte_cpu_init(cpup); ! hat_vlp_setup(cpup); } CPUSET_ATOMIC_ADD(khat_cpuset, cpup->cpu_id); } /* --- 4620,4630 ---- void hat_cpu_online(struct cpu *cpup) { if (cpup != CPU) { x86pte_cpu_init(cpup); ! hat_pcp_setup(cpup); } CPUSET_ATOMIC_ADD(khat_cpuset, cpup->cpu_id); } /*
*** 4041,4051 **** hat_cpu_offline(struct cpu *cpup) { ASSERT(cpup != CPU); CPUSET_ATOMIC_DEL(khat_cpuset, cpup->cpu_id); ! hat_vlp_teardown(cpup); x86pte_cpu_fini(cpup); } /* * Function called after all CPUs are brought online. --- 4635,4645 ---- hat_cpu_offline(struct cpu *cpup) { ASSERT(cpup != CPU); CPUSET_ATOMIC_DEL(khat_cpuset, cpup->cpu_id); ! hat_pcp_teardown(cpup); x86pte_cpu_fini(cpup); } /* * Function called after all CPUs are brought online.
*** 4488,4492 **** --- 5082,5115 ---- htable_release(ht); htable_release(ht); XPV_ALLOW_MIGRATE(); } #endif /* __xpv */ + + /* + * Helper function to punch in a mapping that we need with the specified + * attributes. + */ + void + hati_cpu_punchin(cpu_t *cpu, uintptr_t va, uint_t attrs) + { + int ret; + pfn_t pfn; + hat_t *cpu_hat = cpu->cpu_hat_info->hci_user_hat; + + ASSERT3S(kpti_enable, ==, 1); + ASSERT3P(cpu_hat, !=, NULL); + ASSERT3U(cpu_hat->hat_flags & HAT_PCP, ==, HAT_PCP); + ASSERT3U(va & MMU_PAGEOFFSET, ==, 0); + + pfn = hat_getpfnum(kas.a_hat, (caddr_t)va); + VERIFY3U(pfn, !=, PFN_INVALID); + + /* + * We purposefully don't try to find the page_t. This means that this + * will be marked PT_NOCONSIST; however, given that this is pretty much + * a static mapping that we're using we should be relatively OK. + */ + attrs |= HAT_STORECACHING_OK; + ret = hati_load_common(cpu_hat, va, NULL, attrs, 0, 0, pfn); + VERIFY3S(ret, ==, 0); + }