illumos-gate Cdiff usr/src/uts/i86pc/vm/hat

Print this page

8956 Implement KPTI
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
9208 hati_demap_func should take pagesize into account
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>


*** 25,34 ****
--- 25,35 ----
   * Copyright (c) 2010, Intel Corporation.
   * All rights reserved.
   */
  /*
   * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
+  * Copyright 2018 Joyent, Inc.  All rights reserved.
   * Copyright (c) 2014, 2015 by Delphix. All rights reserved.
   */
  
  /*
   * VM - Hardware Address Translation management for i386 and amd64
*** 40,49 ****
--- 41,235 ----
   * that work in conjunction with this code.
   *
   * Routines used only inside of i86pc/vm start with hati_ for HAT Internal.
   */
  
+ /*
+  * amd64 HAT Design
+  *
+  * ----------
+  * Background
+  * ----------
+  *
+  * On x86, the address space is shared between a user process and the kernel.
+  * This is different from SPARC. Conventionally, the kernel lives at the top of
+  * the address space and the user process gets to enjoy the rest of it. If you
+  * look at the image of the address map in uts/i86pc/os/startup.c, you'll get a
+  * rough sense of how the address space is laid out and used.
+  *
+  * Every unique address space is represented by an instance of a HAT structure
+  * called a 'hat_t'. In addition to a hat_t structure for each process, there is
+  * also one that is used for the kernel (kas.a_hat), and each CPU ultimately
+  * also has a HAT.
+  *
+  * Each HAT contains a pointer to its root page table. This root page table is
+  * what we call an L3 page table in illumos and Intel calls the PML4. It is the
+  * physical address of the L3 table that we place in the %cr3 register which the
+  * processor uses.
+  *
+  * Each of the many layers of the page table is represented by a structure
+  * called an htable_t. The htable_t manages a set of 512 8-byte entries. The
+  * number of entries in a given page table is constant across all different
+  * level page tables. Note, this is only true on amd64. This has not always been
+  * the case on x86.
+  *
+  * Each entry in a page table, generally referred to as a PTE, may refer to
+  * another page table or a memory location, depending on the level of the page
+  * table and the use of large pages. Importantly, the top-level L3 page table
+  * (PML4) only supports linking to further page tables. This is also true on
+  * systems which support a 5th level page table (which we do not currently
+  * support).
+  *
+  * Historically, on x86, when a process was running on CPU, the root of the page
+  * table was inserted into %cr3 on each CPU on which it was currently running.
+  * When processes would switch (by calling hat_switch()), then the value in %cr3
+  * on that CPU would change to that of the new HAT. While this behavior is still
+  * maintained in the xpv kernel, this is not what is done today.
+  *
+  * -------------------
+  * Per-CPU Page Tables
+  * -------------------
+  *
+  * Throughout the system the 64-bit kernel has a notion of what it calls a
+  * per-CPU page table or PCP. The notion of a per-CPU page table was originally
+  * introduced as part of the original work to support x86 PAE. On the 64-bit
+  * kernel, it was originally used for 32-bit processes running on the 64-bit
+  * kernel. The rationale behind this was that each 32-bit process could have all
+  * of its memory represented in a single L2 page table as each L2 page table
+  * entry represents 1 GbE of memory.
+  *
+  * Following on from this, the idea was that given that all of the L3 page table
+  * entries for 32-bit processes are basically going to be identical with the
+  * exception of the first entry in the page table, why not share those page
+  * table entries. This gave rise to the idea of a per-CPU page table.
+  *
+  * The way this works is that we have a member in the machcpu_t called the
+  * mcpu_hat_info. That structure contains two different 4k pages: one that
+  * represents the L3 page table and one that represents an L2 page table. When
+  * the CPU starts up, the L3 page table entries are copied in from the kernel's
+  * page table. The L3 kernel entries do not change throughout the lifetime of
+  * the kernel. The kernel portion of these L3 pages for each CPU have the same
+  * records, meaning that they point to the same L2 page tables and thus see a
+  * consistent view of the world.
+  *
+  * When a 32-bit process is loaded into this world, we copy the 32-bit process's
+  * four top-level page table entries into the CPU's L2 page table and then set
+  * the CPU's first L3 page table entry to point to the CPU's L2 page.
+  * Specifically, in hat_pcp_update(), we're copying from the process's
+  * HAT_COPIED_32 HAT into the page tables specific to this CPU.
+  *
+  * As part of the implementation of kernel page table isolation, this was also
+  * extended to 64-bit processes. When a 64-bit process runs, we'll copy their L3
+  * PTEs across into the current CPU's L3 page table. (As we can't do the
+  * first-L3-entry trick for 64-bit processes, ->hci_pcp_l2ptes is unused in this
+  * case.)
+  *
+  * The use of per-CPU page tables has a lot of implementation ramifications. A
+  * HAT that runs a user process will be flagged with the HAT_COPIED flag to
+  * indicate that it is using the per-CPU page table functionality. In tandem
+  * with the HAT, the top-level htable_t will be flagged with the HTABLE_COPIED
+  * flag. If the HAT represents a 32-bit process, then we will also set the
+  * HAT_COPIED_32 flag on that hat_t.
+  *
+  * These two flags work together. The top-level htable_t when using per-CPU page
+  * tables is 'virtual'. We never allocate a ptable for this htable_t (i.e.
+  * ht->ht_pfn is PFN_INVALID).  Instead, when we need to modify a PTE in an
+  * HTABLE_COPIED ptable, x86pte_access_pagetable() will redirect any accesses to
+  * ht_hat->hat_copied_ptes.
+  *
+  * Of course, such a modification won't actually modify the HAT_PCP page tables
+  * that were copied from the HAT_COPIED htable. When we change the top level
+  * page table entries (L2 PTEs for a 32-bit process and L3 PTEs for a 64-bit
+  * process), we need to make sure to trigger hat_pcp_update() on all CPUs that
+  * are currently tied to this HAT (including the current CPU).
+  *
+  * To do this, PCP piggy-backs on TLB invalidation, specifically via the
+  * hat_tlb_inval() path from link_ptp() and unlink_ptp().
+  *
+  * (Importantly, in all such cases, when this is in operation, the top-level
+  * entry should not be able to refer to an actual page table entry that can be
+  * changed and consolidated into a large page. If large page consolidation is
+  * required here, then there will be much that needs to be reconsidered.)
+  *
+  * -----------------------------------------------
+  * Kernel Page Table Isolation and the Per-CPU HAT
+  * -----------------------------------------------
+  *
+  * All Intel CPUs that support speculative execution and paging are subject to a
+  * series of bugs that have been termed 'Meltdown'. These exploits allow a user
+  * process to read kernel memory through cache side channels and speculative
+  * execution. To mitigate this on vulnerable CPUs, we need to use a technique
+  * called kernel page table isolation. What this requires is that we have two
+  * different page table roots. When executing in kernel mode, we will use a %cr3
+  * value that has both the user and kernel pages. However when executing in user
+  * mode, we will need to have a %cr3 that has all of the user pages; however,
+  * only a subset of the kernel pages required to operate.
+  *
+  * These kernel pages that we need mapped are:
+  *
+  *   o Kernel Text that allows us to switch between the cr3 values.
+  *   o The current global descriptor table (GDT)
+  *   o The current interrupt descriptor table (IDT)
+  *   o The current task switching state (TSS)
+  *   o The current local descriptor table (LDT)
+  *   o Stacks and scratch space used by the interrupt handlers
+  *
+  * For more information on the stack switching techniques, construction of the
+  * trampolines, and more, please see i86pc/ml/kpti_trampolines.s. The most
+  * important part of these mappings are the following two constraints:
+  *
+  *   o The mappings are all per-CPU (except for read-only text)
+  *   o The mappings are static. They are all established before the CPU is
+  *     started (with the exception of the boot CPU).
+  *
+  * To facilitate the kernel page table isolation we employ our per-CPU
+  * page tables discussed in the previous section and add the notion of a per-CPU
+  * HAT. Fundamentally we have a second page table root. There is both a kernel
+  * page table (hci_pcp_l3ptes), and a user L3 page table (hci_user_l3ptes).
+  * Both will have the user page table entries copied into them, the same way
+  * that we discussed in the section 'Per-CPU Page Tables'.
+  *
+  * The complex part of this is how do we construct the set of kernel mappings
+  * that should be present when running with the user page table. To answer that,
+  * we add the notion of a per-CPU HAT. This HAT functions like a normal HAT,
+  * except that it's not really associated with an address space the same way
+  * that other HATs are.
+  *
+  * This HAT lives off of the 'struct hat_cpu_info' which is a member of the
+  * machcpu in the member hci_user_hat. We use this per-CPU HAT to create the set
+  * of kernel mappings that should be present on this CPU. The kernel mappings
+  * are added to the per-CPU HAT through the function hati_cpu_punchin(). Once a
+  * mapping has been punched in, it may not be punched out. The reason that we
+  * opt to leverage a HAT structure is that it knows how to allocate and manage
+  * all of the lower level page tables as required.
+  *
+  * Because all of the mappings are present at the beginning of time for this CPU
+  * and none of the mappings are in the kernel pageable segment, we don't have to
+  * worry about faulting on these HAT structures and thus the notion of the
+  * current HAT that we're using is always the appropriate HAT for the process
+  * (usually a user HAT or the kernel's HAT).
+  *
+  * A further constraint we place on the system with these per-CPU HATs is that
+  * they are not subject to htable_steal(). Because each CPU will have a rather
+  * fixed number of page tables, the same way that we don't steal from the
+  * kernel's HAT, it was determined that we should not steal from this HAT due to
+  * the complications involved and somewhat criminal nature of htable_steal().
+  *
+  * The per-CPU HAT is initialized in hat_pcp_setup() which is called as part of
+  * onlining the CPU, but before the CPU is actually started. The per-CPU HAT is
+  * removed in hat_pcp_teardown() which is called when a CPU is being offlined to
+  * be removed from the system (which is different from what psradm usually
+  * does).
+  *
+  * Finally, once the CPU has been onlined, the set of mappings in the per-CPU
+  * HAT must not change. The HAT related functions that we call are not meant to
+  * be called when we're switching between processes. For example, it is quite
+  * possible that if they were, they would try to grab an htable mutex which
+  * another thread might have. One needs to treat hat_switch() as though they
+  * were above LOCK_LEVEL and therefore _must not_ block under any circumstance.
+  */
+ 
  #include <sys/machparam.h>
  #include <sys/machsystm.h>
  #include <sys/mman.h>
  #include <sys/types.h>
  #include <sys/systm.h>
*** 93,123 ****
  /*
   * The page that is the kernel's top level pagetable.
   *
   * For 32 bit PAE support on i86pc, the kernel hat will use the 1st 4 entries
   * on this 4K page for its top level page table. The remaining groups of
!  * 4 entries are used for per processor copies of user VLP pagetables for
   * running threads.  See hat_switch() and reload_pae32() for details.
   *
!  * vlp_page[0..3] - level==2 PTEs for kernel HAT
!  * vlp_page[4..7] - level==2 PTEs for user thread on cpu 0
!  * vlp_page[8..11]  - level==2 PTE for user thread on cpu 1
   * etc...
   */
! static x86pte_t *vlp_page;
  
  /*
   * forward declaration of internal utility routines
   */
  static x86pte_t hati_update_pte(htable_t *ht, uint_t entry, x86pte_t expected,
          x86pte_t new);
  
  /*
!  * The kernel address space exists in all HATs. To implement this the
!  * kernel reserves a fixed number of entries in the topmost level(s) of page
!  * tables. The values are setup during startup and then copied to every user
!  * hat created by hat_alloc(). This means that kernelbase must be:
   *
   *        4Meg aligned for 32 bit kernels
   *      512Gig aligned for x86_64 64 bit kernel
   *
   * The hat_kernel_range_ts describe what needs to be copied from kernel hat
--- 279,312 ----
  /*
   * The page that is the kernel's top level pagetable.
   *
   * For 32 bit PAE support on i86pc, the kernel hat will use the 1st 4 entries
   * on this 4K page for its top level page table. The remaining groups of
!  * 4 entries are used for per processor copies of user PCP pagetables for
   * running threads.  See hat_switch() and reload_pae32() for details.
   *
!  * pcp_page[0..3] - level==2 PTEs for kernel HAT
!  * pcp_page[4..7] - level==2 PTEs for user thread on cpu 0
!  * pcp_page[8..11]  - level==2 PTE for user thread on cpu 1
   * etc...
+  *
+  * On the 64-bit kernel, this is the normal root of the page table and there is
+  * nothing special about it when used for other CPUs.
   */
! static x86pte_t *pcp_page;
  
  /*
   * forward declaration of internal utility routines
   */
  static x86pte_t hati_update_pte(htable_t *ht, uint_t entry, x86pte_t expected,
          x86pte_t new);
  
  /*
!  * The kernel address space exists in all non-HAT_COPIED HATs. To implement this
!  * the kernel reserves a fixed number of entries in the topmost level(s) of page
!  * tables. The values are setup during startup and then copied to every user hat
!  * created by hat_alloc(). This means that kernelbase must be:
   *
   *        4Meg aligned for 32 bit kernels
   *      512Gig aligned for x86_64 64 bit kernel
   *
   * The hat_kernel_range_ts describe what needs to be copied from kernel hat
*** 168,178 ****
   */
  kmutex_t        hat_list_lock;
  kcondvar_t      hat_list_cv;
  kmem_cache_t    *hat_cache;
  kmem_cache_t    *hat_hash_cache;
! kmem_cache_t    *vlp_hash_cache;
  
  /*
   * Simple statistics
   */
  struct hatstats hatstat;
--- 357,367 ----
   */
  kmutex_t        hat_list_lock;
  kcondvar_t      hat_list_cv;
  kmem_cache_t    *hat_cache;
  kmem_cache_t    *hat_hash_cache;
! kmem_cache_t    *hat32_hash_cache;
  
  /*
   * Simple statistics
   */
  struct hatstats hatstat;
*** 186,201 ****
   * HAT uses cmpxchg() and the other paths (hypercall etc.) were never
   * incorrect.
   */
  int pt_kern;
  
- /*
-  * useful stuff for atomic access/clearing/setting REF/MOD/RO bits in page_t's.
-  */
- extern void atomic_orb(uchar_t *addr, uchar_t val);
- extern void atomic_andb(uchar_t *addr, uchar_t val);
- 
  #ifndef __xpv
  extern pfn_t memseg_get_start(struct memseg *);
  #endif
  
  #define PP_GETRM(pp, rmmask)    (pp->p_nrm & rmmask)
--- 375,384 ----
*** 234,259 ****
          hat->hat_ht_hash = NULL;
          return (0);
  }
  
  /*
   * Allocate a hat structure for as. We also create the top level
   * htable and initialize it to contain the kernel hat entries.
   */
  hat_t *
  hat_alloc(struct as *as)
  {
          hat_t                   *hat;
          htable_t                *ht;    /* top level htable */
!         uint_t                  use_vlp;
          uint_t                  r;
          hat_kernel_range_t      *rp;
          uintptr_t               va;
          uintptr_t               eva;
          uint_t                  start;
          uint_t                  cnt;
          htable_t                *src;
  
          /*
           * Once we start creating user process HATs we can enable
           * the htable_steal() code.
           */
--- 417,469 ----
          hat->hat_ht_hash = NULL;
          return (0);
  }
  
  /*
+  * Put it at the start of the global list of all hats (used by stealing)
+  *
+  * kas.a_hat is not in the list but is instead used to find the
+  * first and last items in the list.
+  *
+  * - kas.a_hat->hat_next points to the start of the user hats.
+  *   The list ends where hat->hat_next == NULL
+  *
+  * - kas.a_hat->hat_prev points to the last of the user hats.
+  *   The list begins where hat->hat_prev == NULL
+  */
+ static void
+ hat_list_append(hat_t *hat)
+ {
+         mutex_enter(&hat_list_lock);
+         hat->hat_prev = NULL;
+         hat->hat_next = kas.a_hat->hat_next;
+         if (hat->hat_next)
+                 hat->hat_next->hat_prev = hat;
+         else
+                 kas.a_hat->hat_prev = hat;
+         kas.a_hat->hat_next = hat;
+         mutex_exit(&hat_list_lock);
+ }
+ 
+ /*
   * Allocate a hat structure for as. We also create the top level
   * htable and initialize it to contain the kernel hat entries.
   */
  hat_t *
  hat_alloc(struct as *as)
  {
          hat_t                   *hat;
          htable_t                *ht;    /* top level htable */
!         uint_t                  use_copied;
          uint_t                  r;
          hat_kernel_range_t      *rp;
          uintptr_t               va;
          uintptr_t               eva;
          uint_t                  start;
          uint_t                  cnt;
          htable_t                *src;
+         boolean_t               use_hat32_cache;
  
          /*
           * Once we start creating user process HATs we can enable
           * the htable_steal() code.
           */
*** 266,299 ****
          mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL);
          ASSERT(hat->hat_flags == 0);
  
  #if defined(__xpv)
          /*
!          * No VLP stuff on the hypervisor due to the 64-bit split top level
           * page tables.  On 32-bit it's not needed as the hypervisor takes
           * care of copying the top level PTEs to a below 4Gig page.
           */
!         use_vlp = 0;
  #else   /* __xpv */
!         /* 32 bit processes uses a VLP style hat when running with PAE */
! #if defined(__amd64)
!         use_vlp = (ttoproc(curthread)->p_model == DATAMODEL_ILP32);
! #elif defined(__i386)
!         use_vlp = mmu.pae_hat;
! #endif
  #endif  /* __xpv */
!         if (use_vlp) {
!                 hat->hat_flags = HAT_VLP;
!                 bzero(hat->hat_vlp_ptes, VLP_SIZE);
          }
  
          /*
!          * Allocate the htable hash
           */
!         if ((hat->hat_flags & HAT_VLP)) {
!                 hat->hat_num_hash = mmu.vlp_hash_cnt;
!                 hat->hat_ht_hash = kmem_cache_alloc(vlp_hash_cache, KM_SLEEP);
          } else {
                  hat->hat_num_hash = mmu.hash_cnt;
                  hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP);
          }
          bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
--- 476,538 ----
          mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL);
          ASSERT(hat->hat_flags == 0);
  
  #if defined(__xpv)
          /*
!          * No PCP stuff on the hypervisor due to the 64-bit split top level
           * page tables.  On 32-bit it's not needed as the hypervisor takes
           * care of copying the top level PTEs to a below 4Gig page.
           */
!         use_copied = 0;
!         use_hat32_cache = B_FALSE;
!         hat->hat_max_level = mmu.max_level;
!         hat->hat_num_copied = 0;
!         hat->hat_flags = 0;
  #else   /* __xpv */
! 
!         /*
!          * All processes use HAT_COPIED on the 64-bit kernel if KPTI is
!          * turned on.
!          */
!         if (ttoproc(curthread)->p_model == DATAMODEL_ILP32) {
!                 use_copied = 1;
!                 hat->hat_max_level = mmu.max_level32;
!                 hat->hat_num_copied = mmu.num_copied_ents32;
!                 use_hat32_cache = B_TRUE;
!                 hat->hat_flags |= HAT_COPIED_32;
!                 HATSTAT_INC(hs_hat_copied32);
!         } else if (kpti_enable == 1) {
!                 use_copied = 1;
!                 hat->hat_max_level = mmu.max_level;
!                 hat->hat_num_copied = mmu.num_copied_ents;
!                 use_hat32_cache = B_FALSE;
!                 HATSTAT_INC(hs_hat_copied64);
!         } else {
!                 use_copied = 0;
!                 use_hat32_cache = B_FALSE;
!                 hat->hat_max_level = mmu.max_level;
!                 hat->hat_num_copied = 0;
!                 hat->hat_flags = 0;
!                 HATSTAT_INC(hs_hat_normal64);
!         }
  #endif  /* __xpv */
!         if (use_copied) {
!                 hat->hat_flags |= HAT_COPIED;
!                 bzero(hat->hat_copied_ptes, sizeof (hat->hat_copied_ptes));
          }
  
          /*
!          * Allocate the htable hash. For 32-bit PCP processes we use the
!          * hat32_hash_cache. However, for 64-bit PCP processes we do not as the
!          * number of entries that they have to handle is closer to
!          * hat_hash_cache in count (though there will be more wastage when we
!          * have more DRAM in the system and thus push down the user address
!          * range).
           */
!         if (use_hat32_cache) {
!                 hat->hat_num_hash = mmu.hat32_hash_cnt;
!                 hat->hat_ht_hash = kmem_cache_alloc(hat32_hash_cache, KM_SLEEP);
          } else {
                  hat->hat_num_hash = mmu.hash_cnt;
                  hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP);
          }
          bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
*** 307,317 ****
          XPV_DISALLOW_MIGRATE();
          ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL);
          hat->hat_htable = ht;
  
  #if defined(__amd64)
!         if (hat->hat_flags & HAT_VLP)
                  goto init_done;
  #endif
  
          for (r = 0; r < num_kernel_ranges; ++r) {
                  rp = &kernel_ranges[r];
--- 546,556 ----
          XPV_DISALLOW_MIGRATE();
          ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL);
          hat->hat_htable = ht;
  
  #if defined(__amd64)
!         if (hat->hat_flags & HAT_COPIED)
                  goto init_done;
  #endif
  
          for (r = 0; r < num_kernel_ranges; ++r) {
                  rp = &kernel_ranges[r];
*** 332,344 ****
                              (eva > rp->hkr_end_va || eva == 0))
                                  cnt = htable_va2entry(rp->hkr_end_va, ht) -
                                      start;
  
  #if defined(__i386) && !defined(__xpv)
!                         if (ht->ht_flags & HTABLE_VLP) {
!                                 bcopy(&vlp_page[start],
!                                     &hat->hat_vlp_ptes[start],
                                      cnt * sizeof (x86pte_t));
                                  continue;
                          }
  #endif
                          src = htable_lookup(kas.a_hat, va, rp->hkr_level);
--- 571,583 ----
                              (eva > rp->hkr_end_va || eva == 0))
                                  cnt = htable_va2entry(rp->hkr_end_va, ht) -
                                      start;
  
  #if defined(__i386) && !defined(__xpv)
!                         if (ht->ht_flags & HTABLE_COPIED) {
!                                 bcopy(&pcp_page[start],
!                                     &hat->hat_copied_ptes[start],
                                      cnt * sizeof (x86pte_t));
                                  continue;
                          }
  #endif
                          src = htable_lookup(kas.a_hat, va, rp->hkr_level);
*** 359,392 ****
          xen_pin(hat->hat_user_ptable, mmu.max_level);
  #endif
  #endif
          XPV_ALLOW_MIGRATE();
  
          /*
!          * Put it at the start of the global list of all hats (used by stealing)
!          *
!          * kas.a_hat is not in the list but is instead used to find the
!          * first and last items in the list.
!          *
!          * - kas.a_hat->hat_next points to the start of the user hats.
!          *   The list ends where hat->hat_next == NULL
!          *
!          * - kas.a_hat->hat_prev points to the last of the user hats.
!          *   The list begins where hat->hat_prev == NULL
           */
!         mutex_enter(&hat_list_lock);
!         hat->hat_prev = NULL;
!         hat->hat_next = kas.a_hat->hat_next;
!         if (hat->hat_next)
!                 hat->hat_next->hat_prev = hat;
!         else
!                 kas.a_hat->hat_prev = hat;
!         kas.a_hat->hat_next = hat;
!         mutex_exit(&hat_list_lock);
  
          return (hat);
  }
  
  /*
   * process has finished executing but as has not been cleaned up yet.
   */
  /*ARGSUSED*/
--- 598,655 ----
          xen_pin(hat->hat_user_ptable, mmu.max_level);
  #endif
  #endif
          XPV_ALLOW_MIGRATE();
  
+         hat_list_append(hat);
+ 
+         return (hat);
+ }
+ 
+ #if !defined(__xpv)
+ /*
+  * Cons up a HAT for a CPU. This represents the user mappings. This will have
+  * various kernel pages punched into it manually. Importantly, this hat is
+  * ineligible for stealing. We really don't want to deal with this ever
+  * faulting and figuring out that this is happening, much like we don't with
+  * kas.
+  */
+ static hat_t *
+ hat_cpu_alloc(cpu_t *cpu)
+ {
+         hat_t *hat;
+         htable_t *ht;
+ 
+         hat = kmem_cache_alloc(hat_cache, KM_SLEEP);
+         hat->hat_as = NULL;
+         mutex_init(&hat->hat_mutex, NULL, MUTEX_DEFAULT, NULL);
+         hat->hat_max_level = mmu.max_level;
+         hat->hat_num_copied = 0;
+         hat->hat_flags = HAT_PCP;
+ 
+         hat->hat_num_hash = mmu.hash_cnt;
+         hat->hat_ht_hash = kmem_cache_alloc(hat_hash_cache, KM_SLEEP);
+         bzero(hat->hat_ht_hash, hat->hat_num_hash * sizeof (htable_t *));
+ 
+         hat->hat_next = hat->hat_prev = NULL;
+ 
          /*
!          * Because this HAT will only ever be used by the current CPU, we'll go
!          * ahead and set the CPUSET up to only point to the CPU in question.
           */
!         CPUSET_ADD(hat->hat_cpus, cpu->cpu_id);
  
+         hat->hat_htable = NULL;
+         hat->hat_ht_cached = NULL;
+         ht = htable_create(hat, (uintptr_t)0, TOP_LEVEL(hat), NULL);
+         hat->hat_htable = ht;
+ 
+         hat_list_append(hat);
+ 
          return (hat);
  }
+ #endif /* !__xpv */
  
  /*
   * process has finished executing but as has not been cleaned up yet.
   */
  /*ARGSUSED*/
*** 439,448 ****
--- 702,712 ----
  
  #if defined(__xpv)
          /*
           * On the hypervisor, unpin top level page table(s)
           */
+         VERIFY3U(hat->hat_flags & HAT_PCP, ==, 0);
          xen_unpin(hat->hat_htable->ht_pfn);
  #if defined(__amd64)
          xen_unpin(hat->hat_user_ptable);
  #endif
  #endif
*** 453,470 ****
          htable_purge_hat(hat);
  
          /*
           * Decide which kmem cache the hash table came from, then free it.
           */
!         if (hat->hat_flags & HAT_VLP)
!                 cache = vlp_hash_cache;
!         else
                  cache = hat_hash_cache;
          kmem_cache_free(cache, hat->hat_ht_hash);
          hat->hat_ht_hash = NULL;
  
          hat->hat_flags = 0;
          kmem_cache_free(hat_cache, hat);
  }
  
  /*
   * round kernelbase down to a supported value to use for _userlimit
--- 717,745 ----
          htable_purge_hat(hat);
  
          /*
           * Decide which kmem cache the hash table came from, then free it.
           */
!         if (hat->hat_flags & HAT_COPIED) {
! #if defined(__amd64)
!                 if (hat->hat_flags & HAT_COPIED_32) {
!                         cache = hat32_hash_cache;
!                 } else {
                          cache = hat_hash_cache;
+                 }
+ #else
+                 cache = hat32_hash_cache;
+ #endif
+         } else {
+                 cache = hat_hash_cache;
+         }
          kmem_cache_free(cache, hat->hat_ht_hash);
          hat->hat_ht_hash = NULL;
  
          hat->hat_flags = 0;
+         hat->hat_max_level = 0;
+         hat->hat_num_copied = 0;
          kmem_cache_free(hat_cache, hat);
  }
  
  /*
   * round kernelbase down to a supported value to use for _userlimit
*** 515,524 ****
--- 790,835 ----
          else
                  mmu.umax_page_level = lvl;
  }
  
  /*
+  * Determine the number of slots that are in used in the top-most level page
+  * table for user memory. This is based on _userlimit. In effect this is similar
+  * to htable_va2entry, but without the convenience of having an htable.
+  */
+ void
+ mmu_calc_user_slots(void)
+ {
+         uint_t ent, nptes;
+         uintptr_t shift;
+ 
+         nptes = mmu.top_level_count;
+         shift = _userlimit >> mmu.level_shift[mmu.max_level];
+         ent = shift & (nptes - 1);
+ 
+         /*
+          * Ent tells us the slot that the page for _userlimit would fit in. We
+          * need to add one to this to cover the total number of entries.
+          */
+         mmu.top_level_uslots = ent + 1;
+ 
+         /*
+          * When running 32-bit compatability processes on a 64-bit kernel, we
+          * will only need to use one slot.
+          */
+         mmu.top_level_uslots32 = 1;
+ 
+         /*
+          * Record the number of PCP page table entries that we'll need to copy
+          * around. For 64-bit processes this is the number of user slots. For
+          * 32-bit proceses, this is 4 1 GiB pages.
+          */
+         mmu.num_copied_ents = mmu.top_level_uslots;
+         mmu.num_copied_ents32 = 4;
+ }
+ 
+ /*
   * Initialize hat data structures based on processor MMU information.
   */
  void
  mmu_init(void)
  {
*** 533,543 ****
--- 844,865 ----
           */
          if (is_x86_feature(x86_featureset, X86FSET_PGE) &&
              (getcr4() & CR4_PGE) != 0)
                  mmu.pt_global = PT_GLOBAL;
  
+ #if !defined(__xpv)
          /*
+          * The 64-bit x86 kernel has split user/kernel page tables. As such we
+          * cannot have the global bit set. The simplest way for us to deal with
+          * this is to just say that pt_global is zero, so the global bit isn't
+          * present.
+          */
+         if (kpti_enable == 1)
+                 mmu.pt_global = 0;
+ #endif
+ 
+         /*
           * Detect NX and PAE usage.
           */
          mmu.pae_hat = kbm_pae_support;
          if (kbm_nx_support)
                  mmu.pt_nx = PT_NX;
*** 591,600 ****
--- 913,927 ----
          mmu.num_level = 4;
          mmu.max_level = 3;
          mmu.ptes_per_table = 512;
          mmu.top_level_count = 512;
  
+         /*
+          * 32-bit processes only use 1 GB ptes.
+          */
+         mmu.max_level32 = 2;
+ 
          mmu.level_shift[0] = 12;
          mmu.level_shift[1] = 21;
          mmu.level_shift[2] = 30;
          mmu.level_shift[3] = 39;
  
*** 627,636 ****
--- 954,964 ----
                  mmu.level_offset[i] = mmu.level_size[i] - 1;
                  mmu.level_mask[i] = ~mmu.level_offset[i];
          }
  
          set_max_page_level();
+         mmu_calc_user_slots();
  
          mmu_page_sizes = mmu.max_page_level + 1;
          mmu_exported_page_sizes = mmu.umax_page_level + 1;
  
          /* restrict legacy applications from using pagesizes 1g and above */
*** 662,672 ****
           */
          max_htables = physmax / mmu.ptes_per_table;
          mmu.hash_cnt = MMU_PAGESIZE / sizeof (htable_t *);
          while (mmu.hash_cnt > 16 && mmu.hash_cnt >= max_htables)
                  mmu.hash_cnt >>= 1;
!         mmu.vlp_hash_cnt = mmu.hash_cnt;
  
  #if defined(__amd64)
          /*
           * If running in 64 bits and physical memory is large,
           * increase the size of the cache to cover all of memory for
--- 990,1000 ----
           */
          max_htables = physmax / mmu.ptes_per_table;
          mmu.hash_cnt = MMU_PAGESIZE / sizeof (htable_t *);
          while (mmu.hash_cnt > 16 && mmu.hash_cnt >= max_htables)
                  mmu.hash_cnt >>= 1;
!         mmu.hat32_hash_cnt = mmu.hash_cnt;
  
  #if defined(__amd64)
          /*
           * If running in 64 bits and physical memory is large,
           * increase the size of the cache to cover all of memory for
*** 711,728 ****
          hat_hash_cache = kmem_cache_create("HatHash",
              mmu.hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL,
              NULL, 0, 0);
  
          /*
!          * VLP hats can use a smaller hash table size on large memroy machines
           */
!         if (mmu.hash_cnt == mmu.vlp_hash_cnt) {
!                 vlp_hash_cache = hat_hash_cache;
          } else {
!                 vlp_hash_cache = kmem_cache_create("HatVlpHash",
!                     mmu.vlp_hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL,
!                     NULL, 0, 0);
          }
  
          /*
           * Set up the kernel's hat
           */
--- 1039,1057 ----
          hat_hash_cache = kmem_cache_create("HatHash",
              mmu.hash_cnt * sizeof (htable_t *), 0, NULL, NULL, NULL,
              NULL, 0, 0);
  
          /*
!          * 32-bit PCP hats can use a smaller hash table size on large memory
!          * machines
           */
!         if (mmu.hash_cnt == mmu.hat32_hash_cnt) {
!                 hat32_hash_cache = hat_hash_cache;
          } else {
!                 hat32_hash_cache = kmem_cache_create("Hat32Hash",
!                     mmu.hat32_hash_cnt * sizeof (htable_t *), 0, NULL, NULL,
!                     NULL, NULL, 0, 0);
          }
  
          /*
           * Set up the kernel's hat
           */
*** 735,744 ****
--- 1064,1080 ----
  
          CPUSET_ZERO(khat_cpuset);
          CPUSET_ADD(khat_cpuset, CPU->cpu_id);
  
          /*
+          * The kernel HAT doesn't use PCP regardless of architectures.
+          */
+         ASSERT3U(mmu.max_level, >, 0);
+         kas.a_hat->hat_max_level = mmu.max_level;
+         kas.a_hat->hat_num_copied = 0;
+ 
+         /*
           * The kernel hat's next pointer serves as the head of the hat list .
           * The kernel hat's prev pointer tracks the last hat on the list for
           * htable_steal() to use.
           */
          kas.a_hat->hat_next = NULL;
*** 766,826 ****
           */
          hrm_hashtab = kmem_zalloc(HRM_HASHSIZE * sizeof (struct hrmstat *),
              KM_SLEEP);
  }
  
  /*
!  * Prepare CPU specific pagetables for VLP processes on 64 bit kernels.
   *
   * Each CPU has a set of 2 pagetables that are reused for any 32 bit
!  * process it runs. They are the top level pagetable, hci_vlp_l3ptes, and
!  * the next to top level table for the bottom 512 Gig, hci_vlp_l2ptes.
   */
  /*ARGSUSED*/
  static void
! hat_vlp_setup(struct cpu *cpu)
  {
! #if defined(__amd64) && !defined(__xpv)
          struct hat_cpu_info *hci = cpu->cpu_hat_info;
!         pfn_t pfn;
  
          /*
           * allocate the level==2 page table for the bottom most
           * 512Gig of address space (this is where 32 bit apps live)
           */
          ASSERT(hci != NULL);
!         hci->hci_vlp_l2ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
  
          /*
           * Allocate a top level pagetable and copy the kernel's
!          * entries into it. Then link in hci_vlp_l2ptes in the 1st entry.
           */
!         hci->hci_vlp_l3ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
!         hci->hci_vlp_pfn =
!             hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_vlp_l3ptes);
!         ASSERT(hci->hci_vlp_pfn != PFN_INVALID);
!         bcopy(vlp_page, hci->hci_vlp_l3ptes, MMU_PAGESIZE);
  
!         pfn = hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_vlp_l2ptes);
!         ASSERT(pfn != PFN_INVALID);
!         hci->hci_vlp_l3ptes[0] = MAKEPTP(pfn, 2);
! #endif /* __amd64 && !__xpv */
  }
  
  /*ARGSUSED*/
  static void
! hat_vlp_teardown(cpu_t *cpu)
  {
! #if defined(__amd64) && !defined(__xpv)
          struct hat_cpu_info *hci;
  
          if ((hci = cpu->cpu_hat_info) == NULL)
                  return;
!         if (hci->hci_vlp_l2ptes)
!                 kmem_free(hci->hci_vlp_l2ptes, MMU_PAGESIZE);
!         if (hci->hci_vlp_l3ptes)
!                 kmem_free(hci->hci_vlp_l3ptes, MMU_PAGESIZE);
  #endif
  }
  
  #define NEXT_HKR(r, l, s, e) {                  \
          kernel_ranges[r].hkr_level = l;         \
--- 1102,1270 ----
           */
          hrm_hashtab = kmem_zalloc(HRM_HASHSIZE * sizeof (struct hrmstat *),
              KM_SLEEP);
  }
  
+ 
+ extern void kpti_tramp_start();
+ extern void kpti_tramp_end();
+ 
+ extern void kdi_isr_start();
+ extern void kdi_isr_end();
+ 
+ extern gate_desc_t kdi_idt[NIDT];
+ 
  /*
!  * Prepare per-CPU pagetables for all processes on the 64 bit kernel.
   *
   * Each CPU has a set of 2 pagetables that are reused for any 32 bit
!  * process it runs. They are the top level pagetable, hci_pcp_l3ptes, and
!  * the next to top level table for the bottom 512 Gig, hci_pcp_l2ptes.
   */
  /*ARGSUSED*/
  static void
! hat_pcp_setup(struct cpu *cpu)
  {
! #if !defined(__xpv)
          struct hat_cpu_info *hci = cpu->cpu_hat_info;
!         uintptr_t va;
!         size_t len;
  
          /*
           * allocate the level==2 page table for the bottom most
           * 512Gig of address space (this is where 32 bit apps live)
           */
          ASSERT(hci != NULL);
!         hci->hci_pcp_l2ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
  
          /*
           * Allocate a top level pagetable and copy the kernel's
!          * entries into it. Then link in hci_pcp_l2ptes in the 1st entry.
           */
!         hci->hci_pcp_l3ptes = kmem_zalloc(MMU_PAGESIZE, KM_SLEEP);
!         hci->hci_pcp_l3pfn =
!             hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_pcp_l3ptes);
!         ASSERT3U(hci->hci_pcp_l3pfn, !=, PFN_INVALID);
!         bcopy(pcp_page, hci->hci_pcp_l3ptes, MMU_PAGESIZE);
  
!         hci->hci_pcp_l2pfn =
!             hat_getpfnum(kas.a_hat, (caddr_t)hci->hci_pcp_l2ptes);
!         ASSERT3U(hci->hci_pcp_l2pfn, !=, PFN_INVALID);
! 
!         /*
!          * Now go through and allocate the user version of these structures.
!          * Unlike with the kernel version, we allocate a hat to represent the
!          * top-level page table as that will make it much simpler when we need
!          * to patch through user entries.
!          */
!         hci->hci_user_hat = hat_cpu_alloc(cpu);
!         hci->hci_user_l3pfn = hci->hci_user_hat->hat_htable->ht_pfn;
!         ASSERT3U(hci->hci_user_l3pfn, !=, PFN_INVALID);
!         hci->hci_user_l3ptes =
!             (x86pte_t *)hat_kpm_mapin_pfn(hci->hci_user_l3pfn);
! 
!         /* Skip the rest of this if KPTI is switched off at boot. */
!         if (kpti_enable != 1)
!                 return;
! 
!         /*
!          * OK, now that we have this we need to go through and punch the normal
!          * holes in the CPU's hat for this. At this point we'll punch in the
!          * following:
!          *
!          *   o GDT
!          *   o IDT
!          *   o LDT
!          *   o Trampoline Code
!          *   o machcpu KPTI page
!          *   o kmdb ISR code page (just trampolines)
!          *
!          * If this is cpu0, then we also can initialize the following because
!          * they'll have already been allocated.
!          *
!          *   o TSS for CPU 0
!          *   o Double Fault for CPU 0
!          *
!          * The following items have yet to be allocated and have not been
!          * punched in yet. They will be punched in later:
!          *
!          *   o TSS (mach_cpucontext_alloc_tables())
!          *   o Double Fault Stack (mach_cpucontext_alloc_tables())
!          */
!         hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_gdt, PROT_READ);
!         hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_idt, PROT_READ);
! 
!         /*
!          * As the KDI IDT is only active during kmdb sessions (including single
!          * stepping), typically we don't actually need this punched in (we
!          * consider the routines that switch to the user cr3 to be toxic).  But
!          * if we ever accidentally end up on the user cr3 while on this IDT,
!          * we'd prefer not to triple fault.
!          */
!         hati_cpu_punchin(cpu, (uintptr_t)&kdi_idt, PROT_READ);
! 
!         CTASSERT(((uintptr_t)&kpti_tramp_start % MMU_PAGESIZE) == 0);
!         CTASSERT(((uintptr_t)&kpti_tramp_end % MMU_PAGESIZE) == 0);
!         for (va = (uintptr_t)&kpti_tramp_start;
!             va < (uintptr_t)&kpti_tramp_end; va += MMU_PAGESIZE) {
!                 hati_cpu_punchin(cpu, va, PROT_READ | PROT_EXEC);
!         }
! 
!         VERIFY3U(((uintptr_t)cpu->cpu_m.mcpu_ldt) % MMU_PAGESIZE, ==, 0);
!         for (va = (uintptr_t)cpu->cpu_m.mcpu_ldt, len = LDT_CPU_SIZE;
!             len >= MMU_PAGESIZE; va += MMU_PAGESIZE, len -= MMU_PAGESIZE) {
!                 hati_cpu_punchin(cpu, va, PROT_READ);
!         }
! 
!         /* mcpu_pad2 is the start of the page containing the kpti_frames. */
!         hati_cpu_punchin(cpu, (uintptr_t)&cpu->cpu_m.mcpu_pad2[0],
!             PROT_READ | PROT_WRITE);
! 
!         if (cpu == &cpus[0]) {
!                 /*
!                  * CPU0 uses a global for its double fault stack to deal with
!                  * the chicken and egg problem. We need to punch it into its
!                  * user HAT.
!                  */
!                 extern char dblfault_stack0[];
! 
!                 hati_cpu_punchin(cpu, (uintptr_t)cpu->cpu_m.mcpu_tss,
!                     PROT_READ);
! 
!                 for (va = (uintptr_t)dblfault_stack0,
!                     len = DEFAULTSTKSZ; len >= MMU_PAGESIZE;
!                     va += MMU_PAGESIZE, len -= MMU_PAGESIZE) {
!                         hati_cpu_punchin(cpu, va, PROT_READ | PROT_WRITE);
!                 }
!         }
! 
!         CTASSERT(((uintptr_t)&kdi_isr_start % MMU_PAGESIZE) == 0);
!         CTASSERT(((uintptr_t)&kdi_isr_end % MMU_PAGESIZE) == 0);
!         for (va = (uintptr_t)&kdi_isr_start;
!             va < (uintptr_t)&kdi_isr_end; va += MMU_PAGESIZE) {
!                 hati_cpu_punchin(cpu, va, PROT_READ | PROT_EXEC);
!         }
! #endif /* !__xpv */
  }
  
  /*ARGSUSED*/
  static void
! hat_pcp_teardown(cpu_t *cpu)
  {
! #if !defined(__xpv)
          struct hat_cpu_info *hci;
  
          if ((hci = cpu->cpu_hat_info) == NULL)
                  return;
!         if (hci->hci_pcp_l2ptes != NULL)
!                 kmem_free(hci->hci_pcp_l2ptes, MMU_PAGESIZE);
!         if (hci->hci_pcp_l3ptes != NULL)
!                 kmem_free(hci->hci_pcp_l3ptes, MMU_PAGESIZE);
!         if (hci->hci_user_hat != NULL) {
!                 hat_free_start(hci->hci_user_hat);
!                 hat_free_end(hci->hci_user_hat);
!         }
  #endif
  }
  
  #define NEXT_HKR(r, l, s, e) {                  \
          kernel_ranges[r].hkr_level = l;         \
*** 912,936 ****
          }
  
          /*
           * 32 bit PAE metal kernels use only 4 of the 512 entries in the
           * page holding the top level pagetable. We use the remainder for
!          * the "per CPU" page tables for VLP processes.
           * Map the top level kernel pagetable into the kernel to make
           * it easy to use bcopy access these tables.
           */
          if (mmu.pae_hat) {
!                 vlp_page = vmem_alloc(heap_arena, MMU_PAGESIZE, VM_SLEEP);
!                 hat_devload(kas.a_hat, (caddr_t)vlp_page, MMU_PAGESIZE,
                      kas.a_hat->hat_htable->ht_pfn,
  #if !defined(__xpv)
                      PROT_WRITE |
  #endif
                      PROT_READ | HAT_NOSYNC | HAT_UNORDERED_OK,
                      HAT_LOAD | HAT_LOAD_NOCONSIST);
          }
!         hat_vlp_setup(CPU);
  
          /*
           * Create kmap (cached mappings of kernel PTEs)
           * for 32 bit we map from segmap_start .. ekernelheap
           * for 64 bit we map from segmap_start .. segmap_start + segmapsize;
--- 1356,1383 ----
          }
  
          /*
           * 32 bit PAE metal kernels use only 4 of the 512 entries in the
           * page holding the top level pagetable. We use the remainder for
!          * the "per CPU" page tables for PCP processes.
           * Map the top level kernel pagetable into the kernel to make
           * it easy to use bcopy access these tables.
+          *
+          * PAE is required for the 64-bit kernel which uses this as well to
+          * perform the per-CPU pagetables. See the big theory statement.
           */
          if (mmu.pae_hat) {
!                 pcp_page = vmem_alloc(heap_arena, MMU_PAGESIZE, VM_SLEEP);
!                 hat_devload(kas.a_hat, (caddr_t)pcp_page, MMU_PAGESIZE,
                      kas.a_hat->hat_htable->ht_pfn,
  #if !defined(__xpv)
                      PROT_WRITE |
  #endif
                      PROT_READ | HAT_NOSYNC | HAT_UNORDERED_OK,
                      HAT_LOAD | HAT_LOAD_NOCONSIST);
          }
!         hat_pcp_setup(CPU);
  
          /*
           * Create kmap (cached mappings of kernel PTEs)
           * for 32 bit we map from segmap_start .. ekernelheap
           * for 64 bit we map from segmap_start .. segmap_start + segmapsize;
*** 939,948 ****
--- 1386,1401 ----
          size = (uintptr_t)ekernelheap - segmap_start;
  #elif defined(__amd64)
          size = segmapsize;
  #endif
          hat_kmap_init((uintptr_t)segmap_start, size);
+ 
+ #if !defined(__xpv)
+         ASSERT3U(kas.a_hat->hat_htable->ht_pfn, !=, PFN_INVALID);
+         ASSERT3U(kpti_safe_cr3, ==,
+             MAKECR3(kas.a_hat->hat_htable->ht_pfn, PCID_KERNEL));
+ #endif
  }
  
  /*
   * On 32 bit PAE mode, PTE's are 64 bits, but ordinary atomic memory references
   * are 32 bit, so for safety we must use atomic_cas_64() to install these.
*** 956,971 ****
          x86pte_t pte;
          int i;
  
          /*
           * Load the 4 entries of the level 2 page table into this
!          * cpu's range of the vlp_page and point cr3 at them.
           */
          ASSERT(mmu.pae_hat);
!         src = hat->hat_vlp_ptes;
!         dest = vlp_page + (cpu->cpu_id + 1) * VLP_NUM_PTES;
!         for (i = 0; i < VLP_NUM_PTES; ++i) {
                  for (;;) {
                          pte = dest[i];
                          if (pte == src[i])
                                  break;
                          if (atomic_cas_64(dest + i, pte, src[i]) != src[i])
--- 1409,1424 ----
          x86pte_t pte;
          int i;
  
          /*
           * Load the 4 entries of the level 2 page table into this
!          * cpu's range of the pcp_page and point cr3 at them.
           */
          ASSERT(mmu.pae_hat);
!         src = hat->hat_copied_ptes;
!         dest = pcp_page + (cpu->cpu_id + 1) * MAX_COPIED_PTES;
!         for (i = 0; i < MAX_COPIED_PTES; ++i) {
                  for (;;) {
                          pte = dest[i];
                          if (pte == src[i])
                                  break;
                          if (atomic_cas_64(dest + i, pte, src[i]) != src[i])
*** 974,992 ****
          }
  }
  #endif
  
  /*
   * Switch to a new active hat, maintaining bit masks to track active CPUs.
   *
!  * On the 32-bit PAE hypervisor, %cr3 is a 64-bit value, on metal it
!  * remains a 32-bit value.
   */
  void
  hat_switch(hat_t *hat)
  {
-         uint64_t        newcr3;
          cpu_t           *cpu = CPU;
          hat_t           *old = cpu->cpu_current_hat;
  
          /*
           * set up this information first, so we don't miss any cross calls
--- 1427,1593 ----
          }
  }
  #endif
  
  /*
+  * Update the PCP data on the CPU cpu to the one on the hat. If this is a 32-bit
+  * process, then we must update the L2 pages and then the L3. If this is a
+  * 64-bit process then we must update the L3 entries.
+  */
+ static void
+ hat_pcp_update(cpu_t *cpu, const hat_t *hat)
+ {
+         ASSERT3U(hat->hat_flags & HAT_COPIED, !=, 0);
+ 
+         if ((hat->hat_flags & HAT_COPIED_32) != 0) {
+                 const x86pte_t *l2src;
+                 x86pte_t *l2dst, *l3ptes, *l3uptes;
+                 /*
+                  * This is a 32-bit process. To set this up, we need to do the
+                  * following:
+                  *
+                  *  - Copy the 4 L2 PTEs into the dedicated L2 table
+                  *  - Zero the user L3 PTEs in the user and kernel page table
+                  *  - Set the first L3 PTE to point to the CPU L2 table
+                  */
+                 l2src = hat->hat_copied_ptes;
+                 l2dst = cpu->cpu_hat_info->hci_pcp_l2ptes;
+                 l3ptes = cpu->cpu_hat_info->hci_pcp_l3ptes;
+                 l3uptes = cpu->cpu_hat_info->hci_user_l3ptes;
+ 
+                 l2dst[0] = l2src[0];
+                 l2dst[1] = l2src[1];
+                 l2dst[2] = l2src[2];
+                 l2dst[3] = l2src[3];
+ 
+                 /*
+                  * Make sure to use the mmu to get the number of slots. The
+                  * number of PLP entries that this has will always be less as
+                  * it's a 32-bit process.
+                  */
+                 bzero(l3ptes, sizeof (x86pte_t) * mmu.top_level_uslots);
+                 l3ptes[0] = MAKEPTP(cpu->cpu_hat_info->hci_pcp_l2pfn, 2);
+                 bzero(l3uptes, sizeof (x86pte_t) * mmu.top_level_uslots);
+                 l3uptes[0] = MAKEPTP(cpu->cpu_hat_info->hci_pcp_l2pfn, 2);
+         } else {
+                 /*
+                  * This is a 64-bit process. To set this up, we need to do the
+                  * following:
+                  *
+                  *  - Zero the 4 L2 PTEs in the CPU structure for safety
+                  *  - Copy over the new user L3 PTEs into the kernel page table
+                  *  - Copy over the new user L3 PTEs into the user page table
+                  */
+                 ASSERT3S(kpti_enable, ==, 1);
+                 bzero(cpu->cpu_hat_info->hci_pcp_l2ptes, sizeof (x86pte_t) * 4);
+                 bcopy(hat->hat_copied_ptes, cpu->cpu_hat_info->hci_pcp_l3ptes,
+                     sizeof (x86pte_t) * mmu.top_level_uslots);
+                 bcopy(hat->hat_copied_ptes, cpu->cpu_hat_info->hci_user_l3ptes,
+                     sizeof (x86pte_t) * mmu.top_level_uslots);
+         }
+ }
+ 
+ static void
+ reset_kpti(struct kpti_frame *fr, uint64_t kcr3, uint64_t ucr3)
+ {
+         ASSERT3U(fr->kf_tr_flag, ==, 0);
+ #if DEBUG
+         if (fr->kf_kernel_cr3 != 0) {
+                 ASSERT3U(fr->kf_lower_redzone, ==, 0xdeadbeefdeadbeef);
+                 ASSERT3U(fr->kf_middle_redzone, ==, 0xdeadbeefdeadbeef);
+                 ASSERT3U(fr->kf_upper_redzone, ==, 0xdeadbeefdeadbeef);
+         }
+ #endif
+ 
+         bzero(fr, offsetof(struct kpti_frame, kf_kernel_cr3));
+         bzero(&fr->kf_unused, sizeof (struct kpti_frame) -
+             offsetof(struct kpti_frame, kf_unused));
+ 
+         fr->kf_kernel_cr3 = kcr3;
+         fr->kf_user_cr3 = ucr3;
+         fr->kf_tr_ret_rsp = (uintptr_t)&fr->kf_tr_rsp;
+ 
+         fr->kf_lower_redzone = 0xdeadbeefdeadbeef;
+         fr->kf_middle_redzone = 0xdeadbeefdeadbeef;
+         fr->kf_upper_redzone = 0xdeadbeefdeadbeef;
+ }
+ 
+ #ifdef __xpv
+ static void
+ hat_switch_xen(hat_t *hat)
+ {
+         struct mmuext_op t[2];
+         uint_t retcnt;
+         uint_t opcnt = 1;
+         uint64_t newcr3;
+ 
+         ASSERT(!(hat->hat_flags & HAT_COPIED));
+         ASSERT(!(getcr4() & CR4_PCIDE));
+ 
+         newcr3 = MAKECR3((uint64_t)hat->hat_htable->ht_pfn, PCID_NONE);
+ 
+         t[0].cmd = MMUEXT_NEW_BASEPTR;
+         t[0].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
+ 
+         /*
+          * There's an interesting problem here, as to what to actually specify
+          * when switching to the kernel hat.  For now we'll reuse the kernel hat
+          * again.
+          */
+         t[1].cmd = MMUEXT_NEW_USER_BASEPTR;
+         if (hat == kas.a_hat)
+                 t[1].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
+         else
+                 t[1].arg1.mfn = pfn_to_mfn(hat->hat_user_ptable);
+         ++opcnt;
+ 
+         if (HYPERVISOR_mmuext_op(t, opcnt, &retcnt, DOMID_SELF) < 0)
+                 panic("HYPERVISOR_mmu_update() failed");
+         ASSERT(retcnt == opcnt);
+ }
+ #endif /* __xpv */
+ 
+ /*
   * Switch to a new active hat, maintaining bit masks to track active CPUs.
   *
!  * With KPTI, all our HATs except kas should be using PCP.  Thus, to switch
!  * HATs, we need to copy over the new user PTEs, then set our trampoline context
!  * as appropriate.
!  *
!  * If lacking PCID, we then load our new cr3, which will flush the TLB: we may
!  * have established userspace TLB entries via kernel accesses, and these are no
!  * longer valid.  We have to do this eagerly, as we just deleted this CPU from
!  * ->hat_cpus, so would no longer see any TLB shootdowns.
!  *
!  * With PCID enabled, things get a little more complicated.  We would like to
!  * keep TLB context around when entering and exiting the kernel, and to do this,
!  * we partition the TLB into two different spaces:
!  *
!  * PCID_KERNEL is defined as zero, and used both by kas and all other address
!  * spaces while in the kernel (post-trampoline).
!  *
!  * PCID_USER is used while in userspace.  Therefore, userspace cannot use any
!  * lingering PCID_KERNEL entries to kernel addresses it should not be able to
!  * read.
!  *
!  * The trampoline cr3s are set not to invalidate on a mov to %cr3. This means if
!  * we take a journey through the kernel without switching HATs, we have some
!  * hope of keeping our TLB state around.
!  *
!  * On a hat switch, rather than deal with any necessary flushes on the way out
!  * of the trampolines, we do them upfront here. If we're switching from kas, we
!  * shouldn't need any invalidation.
!  *
!  * Otherwise, we can have stale userspace entries for both PCID_USER (what
!  * happened before we move onto the kcr3) and PCID_KERNEL (any subsequent
!  * userspace accesses such as ddi_copyin()).  Since setcr3() won't do these
!  * flushes on its own in PCIDE, we'll do a non-flushing load and then
!  * invalidate everything.
   */
  void
  hat_switch(hat_t *hat)
  {
          cpu_t *cpu = CPU;
          hat_t *old = cpu->cpu_current_hat;
  
          /*
           * set up this information first, so we don't miss any cross calls
*** 1004,1059 ****
          if (hat != kas.a_hat) {
                  CPUSET_ATOMIC_ADD(hat->hat_cpus, cpu->cpu_id);
          }
          cpu->cpu_current_hat = hat;
  
!         /*
!          * now go ahead and load cr3
!          */
!         if (hat->hat_flags & HAT_VLP) {
! #if defined(__amd64)
!                 x86pte_t *vlpptep = cpu->cpu_hat_info->hci_vlp_l2ptes;
  
!                 VLP_COPY(hat->hat_vlp_ptes, vlpptep);
!                 newcr3 = MAKECR3(cpu->cpu_hat_info->hci_vlp_pfn);
! #elif defined(__i386)
!                 reload_pae32(hat, cpu);
!                 newcr3 = MAKECR3(kas.a_hat->hat_htable->ht_pfn) +
!                     (cpu->cpu_id + 1) * VLP_SIZE;
! #endif
          } else {
!                 newcr3 = MAKECR3((uint64_t)hat->hat_htable->ht_pfn);
          }
- #ifdef __xpv
-         {
-                 struct mmuext_op t[2];
-                 uint_t retcnt;
-                 uint_t opcnt = 1;
  
!                 t[0].cmd = MMUEXT_NEW_BASEPTR;
!                 t[0].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
! #if defined(__amd64)
                  /*
!                  * There's an interesting problem here, as to what to
!                  * actually specify when switching to the kernel hat.
!                  * For now we'll reuse the kernel hat again.
                   */
!                 t[1].cmd = MMUEXT_NEW_USER_BASEPTR;
!                 if (hat == kas.a_hat)
!                         t[1].arg1.mfn = mmu_btop(pa_to_ma(newcr3));
!                 else
!                         t[1].arg1.mfn = pfn_to_mfn(hat->hat_user_ptable);
!                 ++opcnt;
! #endif  /* __amd64 */
!                 if (HYPERVISOR_mmuext_op(t, opcnt, &retcnt, DOMID_SELF) < 0)
!                         panic("HYPERVISOR_mmu_update() failed");
!                 ASSERT(retcnt == opcnt);
  
!         }
! #else
!         setcr3(newcr3);
! #endif
          ASSERT(cpu == CPU);
  }
  
  /*
   * Utility to return a valid x86pte_t from protections, pfn, and level number
--- 1605,1671 ----
          if (hat != kas.a_hat) {
                  CPUSET_ATOMIC_ADD(hat->hat_cpus, cpu->cpu_id);
          }
          cpu->cpu_current_hat = hat;
  
! #if defined(__xpv)
!         hat_switch_xen(hat);
! #else
!         struct hat_cpu_info *info = cpu->cpu_m.mcpu_hat_info;
!         uint64_t pcide = getcr4() & CR4_PCIDE;
!         uint64_t kcr3, ucr3;
!         pfn_t tl_kpfn;
!         ulong_t flag;
  
!         EQUIV(kpti_enable, !mmu.pt_global);
! 
!         if (hat->hat_flags & HAT_COPIED) {
!                 hat_pcp_update(cpu, hat);
!                 tl_kpfn = info->hci_pcp_l3pfn;
          } else {
!                 IMPLY(kpti_enable, hat == kas.a_hat);
!                 tl_kpfn = hat->hat_htable->ht_pfn;
          }
  
!         if (pcide) {
!                 ASSERT(kpti_enable);
! 
!                 kcr3 = MAKECR3(tl_kpfn, PCID_KERNEL) | CR3_NOINVL_BIT;
!                 ucr3 = MAKECR3(info->hci_user_l3pfn, PCID_USER) |
!                     CR3_NOINVL_BIT;
! 
!                 setcr3(kcr3);
!                 if (old != kas.a_hat)
!                         mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
!         } else {
!                 kcr3 = MAKECR3(tl_kpfn, PCID_NONE);
!                 ucr3 = kpti_enable ?
!                     MAKECR3(info->hci_user_l3pfn, PCID_NONE) :
!                     0;
! 
!                 setcr3(kcr3);
!         }
! 
          /*
!          * We will already be taking shootdowns for our new HAT, and as KPTI
!          * invpcid emulation needs to use kf_user_cr3, make sure we don't get
!          * any cross calls while we're inconsistent.  Note that it's harmless to
!          * have a *stale* kf_user_cr3 (we just did a FLUSH_TLB_ALL), but a
!          * *zero* kf_user_cr3 is not going to go very well.
           */
!         if (pcide)
!                 flag = intr_clear();
  
!         reset_kpti(&cpu->cpu_m.mcpu_kpti, kcr3, ucr3);
!         reset_kpti(&cpu->cpu_m.mcpu_kpti_flt, kcr3, ucr3);
!         reset_kpti(&cpu->cpu_m.mcpu_kpti_dbg, kcr3, ucr3);
! 
!         if (pcide)
!                 intr_restore(flag);
! 
! #endif /* !__xpv */
! 
          ASSERT(cpu == CPU);
  }
  
  /*
   * Utility to return a valid x86pte_t from protections, pfn, and level number
*** 1361,1374 ****
                          x86_hm_exit(pp);
                  } else {
                          ASSERT(flags & HAT_LOAD_NOCONSIST);
                  }
  #if defined(__amd64)
!                 if (ht->ht_flags & HTABLE_VLP) {
                          cpu_t *cpu = CPU;
!                         x86pte_t *vlpptep = cpu->cpu_hat_info->hci_vlp_l2ptes;
!                         VLP_COPY(hat->hat_vlp_ptes, vlpptep);
                  }
  #endif
                  HTABLE_INC(ht->ht_valid_cnt);
                  PGCNT_INC(hat, l);
                  return (rv);
--- 1973,1985 ----
                          x86_hm_exit(pp);
                  } else {
                          ASSERT(flags & HAT_LOAD_NOCONSIST);
                  }
  #if defined(__amd64)
!                 if (ht->ht_flags & HTABLE_COPIED) {
                          cpu_t *cpu = CPU;
!                         hat_pcp_update(cpu, hat);
                  }
  #endif
                  HTABLE_INC(ht->ht_valid_cnt);
                  PGCNT_INC(hat, l);
                  return (rv);
*** 1436,1446 ****
           * early before we blow out the kernel stack.
           */
          ++curthread->t_hatdepth;
          ASSERT(curthread->t_hatdepth < 16);
  
!         ASSERT(hat == kas.a_hat || AS_LOCK_HELD(hat->hat_as));
  
          if (flags & HAT_LOAD_SHARE)
                  hat->hat_flags |= HAT_SHARED;
  
          /*
--- 2047,2058 ----
           * early before we blow out the kernel stack.
           */
          ++curthread->t_hatdepth;
          ASSERT(curthread->t_hatdepth < 16);
  
!         ASSERT(hat == kas.a_hat || (hat->hat_flags & HAT_PCP) != 0 ||
!             AS_LOCK_HELD(hat->hat_as));
  
          if (flags & HAT_LOAD_SHARE)
                  hat->hat_flags |= HAT_SHARED;
  
          /*
*** 1456,1474 ****
  
          if (ht == NULL) {
                  ht = htable_create(hat, va, level, NULL);
                  ASSERT(ht != NULL);
          }
          entry = htable_va2entry(va, ht);
  
          /*
           * a bunch of paranoid error checking
           */
          ASSERT(ht->ht_busy > 0);
-         if (ht->ht_vaddr > va || va > HTABLE_LAST_PAGE(ht))
-                 panic("hati_load_common: bad htable %p, va %p",
-                     (void *)ht, (void *)va);
          ASSERT(ht->ht_level == level);
  
          /*
           * construct the new PTE
           */
--- 2068,2094 ----
  
          if (ht == NULL) {
                  ht = htable_create(hat, va, level, NULL);
                  ASSERT(ht != NULL);
          }
+         /*
+          * htable_va2entry checks this condition as well, but it won't include
+          * much useful info in the panic. So we do it in advance here to include
+          * all the context.
+          */
+         if (ht->ht_vaddr > va || va > HTABLE_LAST_PAGE(ht)) {
+                 panic("hati_load_common: bad htable: va=%p, last page=%p, "
+                     "ht->ht_vaddr=%p, ht->ht_level=%d", (void *)va,
+                     (void *)HTABLE_LAST_PAGE(ht), (void *)ht->ht_vaddr,
+                     (int)ht->ht_level);
+         }
          entry = htable_va2entry(va, ht);
  
          /*
           * a bunch of paranoid error checking
           */
          ASSERT(ht->ht_busy > 0);
          ASSERT(ht->ht_level == level);
  
          /*
           * construct the new PTE
           */
*** 1914,2003 ****
          panic("No shared region support on x86");
  }
  
  #if !defined(__xpv)
  /*
!  * Cross call service routine to demap a virtual page on
!  * the current CPU or flush all mappings in TLB.
   */
- /*ARGSUSED*/
  static int
  hati_demap_func(xc_arg_t a1, xc_arg_t a2, xc_arg_t a3)
  {
          hat_t   *hat = (hat_t *)a1;
!         caddr_t addr = (caddr_t)a2;
!         size_t len = (size_t)a3;
  
          /*
           * If the target hat isn't the kernel and this CPU isn't operating
           * in the target hat, we can ignore the cross call.
           */
          if (hat != kas.a_hat && hat != CPU->cpu_current_hat)
                  return (0);
  
!         /*
!          * For a normal address, we flush a range of contiguous mappings
!          */
!         if ((uintptr_t)addr != DEMAP_ALL_ADDR) {
!                 for (size_t i = 0; i < len; i += MMU_PAGESIZE)
!                         mmu_tlbflush_entry(addr + i);
                  return (0);
          }
  
          /*
!          * Otherwise we reload cr3 to effect a complete TLB flush.
           *
!          * A reload of cr3 on a VLP process also means we must also recopy in
!          * the pte values from the struct hat
           */
!         if (hat->hat_flags & HAT_VLP) {
  #if defined(__amd64)
!                 x86pte_t *vlpptep = CPU->cpu_hat_info->hci_vlp_l2ptes;
! 
!                 VLP_COPY(hat->hat_vlp_ptes, vlpptep);
  #elif defined(__i386)
                  reload_pae32(hat, CPU);
  #endif
          }
!         reload_cr3();
          return (0);
  }
  
! /*
!  * Flush all TLB entries, including global (ie. kernel) ones.
!  */
! static void
! flush_all_tlb_entries(void)
! {
!         ulong_t cr4 = getcr4();
! 
!         if (cr4 & CR4_PGE) {
!                 setcr4(cr4 & ~(ulong_t)CR4_PGE);
!                 setcr4(cr4);
! 
!                 /*
!                  * 32 bit PAE also needs to always reload_cr3()
!                  */
!                 if (mmu.max_level == 2)
!                         reload_cr3();
!         } else {
!                 reload_cr3();
!         }
! }
! 
! #define TLB_CPU_HALTED  (01ul)
! #define TLB_INVAL_ALL   (02ul)
  #define CAS_TLB_INFO(cpu, old, new)     \
          atomic_cas_ulong((ulong_t *)&(cpu)->cpu_m.mcpu_tlb_info, (old), (new))
  
  /*
   * Record that a CPU is going idle
   */
  void
  tlb_going_idle(void)
  {
!         atomic_or_ulong((ulong_t *)&CPU->cpu_m.mcpu_tlb_info, TLB_CPU_HALTED);
  }
  
  /*
   * Service a delayed TLB flush if coming out of being idle.
   * It will be called from cpu idle notification with interrupt disabled.
--- 2534,2596 ----
          panic("No shared region support on x86");
  }
  
  #if !defined(__xpv)
  /*
!  * Cross call service routine to demap a range of virtual
!  * pages on the current CPU or flush all mappings in TLB.
   */
  static int
  hati_demap_func(xc_arg_t a1, xc_arg_t a2, xc_arg_t a3)
  {
+         _NOTE(ARGUNUSED(a3));
          hat_t           *hat = (hat_t *)a1;
!         tlb_range_t     *range = (tlb_range_t *)a2;
  
          /*
           * If the target hat isn't the kernel and this CPU isn't operating
           * in the target hat, we can ignore the cross call.
           */
          if (hat != kas.a_hat && hat != CPU->cpu_current_hat)
                  return (0);
  
!         if (range->tr_va != DEMAP_ALL_ADDR) {
!                 mmu_flush_tlb(FLUSH_TLB_RANGE, range);
                  return (0);
          }
  
          /*
!          * We are flushing all of userspace.
           *
!          * When using PCP, we first need to update this CPU's idea of the PCP
!          * PTEs.
           */
!         if (hat->hat_flags & HAT_COPIED) {
  #if defined(__amd64)
!                 hat_pcp_update(CPU, hat);
  #elif defined(__i386)
                  reload_pae32(hat, CPU);
  #endif
          }
! 
!         mmu_flush_tlb(FLUSH_TLB_NONGLOBAL, NULL);
          return (0);
  }
  
! #define TLBIDLE_CPU_HALTED      (0x1UL)
! #define TLBIDLE_INVAL_ALL       (0x2UL)
  #define CAS_TLB_INFO(cpu, old, new)     \
          atomic_cas_ulong((ulong_t *)&(cpu)->cpu_m.mcpu_tlb_info, (old), (new))
  
  /*
   * Record that a CPU is going idle
   */
  void
  tlb_going_idle(void)
  {
!         atomic_or_ulong((ulong_t *)&CPU->cpu_m.mcpu_tlb_info,
!             TLBIDLE_CPU_HALTED);
  }
  
  /*
   * Service a delayed TLB flush if coming out of being idle.
   * It will be called from cpu idle notification with interrupt disabled.
*** 2010,2046 ****
  
          /*
           * We only have to do something if coming out of being idle.
           */
          tlb_info = CPU->cpu_m.mcpu_tlb_info;
!         if (tlb_info & TLB_CPU_HALTED) {
                  ASSERT(CPU->cpu_current_hat == kas.a_hat);
  
                  /*
                   * Atomic clear and fetch of old state.
                   */
                  while ((found = CAS_TLB_INFO(CPU, tlb_info, 0)) != tlb_info) {
!                         ASSERT(found & TLB_CPU_HALTED);
                          tlb_info = found;
                          SMT_PAUSE();
                  }
!                 if (tlb_info & TLB_INVAL_ALL)
!                         flush_all_tlb_entries();
          }
  }
  #endif /* !__xpv */
  
  /*
   * Internal routine to do cross calls to invalidate a range of pages on
   * all CPUs using a given hat.
   */
  void
! hat_tlb_inval_range(hat_t *hat, uintptr_t va, size_t len)
  {
          extern int      flushes_require_xcalls; /* from mp_startup.c */
          cpuset_t        justme;
          cpuset_t        cpus_to_shootdown;
  #ifndef __xpv
          cpuset_t        check_cpus;
          cpu_t           *cpup;
          int             c;
  #endif
--- 2603,2640 ----
  
          /*
           * We only have to do something if coming out of being idle.
           */
          tlb_info = CPU->cpu_m.mcpu_tlb_info;
!         if (tlb_info & TLBIDLE_CPU_HALTED) {
                  ASSERT(CPU->cpu_current_hat == kas.a_hat);
  
                  /*
                   * Atomic clear and fetch of old state.
                   */
                  while ((found = CAS_TLB_INFO(CPU, tlb_info, 0)) != tlb_info) {
!                         ASSERT(found & TLBIDLE_CPU_HALTED);
                          tlb_info = found;
                          SMT_PAUSE();
                  }
!                 if (tlb_info & TLBIDLE_INVAL_ALL)
!                         mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
          }
  }
  #endif /* !__xpv */
  
  /*
   * Internal routine to do cross calls to invalidate a range of pages on
   * all CPUs using a given hat.
   */
  void
! hat_tlb_inval_range(hat_t *hat, tlb_range_t *in_range)
  {
          extern int      flushes_require_xcalls; /* from mp_startup.c */
          cpuset_t        justme;
          cpuset_t        cpus_to_shootdown;
+         tlb_range_t     range = *in_range;
  #ifndef __xpv
          cpuset_t        check_cpus;
          cpu_t           *cpup;
          int             c;
  #endif
*** 2057,2083 ****
           * entire set of user TLBs, since we don't know what addresses
           * these were shared at.
           */
          if (hat->hat_flags & HAT_SHARED) {
                  hat = kas.a_hat;
!                 va = DEMAP_ALL_ADDR;
          }
  
          /*
           * if not running with multiple CPUs, don't use cross calls
           */
          if (panicstr || !flushes_require_xcalls) {
  #ifdef __xpv
!                 if (va == DEMAP_ALL_ADDR) {
                          xen_flush_tlb();
                  } else {
!                         for (size_t i = 0; i < len; i += MMU_PAGESIZE)
!                                 xen_flush_va((caddr_t)(va + i));
                  }
  #else
!                 (void) hati_demap_func((xc_arg_t)hat,
!                     (xc_arg_t)va, (xc_arg_t)len);
  #endif
                  return;
          }
  
  
--- 2651,2678 ----
           * entire set of user TLBs, since we don't know what addresses
           * these were shared at.
           */
          if (hat->hat_flags & HAT_SHARED) {
                  hat = kas.a_hat;
!                 range.tr_va = DEMAP_ALL_ADDR;
          }
  
          /*
           * if not running with multiple CPUs, don't use cross calls
           */
          if (panicstr || !flushes_require_xcalls) {
  #ifdef __xpv
!                 if (range.tr_va == DEMAP_ALL_ADDR) {
                          xen_flush_tlb();
                  } else {
!                         for (size_t i = 0; i < TLB_RANGE_LEN(&range);
!                             i += MMU_PAGESIZE) {
!                                 xen_flush_va((caddr_t)(range.tr_va + i));
                          }
+                 }
  #else
!                 (void) hati_demap_func((xc_arg_t)hat, (xc_arg_t)&range, 0);
  #endif
                  return;
          }
  
  
*** 2107,2123 ****
                  cpup = cpu[c];
                  if (cpup == NULL)
                          continue;
  
                  tlb_info = cpup->cpu_m.mcpu_tlb_info;
!                 while (tlb_info == TLB_CPU_HALTED) {
!                         (void) CAS_TLB_INFO(cpup, TLB_CPU_HALTED,
!                             TLB_CPU_HALTED | TLB_INVAL_ALL);
                          SMT_PAUSE();
                          tlb_info = cpup->cpu_m.mcpu_tlb_info;
                  }
!                 if (tlb_info == (TLB_CPU_HALTED | TLB_INVAL_ALL)) {
                          HATSTAT_INC(hs_tlb_inval_delayed);
                          CPUSET_DEL(cpus_to_shootdown, c);
                  }
          }
  #endif
--- 2702,2718 ----
                  cpup = cpu[c];
                  if (cpup == NULL)
                          continue;
  
                  tlb_info = cpup->cpu_m.mcpu_tlb_info;
!                 while (tlb_info == TLBIDLE_CPU_HALTED) {
!                         (void) CAS_TLB_INFO(cpup, TLBIDLE_CPU_HALTED,
!                             TLBIDLE_CPU_HALTED | TLBIDLE_INVAL_ALL);
                          SMT_PAUSE();
                          tlb_info = cpup->cpu_m.mcpu_tlb_info;
                  }
!                 if (tlb_info == (TLBIDLE_CPU_HALTED | TLBIDLE_INVAL_ALL)) {
                          HATSTAT_INC(hs_tlb_inval_delayed);
                          CPUSET_DEL(cpus_to_shootdown, c);
                  }
          }
  #endif
*** 2124,2158 ****
  
          if (CPUSET_ISNULL(cpus_to_shootdown) ||
              CPUSET_ISEQUAL(cpus_to_shootdown, justme)) {
  
  #ifdef __xpv
!                 if (va == DEMAP_ALL_ADDR) {
                          xen_flush_tlb();
                  } else {
!                         for (size_t i = 0; i < len; i += MMU_PAGESIZE)
!                                 xen_flush_va((caddr_t)(va + i));
                  }
  #else
!                 (void) hati_demap_func((xc_arg_t)hat,
!                     (xc_arg_t)va, (xc_arg_t)len);
  #endif
  
          } else {
  
                  CPUSET_ADD(cpus_to_shootdown, CPU->cpu_id);
  #ifdef __xpv
!                 if (va == DEMAP_ALL_ADDR) {
                          xen_gflush_tlb(cpus_to_shootdown);
                  } else {
!                         for (size_t i = 0; i < len; i += MMU_PAGESIZE) {
!                                 xen_gflush_va((caddr_t)(va + i),
                                      cpus_to_shootdown);
                          }
                  }
  #else
!                 xc_call((xc_arg_t)hat, (xc_arg_t)va, (xc_arg_t)len,
                      CPUSET2BV(cpus_to_shootdown), hati_demap_func);
  #endif
  
          }
          kpreempt_enable();
--- 2719,2755 ----
  
          if (CPUSET_ISNULL(cpus_to_shootdown) ||
              CPUSET_ISEQUAL(cpus_to_shootdown, justme)) {
  
  #ifdef __xpv
!                 if (range.tr_va == DEMAP_ALL_ADDR) {
                          xen_flush_tlb();
                  } else {
!                         for (size_t i = 0; i < TLB_RANGE_LEN(&range);
!                             i += MMU_PAGESIZE) {
!                                 xen_flush_va((caddr_t)(range.tr_va + i));
                          }
+                 }
  #else
!                 (void) hati_demap_func((xc_arg_t)hat, (xc_arg_t)&range, 0);
  #endif
  
          } else {
  
                  CPUSET_ADD(cpus_to_shootdown, CPU->cpu_id);
  #ifdef __xpv
!                 if (range.tr_va == DEMAP_ALL_ADDR) {
                          xen_gflush_tlb(cpus_to_shootdown);
                  } else {
!                         for (size_t i = 0; i < TLB_RANGE_LEN(&range);
!                             i += MMU_PAGESIZE) {
!                                 xen_gflush_va((caddr_t)(range.tr_va + i),
                                      cpus_to_shootdown);
                          }
                  }
  #else
!                 xc_call((xc_arg_t)hat, (xc_arg_t)&range, 0,
                      CPUSET2BV(cpus_to_shootdown), hati_demap_func);
  #endif
  
          }
          kpreempt_enable();
*** 2159,2169 ****
  }
  
  void
  hat_tlb_inval(hat_t *hat, uintptr_t va)
  {
!         hat_tlb_inval_range(hat, va, MMU_PAGESIZE);
  }
  
  /*
   * Interior routine for HAT_UNLOADs from hat_unload_callback(),
   * hat_kmap_unload() OR from hat_steal() code.  This routine doesn't
--- 2756,2774 ----
  }
  
  void
  hat_tlb_inval(hat_t *hat, uintptr_t va)
  {
!         /*
!          * Create range for a single page.
!          */
!         tlb_range_t range;
!         range.tr_va = va;
!         range.tr_cnt = 1; /* one page */
!         range.tr_level = MIN_PAGE_LEVEL; /* pages are MMU_PAGESIZE */
! 
!         hat_tlb_inval_range(hat, &range);
  }
  
  /*
   * Interior routine for HAT_UNLOADs from hat_unload_callback(),
   * hat_kmap_unload() OR from hat_steal() code.  This routine doesn't
*** 2326,2361 ****
          }
          XPV_ALLOW_MIGRATE();
  }
  
  /*
-  * Do the callbacks for ranges being unloaded.
-  */
- typedef struct range_info {
-         uintptr_t       rng_va;
-         ulong_t         rng_cnt;
-         level_t         rng_level;
- } range_info_t;
- 
- /*
   * Invalidate the TLB, and perform the callback to the upper level VM system,
   * for the specified ranges of contiguous pages.
   */
  static void
! handle_ranges(hat_t *hat, hat_callback_t *cb, uint_t cnt, range_info_t *range)
  {
          while (cnt > 0) {
-                 size_t len;
- 
                  --cnt;
!                 len = range[cnt].rng_cnt << LEVEL_SHIFT(range[cnt].rng_level);
!                 hat_tlb_inval_range(hat, (uintptr_t)range[cnt].rng_va, len);
  
                  if (cb != NULL) {
!                         cb->hcb_start_addr = (caddr_t)range[cnt].rng_va;
                          cb->hcb_end_addr = cb->hcb_start_addr;
!                         cb->hcb_end_addr += len;
                          cb->hcb_function(cb);
                  }
          }
  }
  
--- 2931,2955 ----
          }
          XPV_ALLOW_MIGRATE();
  }
  
  /*
   * Invalidate the TLB, and perform the callback to the upper level VM system,
   * for the specified ranges of contiguous pages.
   */
  static void
! handle_ranges(hat_t *hat, hat_callback_t *cb, uint_t cnt, tlb_range_t *range)
  {
          while (cnt > 0) {
                  --cnt;
!                 hat_tlb_inval_range(hat, &range[cnt]);
  
                  if (cb != NULL) {
!                         cb->hcb_start_addr = (caddr_t)range[cnt].tr_va;
                          cb->hcb_end_addr = cb->hcb_start_addr;
!                         cb->hcb_end_addr += range[cnt].tr_cnt <<
!                             LEVEL_SHIFT(range[cnt].tr_level);
                          cb->hcb_function(cb);
                  }
          }
  }
  
*** 2381,2391 ****
          uintptr_t       vaddr = (uintptr_t)addr;
          uintptr_t       eaddr = vaddr + len;
          htable_t        *ht = NULL;
          uint_t          entry;
          uintptr_t       contig_va = (uintptr_t)-1L;
!         range_info_t    r[MAX_UNLOAD_CNT];
          uint_t          r_cnt = 0;
          x86pte_t        old_pte;
  
          XPV_DISALLOW_MIGRATE();
          ASSERT(hat == kas.a_hat || eaddr <= _userlimit);
--- 2975,2985 ----
          uintptr_t       vaddr = (uintptr_t)addr;
          uintptr_t       eaddr = vaddr + len;
          htable_t        *ht = NULL;
          uint_t          entry;
          uintptr_t       contig_va = (uintptr_t)-1L;
!         tlb_range_t     r[MAX_UNLOAD_CNT];
          uint_t          r_cnt = 0;
          x86pte_t        old_pte;
  
          XPV_DISALLOW_MIGRATE();
          ASSERT(hat == kas.a_hat || eaddr <= _userlimit);
*** 2421,2438 ****
  
                  /*
                   * We'll do the call backs for contiguous ranges
                   */
                  if (vaddr != contig_va ||
!                     (r_cnt > 0 && r[r_cnt - 1].rng_level != ht->ht_level)) {
                          if (r_cnt == MAX_UNLOAD_CNT) {
                                  handle_ranges(hat, cb, r_cnt, r);
                                  r_cnt = 0;
                          }
!                         r[r_cnt].rng_va = vaddr;
!                         r[r_cnt].rng_cnt = 0;
!                         r[r_cnt].rng_level = ht->ht_level;
                          ++r_cnt;
                  }
  
                  /*
                   * Unload one mapping (for a single page) from the page tables.
--- 3015,3032 ----
  
                  /*
                   * We'll do the call backs for contiguous ranges
                   */
                  if (vaddr != contig_va ||
!                     (r_cnt > 0 && r[r_cnt - 1].tr_level != ht->ht_level)) {
                          if (r_cnt == MAX_UNLOAD_CNT) {
                                  handle_ranges(hat, cb, r_cnt, r);
                                  r_cnt = 0;
                          }
!                         r[r_cnt].tr_va = vaddr;
!                         r[r_cnt].tr_cnt = 0;
!                         r[r_cnt].tr_level = ht->ht_level;
                          ++r_cnt;
                  }
  
                  /*
                   * Unload one mapping (for a single page) from the page tables.
*** 2446,2456 ****
                  entry = htable_va2entry(vaddr, ht);
                  hat_pte_unmap(ht, entry, flags, old_pte, NULL, B_FALSE);
                  ASSERT(ht->ht_level <= mmu.max_page_level);
                  vaddr += LEVEL_SIZE(ht->ht_level);
                  contig_va = vaddr;
!                 ++r[r_cnt - 1].rng_cnt;
          }
          if (ht)
                  htable_release(ht);
  
          /*
--- 3040,3050 ----
                  entry = htable_va2entry(vaddr, ht);
                  hat_pte_unmap(ht, entry, flags, old_pte, NULL, B_FALSE);
                  ASSERT(ht->ht_level <= mmu.max_page_level);
                  vaddr += LEVEL_SIZE(ht->ht_level);
                  contig_va = vaddr;
!                 ++r[r_cnt - 1].tr_cnt;
          }
          if (ht)
                  htable_release(ht);
  
          /*
*** 2475,2492 ****
                  sz = hat_getpagesize(hat, va);
                  if (sz < 0) {
  #ifdef __xpv
                          xen_flush_tlb();
  #else
!                         flush_all_tlb_entries();
  #endif
                          break;
                  }
  #ifdef __xpv
                  xen_flush_va(va);
  #else
!                 mmu_tlbflush_entry(va);
  #endif
                  va += sz;
          }
  }
  
--- 3069,3086 ----
                  sz = hat_getpagesize(hat, va);
                  if (sz < 0) {
  #ifdef __xpv
                          xen_flush_tlb();
  #else
!                         mmu_flush_tlb(FLUSH_TLB_ALL, NULL);
  #endif
                          break;
                  }
  #ifdef __xpv
                  xen_flush_va(va);
  #else
!                 mmu_flush_tlb_kpage((uintptr_t)va);
  #endif
                  va += sz;
          }
  }
  
*** 3148,3158 ****
                  }
          }
  
          /*
           * flush the TLBs - since we're probably dealing with MANY mappings
!          * we do just one CR3 reload.
           */
          if (!(hat->hat_flags & HAT_FREEING) && need_demaps)
                  hat_tlb_inval(hat, DEMAP_ALL_ADDR);
  
          /*
--- 3742,3752 ----
                  }
          }
  
          /*
           * flush the TLBs - since we're probably dealing with MANY mappings
!          * we just do a full invalidation.
           */
          if (!(hat->hat_flags & HAT_FREEING) && need_demaps)
                  hat_tlb_inval(hat, DEMAP_ALL_ADDR);
  
          /*
*** 3931,3941 ****
                      (pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
                  if (mmu.pae_hat)
                          *pteptr = 0;
                  else
                          *(x86pte32_t *)pteptr = 0;
!                 mmu_tlbflush_entry(addr);
                  x86pte_mapout();
          }
  #endif
  
          ht = htable_getpte(kas.a_hat, ALIGN2PAGE(addr), NULL, NULL, 0);
--- 4525,4535 ----
                      (pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
                  if (mmu.pae_hat)
                          *pteptr = 0;
                  else
                          *(x86pte32_t *)pteptr = 0;
!                 mmu_flush_tlb_kpage((uintptr_t)addr);
                  x86pte_mapout();
          }
  #endif
  
          ht = htable_getpte(kas.a_hat, ALIGN2PAGE(addr), NULL, NULL, 0);
*** 3992,4002 ****
                      (pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
                  if (mmu.pae_hat)
                          *(x86pte_t *)pteptr = pte;
                  else
                          *(x86pte32_t *)pteptr = (x86pte32_t)pte;
!                 mmu_tlbflush_entry(addr);
                  x86pte_mapout();
          }
  #endif
          XPV_ALLOW_MIGRATE();
  }
--- 4586,4596 ----
                      (pte_pa & MMU_PAGEOFFSET) >> mmu.pte_size_shift, NULL);
                  if (mmu.pae_hat)
                          *(x86pte_t *)pteptr = pte;
                  else
                          *(x86pte32_t *)pteptr = (x86pte32_t)pte;
!                 mmu_flush_tlb_kpage((uintptr_t)addr);
                  x86pte_mapout();
          }
  #endif
          XPV_ALLOW_MIGRATE();
  }
*** 4026,4036 ****
  void
  hat_cpu_online(struct cpu *cpup)
  {
          if (cpup != CPU) {
                  x86pte_cpu_init(cpup);
!                 hat_vlp_setup(cpup);
          }
          CPUSET_ATOMIC_ADD(khat_cpuset, cpup->cpu_id);
  }
  
  /*
--- 4620,4630 ----
  void
  hat_cpu_online(struct cpu *cpup)
  {
          if (cpup != CPU) {
                  x86pte_cpu_init(cpup);
!                 hat_pcp_setup(cpup);
          }
          CPUSET_ATOMIC_ADD(khat_cpuset, cpup->cpu_id);
  }
  
  /*
*** 4041,4051 ****
  hat_cpu_offline(struct cpu *cpup)
  {
          ASSERT(cpup != CPU);
  
          CPUSET_ATOMIC_DEL(khat_cpuset, cpup->cpu_id);
!         hat_vlp_teardown(cpup);
          x86pte_cpu_fini(cpup);
  }
  
  /*
   * Function called after all CPUs are brought online.
--- 4635,4645 ----
  hat_cpu_offline(struct cpu *cpup)
  {
          ASSERT(cpup != CPU);
  
          CPUSET_ATOMIC_DEL(khat_cpuset, cpup->cpu_id);
!         hat_pcp_teardown(cpup);
          x86pte_cpu_fini(cpup);
  }
  
  /*
   * Function called after all CPUs are brought online.
*** 4488,4492 ****
--- 5082,5115 ----
          htable_release(ht);
          htable_release(ht);
          XPV_ALLOW_MIGRATE();
  }
  #endif  /* __xpv */
+ 
+ /*
+  * Helper function to punch in a mapping that we need with the specified
+  * attributes.
+  */
+ void
+ hati_cpu_punchin(cpu_t *cpu, uintptr_t va, uint_t attrs)
+ {
+         int ret;
+         pfn_t pfn;
+         hat_t *cpu_hat = cpu->cpu_hat_info->hci_user_hat;
+ 
+         ASSERT3S(kpti_enable, ==, 1);
+         ASSERT3P(cpu_hat, !=, NULL);
+         ASSERT3U(cpu_hat->hat_flags & HAT_PCP, ==, HAT_PCP);
+         ASSERT3U(va & MMU_PAGEOFFSET, ==, 0);
+ 
+         pfn = hat_getpfnum(kas.a_hat, (caddr_t)va);
+         VERIFY3U(pfn, !=, PFN_INVALID);
+ 
+         /*
+          * We purposefully don't try to find the page_t. This means that this
+          * will be marked PT_NOCONSIST; however, given that this is pretty much
+          * a static mapping that we're using we should be relatively OK.
+          */
+         attrs |= HAT_STORECACHING_OK;
+         ret = hati_load_common(cpu_hat, va, NULL, attrs, 0, 0, pfn);
+         VERIFY3S(ret, ==, 0);
+ }