Print this page
11859 need swapgs mitigation
Reviewed by: Robert Mustacchi <rm@fingolfin.org>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@fingolfin.org>


 893  * form cpuid_get*. This is used by a number of different subsystems in the
 894  * kernel to determine more detailed information about what we're running on,
 895  * topology information, etc. Some of these subsystems include processor groups
 896  * (uts/common/os/pg.c.), CPU Module Interface (uts/i86pc/os/cmi.c), ACPI,
 897  * microcode, and performance monitoring. These functions all ASSERT that the
 898  * CPU they're being called on has reached a certain cpuid pass. If the passes
 899  * are rearranged, then this needs to be adjusted.
 900  *
 901  * -----------------------------------------------
 902  * Speculative Execution CPU Side Channel Security
 903  * -----------------------------------------------
 904  *
 905  * With the advent of the Spectre and Meltdown attacks which exploit speculative
 906  * execution in the CPU to create side channels there have been a number of
 907  * different attacks and corresponding issues that the operating system needs to
 908  * mitigate against. The following list is some of the common, but not
 909  * exhaustive, set of issues that we know about and have done some or need to do
 910  * more work in the system to mitigate against:
 911  *
 912  *   - Spectre v1

 913  *   - Spectre v2
 914  *   - Meltdown (Spectre v3)
 915  *   - Rogue Register Read (Spectre v3a)
 916  *   - Speculative Store Bypass (Spectre v4)
 917  *   - ret2spec, SpectreRSB
 918  *   - L1 Terminal Fault (L1TF)
 919  *   - Microarchitectural Data Sampling (MDS)
 920  *
 921  * Each of these requires different sets of mitigations and has different attack
 922  * surfaces. For the most part, this discussion is about protecting the kernel
 923  * from non-kernel executing environments such as user processes and hardware
 924  * virtual machines. Unfortunately, there are a number of user vs. user
 925  * scenarios that exist with these. The rest of this section will describe the
 926  * overall approach that the system has taken to address these as well as their
 927  * shortcomings. Unfortunately, not all of the above have been handled today.
 928  *
 929  * SPECTRE FAMILY (Spectre v2, ret2spec, SpectreRSB)
 930  *
 931  * The second variant of the spectre attack focuses on performing branch target
 932  * injection. This generally impacts indirect call instructions in the system.
 933  * There are three different ways to mitigate this issue that are commonly
 934  * described today:
 935  *
 936  *  1. Using Indirect Branch Restricted Speculation (IBRS).
 937  *  2. Using Retpolines and RSB Stuffing
 938  *  3. Using Enhanced Indirect Branch Restricted Speculation (EIBRS)
 939  *
 940  * IBRS uses a feature added to microcode to restrict speculation, among other
 941  * things. This form of mitigation has not been used as it has been generally
 942  * seen as too expensive and requires reactivation upon various transitions in
 943  * the system.
 944  *
 945  * As a less impactful alternative to IBRS, retpolines were developed by
 946  * Google. These basically require one to replace indirect calls with a specific
 947  * trampoline that will cause speculation to fail and break the attack.
 948  * Retpolines require compiler support. We always build with retpolines in the
 949  * external thunk mode. This means that a traditional indirect call is replaced


1018  * now.
1019  *
1020  * To fully protect user to user and vmx to vmx attacks from these classes of
1021  * issues, we would also need to allow them to opt into performing an Indirect
1022  * Branch Prediction Barrier (IBPB) on switch. This is not currently wired up.
1023  *
1024  * By default, the system will enable RSB stuffing and the required variant of
1025  * retpolines and store that information in the x86_spectrev2_mitigation value.
1026  * This will be evaluated after a microcode update as well, though it is
1027  * expected that microcode updates will not take away features. This may mean
1028  * that a late loaded microcode may not end up in the optimal configuration
1029  * (though this should be rare).
1030  *
1031  * Currently we do not build kmdb with retpolines or perform any additional side
1032  * channel security mitigations for it. One complication with kmdb is that it
1033  * requires its own retpoline thunks and it would need to adjust itself based on
1034  * what the kernel does. The threat model of kmdb is more limited and therefore
1035  * it may make more sense to investigate using prediction barriers as the whole
1036  * system is only executing a single instruction at a time while in kmdb.
1037  *
1038  * SPECTRE FAMILY (v1, v4)
1039  *
1040  * The v1 and v4 variants of spectre are not currently mitigated in the
1041  * system and require other classes of changes to occur in the code.
1042  *
































1043  * MELTDOWN
1044  *
1045  * Meltdown, or spectre v3, allowed a user process to read any data in their
1046  * address space regardless of whether or not the page tables in question
1047  * allowed the user to have the ability to read them. The solution to meltdown
1048  * is kernel page table isolation. In this world, there are two page tables that
1049  * are used for a process, one in user land and one in the kernel. To implement
1050  * this we use per-CPU page tables and switch between the user and kernel
1051  * variants when entering and exiting the kernel.  For more information about
1052  * this process and how the trampolines work, please see the big theory
1053  * statements and additional comments in:
1054  *
1055  *  - uts/i86pc/ml/kpti_trampolines.s
1056  *  - uts/i86pc/vm/hat_i86.c
1057  *
1058  * While Meltdown only impacted Intel systems and there are also Intel systems
1059  * that have Meltdown fixed (called Rogue Data Cache Load), we always have
1060  * kernel page table isolation enabled. While this may at first seem weird, an
1061  * important thing to remember is that you can't speculatively read an address
1062  * if it's never in your page table at all. Having user processes without kernel


1142  * MDS_NO has been set, then this is fully mitigated and x86_md_clear() becomes
1143  * a no-op.
1144  *
1145  * Unfortunately, with this issue hyperthreading rears its ugly head. In
1146  * particular, everything we've discussed above is only valid for a single
1147  * thread executing on a core. In the case where you have hyper-threading
1148  * present, this attack can be performed between threads. The theoretical fix
1149  * for this is to ensure that both threads are always in the same security
1150  * domain. This means that they are executing in the same ring and mutually
1151  * trust each other. Practically speaking, this would mean that a system call
1152  * would have to issue an inter-processor interrupt (IPI) to the other thread.
1153  * Rather than implement this, we recommend that one disables hyper-threading
1154  * through the use of psradm -aS.
1155  *
1156  * SUMMARY
1157  *
1158  * The following table attempts to summarize the mitigations for various issues
1159  * and what's done in various places:
1160  *
1161  *  - Spectre v1: Not currently mitigated

1162  *  - Spectre v2: Retpolines/RSB Stuffing or EIBRS if HW support
1163  *  - Meltdown: Kernel Page Table Isolation
1164  *  - Spectre v3a: Updated CPU microcode
1165  *  - Spectre v4: Not currently mitigated
1166  *  - SpectreRSB: SMEP and RSB Stuffing
1167  *  - L1TF: spec_uarch_flush, smt exclusion, requires microcode
1168  *  - MDS: x86_md_clear, requires microcode, disabling hyper threading
1169  *
1170  * The following table indicates the x86 feature set bits that indicate that a
1171  * given problem has been solved or a notable feature is present:
1172  *
1173  *  - RDCL_NO: Meltdown, L1TF, MSBDS subset of MDS
1174  *  - MDS_NO: All forms of MDS
1175  */
1176 
1177 #include <sys/types.h>
1178 #include <sys/archsystm.h>
1179 #include <sys/x86_archext.h>
1180 #include <sys/kmem.h>
1181 #include <sys/systm.h>
1182 #include <sys/cmn_err.h>
1183 #include <sys/sunddi.h>
1184 #include <sys/sunndi.h>
1185 #include <sys/cpuvar.h>
1186 #include <sys/processor.h>
1187 #include <sys/sysmacros.h>




 893  * form cpuid_get*. This is used by a number of different subsystems in the
 894  * kernel to determine more detailed information about what we're running on,
 895  * topology information, etc. Some of these subsystems include processor groups
 896  * (uts/common/os/pg.c.), CPU Module Interface (uts/i86pc/os/cmi.c), ACPI,
 897  * microcode, and performance monitoring. These functions all ASSERT that the
 898  * CPU they're being called on has reached a certain cpuid pass. If the passes
 899  * are rearranged, then this needs to be adjusted.
 900  *
 901  * -----------------------------------------------
 902  * Speculative Execution CPU Side Channel Security
 903  * -----------------------------------------------
 904  *
 905  * With the advent of the Spectre and Meltdown attacks which exploit speculative
 906  * execution in the CPU to create side channels there have been a number of
 907  * different attacks and corresponding issues that the operating system needs to
 908  * mitigate against. The following list is some of the common, but not
 909  * exhaustive, set of issues that we know about and have done some or need to do
 910  * more work in the system to mitigate against:
 911  *
 912  *   - Spectre v1
 913  *   - swapgs (Spectre v1 variant)
 914  *   - Spectre v2
 915  *   - Meltdown (Spectre v3)
 916  *   - Rogue Register Read (Spectre v3a)
 917  *   - Speculative Store Bypass (Spectre v4)
 918  *   - ret2spec, SpectreRSB
 919  *   - L1 Terminal Fault (L1TF)
 920  *   - Microarchitectural Data Sampling (MDS)
 921  *
 922  * Each of these requires different sets of mitigations and has different attack
 923  * surfaces. For the most part, this discussion is about protecting the kernel
 924  * from non-kernel executing environments such as user processes and hardware
 925  * virtual machines. Unfortunately, there are a number of user vs. user
 926  * scenarios that exist with these. The rest of this section will describe the
 927  * overall approach that the system has taken to address these as well as their
 928  * shortcomings. Unfortunately, not all of the above have been handled today.
 929  *
 930  * SPECTRE v2, ret2spec, SpectreRSB
 931  *
 932  * The second variant of the spectre attack focuses on performing branch target
 933  * injection. This generally impacts indirect call instructions in the system.
 934  * There are three different ways to mitigate this issue that are commonly
 935  * described today:
 936  *
 937  *  1. Using Indirect Branch Restricted Speculation (IBRS).
 938  *  2. Using Retpolines and RSB Stuffing
 939  *  3. Using Enhanced Indirect Branch Restricted Speculation (EIBRS)
 940  *
 941  * IBRS uses a feature added to microcode to restrict speculation, among other
 942  * things. This form of mitigation has not been used as it has been generally
 943  * seen as too expensive and requires reactivation upon various transitions in
 944  * the system.
 945  *
 946  * As a less impactful alternative to IBRS, retpolines were developed by
 947  * Google. These basically require one to replace indirect calls with a specific
 948  * trampoline that will cause speculation to fail and break the attack.
 949  * Retpolines require compiler support. We always build with retpolines in the
 950  * external thunk mode. This means that a traditional indirect call is replaced


1019  * now.
1020  *
1021  * To fully protect user to user and vmx to vmx attacks from these classes of
1022  * issues, we would also need to allow them to opt into performing an Indirect
1023  * Branch Prediction Barrier (IBPB) on switch. This is not currently wired up.
1024  *
1025  * By default, the system will enable RSB stuffing and the required variant of
1026  * retpolines and store that information in the x86_spectrev2_mitigation value.
1027  * This will be evaluated after a microcode update as well, though it is
1028  * expected that microcode updates will not take away features. This may mean
1029  * that a late loaded microcode may not end up in the optimal configuration
1030  * (though this should be rare).
1031  *
1032  * Currently we do not build kmdb with retpolines or perform any additional side
1033  * channel security mitigations for it. One complication with kmdb is that it
1034  * requires its own retpoline thunks and it would need to adjust itself based on
1035  * what the kernel does. The threat model of kmdb is more limited and therefore
1036  * it may make more sense to investigate using prediction barriers as the whole
1037  * system is only executing a single instruction at a time while in kmdb.
1038  *
1039  * SPECTRE v1, v4
1040  *
1041  * The v1 and v4 variants of spectre are not currently mitigated in the
1042  * system and require other classes of changes to occur in the code.
1043  *
1044  * SPECTRE v1 (SWAPGS VARIANT)
1045  *
1046  * The class of Spectre v1 vulnerabilities aren't all about bounds checks, but
1047  * can generally affect any branch-dependent code. The swapgs issue is one
1048  * variant of this. If we are coming in from userspace, we can have code like
1049  * this:
1050  *
1051  *      cmpw    $KCS_SEL, REGOFF_CS(%rsp)
1052  *      je      1f
1053  *      movq    $0, REGOFF_SAVFP(%rsp)
1054  *      swapgs
1055  *      1:
1056  *      movq    %gs:CPU_THREAD, %rax
1057  *
1058  * If an attacker can cause a mis-speculation of the branch here, we could skip
1059  * the needed swapgs, and use the /user/ %gsbase as the base of the %gs-based
1060  * load. If subsequent code can act as the usual Spectre cache gadget, this
1061  * would potentially allow KPTI bypass. To fix this, we need an lfence prior to
1062  * any use of the %gs override.
1063  *
1064  * The other case is also an issue: if we're coming into a trap from kernel
1065  * space, we could mis-speculate and swapgs the user %gsbase back in prior to
1066  * using it. AMD systems are not vulnerable to this version, as a swapgs is
1067  * serializing with respect to subsequent uses. But as AMD /does/ need the other
1068  * case, and the fix is the same in both cases (an lfence at the branch target
1069  * 1: in this example), we'll just do it unconditionally.
1070  *
1071  * Note that we don't enable user-space "wrgsbase" via CR4_FSGSBASE, making it
1072  * harder for user-space to actually set a useful %gsbase value: although it's
1073  * not clear, it might still be feasible via lwp_setprivate(), though, so we
1074  * mitigate anyway.
1075  *
1076  * MELTDOWN
1077  *
1078  * Meltdown, or spectre v3, allowed a user process to read any data in their
1079  * address space regardless of whether or not the page tables in question
1080  * allowed the user to have the ability to read them. The solution to meltdown
1081  * is kernel page table isolation. In this world, there are two page tables that
1082  * are used for a process, one in user land and one in the kernel. To implement
1083  * this we use per-CPU page tables and switch between the user and kernel
1084  * variants when entering and exiting the kernel.  For more information about
1085  * this process and how the trampolines work, please see the big theory
1086  * statements and additional comments in:
1087  *
1088  *  - uts/i86pc/ml/kpti_trampolines.s
1089  *  - uts/i86pc/vm/hat_i86.c
1090  *
1091  * While Meltdown only impacted Intel systems and there are also Intel systems
1092  * that have Meltdown fixed (called Rogue Data Cache Load), we always have
1093  * kernel page table isolation enabled. While this may at first seem weird, an
1094  * important thing to remember is that you can't speculatively read an address
1095  * if it's never in your page table at all. Having user processes without kernel


1175  * MDS_NO has been set, then this is fully mitigated and x86_md_clear() becomes
1176  * a no-op.
1177  *
1178  * Unfortunately, with this issue hyperthreading rears its ugly head. In
1179  * particular, everything we've discussed above is only valid for a single
1180  * thread executing on a core. In the case where you have hyper-threading
1181  * present, this attack can be performed between threads. The theoretical fix
1182  * for this is to ensure that both threads are always in the same security
1183  * domain. This means that they are executing in the same ring and mutually
1184  * trust each other. Practically speaking, this would mean that a system call
1185  * would have to issue an inter-processor interrupt (IPI) to the other thread.
1186  * Rather than implement this, we recommend that one disables hyper-threading
1187  * through the use of psradm -aS.
1188  *
1189  * SUMMARY
1190  *
1191  * The following table attempts to summarize the mitigations for various issues
1192  * and what's done in various places:
1193  *
1194  *  - Spectre v1: Not currently mitigated
1195  *  - swapgs: lfences after swapgs paths
1196  *  - Spectre v2: Retpolines/RSB Stuffing or EIBRS if HW support
1197  *  - Meltdown: Kernel Page Table Isolation
1198  *  - Spectre v3a: Updated CPU microcode
1199  *  - Spectre v4: Not currently mitigated
1200  *  - SpectreRSB: SMEP and RSB Stuffing
1201  *  - L1TF: spec_uarch_flush, SMT exclusion, requires microcode
1202  *  - MDS: x86_md_clear, requires microcode, disabling hyper threading
1203  *
1204  * The following table indicates the x86 feature set bits that indicate that a
1205  * given problem has been solved or a notable feature is present:
1206  *
1207  *  - RDCL_NO: Meltdown, L1TF, MSBDS subset of MDS
1208  *  - MDS_NO: All forms of MDS
1209  */
1210 
1211 #include <sys/types.h>
1212 #include <sys/archsystm.h>
1213 #include <sys/x86_archext.h>
1214 #include <sys/kmem.h>
1215 #include <sys/systm.h>
1216 #include <sys/cmn_err.h>
1217 #include <sys/sunddi.h>
1218 #include <sys/sunndi.h>
1219 #include <sys/cpuvar.h>
1220 #include <sys/processor.h>
1221 #include <sys/sysmacros.h>