893 * form cpuid_get*. This is used by a number of different subsystems in the
894 * kernel to determine more detailed information about what we're running on,
895 * topology information, etc. Some of these subsystems include processor groups
896 * (uts/common/os/pg.c.), CPU Module Interface (uts/i86pc/os/cmi.c), ACPI,
897 * microcode, and performance monitoring. These functions all ASSERT that the
898 * CPU they're being called on has reached a certain cpuid pass. If the passes
899 * are rearranged, then this needs to be adjusted.
900 *
901 * -----------------------------------------------
902 * Speculative Execution CPU Side Channel Security
903 * -----------------------------------------------
904 *
905 * With the advent of the Spectre and Meltdown attacks which exploit speculative
906 * execution in the CPU to create side channels there have been a number of
907 * different attacks and corresponding issues that the operating system needs to
908 * mitigate against. The following list is some of the common, but not
909 * exhaustive, set of issues that we know about and have done some or need to do
910 * more work in the system to mitigate against:
911 *
912 * - Spectre v1
913 * - Spectre v2
914 * - Meltdown (Spectre v3)
915 * - Rogue Register Read (Spectre v3a)
916 * - Speculative Store Bypass (Spectre v4)
917 * - ret2spec, SpectreRSB
918 * - L1 Terminal Fault (L1TF)
919 * - Microarchitectural Data Sampling (MDS)
920 *
921 * Each of these requires different sets of mitigations and has different attack
922 * surfaces. For the most part, this discussion is about protecting the kernel
923 * from non-kernel executing environments such as user processes and hardware
924 * virtual machines. Unfortunately, there are a number of user vs. user
925 * scenarios that exist with these. The rest of this section will describe the
926 * overall approach that the system has taken to address these as well as their
927 * shortcomings. Unfortunately, not all of the above have been handled today.
928 *
929 * SPECTRE FAMILY (Spectre v2, ret2spec, SpectreRSB)
930 *
931 * The second variant of the spectre attack focuses on performing branch target
932 * injection. This generally impacts indirect call instructions in the system.
933 * There are three different ways to mitigate this issue that are commonly
934 * described today:
935 *
936 * 1. Using Indirect Branch Restricted Speculation (IBRS).
937 * 2. Using Retpolines and RSB Stuffing
938 * 3. Using Enhanced Indirect Branch Restricted Speculation (EIBRS)
939 *
940 * IBRS uses a feature added to microcode to restrict speculation, among other
941 * things. This form of mitigation has not been used as it has been generally
942 * seen as too expensive and requires reactivation upon various transitions in
943 * the system.
944 *
945 * As a less impactful alternative to IBRS, retpolines were developed by
946 * Google. These basically require one to replace indirect calls with a specific
947 * trampoline that will cause speculation to fail and break the attack.
948 * Retpolines require compiler support. We always build with retpolines in the
949 * external thunk mode. This means that a traditional indirect call is replaced
1018 * now.
1019 *
1020 * To fully protect user to user and vmx to vmx attacks from these classes of
1021 * issues, we would also need to allow them to opt into performing an Indirect
1022 * Branch Prediction Barrier (IBPB) on switch. This is not currently wired up.
1023 *
1024 * By default, the system will enable RSB stuffing and the required variant of
1025 * retpolines and store that information in the x86_spectrev2_mitigation value.
1026 * This will be evaluated after a microcode update as well, though it is
1027 * expected that microcode updates will not take away features. This may mean
1028 * that a late loaded microcode may not end up in the optimal configuration
1029 * (though this should be rare).
1030 *
1031 * Currently we do not build kmdb with retpolines or perform any additional side
1032 * channel security mitigations for it. One complication with kmdb is that it
1033 * requires its own retpoline thunks and it would need to adjust itself based on
1034 * what the kernel does. The threat model of kmdb is more limited and therefore
1035 * it may make more sense to investigate using prediction barriers as the whole
1036 * system is only executing a single instruction at a time while in kmdb.
1037 *
1038 * SPECTRE FAMILY (v1, v4)
1039 *
1040 * The v1 and v4 variants of spectre are not currently mitigated in the
1041 * system and require other classes of changes to occur in the code.
1042 *
1043 * MELTDOWN
1044 *
1045 * Meltdown, or spectre v3, allowed a user process to read any data in their
1046 * address space regardless of whether or not the page tables in question
1047 * allowed the user to have the ability to read them. The solution to meltdown
1048 * is kernel page table isolation. In this world, there are two page tables that
1049 * are used for a process, one in user land and one in the kernel. To implement
1050 * this we use per-CPU page tables and switch between the user and kernel
1051 * variants when entering and exiting the kernel. For more information about
1052 * this process and how the trampolines work, please see the big theory
1053 * statements and additional comments in:
1054 *
1055 * - uts/i86pc/ml/kpti_trampolines.s
1056 * - uts/i86pc/vm/hat_i86.c
1057 *
1058 * While Meltdown only impacted Intel systems and there are also Intel systems
1059 * that have Meltdown fixed (called Rogue Data Cache Load), we always have
1060 * kernel page table isolation enabled. While this may at first seem weird, an
1061 * important thing to remember is that you can't speculatively read an address
1062 * if it's never in your page table at all. Having user processes without kernel
1142 * MDS_NO has been set, then this is fully mitigated and x86_md_clear() becomes
1143 * a no-op.
1144 *
1145 * Unfortunately, with this issue hyperthreading rears its ugly head. In
1146 * particular, everything we've discussed above is only valid for a single
1147 * thread executing on a core. In the case where you have hyper-threading
1148 * present, this attack can be performed between threads. The theoretical fix
1149 * for this is to ensure that both threads are always in the same security
1150 * domain. This means that they are executing in the same ring and mutually
1151 * trust each other. Practically speaking, this would mean that a system call
1152 * would have to issue an inter-processor interrupt (IPI) to the other thread.
1153 * Rather than implement this, we recommend that one disables hyper-threading
1154 * through the use of psradm -aS.
1155 *
1156 * SUMMARY
1157 *
1158 * The following table attempts to summarize the mitigations for various issues
1159 * and what's done in various places:
1160 *
1161 * - Spectre v1: Not currently mitigated
1162 * - Spectre v2: Retpolines/RSB Stuffing or EIBRS if HW support
1163 * - Meltdown: Kernel Page Table Isolation
1164 * - Spectre v3a: Updated CPU microcode
1165 * - Spectre v4: Not currently mitigated
1166 * - SpectreRSB: SMEP and RSB Stuffing
1167 * - L1TF: spec_uarch_flush, smt exclusion, requires microcode
1168 * - MDS: x86_md_clear, requires microcode, disabling hyper threading
1169 *
1170 * The following table indicates the x86 feature set bits that indicate that a
1171 * given problem has been solved or a notable feature is present:
1172 *
1173 * - RDCL_NO: Meltdown, L1TF, MSBDS subset of MDS
1174 * - MDS_NO: All forms of MDS
1175 */
1176
1177 #include <sys/types.h>
1178 #include <sys/archsystm.h>
1179 #include <sys/x86_archext.h>
1180 #include <sys/kmem.h>
1181 #include <sys/systm.h>
1182 #include <sys/cmn_err.h>
1183 #include <sys/sunddi.h>
1184 #include <sys/sunndi.h>
1185 #include <sys/cpuvar.h>
1186 #include <sys/processor.h>
1187 #include <sys/sysmacros.h>
|
893 * form cpuid_get*. This is used by a number of different subsystems in the
894 * kernel to determine more detailed information about what we're running on,
895 * topology information, etc. Some of these subsystems include processor groups
896 * (uts/common/os/pg.c.), CPU Module Interface (uts/i86pc/os/cmi.c), ACPI,
897 * microcode, and performance monitoring. These functions all ASSERT that the
898 * CPU they're being called on has reached a certain cpuid pass. If the passes
899 * are rearranged, then this needs to be adjusted.
900 *
901 * -----------------------------------------------
902 * Speculative Execution CPU Side Channel Security
903 * -----------------------------------------------
904 *
905 * With the advent of the Spectre and Meltdown attacks which exploit speculative
906 * execution in the CPU to create side channels there have been a number of
907 * different attacks and corresponding issues that the operating system needs to
908 * mitigate against. The following list is some of the common, but not
909 * exhaustive, set of issues that we know about and have done some or need to do
910 * more work in the system to mitigate against:
911 *
912 * - Spectre v1
913 * - swapgs (Spectre v1 variant)
914 * - Spectre v2
915 * - Meltdown (Spectre v3)
916 * - Rogue Register Read (Spectre v3a)
917 * - Speculative Store Bypass (Spectre v4)
918 * - ret2spec, SpectreRSB
919 * - L1 Terminal Fault (L1TF)
920 * - Microarchitectural Data Sampling (MDS)
921 *
922 * Each of these requires different sets of mitigations and has different attack
923 * surfaces. For the most part, this discussion is about protecting the kernel
924 * from non-kernel executing environments such as user processes and hardware
925 * virtual machines. Unfortunately, there are a number of user vs. user
926 * scenarios that exist with these. The rest of this section will describe the
927 * overall approach that the system has taken to address these as well as their
928 * shortcomings. Unfortunately, not all of the above have been handled today.
929 *
930 * SPECTRE v2, ret2spec, SpectreRSB
931 *
932 * The second variant of the spectre attack focuses on performing branch target
933 * injection. This generally impacts indirect call instructions in the system.
934 * There are three different ways to mitigate this issue that are commonly
935 * described today:
936 *
937 * 1. Using Indirect Branch Restricted Speculation (IBRS).
938 * 2. Using Retpolines and RSB Stuffing
939 * 3. Using Enhanced Indirect Branch Restricted Speculation (EIBRS)
940 *
941 * IBRS uses a feature added to microcode to restrict speculation, among other
942 * things. This form of mitigation has not been used as it has been generally
943 * seen as too expensive and requires reactivation upon various transitions in
944 * the system.
945 *
946 * As a less impactful alternative to IBRS, retpolines were developed by
947 * Google. These basically require one to replace indirect calls with a specific
948 * trampoline that will cause speculation to fail and break the attack.
949 * Retpolines require compiler support. We always build with retpolines in the
950 * external thunk mode. This means that a traditional indirect call is replaced
1019 * now.
1020 *
1021 * To fully protect user to user and vmx to vmx attacks from these classes of
1022 * issues, we would also need to allow them to opt into performing an Indirect
1023 * Branch Prediction Barrier (IBPB) on switch. This is not currently wired up.
1024 *
1025 * By default, the system will enable RSB stuffing and the required variant of
1026 * retpolines and store that information in the x86_spectrev2_mitigation value.
1027 * This will be evaluated after a microcode update as well, though it is
1028 * expected that microcode updates will not take away features. This may mean
1029 * that a late loaded microcode may not end up in the optimal configuration
1030 * (though this should be rare).
1031 *
1032 * Currently we do not build kmdb with retpolines or perform any additional side
1033 * channel security mitigations for it. One complication with kmdb is that it
1034 * requires its own retpoline thunks and it would need to adjust itself based on
1035 * what the kernel does. The threat model of kmdb is more limited and therefore
1036 * it may make more sense to investigate using prediction barriers as the whole
1037 * system is only executing a single instruction at a time while in kmdb.
1038 *
1039 * SPECTRE v1, v4
1040 *
1041 * The v1 and v4 variants of spectre are not currently mitigated in the
1042 * system and require other classes of changes to occur in the code.
1043 *
1044 * SPECTRE v1 (SWAPGS VARIANT)
1045 *
1046 * The class of Spectre v1 vulnerabilities aren't all about bounds checks, but
1047 * can generally affect any branch-dependent code. The swapgs issue is one
1048 * variant of this. If we are coming in from userspace, we can have code like
1049 * this:
1050 *
1051 * cmpw $KCS_SEL, REGOFF_CS(%rsp)
1052 * je 1f
1053 * movq $0, REGOFF_SAVFP(%rsp)
1054 * swapgs
1055 * 1:
1056 * movq %gs:CPU_THREAD, %rax
1057 *
1058 * If an attacker can cause a mis-speculation of the branch here, we could skip
1059 * the needed swapgs, and use the /user/ %gsbase as the base of the %gs-based
1060 * load. If subsequent code can act as the usual Spectre cache gadget, this
1061 * would potentially allow KPTI bypass. To fix this, we need an lfence prior to
1062 * any use of the %gs override.
1063 *
1064 * The other case is also an issue: if we're coming into a trap from kernel
1065 * space, we could mis-speculate and swapgs the user %gsbase back in prior to
1066 * using it. AMD systems are not vulnerable to this version, as a swapgs is
1067 * serializing with respect to subsequent uses. But as AMD /does/ need the other
1068 * case, and the fix is the same in both cases (an lfence at the branch target
1069 * 1: in this example), we'll just do it unconditionally.
1070 *
1071 * Note that we don't enable user-space "wrgsbase" via CR4_FSGSBASE, making it
1072 * harder for user-space to actually set a useful %gsbase value: although it's
1073 * not clear, it might still be feasible via lwp_setprivate(), though, so we
1074 * mitigate anyway.
1075 *
1076 * MELTDOWN
1077 *
1078 * Meltdown, or spectre v3, allowed a user process to read any data in their
1079 * address space regardless of whether or not the page tables in question
1080 * allowed the user to have the ability to read them. The solution to meltdown
1081 * is kernel page table isolation. In this world, there are two page tables that
1082 * are used for a process, one in user land and one in the kernel. To implement
1083 * this we use per-CPU page tables and switch between the user and kernel
1084 * variants when entering and exiting the kernel. For more information about
1085 * this process and how the trampolines work, please see the big theory
1086 * statements and additional comments in:
1087 *
1088 * - uts/i86pc/ml/kpti_trampolines.s
1089 * - uts/i86pc/vm/hat_i86.c
1090 *
1091 * While Meltdown only impacted Intel systems and there are also Intel systems
1092 * that have Meltdown fixed (called Rogue Data Cache Load), we always have
1093 * kernel page table isolation enabled. While this may at first seem weird, an
1094 * important thing to remember is that you can't speculatively read an address
1095 * if it's never in your page table at all. Having user processes without kernel
1175 * MDS_NO has been set, then this is fully mitigated and x86_md_clear() becomes
1176 * a no-op.
1177 *
1178 * Unfortunately, with this issue hyperthreading rears its ugly head. In
1179 * particular, everything we've discussed above is only valid for a single
1180 * thread executing on a core. In the case where you have hyper-threading
1181 * present, this attack can be performed between threads. The theoretical fix
1182 * for this is to ensure that both threads are always in the same security
1183 * domain. This means that they are executing in the same ring and mutually
1184 * trust each other. Practically speaking, this would mean that a system call
1185 * would have to issue an inter-processor interrupt (IPI) to the other thread.
1186 * Rather than implement this, we recommend that one disables hyper-threading
1187 * through the use of psradm -aS.
1188 *
1189 * SUMMARY
1190 *
1191 * The following table attempts to summarize the mitigations for various issues
1192 * and what's done in various places:
1193 *
1194 * - Spectre v1: Not currently mitigated
1195 * - swapgs: lfences after swapgs paths
1196 * - Spectre v2: Retpolines/RSB Stuffing or EIBRS if HW support
1197 * - Meltdown: Kernel Page Table Isolation
1198 * - Spectre v3a: Updated CPU microcode
1199 * - Spectre v4: Not currently mitigated
1200 * - SpectreRSB: SMEP and RSB Stuffing
1201 * - L1TF: spec_uarch_flush, SMT exclusion, requires microcode
1202 * - MDS: x86_md_clear, requires microcode, disabling hyper threading
1203 *
1204 * The following table indicates the x86 feature set bits that indicate that a
1205 * given problem has been solved or a notable feature is present:
1206 *
1207 * - RDCL_NO: Meltdown, L1TF, MSBDS subset of MDS
1208 * - MDS_NO: All forms of MDS
1209 */
1210
1211 #include <sys/types.h>
1212 #include <sys/archsystm.h>
1213 #include <sys/x86_archext.h>
1214 #include <sys/kmem.h>
1215 #include <sys/systm.h>
1216 #include <sys/cmn_err.h>
1217 #include <sys/sunddi.h>
1218 #include <sys/sunndi.h>
1219 #include <sys/cpuvar.h>
1220 #include <sys/processor.h>
1221 #include <sys/sysmacros.h>
|