7003695: Troubleshooting Machine Check Exception (MCE) on SUSE

This document (7003695) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 12

SUSE Linux Enterprise Server 11

SUSE Linux Enterprise Server 10

Situation

The kernel is tainted with TAINT: (M) Machine check exception.

On the x86_64 architecture you see a /var/log/mcelog file with contents similar to the following:

MCE 0

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 8 BANK 4 TSC cc6b33cd1589

MISC c0090fff01000000 ADDR c27c001a0

STATUS dc3bc000ba080a13 MCGSTATUS 0

MCE 1

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 8 BANK 4 TSC cd195ce00597

MISC c0090fff01000000 ADDR c0fd96000

STATUS dc77400065080a13 MCGSTATUS 0

MCE 2

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 8 BANK 4 TSC cdc785f347a5

MISC c0090fff01000000 ADDR c1037d380

STATUS dc4cc000df080a13 MCGSTATUS 0

Resolution

Generally speaking a machine check exception indicates a hardware failure. This is rarely if ever an operating system error. It is highly recommended that you follow your vendor’s hardware diagnostic procedure to find and replace any faulty hardware. You can run memory and firmware tests in addition to your vendor’s recommended hardware diagnostic procedure.

To run a Memory Test do the following:

1. Insert the SUSE LINUX Enterprise Server installation CD1

2. Reboot from CD

3. Select Memory Test to run memtest86

To run a Firmware Test do the following:

1. Insert the SUSE LINUX Enterprise Server installation CD1

2. Reboot from CD

3. Select Firmware Test

Disclaimer

This Support Knowledgebase provides a valuable tool for NetIQ/Novell/SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented “AS IS” WITHOUT WARRANTY OF ANY KIND.

Related:

3582750: Tainted kernel

TheLinux kernel maintains a “tainted state” which is includedin kernel error messages. The tainted state provides an indicationwhether something has happened to the running kernel that affectswhether a kernel error or hang can be troubleshot effectively byanalysing the kernel source code. Some of the information in thetaint relates to whether the information provided by the kernel in anerror message can be considered trustworthy.

As an example,the taint state is set when a machine check exception (MCE) has beenraised, indicating a hardware related problem has occurred.

Oncethe tainted state of a running kernel has been set, it cannot beunset other than by reloading the kernel, that is by shutting downand then restarting the system.

Taintflags

Thetainted status of the kernel not only indicates whether or not thekernel has been tainted but also indicates what type(s) of eventcaused the kernel to be marked as tainted. This information isencoded through single-character flags in the string following”Tainted:”in a kernel error message.

P: A module with aProprietary license has been loaded, i.e. a module that is notlicensed under the GNU General Public License (GPL) or a compatiblelicense. This may indicate that source code for this module is notavailable to the Linux kernel developers or to SUSE developers.

G: The opposite of’P‘: the kernel has been tainted (for a reason indicated by adifferent flag), but all modules loaded into it were licensed underthe GPL or a license compatible with the GPL.

F: A module wasloaded using the Force option “-f” of insmod ormodprobe, which caused a sanity check of the versioning informationfrom the module (if present) to be skipped.

R: A module whichwas in use or was not designed to be removed has been forcefullyRemoved from the running kernel using the force option “-f”of rmmod.

S: The Linux kernelis running with Symmetric MultiProcessor support (SMP), butthe CPUs in the system are not designed or certified for SMP use.

M: A Machine CheckException (MCE) has been raised while the kernel was running.MCEs are triggered by the hardware to indicate a hardware relatedproblem, for example the CPU’s temperature exceeding a threshold or amemory bank signaling an uncorrectable error.

B:A process has been found in a Badpage state, indicating a corruption of the virtual memory subsystem,possibly caused by malfunctioning RAM or cache memory.

U:Useror user application specifically requested that the Tainted flag beset, ‘ ‘ otherwise.

D:Kernel has Diedrecently, i.e. there was an OOPS or BUG.

W:A Warninghas previously been issued by the kernel.

C:A staging driver has been loaded.

A:ACPItable has been overriden [From SLES11 SP1 onwards]

I:Kernel is working around a severe bug in the platform firmware[From SLES11 SP2 onwards]

O:An Out-of-tree module has been loaded [From SLES12 SP0onwards]

L:A Soft Lockup has previously occured on the system [FromSLES12 SP2 onwards]

K:The Kernel has been live patched [From SLES12 SP2 onwards]

The taint flagsabove are implemented in the standard Linux kernel and indicate thatthe information provided in kernel error messages is not necessarilyto be trusted.


InSUSE kernels, additional taint flags are implemented.

N:An uNsupportedmodule has been loaded, i.e. a module which is not supported by SUSEand which is not known to be supported by a third party. For example,the module is a driver that is not yet mature enough to besupportable or is a driver for an obsolete type of hardware which canno longer be tested adequately.

X:A module that is supported by SUSE in cooperation with a thirdparty has been loaded into the kernel.

E:An unsigned module has been loaded in a kernel supportconfig modulesignature [From SLES11 SP3 onwards].

H:System restored from unsafe Hibernatesnapshot image [From SLES12 SP1 onwards]

Determining the taint status of a runningkernel.

Thetaint status of a running kernel can be determined by running

cat/proc/sys/kernel/tainted

When the output is 0, the kernel is nottainted, when the output is non-zero, the kernel istainted.

The value will be a combined number of allapplying kernel taint flags added (ORed) together. You can find alist of currently used kernel flags under:

/usr/src/linux/Documentation/sysctl/kernel.txt


When the kernel produces an error, a stringdetailing the taint status will be included.

Taintedkernels and support from SUSE Customer Care

Asthe information provided by a tainted kernel is not necessarilytrustworthy and may relate to third-party code for which source codeis not available to SUSE, it can be of limited value fortroubleshooting. As such, the support whichSUSE Customer Care is able to provide forissues involving tainted kernels is limited.

When an issueoccurs with a tainted kernel, it is important that every effort bemade to reproduce the issue with an untainted kernel (or a kernelonly tainted with X)so that SUSE Customer Care can provide appropriate support.

Avoidingkernel tainting

The followingsteps can be taken to avoid tainting the kernel:

  • To avoid “P” tainting, switch to using SUSE supplied drivers and modules instead of ones supplied by other vendors. For instance, use Linux’ device-mapper multipath I/O rather than a vendor’s proprietary multipathing implementation. Note that is not always possible; some hardware components are only supported by proprietary drivers, for instance.
  • To avoid “F” or “R” tainting, do not use force options when (un)loading kernel modules. If necessary, obtain drivers built specifically for the running kernel version.

  • To avoid “S” tainting, do not use SMP-enabled kernels on systems with CPUs that have not been designed or certified for SMP use.

  • To avoid “M” or “B” tainting, ensure that the hardware is operated within specified parameters for power supply, temperature, humidity and air flow. Additionally, check the hardware using hardware diagnostics tools from the hardware vendor and consult the Sig11 information page for more information on hardware and hardware configuration issues.

  • To avoid “N“, use a YES certified configuration.

  • X” tainting is not necessarily problematic. As for support, SUSE Customer Care can involve the third party that provides support for the driver involved. The external flag can be avoided by using hardware components that are directly supported by SUSE.

  • To avoid “I” you may figure out where the firmware problem lays (BISO or some other component)
  • To avoid “E” tainting. please make sure to not load any module which does not have a valid signature. You can force this behaviour by enabling CONFIG_MODULE_SIG_FORCE config option, or by setting module.sig_enforce=1 on the kernel command line
  • H” is being set when the system tries to boot from a snapshot images without valid signatures. it is controlled through CONFIG_SNAPSHOT_VERIFICATION config option. if you get this flag, make sure your snapshot has a valid signature

Related:

Machine Check Event reported is a CPU thermal throttling event reported from CPU %1. The CPU has dropped below the temperature limit and throttling has been removed. %2 additional error(s) are contained within the record.

Details
Product: Windows Operating System
Event ID: 112
Source: WMIxWDM
Version: 5.2
Symbolic Name: MCA_INFO_CPU_THERMAL_THROTTLING_REMOVED
Message: Machine Check Event reported is a CPU thermal throttling event reported from CPU %1. The CPU has dropped below the temperature limit and throttling has been removed. %2 additional error(s) are contained within the record.
   
Explanation

The temperature of the CPU specified in the event has decreased and the CPU is now running at normal speed. For detailed information about the hardware that caused the problem, refer to the event log.

   
User Action

No user action is required.

Related:

Machine Check Event reported is a fatal error.

Details
Product: Windows Operating System
Event ID: 107
Source: WMIxWDM
Version: 5.2
Symbolic Name: MCA_ERROR_UNKNOWN_NO_CPU
Message: Machine Check Event reported is a fatal error.
   
Explanation

A nonrecoverable hardware error caused the hardware to fail. The type of error cannot be determined because the error record returned by the firmware is not in the required format. For detailed information about the hardware that caused the problem, refer to the event log.

   
User Action

Contact your hardware support provider.

Related:

Machine Check Event reported is a corrected error.

Details
Product: Windows Operating System
Event ID: 106
Source: WMIxWDM
Version: 5.2
Symbolic Name: MCA_WARNING_UNKNOWN_NO_CPU
Message: Machine Check Event reported is a corrected error.
   
Explanation

A hardware error was resolved by your system. The type of error cannot be determined because the error record returned by the firmware is not in the required format. For detailed information about the hardware that caused the problem, refer to the event log.

   
User Action

If the problem continues, contact your hardware support provider.

Related:

Machine Check Event reported is a fatal error reported to CPU %1.

Details
Product: Windows Operating System
Event ID: 105
Source: WMIxWDM
Version: 5.2
Symbolic Name: MCA_ERROR_UNKNOWN
Message: Machine Check Event reported is a fatal error reported to CPU %1.
   
Explanation

A hardware error caused the hardware to fail. The type of error cannot be determined because the error record returned by the firmware is not in the required format. For detailed information about the hardware that caused the problem, refer to the event log.

   
User Action

Contact your hardware support provider.

Related:

Machine Check Event reported is a corrected error reported to CPU %1.

Details
Product: Windows Operating System
Event ID: 104
Source: WMIxWDM
Version: 5.2
Symbolic Name: MCA_WARNING_UNKNOWN
Message: Machine Check Event reported is a corrected error reported to CPU %1.
   
Explanation

A hardware error was resolved by your system. The type of error cannot be determined because the error record returned by the firmware is not in the required format. For detailed information about the hardware that caused the problem, refer to the event log.

   
User Action

If the problem continues, contact your hardware support provider.

Related: