Connectrix Brocade C3-1010 error due to single bit parity error in the ECC memory of the ASIC

Article Number: 522005 Article Version: 2 Article Type: Break Fix



Connectrix DS-6505B

C3-1010 error due to single bit parity error in the ECC memory of the ASIC.

errdump:

2018/05/25-05:00:23, [C3-1010], 79248, CHASSIS, CRITICAL, switchname, S0,C0: Above normal hardware errors were observed: fault1:0x2, fault2:0x64 thresh2:0xb0000064.

The engineering logs (RAS) show the following that triggered the message in the errdump:

2018/05/25-05:00:23:328378, [C3-5663], 1255769/0, CHASSIS, ERROR, switchname, S0,C0: FDS global block 1st = 0x4000, 2nd = 0x40, payload parity err on FID = 0xf57, addr = 0x8, line = 0x3., OID:0x43010080, c3_intr.c, line: 5148, comp:insmod, ltime:2018/05/25-05:00:23:318504

2018/05/25-05:00:23:328554, [C3-1010], 1255770/79248, CHASSIS, CRITICAL, switchname, S0,C0: Above normal hardware errors were observed: fault1:0x2, fault2:0x64 thresh2:0xb0000064., OID:0x43010080, c3_debug.c, line: 1146, comp:insmod, ltime:2018/05/25-05:00:23:319025

2018/05/25-05:00:23:328612, [BL-5215], 1255771/0, CHASSIS, WARNING, switchname, ASIC error/fault message received for chip = 0 ,reason = 52, OID:0x43010080, pulsar_chip.c, line: 936, comp:emd0, ltime:2018/05/25-05:00:23:319568

2018/05/25-05:00:23:328684, [BL-5262], 1255772/0, CHASSIS, WARNING, switchname, ASIC errors on switch, OID:0x43000000, pulsar_blade.c, line: 1264, comp:emd0, ltime:2018/05/25-05:00:23:320097

These are NOT Hardware errors in this case.

The C3-5663 message refers to payload parity err and almost all parity errors are transient and can be corrected. I have decoded the messages and in this case they can be ignored.

The BL messages also in this case do not require any actions

These are NOT Hardware errors in this case.

The C3-5663 message refers to payload parity err and almost all parity errors are transient and can be corrected. I have decoded the messages and in this case they can be ignored.

The BL messages also in this case do not require any actions

none

reboot switch to stop messages and messages will be filerted out in later FOS

Related:

7022118: Considerations for dealing with correctable memory error messages

Whether or not correctable memory errors are logged is a company or IT department policy and there is no general rule which will fit every IT department’s goals.

The operating system (in this case the kernel) is as verbose as possible and logs those events by default which may lead to false/positive alerts if no errors are reported in the hardware management board.

The kernel-source.rpm contains the file

/usr/src/linux/Documentation/x86/x86_64/boot-options.txt

which provides a number of kernel options to influence the logging behavior of the kernel. The question mainly is, should the administrator worry about corrected ECC errors at all?

From a technical point of view, a corrected memory message should be considered as an informational message only because the error has been corrected by the built-in hardware error correction mechanisms and it has not had any effect on system execution. However, todays hardware management boards may provide defined thresholds how many errors may occur before a warning / action is triggered.

Uncorrected errors on the other hand are the ones to worry about. In case of such an event, the kernel panics automatically to prevent data corruption (see option mce=tolerancelevel# in /usr/src/linux/Documentation/x86/x86_64/boot-options.txt)

A kernel option that may influence the behaviour of ECC RAM error logging are (taken from /usr/src/linux/Documentation/x86/x86_64/boot-options.txt):

mce=ignore_ce

Disable features for corrected errors, e.g. polling timer

and CMCI. All events reported as corrected are not cleared

by OS and remained in its error banks.

[…]

This option instructs the kernel to ignore correctable errors in the presence of a hardware management board which takes care of monitoring such events instead.

Related:

SHC110 Blade uncorrectable ECC memory error detected

L2 Rule : SHC110
Issue Detected : Blade uncorrectable ECC memory error detected
Severity : High
Components : In SPA 1 for Blade_09 SN#Y011UF13V114 Recovery Memory
device 4, (DIMM 4) correctable ECC memory error logging limit reached
02/06/16 12:40:50
In SPA 1 for Blade_09 SN#Y011UF13V114 Memory device
4, (DIMM 4) correctable ECC memory error logging limit reached 12/31/15
15:07:56

Related:

Machine Check Event reported is a corrected error reported to CPU %1.

Details
Product: Windows Operating System
Event ID: 104
Source: WMIxWDM
Version: 5.2
Symbolic Name: MCA_WARNING_UNKNOWN
Message: Machine Check Event reported is a corrected error reported to CPU %1.
   
Explanation

A hardware error was resolved by your system. The type of error cannot be determined because the error record returned by the firmware is not in the required format. For detailed information about the hardware that caused the problem, refer to the event log.

   
User Action

If the problem continues, contact your hardware support provider.

Related:

Machine Check Event reported is a fatal ECC memory error at physical address %3 on memory module %4 on memory card %5 reported to CPU %1. %2 additional error(s) are contained within the record.

Details
Product: Windows Operating System
Event ID: 77
Source: WMIxWDM
Version: 5.2
Symbolic Name: MCA_ERROR_MEM_1_2_5_4
Message: Machine Check Event reported is a fatal ECC memory error at physical address %3 on memory module %4 on memory card %5 reported to CPU %1. %2 additional error(s) are contained within the record.
   
Explanation

A nonrecoverable hardware error within the system memory subsystem caused the hardware to fail. For detailed information about the hardware that caused the problem, refer to the event log.

   
User Action

Contact your hardware support provider.

Related:

Machine Check Event reported is a corrected ECC memory error at physical address %3 on memory module %4 on memory card %5 reported to CPU %1. %2 additional error(s) are contained within the record.

Details
Product: Windows Operating System
Event ID: 76
Source: WMIxWDM
Version: 5.2
Symbolic Name: MCA_WARNING_MEM_1_2_5_4
Message: Machine Check Event reported is a corrected ECC memory error at physical address %3 on memory module %4 on memory card %5 reported to CPU %1. %2 additional error(s) are contained within the record.
   
Explanation

A hardware error within the system memory subsystem was resolved by your system. For detailed information about the hardware that caused the problem, refer to the event log.

   
User Action

If the problem continues, contact your hardware support provider.

Related: