Troubleshooting Information in Cisco UCS Manager GUI
Cisco UCS Manager GUI provides several tabs and other areas that you can use to find troubleshooting information for a Cisco UCS domain. For example, you can view faults and events for specific objects or for all objects in the system.
The Admin tab in the Navigation pane provides access to faults, events, core files, and other information that can help you troubleshoot issues.
If you select Faults, Events and Audit Log in the Filter field on the Admin tab, Cisco UCS Manager GUI limits the tree browser so that you can only access the following:
The faults for all components in the system
The events for all components in the system
The audit log for the system
Any core files created by the fabric interconnects in the system
The fault collection and core file export settings
Note
Fault thresholds might need to be modified. See the “Statistics Threshold Policy” section in the Cisco UCS Manager GUI
Configuration Guide for the release of Cisco UCS Manager that you are using.
Troubleshooting Information in Cisco UCS Manager CLI
The Cisco UCS Manager CLI includes several show commands that you can execute to find troubleshooting information for a Cisco UCS domain. These show commands are scope-aware, which means that if you enter the show fault command from the top scope, it displays all faults in the system. However, if you scope to a specific object, the show fault command displays faults that are related to that object only.
Note
Fault thresholds might need to be modified. See the “Statistics Threshold Policy” section in the Cisco UCS Manager CLI
Configuration Guide for the release of Cisco UCS Manager that you are using.
Additional Troubleshooting Documentation
Additional troubleshooting information is available in the
following documents:
In Cisco UCS, a fault is a mutable object that is managed by Cisco UCS Manager. Each fault represents a failure in the Cisco UCS domain or an alarm threshold that has been raised. During the lifecycle of a fault, it can change from one state or severity to another.
Each fault includes information about the operational state of the affected object at the time the fault was raised. If the fault is transitional and the failure is resolved, the object transitions to a functional state.
A fault remains in Cisco UCS Manager until the fault is cleared and deleted according to the settings in the fault collection policy.
You can view all faults in a Cisco UCS domain from either the Cisco UCS Manager CLI or the Cisco UCS Manager GUI. You can also configure the fault collection policy to determine how a Cisco UCS domain collects and retains faults.
Note
All Cisco UCS faults
are included in MIBs and can be trapped by SNMP.
A fault raised in a Cisco UCS domain can transition
through more than one severity during its lifecycle. The following table describes the fault
severities that you may encounter.
Severity
Description
Critical
Service-affecting condition that requires immediate corrective action. For example, this severity could indicate that the managed object is out of service and its capability must be restored.
Major
Service-affecting condition that requires urgent corrective action. For example, this severity could indicate a severe degradation in the capability of the managed object and that its full capability must be restored.
Minor
Nonservice-affecting fault condition that requires corrective action to prevent a more serious fault from occurring. For example, this severity could indicate that the detected alarm condition is not degrading the capacity of the managed object.
Warning
Potential or impending service-affecting fault that has no significant effects in the system. You should take action to further diagnose, if necessary, and correct the problem to prevent it from becoming a more serious service-affecting fault.
Condition
Informational message about a condition, possibly independently insignificant.
Info
Basic notification or informational message, possibly independently insignificant.
Fault States
A fault raised in a Cisco UCS domain transitions
through more than one state during its lifecycle. The following table describes the possible
fault states in alphabetical order.
State
Description
Cleared
Condition that has been resolved and cleared.
Flapping
Fault that was raised, cleared, and raised again within a short time interval, known as the flap interval.
Soaking
Fault that was raised and cleared within a short time interval, known as the flap interval. Because this state may be a flapping condition, the fault severity remains at its original active value, but this state indicates the condition that raised the fault has cleared.
Fault Types
A fault raised in a Cisco UCS domain can be
one of the types described in the following table.
Type
Description
fsm
FSM task has failed to complete successfully, or Cisco UCS Manager is retrying one of the stages of the FSM.
equipment
Cisco UCS Manager has detected that a physical component is inoperable or has another functional issue.
server
Cisco UCS Manager cannot complete a server task, such as associating a service profile with a server.
configuration
Cisco UCS Manager cannot successfully configure a component.
environment
Cisco UCS Manager has detected a power problem, thermal problem, voltage problem, or loss of CMOS settings.
management
Cisco UCS Manager has detected a serious management issue, such as one of the following:
Critical services could not be started
The primary fabric interconnect could not be identified
Components in the Cisco UCS domain include incompatible firmware versions
connectivity
Cisco UCS Manager has detected a connectivity problem, such as an unreachable adapter.
network
Cisco UCS Manager has detected a network issue, such as a link down.
operational
Cisco UCS Manager has detected an operational problem, such as a log capacity issue or a failed server discovery.
Fault Properties
Cisco UCS Manager provides detailed
information about each fault raised in a Cisco UCS domain. The following table describes the fault
properties that you can view in Cisco UCS Manager CLI
or Cisco UCS Manager GUI.
Property Name
Description
Severity
Current severity level of the fault, which can be any of the severities described in .
Last Transition
Day and time on which the severity for the fault last changed. If the severity has not changed since the fault was raised, this property displays the original creation date.
Affected Object
Component that is affected by the condition that raised the fault.
Description
Description of the fault.
ID
Unique identifier assigned to the fault.
Type
Type of fault that has been raised, which can be any of the types described in .
Cause
Unique identifier associated with the condition that caused the fault.
Created at
Day and time when the fault occurred.
Code
Unique identifier assigned to the fault.
Number of Occurrences
Number of times the event that raised the fault occurred.
Original Severity
Severity assigned to the fault the first time it occurred.
Previous Severity
Previous severity level. This property is only used if the severity of a fault changes during its lifecycle.
Highest Severity
Highest severity encountered for this issue.
Lifecycle of Faults
Faults in Cisco UCS are stateful. Only one instance of a given fault can exist on each object. If the same fault occurs a second time, Cisco UCS increases the number of occurrences by one.
A fault has the following lifecycle:
A condition occurs in the system and
Cisco UCS Manager
raises a fault. This is the active state.
When the fault is alleviated, it enters a flapping or soaking interval that is designed to prevent flapping. Flapping occurs when a fault is raised and cleared several times in rapid succession. During the flapping interval, the fault retains its severity for the length of time specified in the fault collection policy.
If the condition reoccurs during the flapping interval, the fault returns to the active state. If the condition does not reoccur during the flapping interval, the fault is cleared.
The cleared fault enters the retention interval. This interval ensures that the fault reaches the attention of an administrator even if the condition that caused the fault has been alleviated and the fault has not been deleted prematurely. The retention interval retains the cleared fault for the length of time specified in the fault collection policy.
If the condition reoccurs during the retention interval, the fault returns to the active state. If the condition does not reoccur, the fault is deleted.
Faults in Cisco UCS Manager GUI
If you want to view faults for a single object in the system, navigate to that object in the Cisco UCS Manager GUI and click the Faults tab in the Work pane. If you want to view faults for all objects in the system, navigate to the Faults node on the Admin tab under Faults, Events and Audit Log.
In addition, you can also view a summary of all faults in a Cisco UCS domain in the Fault Summary area in the upper left of the Cisco UCS Manager GUI. This area provides a summary of all faults that have occurred in the Cisco UCS domain.
Each fault severity is represented by a different icon. The number below each icon indicates how many faults of that severity have occurred in the system. If you click an icon, the Cisco UCS Manager GUI opens the Faults tab in the Work pane and displays the details of all faults with that severity.
Faults in Cisco UCS Manager CLI
If you want to view the faults for all objects in the system, enter the show fault command from the top-level scope. If you want to view the faults for a specific object, scope to that object and then execute theshow fault command.
If you want to view all available details about a fault, enter the show fault detail command.
Fault Collection Policy
The fault collection policy controls the lifecycle of a fault in the Cisco UCS domain, including the length of time that each fault remains in the flapping and retention intervals.
Tip
For information on how to configure the fault collection policy, see the Cisco UCS Manager configuration guides, which are accessible through the Cisco UCS B-Series Servers Documentation Roadmap.
Events
In Cisco UCS, an event is an immutable object that is managed by Cisco UCS Manager. Each event represents a nonpersistent condition in the Cisco UCS domain. After Cisco UCS Manager creates and logs an event, the event does not change. For example, if you power on a server, Cisco UCS Manager creates and logs an event for the beginning and the end of that request.
You can view events for a single object, or you can view all events in a Cisco UCS domain from either the Cisco UCS Manager CLI or the Cisco UCS Manager GUI. Events remain in the Cisco UCS until the event log fills up. When the log is full, Cisco UCS Manager purges the log and all events in it.
Cisco UCS Manager provides detailed information about each event created and logged in a Cisco UCS domain. The following table describes the fault properties that you can view in Cisco UCS Manager CLI or Cisco UCS Manager GUI.
Table 1 Event Properties
Property Name
Description
Affected Object
Component that created the event.
Description
Description of the event.
Cause
Unique identifier associated with the event.
Created at
Date and time when the event was created.
User
Type of user that created the event, such as one of the following:
admin
internal
blank
Code
Unique identifier assigned to the event.
Events in the Cisco UCS Manager GUI
If you want to view events for a single object in the system, navigate to that object in the Cisco UCS Manager GUI and click the Events tab in the Work pane. If you want to view events for all objects in the system, navigate to the Events node on the Admin tab under the Faults, Events and Audit Log.
Events in the Cisco UCS Manager CLI
If you want to view events for all objects in the system, enter the show event command from the top-level scope. If you want to view events for a specific object, scope to that object and then enter the show event command.
If you want to view all available details about an event, enter the show event detail command.
Core Files
Critical failures in Cisco UCS Manager and some of the Cisco UCS components, such as a fabric interconnect or an I/O module, can cause the system to create a core file. Each core file contains a large amount of data about the system and the component at the time of the failure.
Cisco UCS Manager manages the core files from all of the components. You can configure Cisco UCS Manager to export a copy of a core file to a location on an external TFTP server as soon as that core file is created.
You can find out if a component in the Cisco UCS domain generated a core file by navigating to the Core Files node on the Admin tab under the Faults, Events and Audit Log node.
Core Files in the Cisco UCS Manager CLI
You can find out if a component in the Cisco UCS domain generated a core file by entering the following commands:
scope monitoring
scope sysdebug
show cores
Core File Exporter
If you enable the Core File Exporter, you can configure Cisco UCS Manager to export the core files as soon as they occur to a specified location on the network through TFTP. This functionality allows you to export the tar file with the contents of the core file to the location specified.
The audit log records actions performed by users in Cisco UCS Manager, including direct and indirect actions. Each entry in the audit log represents a single, non-persistent action. For example, if a user logs in, logs out, or creates, modifies, or deletes an object such as a service profile, Cisco UCS Manager adds an entry to the audit log for that action.
You can view the audit log entries in the Cisco UCS Manager CLI, Cisco UCS Manager GUI, or in a technical support file that you output from Cisco UCS Manager.
Cisco UCS Manager provides detailed information about each entry in the audit log. The following table describes the fault properties that you can view in the Cisco UCS Manager GUI or the Cisco UCS Manager CLI.
Table 2 Audit Log Entry Properties
Property Name
Description
ID
Unique identifier associated with the audit
log message.
Affected Object
Component affected by the user action.
Severity
Current severity level of the user action associated with the audit log message. These severities are also used for the faults, as described Fault Severities.
Trigger
User role associated with the user that
raised the message.
User
Type of user that created the event, as follows:
admin
internal
blank
Indication
Action indicated by the audit log message, which can be one of the following:
creation—A component was added to the system.
modification—An existing component was changed.
Description
Description of the user action.
Audit Log in the Cisco UCS Manager GUI
In the Cisco UCS Manager GUI, you can view the audit log on the Audit Log node on the Admin tab under the Faults, Events and Audit Log node.
Audit Log in the Cisco UCS Manager GUI
In the Cisco UCS Manager CLI, you can view the audit log through the following commands:
scope security
show audit-logs
System Event Log
The system event log (SEL) resides on the CIMC in NVRAM. It records most server-related events, such as over and under voltage, temperature events, fan events, and events from BIOS. The SEL is mainly used for troubleshooting purposes.
The SEL file is approximately 40KB in size, and no further events can be recorded when it is full. It must be cleared before additional events can be recorded.
You can use the SEL policy to backup the SEL to a remote server, and optionally clear the SEL after a backup operation occurs. Backup operations can be triggered based on specific actions, or they can occur at regular intervals. You can also manually backup or clear the SEL.
The backup file is automatically generated. The filename format is sel-SystemName-ChassisID-ServerID-ServerSerialNumber-Timestamp; for example, sel-UCS-A-ch01-serv01-QCI12522939-20091121160736.
Tip
For more information about the SEL, including how to view the SEL for each server and configure the SEL policy, see the Cisco UCS Manager configuration guides, which are accessible through the Cisco UCS B-Series Servers Documentation Roadmap.
The SEL file is approximately 40 KB. No further events can be recorded when the SEL file is full. It must be cleared before additional events can be recorded.
SEL Policy
You can use the SEL policy to back up the SEL to a remote server and optionally clear the SEL after a backup operation occurs. Backup operations can be triggered, based on specific actions, or they can occur at regular intervals. You can also manually back up or clear the SEL.
Cisco UCS Manager automatically generates the SEL backup file, according to the settings in the SEL policy. The filename format is sel-SystemName-ChassisID-ServerID-ServerSerialNumber-Timestamp
For example, a filename could be sel-UCS-A-ch01-serv01-QCI12522939-20091121160736.
Syslog
The syslog provides a central point for collecting and processing system logs that you can use to troubleshoot and audit a Cisco UCS domain. Cisco UCS Manager relies on the Cisco NX-OS syslog mechanism and API, and on the syslog feature of the primary fabric interconnect to collect and process the syslog entries.
Cisco UCS Manager manages and configures the syslog collectors for a Cisco UCS domain and deploys the configuration to the fabric interconnect or fabric interconnects. This configuration affects all syslog entries generated in a Cisco UCS domain by Cisco NX-OS or by Cisco UCS Manager.
You can configure Cisco UCS Manager to do one or more of the following with the syslog and syslog entries:
Display the syslog entries in the console or on the monitor
Store the syslog entries in a file
Forward the syslog entries to up to three external log collectors where the syslog for the Cisco UCS domain is stored
Each syslog entry generated by a Cisco UCS component is formatted
as follows:
Year month date hh:mm:ss hostname %facility-severity-MNEMONIC description
For example: 2007 Nov 1 14:07:58 excal-113 %MODULE-5-MOD_OK: Module
1 is online
Syslog Entry Severities
A syslog entry is assigned a Cisco UCS severity by Cisco UCS Manager. The following table shows how the Cisco UCS severities map to the syslog severities.
Table 3 Syslog Entry Severities in Cisco UCS
Cisco UCS Severity
Syslog Severity
CRIT
CRIT
MAJOR
ERR
MINOR
WARNING
WARNING
NOTICE
INFO
INFO
Syslog Entry Parameters
The following table describes the information
contained in each syslog entry.
Table 4 Syslog Message Content
Name
Description
Facility
Logging facility that generated and sent the syslog entry. The facilities are broad categories that are represented by integers. These sources can be one of the following standard Linux facilities:
local0
local1
local2
local3
local4
local5
local6
local7
Severity
Severity of the event, alert, or issue that caused the syslog entry to be generated. The severity can be one of the following:
emergencies
critical
alerts
errors
warnings
information
notifications
debugging
Hostname
Hostname included in the syslog entry that depends upon the component where the entry originated, as follows:
The fabric interconnect, Cisco UCS Manager,
or the hostname of the Cisco UCS domain
For all other components, the hostname associated with the virtual interface (VIF)
Timestamp
Date and time when the syslog entry was generated.
Message
Description of the event, alert, or issue that caused the syslog entry to be generated.
Syslog Services
The following Cisco UCS components use the Cisco NX-OS syslog services to generate syslog entries for system information and alerts:
I/O module—All syslog entries are sent by syslogd to the fabric interconnect to which it is connected.
CIMC—All syslog entries are sent to the primary fabric interconnect in a cluster configuration.
Adapter—All syslog entries are sent by NIC-Tools/Syslog to both fabric interconnects.
Cisco UCS Manager—Self-generated syslog entries are logged according to the syslog configuration.