IT Event Correlation and Analysis Explainedposted by Anna Mar, December 23, 2011
Enterprises monitor their IT infrastructure (usually with network management software).
A large enterprise will process thousands of infrastructure events every day: information, warnings, errors, critical errors. When an infrastructure incident occurs there's often a chain reaction — the network management software reports hundreds of error events at the same time.
Incident teams often spend significant time identifying root cause. This can involve manual inspection of a sea of error events. It's not unusual for root cause identification to be 60% or more of MTTR.
IT Event Correlation And Analysis (ECA) automates (or partially automates) the process of identifying the root cause of an incident. ECA tools also improve operational efficiency by automating event management processes. This means filtering, aggregating and correlating events to generate actions (tickets) and notifications.
Incident ManagementLet's say (for example) that a router goes down. This can impact thousands of servers, devices and services. They'll all send out error events within a short period of time. An incident response team will look at the events to determine what's wrong and fix it.
Ideally, ECA tools can algorithmically identify the root cause of an incident using event and network topology information. This has been the holy grail of infrastructure incident management since telecom companies first tried to automate root cause identification back in the 1970s.
Today, ECA tools can (in theory) identify root cause automatically. However, in practice many tools continue to lack the sophistication to analyze events occurring in complex network topologies to identify a root cause.
ECA tools that are unable to determine a root cause can (at least) filter, correlate and aggregate events to provide the incident management team with a short list of enriched events.
Infrastructure Event ManagementLet's say a server sends a disk capacity warning. The server's disk is 97% full. Your infrastructure operations team should generate a maintenance action to clean up the disk. They should also notify the application owners.
In order to do this, your infrastructure operations team may need to inspect thousands of events each day. Only a tiny fraction of infrastructure events are worthy of an action or notification. Many infrastructure operations teams spend much of their time manually inspecting and filtering events.
ECA tools can automatically filter, correlate and aggregate events. This can greatly reduce the number of events that operations teams need to process. In many cases, ECA tools can completely automate event management processes — generating actions and notifications from significant events.
How does this relate to ITIL v3?ECA tools are referred to as a correlation engine by ITIL v3. Event correlation is part of the ITIL Event Management process.
Chief Information Officer |
Reference and sample IT templates.|
The 9 software design principles that changed the world.|