Alerts, Events, and Data Collection

Over at IT Skeptic there is some question over the ITIL use of the terms “Event” and “Alert” . I thought this was fairly amusing because I’ve been having this discussion with my customers (and my team) for years.

I have designed, implemented and managed HP OM (OVO or Openview), NNM, OVIS,  Netview, T/EC, Concord SystemEdge, Topaz/BAC/BSM (including BPM/RUM), Freshwater SiteScope/Mercury SiteScope/HP SiteScope, and AlarmPoint/xMatters over the past 10 plus years.

When trying to set this ‘stuff’ up we have to know what is important to capture, what is of some importance to know (informational perhaps), and what you need to know right now to take immediate action on.

Over the years I’ve come to think of it in these terms

Data Collection = anything we’ve (either the customer, or us based on our ‘expertise’) determined important enough to capture and record. The primary source for reporting.

Examples:

  • Disk space utilization every 5 minutes (we don’t care what it is, only that we capture/record it)
  • CPU
  • Security Log entries,
  • End user emulation transaction times

Events = Any data collected that either has value for immediate action (either automated or manual) or contains information of a ‘proactive’ nature. Provides additional insight into Incident/Problem Management. Can be used by Problem to trend “events” over time.

Examples:

  • Disk space has exceeded a certain threshold (over a period of time (my preference), or occurred once) – Perhaps 50%, or 65%, or 95%
  • A security log entry of a particular type
  • End user emulation transactions have failed – from one location or maybe all locations (over a period of time (my preference), or occurred once)

Alerts = Any event that meets or exceeds defined thresholds that require immediate attention/action by ‘service providers’ (sys admins, DBAs, network engineers, product managers, service managers, service desk). Indicators of Incidents and/or Problems.

  • Disk space has exceeded a certain threshold – usually something high like 95% and most always over a period of time (to avoid the “false positive”)
  • A security log entry of a particular type
  • End user emulation transactions have failed – usually from more than 1 location and for a period of time.

So, I think of it like this: Alert must first be an Event which must first be Data that is collected.

  • Data collection > Events > Alerts
  • Lots of things > Some things > Few things

Not all Data collected is worthy to be an Event – I just want to log CPU over time so I can graph it later

Not all Events are worthy to be an Alert – CPU spiked once on one web server (although if it happens every day, perhaps Prob Mgmt using reports on Events can see this an investigate)
All Alerts should create Incidents and/or Problem tickets. Something is really messed up (or is soon to be) requiring immediate (or near immediate) action/work.

Works for me anyway .

Advertisements

One comment on “Alerts, Events, and Data Collection

  1. Mohamed says:

    I am really impressed along with your writing talents
    as smartly as with the structure to your weblog. Is that this a paid subject or did you modify it yourself?
    Anyway keep up the excellent high quality writing, it
    is uncommon to peer a nice weblog like this one
    nowadays..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s