Ideas on Categorizing Changes and Problems…

Categorization

Everybody categorizes changes and problems. But what I find is that the categories are typically too narrow and don’t allow for broader understanding of  what is driving your IT shop.

Usually data will show that you do a lot of work in a certain application or service. Or that your changes are only planned 8 days in advance or your emergency changes are 15% of all changes. Or maybe you had 13 problems in Email last year.

But this doesn’t help explain what is driving these numbers

Change Categories

I am a big fan of not just capturing if your Change effort was an Emergency or not but what the real driver for the change was.

In particular, I believe you can “categorize” your change efforts into just a few buckets:

Break/fix – something is broken and you have to fix it.

Planned Maintenance – Maybe this is a planned patch or reboot or even upgrade of your application. This is just to keep things going, maybe to maintain support with a vendor or close a security hole. There is no “additional benefit” to the client  that is driving this effort there might be some, but it isn’t the driver. How do you know if it is the driver? Well, how are you selling the effort? Are you selling it the change effort as something that will change the way the client works (additional features) or will they even notice? Is it more for your IT support staff or for the client? If it is for your IT staff – it is PM (unless it has already broken, then it is B/F).

Enhancement – This one is tricky. Most people in IT will want their work to go here but in reality, it is probably PM. Enhancement is for change efforts that will ‘enhance’ the way the client does work today. It will cause some Business change as a result. If your change doesn’t do this, or doesn’t intend to do this, it isn’t an Enhancement. Some examples:

I have a plain old telephone system. I am switching it all out for IP telephony at some considerable expense. This requires a lot of IT engineering and transition planning. When it is done, the business will be able to make phone calls just like they do today.

So, what kind of change is that?

If you answered Enhancement – you would be wrong. Nothing about how the business functions will be changed because of this effort. They will still make phone calls like they used to. The underlying technology has changed but that did not “enhance” the business did it? No, it did not.

Another example: I am using Exchange 2003 and I’m upgrading to Exchange 2010. There are a lot of new features that we may take advantage of in the future. This change though is just getting us to the new infrastructure. There is a lot riding on this because our company lives and dies with email.

What kind of change is this?

Planned Maintenance. Again, it does not change the way the business works. It is a background change for IT. That doesn’t take away from the hard work, the long hours, or the importance of the work. It just categorizes it for what it is.

One more example: When a new person is hired, the hiring manager (or office manager) can now click a single link from the Companies Home page and fill out a small form. This will kick off all the tickets that need to be created for laptop provisioning, phone provisioning, and account provisioning. The master ticket will then be emailed to you. You no longer have to call the Service Desk and track multiple tickets.

Finally, an Enhancement!  This is taking current service provided and making them better. Crucially, this is going to change the way your client works – not a drastic change, but a change nevertheless.

Transformative – This change is fairly rare. This is really transforming the way the client does business. This isn’t a fairly small step (Enhancement) this is a large step requiring extensive training and probably new/changed business processes.

Example: We are integrating Email, Voice Mail, Instant messaging and creating “soft phones” on your laptops. We are also introducing camera’s for all laptops/desktops so we can do video conferencing. We are calling this Unified Communications.

This can be a fairly dramatic change to an organization requiring a lot of training to your customer base. There are also a lot of new features (voice mail accessible in email, ability to “see” each other) that need to be discussed and explained so we can make the best use of them. One of the biggest might be the “soft phone” on the laptop. Now your work phone travels…so we expect you to pick it up. : – )

Perhaps though, this isn’t a big enough change for you to qualify as “transformative” so you may only call it a “enhancement” – that is fine. The point here is that you have to think about it and you can’t call everything you do “transformative” or an “enhancement” because of the impact it has on IT – the impact has to be on the customer!

Finally, the last type of change I would categorize is:

Legal or M&A – Any type of change that is dictated by legal requirements or because of merger and acquisition activity. Typically there is little lead time for this work – so it is actually “interrupt” work. This is probably coming across as an “Emergency” change – but the reason it is an emergency change is very different than say B/F. You probably need to know that so you can explain why Emergency changes have gone up or why you can’t get them below a certain point. If your business is buying/merging or heavily regulated with lots of rule changes – there is only so much an IT department can “predict” – the rest is reactionary.

Problem Categories

Now with all that in mind, I was thinking of Problems. What are the main categories of Problems?

The list I came up with are:

Hardware – Rather simple one here. A hardware error. I suspect you won’t see many of these but still, they happen from time to time.

Engineering/Configuration – This is an error that was introduced in engineering or configuration of the system, application or service. This could be vendor caused or internal engineering caused. This type of error is typically troublesome, hard to find, and takes awhile to fix. If you are lucky it is something relatively straightforward like a setting was wrong on your Load Balancer so it wasn’t actually distributing the load properly.

Administrative/Operational – This is an error caused by the inattention of the operational staff due to a) ineptitude b) ignorance or c) understaffed (other priorities). Things like, a disk filling up is an Administrative/Operational issue. Patches not being applied in a timely manner causing a security hole to be exploited is a Administrative/Operational error. A redundant NIC failure that isn’t noticed, then when the primary fails…the server is offline – is a mixture of Hardware and Administrative/Operational error.

That is it. All errors I can think of will fall into one of those 3 categories.

So, what do these categories tell you?

Well, for changes it tells you what type of work you are doing – Break/Fix and Planned Maintenance are KTLO (keep the light on) type work. They are “operational” in nature. They don’t excite the business at all (they do however, keep it running).

Enhancements and Transformative changes are all about BITA (Business IT Alignment) – what are you doing to help the business grow?  Do think you are aligned with the business? Why then are only 10% of all your changes either Enhancements or Transformative?

Legal, M&A – is a mixture of KTLO and BITA. You have to do it keep the lights on but it isn’t IT driven, it is business driven. This data can show them that you do jump when needed.

You can even use these categories to help set urgency (or priority if you are so inclined). What is more important – KTLO or BITA? Where do you put more of your attention, more of your star players? Probably BITA, but that is for you to decide.

On the Problem data, the Hardware category might be interesting but it probably won’t be. What will be interesting to know is if most of your Problems are Engineering related or Operational related. Is it because you are not doing a good enough job in Service Design/Transition or because you are not doing enough in Operations?

They require different organizational responses so it is probably quite helpful to know the answer. Should you invest in a new Testing team, testing manager, testing software or should you spend more on hiring (or outsourcing) Operational efforts? Maybe you have poor processes, or poor operational tools (like an Event tool or alerting tool).

Categorizing your changes and problems in this manner can help you make these decisions in a way that the traditional category methods cannot.

Alerts, Events, and Data Collection

Over at IT Skeptic there is some question over the ITIL use of the terms “Event” and “Alert” . I thought this was fairly amusing because I’ve been having this discussion with my customers (and my team) for years.

I have designed, implemented and managed HP OM (OVO or Openview), NNM, OVIS,  Netview, T/EC, Concord SystemEdge, Topaz/BAC/BSM (including BPM/RUM), Freshwater SiteScope/Mercury SiteScope/HP SiteScope, and AlarmPoint/xMatters over the past 10 plus years.

When trying to set this ‘stuff’ up we have to know what is important to capture, what is of some importance to know (informational perhaps), and what you need to know right now to take immediate action on.

Over the years I’ve come to think of it in these terms

Data Collection = anything we’ve (either the customer, or us based on our ‘expertise’) determined important enough to capture and record. The primary source for reporting.

Examples:

  • Disk space utilization every 5 minutes (we don’t care what it is, only that we capture/record it)
  • CPU
  • Security Log entries,
  • End user emulation transaction times

Events = Any data collected that either has value for immediate action (either automated or manual) or contains information of a ‘proactive’ nature. Provides additional insight into Incident/Problem Management. Can be used by Problem to trend “events” over time.

Examples:

  • Disk space has exceeded a certain threshold (over a period of time (my preference), or occurred once) – Perhaps 50%, or 65%, or 95%
  • A security log entry of a particular type
  • End user emulation transactions have failed – from one location or maybe all locations (over a period of time (my preference), or occurred once)

Alerts = Any event that meets or exceeds defined thresholds that require immediate attention/action by ‘service providers’ (sys admins, DBAs, network engineers, product managers, service managers, service desk). Indicators of Incidents and/or Problems.

  • Disk space has exceeded a certain threshold – usually something high like 95% and most always over a period of time (to avoid the “false positive”)
  • A security log entry of a particular type
  • End user emulation transactions have failed – usually from more than 1 location and for a period of time.

So, I think of it like this: Alert must first be an Event which must first be Data that is collected.

  • Data collection > Events > Alerts
  • Lots of things > Some things > Few things

Not all Data collected is worthy to be an Event – I just want to log CPU over time so I can graph it later

Not all Events are worthy to be an Alert – CPU spiked once on one web server (although if it happens every day, perhaps Prob Mgmt using reports on Events can see this an investigate)
All Alerts should create Incidents and/or Problem tickets. Something is really messed up (or is soon to be) requiring immediate (or near immediate) action/work.

Works for me anyway .