We’re thrilled to have completed another big PIE CMS project. This implementation, for a major retailer with more than 5200 retail locations, took less than 500 hours from start to finish (about 60% of which was spent developing the innovations outlined below). In our continuing mission to make HPE UCMDB the hub of the most comprehensive Configuration Management System possible, we implemented our full PIE offering (ITSM, UCMDB, Universal Discovery, UCMDB Browser, BSM). But we didn’t stop there. We were able to expand scope for this customer to include Closed Loop Incident Processing (CLIP) of their monitoring software as well. We added Network Node Manager i (NNMi), HP’s new Operations Manager i (OMi) tool, and improved integration and functionality of the HP Business Service Management (BSM) tool; all driven by our automation approach and best practices.
The key to a truly integrated and automated IT Operations Management solution is maintaining good data and unique Global IDs for CIs across disparate toolsets; and therein lay our challenge. In light of this recent success with monitoring integration, we’d like to share with you some of the obstacles we encountered, how we dealt with them, and how you can benefit from our innovation.
BSM (which essentially runs on a version of UCMDB called RTSM) integrates with HP SiteScope (and other HP data collectors: Business Process Monitor (BPM), Real User Monitor (RUM), and Diagnostics) in order to provide performance metrics and/or Events for infrastructure health. But another important function of this integration is supplementing topology in the event of missing CIs and relationships in BSM/RTSM.
Why does SiteScope supplement topology? An Event has to be tied to a CI in order for that Event to have a chance to create an Incident in Service Manager (Incident is a type of IT Process Record). The infrastructure CI to which the Event is attached (think Computer, Running Software, or Database) must belong to a Business Application, Business Service, or Infrastructure Service or else Service Manager will reject that Event. This is the out-of-the-box enforcement behavior for Service Manager because each App or Service in SM is tied to a Functional Group (the responsible party that handles that service’s health), therefore making an Incident actionable. Therefore, in order for an Event/Incident to be addressed and tracked, it is essential that it be attached to one or more CIs, and that those CIs match in all the various systems.
Sounds simple enough, right? It should be, but due to the way each tool in the HP solution stack collects, stores and shares data, it is quite a challenge to leverage automation. While each tool is meaning to address the same logical thing, differences in the way those things are stored usually lead to sporadic success at best. Organizations without Effectual’s PIE need to use costly and tedious manual management of the data in each tool (and will often have to build entire teams just to do so). Others will create an elaborate kind of hodge-podge of manual and automation, or simply work with what the out of the box integration provides.
The CLIP success rate for HP customers varies greatly. But with the PIE approach of synchronizing data, every Incident, every CI and related Service item, can be automated, in real time and without fail, to the correct assignee. Non-Effectual best case; it works most of the the time for some things. With Effectual, CLIP works every time for every thing.
Here’s one example of where these tools run into integration trouble because of differences in their respective data sets. The dynamic disk monitor from SiteScope reports the mount point of a hard drive (known as a FileSystem CI-type in UCMDB/RTSM) on a Windows machine as “C:”, “D:”, etc. However, Universal Discovery in UCMDB discovers the mount point of that same drive on that same Windows box as “C”, “D”, etc.
Keeping in mind that the reconciliation rules for a disk drive are Container and the attribute Mount Point, you start to see the picture. “Container” is the parent CIT, or the Windows CI. Since it remains the same in both cases, we determined that was not the root cause of the conflict. The “Mount Point” attribute however, is the drive letter (C, D, E, etc.). So, if a Windows computer is discovered by UD, and monitored by SiteScope, conflicts in the data used to reconcile its FileSystem child CIs will result in duplicates. Although the containers are the same, the names of the mount points of the drives are different simply because SiteScope includes the colon and UD does not!
What We Did
The resolution to the FileSystem reconciliation naming conflict was relatively straightforward. We went back to the UD Discovery script, analyzed it and found that it deliberately removes the “:” in the Mount Point attribute when creating the CI. By preventing the script from removing the “:” we were able to create FileSystem CIs that matched those being reported by SiteScope and the reconciliation engine did the rest.
But we wanted to accomplish two more goals: to manage the data bi-directionally, and to assign a unique global ID so events in BSM line up to the service and infrastructure catalog in Service Manager. BSM 9.2x, however, uses an instance of UCMDB 9.05 for its RTSM core, meaning that its version of the CI type model is some 5 or more years old. Since the CMS at the center of this build is a UCMDB 10.2x, there have been some changes to the CIT model since the CIT model of 2010. So to get the functionality we wanted, we needed to refresh the BSM CI Type Model and bring it up to current UCMDB standards.
The screenshots below illustrate the CITs and relationships we needed to add to BSM’s CIT model in order to modernize it to receive CITs from UCMDB 10.2x. They show as Errors because the downstream RTSM under BSM was unable to sync with the specifically-typed CIs in the CMS.
Why It Matters
Remember what we said about needing the right Incident to automatically relate to the right CI? Same applies here; the Databases, Running Software, etc. are all objects that are monitored by tools reporting to BSM (like SiteScope). Remember, if SiteScope doesn’t have a CI to attach an event to, it creates a new one. The one it creates has no global ID, and no relationships to other CIs (like a Business Application) and is therefore an island (not to mention a duplicate CI in other systems, such as UCMDB and, more critically, ITSM).
Adding these new CITs to BSM and using the same set of data for these CITs from the CMS allows the events to relate to the proper CIs which are a part of the same Applications and Services in Service Manager. BSM, being the oldest of the tools in use, was the only tool (out of SM, UCMDB and OMi) which did not already include these CITs. So adding them to its CIT model (and aligning the CIT models of all the tools) was an obvious move which allowed us to seamlessly integrate the valuable topology information from BSM to the rest of the CMS. Yes, there are critics who will think that manipulating the BSM CIT model will create additional challenges, but these are already understood and easily solvable. Allowing these simple challenges to deter an organization from achieving real CLIP automation between HP tools is both costly and irresponsible.
HP Operations Manager i (OMi) functions very similarly to BSM v10. OMi was just released this past March, so it is extremely new. As with most new products, this first version was not without its bugs. So on our first encounter with the tool, we were impacted by two major issues.
First, we immediately saw that the out-of-the-box OMi integration with UCMDB was completely broken. When we tried to push CIs to OMi, only one CI would update at a time. For example, if we were pushing ten CIs, one would update, the other nine wouldn’t. Then, on the next iteration, another single CI would make it, and there’d be eight left. Now imagine a trying to push a UCMDB with a million CIs instead of ten.
Secondly, the event stream for OMi requires information from other monitoring tools. However, since the other tools can generate duplicate CIs (the duplicate FileSystem CIs made in SiteScope that we mentioned above are just one example, Diagnostics is also making its own duplicate CIs), poor data quality can severely limit the value of the event stream without intervention.
What We Did
The integration between UCMDB and OMi normally goes through a web server called a “Gateway”, which passes work to a second server called a “Data Processing Server” or DPS. That integration is broken in OMi v 10.00 and 10.01. So in order to deal with the push problem, our only option was to integrate directly to the DPS.
Unfortunately, this workaround was problematic in this highly secure, high availability environment. The HP High Availability or “HA” mechanism that OMi employs uses multiple Gateways and DPSes to detect and fail-over when one of the servers goes down. It’s designed this way to allow users to keep working even during an outage. However, integrations to an HA OMi only work with connections going to the Gateway and with the limitation above, this wasn’t an option.
So connecting directly to a DPS server means that when the server fails, so does the integration. When the DPS fails to its secondary machine, someone has to manually alter the integration target server and begin a full synchronization as a result, therefore completely defeating the purpose of using HA in the first place. (Given this information, the direct integration to the DPS is only a temporary workaround. We have been told that this is a known bug in OMi v10.0x, which will hopefully be resolved in forthcoming versions of the software. Until then, you might want to wait to upgrade your software.)
Then we had to address the OMi event stream. The event stream was developed as follows:
- Monitoring tool (SiteScope, BPM, RUM, Diagnostics) takes a measurement, determines that a threshold has been breached, and forwards an Event to BSM.
- BSM correlates that Event to a monitor CI and an infrastructure CI in RTSM, and displays the offender in red on a dashboard. It then forwards that Event to OMi.
- Once OMi receives the Event from BSM, it also relates that Event to the corresponding CIs in its own RTSM, and displays them on another set of dashboards (a somewhat redundant step, since this has already been done in BSM).
- OMi forwards the Event to create an Incident in SM, which includes the infrastructure CI as well as the Business Application/Service and/or Infrastructure Service to which that CI belongs.
- An Incident ticket is generated and the responsible party or parties are notified.
We had to do a couple of things in order to get this stream functioning properly. First, we had to ensure that the monitoring tools were relating to good CIs, not duplicates. Then we had to inherit the Monitor CITs and their relationships into the CMS/UCMDB, and push that info into OMi. This took a little bit of work. Out-of-the-box, UCMDB does not contain Monitor CITs, so we had to update its CIT model to incorporate them. This was relatively straightforward; OMi uses UCMDB v 10.11 at its core (for RTSM), and has Monitor CITs. So we exported copies of its CITs, and imported them into the CMS/UCMDB.
Why It Matters
All of this effort is about making sure Events relate to CIs with uniform global IDs in each tool. The monitoring and configuration management UCMDB or RTSMs all needed to have uniformly defined CITs in them, even if the reconciliation priorities and validation rules were different. Without merging the information model, you cannot ensure data quality over the lifecycle of the CIs being moved. An unmerged information model would create pseudo-versions of data which represent the same thing in the real world, but not the same CI in the data sets of each tool. This challenge is a fundamental reason why most integrations between two systems create a third set of data over time, and basic or out-of-the-box integrations work in the short term, but fail to work over time.
At this point, all versions of UCMDB across all the tools were functioning with the same version of the CI Type Model released with 10.2x. So once the data was synchronized and up to date between the tools, the CMS could effectively maintain uniqueness for both discovered and non-discovered data (people, places and things), as well as the events and monitors themselves.
An added benefit to harmonizing this additional information into the CMS is that it is also now reportable, consumable, and searchable in the customer’s PIE-enhanced UCMDB Browser. Additional value can be derived from leveraging accurate federation on the status of KPIs (and related KPIs) from a single Incident or Node in the CMS. New opportunities for reporting and merging people, places, and things to current status will continue to add value over time. Ultimately, when coupled with PIE’s SACM integrations from Asset and Service Manager, total cost and efficiency of costs will be easier to observe and manage.
NNMi is a network discovery & monitoring tool. HP recommends its use over UD network discovery for tracking network devices and topology. It provides CIs and relationships (e.g. Layer 2) to UCMDB or RTSM via its topology sync, while also forwarding Events regarding health of those CIs.
In order to align Events to topology and CIs, NNMi uses an ID pushback mechanism to push local IDs from whatever UCMDB/RTSM to which it connects. It then adds that ID to the Event details so that the Event and the CI will line up. This part mostly works, but only if you know how to tightly control IDs from one UCMDB to another (like we do).
However, the topology sync NNMi uses to create CIs in UCMDB/RTSM acts like a bulldozer to any existing data in the receiving system. Good attributes get overwritten by bad or null values. For example, a Discovered Model attribute on a Computer CI from UD might be “hp_proliant_bl460c_gen8″. But if that attribute isn’t discovered by NNMi, it will just overwrite the value to “<No SNMP>” (as shown below). SNMP is a protocol normally used with networking equipment, so a computer that doesn’t have SNMP enabled can’t be interrogated by NNMi for a value. While SNMP is not a common protocol for servers, it is a common protocol for switches, routers and other network appliances. So NNMi can get quite a bit of good info as far as networking equipment goes, but not as much from servers without some extra configurations.
For Push and Population integration jobs, UCMDB/RTSM requires that data run through its Data Flow Probe (DFP). However, the UCMDB/RTSM integration from NNMi is a light switch, it’s either on or off. It has no options regarding which attributes to include or exclude. The only option is which CITs you allow it to send (which are limited to Node and the subtypes Switch, Router, and Chassis). Since the “sync” is essentially just a single direction data push using UCMDB’s API, the NNMi integration does not flow through the DFP. The result is that all of the controls in place in UCMDB/RTSM for things like attribute and CIT priority are ignored and normalization rules get bypassed entirely.
This is why HP has always recommend using one or the other for Network discovery, but not both. In theory, both Universal Discovery and NNMi have useful contributions to make. Our analysis is that when properly merged, these two tools do indeed provide a complete picture which would not be available from either single source. This is another example of best practices usurping value by following surface-level functional limitations. When the limitation itself is addressed, a superior method becomes available deriving more value from the best of both tools.
What We Did
The main reason we use a UCMDB integration tier in our multi-instance UCMDB architecture is to add data quality measures to control misaligned integrations in order to prevent garbage data, like that provided by the NNMi push. The integration UCMDB is where we introduce the “prod” data set to the NNMi data, and then analyze what merges, what creates duplicates, which attributes get overwritten, etc. Just like in any other tier of our multi-instance UCMDB architecture, any Nodes that are loosely qualified, that are not typed as a specific CI subtype like Windows, Switch, etc., never leave the integration UCMDB tier.
Despite our best efforts we could not prevent NNMi from overwriting attributes in the integration UCMDB. However, we have used attribute and CI priorities in the production tier to prevent the bad CI attributes from blowing out good attributes there. We use smart scheduling of enrichments in the integration tier to merge CIs that should not be duplicates, and timed data migrations to control which data moves and in what order.
By using numeric values, we can set which integrations take priority by CI type and attribute. The numeric values range from negative one million to one million. The higher the number the higher the priority. So we gave the highest Node CIT priority to the Discovery Master-to-Prod Master integration. We then ranked the NNMi Integration Job second. Next, we listed the individual attributes we knew would be changed by the NNMi job.
We were then able to compare the data, and decide which integration was the “owner” of each attribute. All attributes owned by NNMi were given a numeric value of 200 and and the other jobs given a -10. These values were then reversed for all attributes that needed to be changed from null data in NNMi to the accurate data from Discovery. We ended up with a recon priority that looks like this:
Once these rules were in place, the last step was to organize the schedule for the integration jobs. The NNMi integration job was scheduled to run first, because it has the least number of attributes to override, and the lower CIT override value. When this job runs, all net new data is merged, and the attributes are replaced with the data from NNMi. Immediately afterward, the Discovery Master Job runs, replacing all values not overwritten in the attribute override. Once completed, even if you were to run the jobs again, all attributes remain locked in place based on our priorities.
The end result is that CIs, attributes, and relationships from UD and NNMi merge, enriched with the best, most complete information from both data sets, and a single global ID that enables Events to connect to topology in all of the HP tools. When coupled with the existing PIE CMS solutions and PIE for ITSM, a full and complete automated loop is established that requires no sorting, sifting, or human intervention. People, places, infrastructure, and service “things” are provided to monitoring in real time. Unique monitoring objects are linked to true and meaningful topology. Events are triggered accurately and discreetly. Each can be automatically assigned and traced to its actual group and service catalog item in ITSM in real time. There are no fuzzy incidents, no human intermediate steps or clean up, or manual assignment required. Coverage of monitoring is tied directly to coverage of the catalog.
Merely calling this a successful “integration” project would be a gross understatement. Make no mistake, without meticulous intervention these tools just won’t work together correctly without constant overhead and confusing noise. Your data will be inactionable, your users will have low confidence, your automation will require more effort to maintain than it is worth. As frustration increases, value decreases. But careful work can yield incredibly valuable returns.
Want better HP monitoring integrations with real automation in service or incident management? Effectual can provide this outcome in your environment in a fraction of the time, cost and complexity of any other company or approach.