Performance, Scalability and Architecture

Andreas Grabner

Subscribe to Andreas Grabner: eMailAlertsEmail Alerts
Get Andreas Grabner: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Application Performance Management (APM), DevOps Journal

Blog Post

Performance Monitor All Your Apps By @Dynatrace | @DevOpsSummit [#DevOps]

A solution for monitoring large IT infrastructures that contain several hundred components

How to Performance Monitor All Your Applications on a Single Dashboard

It's become easy to monitor applications that are deployed on hundreds of servers - thanks to the advances in application performance management tools. But the more data you collect the harder it is to visualize the health state in a way that a single dashboard tells you both the overall status as well as the problematic component.

Eugene Turetsky (Dynatrace) and Stephan Levesque (SSQ Financial Group) shared their solution for monitoring large IT infrastructures that contain several hundred components that support SSQ's most-critical applications running on a variety of technology stacks including WebLogic, Oracle Databases, Ingres Databases, and WebSphere MQs. When Stephan showed me his SSQ dashboards, I knew I had to write a blog about this.

Stephan agreed to share these details with a larger audience - eventually uploading the plugins that were designed, developed and built by Eugene Turetsky for this onto our Dynatrace GitHub Organization. Now check this out. All Dynatrace dashboards are designated to a wide audience - from high management teams to support engineering teams responsible for maintaining the health of specific components. For example, the following screenshot shows one of SSQ's dashboards: application health arranged vertically, cluster, server and component health horizontally. The names of the apps and servers are sanitized for privacy reasons:

Each dot represents the health status of a component, aggregated to a cluster or an individual server and aggregated onan  application level. If an app goes red or yellow, it's easy to spot which component is causing it

Stephan and his colleagues read this dashboard from top left to bottom right: The big red dot in the top left means that at least one of the applications is unhealthy. Spotting which apps are unhealthy is easy - just look for red. On those application rows it's easy to find the red dots that tell which component (Web Server, App Server, Message Queue, etc.) to focus his root cause analysis on.

Let's look a little deeper into how he calculates the health status of each individual component and how he aggregates the data so that you can rebuild this for your own environment in case you find this useful:

Health Status of Components
A component can be an application server, a database, a message queue or a device such as a Load Balancer. Stephan uses Dynatrace to monitor each component and has one or more metrics for each component that tells him whether it's healthy or not. Here are some examples:

  • Application: Application status is red if one or more clusters or individual un-cluster components are red. Application status is yellow (degraded) if some of an application's clustered components (i.e., nodes) are down but surviving nodes in the cluster can manage the application load. Otherwise the application status is green.
  • WebLogic: If all clustered WebLogic components are down (i.e., cluster is down) then the status of WebLogic is red. If some nodes in the cluster are down but surviving nodes can manage the application load, the status of WebLogic is yellow. Otherwise the status of WebLogic is green.
  • Database: If all clustered database components are down (i.e., cluster is down) then the status of the database is red. If some nodes in the cluster are down but surviving nodes can manage the application load, the status of the database is yellow. Otherwise the status of the database is green.
  • MQ: If all clustered MQ components are down (i.e., cluster is down) then the status of the database is red. If some nodes in the cluster are down but surviving nodes can manage the application load, the status of MQ is yellow. Otherwise the status of MQ is green.
  • Dynatrace agents: The state, or availability, of the Dynatrace agents is also monitored. If a critical agent is unavailable, an alert will be triggered and a red dot will be shown.

Whether you use Dynatrace or other APM tools - make sure you capture both system metrics, such as Availability, CPU, and Memory, but also performance relevant metrics such as Response Time and combine these metrics into your health states.

Aggregating Performance Data from Component to Server to Application
Besides monitoring the health of each component individually, the dashboard also aggregates data "upwards." Stephan calculates an overall health state per component type, e.g., overall WebLogic health in the cluster is calculated based on the states of each individual WebLogic instance. The overall Application Health is then calculated by the Applications Availability as well as the aggregated state of all supporting components. The final overall system health shows whether there is any application currently suffering an issue. The following screenshot shows how this works in a simple example.

Health States get aggregated to Health Groups which eventually end up being aggregated to the Application and the Overall System Status

For further insight, click here for the full article

More Stories By Andreas Grabner

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.