Reducing troubleshooting time with Azure Resource health
Today we are pleased to preview of Azure Resource health, a new service that exposes the health of the individual Azure resources and provides actionable guidance to troubleshoot problems. The goal for Resource health is to reduce the time customers spend on troubleshooting, in particular reducing the time spent determining if the root of the problem lays inside the application or if it is caused by an event inside the Azure platform.
Talk to any IT administrator about what keeps them at night and there is a good chance that maintaining high availability and being able to troubleshoot and fix problems as soon as they arrive will make the top five of their list. These requirements become even more critical in a cloud environment, as it is not possible for them to directly access the server or the different elements in their infrastructure.
How is Resource health different from Service Health Dashboard?
We consider a resource, an instance of a service created by a customer, for example: a virtual machine, a web app or a SQL database.
Following this definition, we can quickly see the information provided by Resource health is more granular than what is provided by the Service Health Dashboard (SHD). While SHD communicates events that impact the availability of a service in a region, Resource health exposes platform events that impact a small number of customers, like a node unexpectedly rebooting.
For this release, we onboarded three f the most used Azure services to Resource health: IaaS Virtual machines (classic only), Web Apps and SQL databases.
Getting the health of all the resources in a subscription
The easiest way is to click on this link, but you can also browse to the Help + Support blade where you will notice a new tile called Resource health which will list the total number of resources in your subscription as well as the number of unhealthy resources. Clicking on the tile will land you in the Resource health blade.
This blade will show the health of all the resources in the subscriptions where the user is allowed to list resources. The health status of each resource will have one of the following values:
- Healthy: The service has not detected any problems in the platform that could be making the resource unavailable.
- Unhealthy: There is an ongoing problem in the platform that is impacting the availability of this resource, for example, the node where the VM was running rebooted unexpectedly.
- Stopped: The resource was stopped by the user.
- Unknown: The service has not received heartbeats for this resource for more than five minutes. Later in this post we will revisit the concept of unknown.
Clicking on each resource will open a new blade that provides additional information about the health of the resource as well as links to recommended actions based on the current health.
- If the resource is healthy, the blade will provide access to commonly use troubleshooting tools.
- If the resource is unhealthy the recommendation will vary depending on how long the resource has been unavailable. For example, after a VM has been unavailable for 20 minutes the service may recommend recreating the VM or for a production VM you may get a recommendation to open a support ticket.
As mentioned before, our goal is to help reduce the time customers spend on troubleshooting, first by helping identify the source of the problem and second by providing quick access to the appropriate troubleshooting tools.
Signal latency and other important things to keep in mind
From the start, we have worked really hard to notify customers as soon as possible if they are impacted by an event in the platform. As you visit the Resource health blade, you will notice a box at the top indicating the signals can be up to 15 minutes delayed. It is very important to keep this in mind, as it will help eliminate unnecessary time spent investigating possible issues.
Needless to say, we are very committed to reducing this latency as much as possible, and expect to see improvements as we continue to optimize the service.
Another important thing to keep in mind is how the availability of the resources is determined. In this preview release health status takes into consideration only the compute portion of the infrastructure and does not include the network. We are working to onboard networking as soon as possible into the service.
I would like to elaborate a little more on how the health of a SQL database is determined. During the design we had the option of exposing the availability of the SQL Server or the SQL database, and we decided to go with the database as it better reflects what is used by our customers.
While going this route provides a better experience for customers, it created very interesting challenges for the team as multiple components and services needed to be taken into consideration to determine the health of a database, which is very different from a virtual machine or website, where by looking at the host, we can determine if the resource is available or not. For this first release we are using a signal that relies on logins to the database, which means that for databases that receive regular logins (which includes, among other things, receiving query execution requests) the health status will be regularly displayed. If the database has not been accessed for a period of 10 minutes or more, it will be moved into the unknown state.
As you will see below, this does not mean that the database has become unavailable, just that no signal has been emitted because no logins have been performed. Connecting to the database and running a query will emit the signals we need to determine the health of the database.
When the health of a resource is set to unhealthy
If the service does not receive signals from a resource for a period of time, the health of the resource will be set to unknown. It is important to notice that this is not a definitive indication that there is something wrong with a resource and customers should follow these recommendations:
- If the resource is running as expected but is health is set to unknown in Resource health, there are no problems and you can expect the status of the resource to update to healthy in a short period of time.
- If you are experiencing problems accessing the resource and its health is set to unknown in Resource health, this could be an early indication that there is a problem and a deeper investigation should be done while the health is updated to either healthy or unhealthy.
As mentioned before, SQL databases health will be set to unknown if the database has not received logins request during the last 10 minutes.
Additional on-demand checks for virtual machines
One of the recommended actions available for virtual machines is to execute a real time check on the VM. This operation interrogates the Azure compute fabric and other internal Azure services in order to determine if the virtual machine is available and running as expected. After the checks are completed you will be notified of the result and will be provided with a list of recommended actions.
On-demand checks can be executed once every 15 minutes. If a request for a check is submitted before the 15 minutes have passed, the result of the last check will be returned.
This is a great tool to use when you are facing issues with a VM and Resource health the health state of the VM is set to healthy or unknown.
Accessing the service through an API
As part of the release today, we are also releasing a API that can be used to connect to the Resource health. This API includes calls to query all resources in a subscription, all resources in a resource group or a specific resource. Here are some sample calls:
- Get health of all resources in a subscription: https://management.azure.com/subscriptions/<SubID>/providers/Microsoft.ResourceHealth/availabilityStatuses?api-version=2015-01-01
- Get health of all resources in a resource group: https://management.azure.com/subscriptions/<SubID>/resourceGroups/<ResourceGroupName>/providers/Microsoft.ResourceHealth/availabilityStatuses?api-version=2015-01-01
- Get the health of a single resource: https://management.azure.com//subscriptions/<SubID>/resourceGroups/<ResourceGroupName>/providers/<ResourceProvider>/<ResourceType>/<ResourceName>/providers/Microsoft.ResourceHealth/availabilityStatuses/current?api-version=2015-01-01
You can also easily test the API using tools like armclient.
We are very excited to reach this milestone, but we know it’s just the first step in a long journey. We are looking forward to your feedback, so feel free to leave your comments below.
Source: Microsoft Azure News