Windows Azure VM (Iaas) unexpected reboots

I have several Windows Azure virtual machines (Iaas) that host a website. There are a number of load-balanced front-end virtual machines that all connect to the same virtual machine using SQL Express. It works well.

But!

I get random reboots in all virtual machines. As for front-end virtual machines (with IIS), since they are load balanced, the site does not change, and the load balancer is configured accordingly. But when the virtual hosting of the database is rebooted, the site crashes until the database returns again. It takes <3min to load, but it is still unacceptable if this happens often enough. Although reboots are relatively rare (2 times a month on a VM), sometimes we get a week with 4 restarts on a VM, which is disappointing. Not all virtual machines restart so often, and I cannot understand the pattern. Unforeseen ones are also restarted (reboot, not shutdown). Datacenter is Western Europe.

Microsoft emphasizes that the SLA only covers 2VMs in the availability set, which I cannot have for the database virtual machine (and the corporate version of SQL costs a hand and three legs). In addition, SQL Azure is not an option because the application is very responsive and the Azure SQL database crashes during peak load (although it works with smooth SQL Express on the average VM!).

My question (s): Is so many restarts normal? Do other people have the same problem? What is your experience in such an environment on Azure? What can I do to minimize this downtime?

Thanks everyone!

+6
source share
2 answers

Is so many restarts normal?

Yes, it can happen in a certain month, you need to get on SQL Server in high availability mode to really make it work.

Yes, it costs an arm and a leg .; (

What is your experience in such an environment on Azure? Some months are really good, some months are bad, depending on your cluster and the data center you are in. MS mixed the range of our equipment where there are data centers. This does not mean that they work on old laptops in some data centers, but, in my experience, new data centers have a better set in them and, therefore, restart less. I use USA East.

What can I do to minimize this downtime?

High availability with a witness is the only way to provide you with accessibility in a virtual machine and yes, it costs hands and feet.

Other serious options. Cache cache. You must use the computer cache, azure cache and try to minimize your calls in the database. This can reduce your chat application and allow you to return to SQL Azure, but it can give you enough opportunity to recover from a failure.

Queues Queues will help you restore the application and give you a message that we are working on it.

Use SQL Azure to switch to another resource. Data synchronization using SQL Azure Sync from Premise (not sure if this works with Express) for SQL Azure and enter the application code to get a connection error and switch to another resource.

Look at using other parts of Azure for part of your application to reduce the number of incoming calls in SQL, i.e. Can you move stuff to table storage?

HTHS gives you some ideas.

+3
source

Windows Azure Infrastructure Services (IaaS) are only in general availability (GA or in production) for about 3 weeks from April 16 (see announcement here ). There was no SLA before GA, and you would see more frequent OS reboots, as various fixes were still applied to the host OS. You say that this picture continued at the same speed since April 16?

Now that IaaS is GA, I would not expect 4 restarts per week. However: there are several reasons why you will see a restart:

  • Host hardware error (this removes all guest OSs running on this host)
  • Host software update (and only if necessary restarting Host os). A reboot of the host system should not occur at the frequency you see.
  • Problems with the guest OS. Here, where things move away from PaaS (the role of the Cloud Services web worker). IaaS does not support Azure Guest OS service; it's all in your hands. If you automatically install Windows updates, a reboot is possible. Perhaps you may encounter a problem at the application level, as a result of which the box will stop responding for a long period of time, as a result of which the Azure fabric controller will reboot your box, because it considers it unhealthy. And ... your application may somehow collapse the box.

If you have ruled out an application error and are sure that the virtual machines are in good condition at the time of their reboot, you may need to open a support ticket with Microsoft to help diagnose the problem.

+1
source

Source: https://habr.com/ru/post/944612/


All Articles