Troubleshooting a memory-bound VM
I had fun at work this week. I spent a lot of time chasing my tail, trying to figure out why one of our SQL servers was frequently memory-bound. Note that I have removed server names and other identifying material from screenshots. One of the greatest frustrations in this new job is finding things my former colleagues either a) completely fucked up, or b) never noticed were a problem. A good example (which I wish I'd documented as well as my story below): A SQL server (this very same one, as it happens), averaging 99% CPU consumption, 24/7. One of my former colleagues (the one with DBA aspirations) decided that the problem was a corrupt SQL installation, and had intended to rebuild the machine from scratch. I thought this sounded like bollocks, and with a bit of googling, learning, reading BOL and testing, discovered that one of the databases had a table that needed an index applied. CPU usage fell to approximately 3% average. A rebuild of the machine would never have fixed that.
Anyway… on to today's tale.
Background: I have been asked to prepare one of our servers for a SQL version upgrade (from 2008 to 2008 R2). As part of this prep work, performance baselines were taken. The performance log files were parsed by an automated tool called PAL (http://pal.codeplex.com/). PAL found that this VM was severely memory-bound.
This conclusion was supported by casual observation in Task Manager:
Manual analysis of the performance log files did not reveal the source of the RAM constraints. All counters suggested that the sum total of memory used by processes was in the order of approximately 2GB.
The VM has been configured with 8GB of RAM. Neither automated analysis nor the manual analysis made sense. How can a system using only 2GB of 8GB be memory-bound? The flat-line nature of the graphs bothered me. My experience has been that flat lines are usually the result of an artificial constraint (e.g. bandwidth consumption limits on WAN traffic, imposed by network shaping).
Since this server is a SQL box, I investigated SQL’s memory usage (Counter: SQLServer: Memory Manager\Target Server Memory (KB):
SQL appeared to be consuming up to approximately 1.8GB of RAM.
By default, SQL is configured at the SQL Server level to consume up to 2PB of RAM. I checked to see if SQL had been “held back” by a non-standard configuration:
Someone has imposed a limit of 6.7GB of RAM usage on this SQL Server instance. Why this particular number was chosen is unknown to me, but in the context of the current problem, it did not appear to be a contributor – we would’ve seen SQL consuming more than 2GB of RAM. In short, automated and manual analysis of the performance logs did not show what was consuming this system’s memory.
I ran a Sysinternals tool called RAMMap:
RAMMap showed that drivers were consuming 5.5GB of RAM! On a VM, this is very unusual. VMware Tools load some drivers that are designed to help the hypervisor shuffle resources between VMs. One of these is the balloon driver, which under normal conditions is used by the host to create an artificial RAM constraint at the guest level. This forces applications to release unused memory to the balloon driver, which “tells” the OS it is using some amount of RAM that cannot be released (preventing user apps from trying to reclaim more RAM). In turn, physical memory is released to the host to allocate to VMs that need more RAM.
At this point, I felt that the issue was at the VMware layer.
The Resource Allocation view provided a clue:
VMware’s performance overview showed another flat-line graph:
Using the advanced graph to show balloon usage for this VM confirmed my suspicion:
Another flat line. The balloon driver was the culprit. But it didn’t make sense. The host is not memory-constrained, and other VMs are not hitting their memory limits.
I investigated the VM’s configuration and found this:
Despite giving the VM 8GB of memory, it had been artificially constrained to only use 2GB. In order to achieve this, the balloon driver kicked in, consuming the difference. There is really no good reason for this, and my suspicion is that the VM was created from a template that had this constraint applied to it, and that whoever provisioned it did not remove the limit. It has been running on effectively 2GB of RAM since.
I checked the “Unlimited” checkbox (no outage required). This removes the hard limit, and tells the balloon driver to release RAM to the OS. I could see the driver immediately released some RAM to the OS, but it wasn’t the 5.5GB I was hoping for:
The process of releasing RAM from the balloon is a slow one. Googling suggests it can take days to finally release all of the memory to the guest. A restart of the VM might accelerate this. We planned an outage to restart the VM, but as it turns out, at 0130 the next morning, the balloon was released entirely:
Next steps
If one VM has been misconfigured, it stands to reason there might be others. I used a PowerCLI command to identify VMs that have memory limitations imposed on them:
Get-VM | Get-VMResourceConfiguration | where {$_.MemLimitMB -ne '-1'} | foreach {$_.VM.Name + " " + $_.VM.MemoryMB + " " + $_.MemLimitMB}
Server1 1024 1024
Server2 4096 4096
Server 3 4000 4000
Server 41024 1024
Server 5 2048 2048
Server6 4096 2000
Server 7 2048 2000
Server8 2048 2048
Server9 2048 2048
Server10 2048 2048
Server11 1024 1024
Server12 2048 2048
Server13 2048 2048
Server14 2048 2048
Server15 4096 4096
Server16 2000 2000
Server17 2048 2048
Why anyone would configure a VM with an amount of RAM, and then set a hard limit on that VM for the same amount of RAM is beyond me. I also do not understand the logic of configuring RAM or RAM limits using values that do not fall on standard boundaries (e.g. 2000 instead of 2048).
We can see here that Server6 has been configured with 4GB of RAM, yet is only allowed to consume 2GB of that 4GB. I would not be surprised if this VM is also experiencing low-memory conditions.
Information on memory limits can also be exposed via the vSphere GUI (but this does not show the VM’s configured memory):
Performance counters for Server6 should be recorded and analysed. If this VM is memory-bound, then the limit should be removed. If it is not memory-bound, then it does not make sense to present it with 4GB of RAM when it only needs 2GB.
Broader next steps would be to review VM configurations across the infrastructure. We should also consider taking baseline performance counters to assess if current workload requirements are being met.
All in all, this was a great learning exercise for me, and I feel that I a) accomplished something useful, and b) identified misconfigurations on other VMs before they became problematic. But I'm also very disappointed that the people trusted with managing this infrastructure in the past really didn't do as good a job as they could've.