VMware Storage I/O Control (SIOC) – A Blessing and a Curse

I am taking this content straight from an email I just sent a customer, so the content isn’t well polished. But the email took me long enough to write that I decided to post it here for others.

Storage I/O Control (SIOC) is a mechanism to prevent one VM to hog all the I/O resources making the other VMs wait for their I/O request to be completed. By default, it gives every VM on a Datastore fair and Equal I/O Shares. It is able to gauge and determine fairness based off latency. So if you have two VMs (VM1 and VM2) and VM1’s latency hits a specified threshold (30ms is default), then it will actually SLOW VM2’s I/O access and give the scheduler resources back to VM1 until fair sharing is equalized again. This is different than QoS, but I’m sure you see some similarities.

So that sounds great, right? (Really, it is great). But this is not very effective and can be detrimental in certain circumstances. I’ll try to explain.

First let me preface this by explaining two concepts, which  you may already be aware of.

  1. Hypervisors work via scheduled process. Every VM waits its turn to receive the CPU Cycle or Memory Page it requested until it is his turn in the scheduler.
  2. Every Volume you create and map to a host is given a LUN ID (this volume is the LUN) and each LUN has access to schedulers. All the VMs in this Volume/LUN take their turn for I/O requests. This is why best practice dictates you put a maximum of 10-15 VMs per Volume, or much less if those are resource intensive VMs. The more VMs in the LUN, the longer each VM has to wait for its I/O requests. (Note- Setting resource shares doesn’t solve this, it just guarantees one VM will have priority over another)

There are certain scenarios where SIOC can possibly make things worse. The scenario you might be running into is the following:
You have a SAN capable of tiered storage, which is really amazing when you think about how that all works. What’s even more incredible is that you are able to have different RAID types be striped across the same physical disk. (Hot data lives on 15k drives in a RAID 10 stripe, and as I becomes warm, it moves into a RAID 5 stripe across those same physical 15k drives).

Lets take our VM1 and VM2, both reside on the same LUN. We have enabled SIOC on that LUN. VM1 is a high resource VM that is crucial to your business and VM2 is just a Test/Dev Server. Most of VM1’s blocks reside in your 15k disks RAID10, but a few of its less hot blocks have moved to RAID 5, but still on those 15k drives. Again, data on VM1 is almost always hot.
VM2 on the other hand, has some of its blocks on the 15k drive, and some reside on the 7k slower drives since that data is hardly ever accessed.

One day you log into VM2, and fires up an application who’s data is on those 7k drives. That data takes longer to retrieve, naturally, since its sitting on the slowest media, and the time it takes to queue up and process that I/O request (latency) is much greater than the time its taking VM1 to process its requests.

What happens is the SIOC’s mechanism kicks in and because the latency on retrieving data for VM2 is impeding on its “fair access” functionality. So it throttles down the I/O of VM1 (your production server) to try and decrease the latency VM2 is having. You have essentially killed the performance of the VM that needs it the most. Now imagine this happening for all your VMs, VMDKs, bits, blocks, whatever you want to include, it has become a traffic nightmare. It can throttle a VM down so much, waiting for the latency to decrease on the other VMs, that everything is timing out, whereas if you weren’t using SIOC, things would be humming along as usual, and VM2 will just take its sweet time processing data from the slow drives.

I am sure you were aware of most of these concepts, and what I have described is somewhat over-simplified, but hopefully that makes sense. Sharing workloads across the same physical drives can make SICO a nightmare. If you are careful in what workloads you place in what LUN, then SIOC can be great, even on tiered storage. If you take an old EMC or Netapp where you used to carve out specific disks for specific volumes, SIOC would also be great.

Dell Compellent’s Best Practice is to use this with caution, just as other have stated as well on this feature.