This is probably the oldest concept that was applied to virtualization, but everyone knows that buying in bulk yields better prices. In the retail space we’ve seen companies like Costco and Amazon emerge, and the prices are great because they’re pooling their purchases together to get a lower price, and you’re buying a pooled resource which gives you a better price. Seems like a very simple concept, but it is part of the fundamental magic of software definition.
With VMware we could pool the resources of multiple servers without resorting to complex hardware clustering configurations. Compute became one bucket instead of lots of small buckets. It became easier to buy, scale, modify, and manage. Sure, a VMware hosts was more expensive due to core counts and memory, but the 95% of a server that had been historically underutilized could be re-purposed from ‘boat anchor’ to a beast of pooled performance. Pooling of these underutilized resources didn’t make it cost comparable… it became far cheaper. If a VM needed even more performance, you could add virtual RAM, storage, or CPU. Easy. This was because the resources were pooled.
Also, infrastructure planning became far easier, since we were now sizing for a pool rather than every individual workload. Previously a request for 10 additional hosts was an exercise in sizing, procurement, waiting, provisioning, etc. With server virtualization there was no more measuring trends in capacity and performance per workload. We had just one thing to look at – pool size and usage. A group that needed 10 new hosts could have them in a few minutes, not a few weeks. That’s the power of the pool!
Pooling also gave us mind blowing performance and consistency. When VMware released Distributed Resource Scheduler busy hosts could have VMs automatically balance to less busy nodes in the cluster. Performance and consistency were separated from physical hardware and became a function of pooling.
Software Defined Storage like Dell EMC’s ScaleIO delivers the same value.
Think of all of the storage distributed across your server infrastructure. What is it doing? Probably hosting a boot volume and not much else. Let’s say for example that you purchase a Dell R730 with 8 SSDs at 480GB each. That’s 3.8 TB of raw and largely untapped capacity. What if you have 10 of them? Now we have 38 TB of raw pooled capacity. And 100 hosts? At this point we’re sitting on top of a latent storage workhorse of 380 TB. You probably see where I’m going with this. Compare what the same 380 TB of raw capacity would cost in an all flash array. More than the server attached SSD list price of $399K for 100 fully populated Dell R730s? Probably. Keep in mind that the software comes at a price, but will it be less than a 380 TB all flash SAN cluster? Most likely – and with fewer management points and 100 storage controllers instead of 2-16.
Pooling is part of the magic to managing costs. You don’t need enormous quantities of locally attached storage. The distributed nature of them means that when pooled you have not only a much larger shared resource, but you can manage performance, availability and consistency across a larger portion of your infrastructure. A component outage in a traditional SAN creates a performance or availability impact. A single node outage in a software defined storage node is a much smaller issue, and the bulk of the infrastructure is insulated from the impact. More servers are working together to rebalance and recover, automagically.
Aggregating performance, capacity, and resilience creates an antifragile infrastructure. For more on that check out Nassim Taleb’s fantastic book. The net of it is that antifragility is not about tolerating faults – which is where many of physical SANs development aside from media are focused. Antifragility is about creating an infrastructure that performs well, but performs better than traditional methods when the bad stuff happens. It loves and thrives on chaos. Bad stuff may be volume based saturation, hardware outages, budgetary restrictions, hardware upgrades, etc. We all know that the clock is ticking on our next bad stuff happened event. Software defined storage distributed across compute nodes is a way of making that an alert message instead of an outage.
What about everybody’s favorite feature – Agility? That is the subject of the next article, and the focus is on practical examples instead of buzzwords.
Missed the first article? Click Here