Virtualization Blog

Discussions and observations on virtualization.

When Virtualised Storage is Faster than Bare Metal

An analysis of block size, inflight requests and outstanding data

INTRODUCTION

Back in August 2014 I went to the Xen Project Developer Summit in Chicago (IL) and presented a graph that caused a few faces to go "ahn?". The graph was meant to show how well XenServer 6.5 storage throughput could scale over several guests. For that, I compared 10 fio threads running in dom0 (mimicking 10 virtual disks) with 10 guests running 1 fio thread each. The result: the aggregate throughput of the virtual machines was actually higher.

In XenServer 6.5 (used for those measurements), the storage traffic of 10 VMs corresponds to 10 tapdisk3 processes doing I/O via libaio in dom0. My measurements used the same disk areas (raw block-based virtual disks) for each fio thread or tapdisk3. So how can 10 tapdisk3 processes possibly be faster than 10 fio threads also using libaio and also running in dom0?

At the time, I hypothesised that the lack of indirect I/O support in tapdisk3 was causing requests larger than 44 KiB (the maximum supported request size in Xen's traditional blkif protocol) to be split into smaller requests. And that the storage infrastructure (a Micron P320h) was responding better to a higher number of smaller requests. In case you are wondering, I also think that people thought I was crazy.

You can check out my one year old hypothesis between 5:10 and 5:30 on the XPDS'14 recording of my talk: https://youtu.be/bbdWFB1mBxA?t=5m10s

20150525-01-slide.jpg

TRADITIONAL STORAGE AND MERGES

For several years operating systems have been optimising storage I/O patterns (in software) before issuing them to the corresponding disk drivers. In Linux, this has been achieved via elevator schedulers and the block layer. Requests can be reordered, delayed, prioritised and even merged into a smaller number of larger requests.

Merging requests has been around for as long as I can remember. Everyone understands that less requests mean less overhead and that storage infrastructures respond better to larger requests. As a matter of fact, the graph above, which shows throughput as a function of request size, is proof of that: bigger requests means higher throughput.

It wasn't until 2010 that a proper means to fully disable request merging came into play in the Linux kernel. Alan Brunelle showed a 0.56% throughput improvement (and less CPU utilisation) by not trying to merge requests at all. I wonder if he questioned that splitting requests could actually be even more beneficial.

SPLITTING I/O REQUESTS

Given the results I have seen on my 2014 measurements, I would like to take this concept a step further. On top of not merging requests, let's forcibly split them.

The rationale behind this idea is that some drives today will respond better to a higher number of outstanding requests. The Micron P320h performance testing guide says that it "has been designed to operate at peak performance at a queue depth of 256" (page 11). Similar documentation from Intel uses a queue depth of 128 to indicate peak performance of its NVMe family of products.

But it is one thing to say that a drive requires a large number of outstanding requests to perform at its peak. It is a different thing to say that a batch of 8 requests of 4 KiB each will complete quicker than one 32 KiB request.

MEASUREMENTS AND RESULTS

So let's put that to the test. I wrote a little script to measure the random read throughput of two modern NVMe drives when facing workloads with varying block sizes and I/O depth. For block sizes from 512 B to 4 MiB, I am particularly interested in analysing how these disks respond to larger "single" requests in comparison to smaller "multiple" requests. In other words, what is faster: 1 outstanding request of X bytes or Y outstanding requests of X/Y bytes?

My test environment consists of a Dell PowerEdge R720 (Intel E5-2643v2 @ 3.5GHz, 2 Sockets, 6 Cores/socket, HT Enabled), with 64 GB of RAM running Linux Jessie 64bit and the Linux 4.0.4 kernel. My two disks are an Intel P3700 (400GB) and a Micron P320h (175GB). Fans were set to full speed and the power profiles are configured for OS Control, with a performance governor in place.

#!/bin/bash
sizes="512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 \
       1048576 2097152 4194304"
drives="nvme0n1 rssda"

for drive in ${drives}; do
    for size in ${sizes}; do
        for ((qd=1; ${size}/${qd} >= 512; qd*=2)); do
            bs=$[ ${size} / ${qd} ]
            tp=$(fio --terse-version=3 --minimal --rw=randread --numjobs=1  \
                     --direct=1 --ioengine=libaio --runtime=30 --time_based \
                     --name=job --filename=/dev/${drive} --bs=${bs}         \
                     --iodepth=${qd} | awk -F';' '{print $7}')
            echo "${size} ${bs} ${qd} ${tp}" | tee -a ${drive}.dat
        done
    done
done

There are several ways of looking at the results. I believe it is always worth starting with a broad overview including everything that makes sense. The graphs below contain all the data points for each drive. Keep in mind that the "x" axis represent Block Size (in KiB) over the Queue Depth.

20150525-02-nvme.jpg

20150525-03-rssda.jpg

While the Intel P3700 is faster overall, both drives share a common treat: for a certain amount of outstanding data, throughput can be significantly higher if such data is split over several inflight requests (instead of a single large request). Because this workload consists of random reads, this is a characteristic that is not evident in spinning disks (where the seek time would negatively affect the total throughput of the workload).

To make this point clearer, I have isolated the workloads involving 512 KiB of outstanding data on the P3700 drive. The graph below shows that if a workload randomly reads 512 KiB of data one request at a time (queue depth=1), the throughput will be just under 1 GB/s. If, instead, the workload would read 8 KiB of data with 64 outstanding requests at a time, the throughput would be about double (just under 2 GB/s).

20150525-04-nvme512k.jpg

CONCLUSIONS

Storage technologies are constantly evolving. At this point in time, it appears that hardware is evolving much faster than software. In this post I have discussed a paradigm of workload optimisation (request merging) that perhaps no longer applies to modern solid state drives. As a matter of fact, I am proposing that the exact opposite (request splitting) should be done in certain cases.

Traditional spinning disks have always responded better to large requests. Such workloads reduced the overhead of seek times where the head of a disk must roam around to fetch random bits of data. In contrast, solid state drives respond better to parallel requests, with virtually no overhead for random access patterns.

Virtualisation platforms and software-defined storage solutions are perfectly placed to take advantage of such paradigm shifts. By understanding the hardware infrastructure they are placed on top of, as well as the workload patterns of their users (e.g. Virtual Desktops), requests can be easily manipulated to better explore system resources.

XenServer's LUN scalability
New ticket statuses on bugs.xenserver.org

Related Posts

 

Comments 4

Tobias Kreidl on Monday, 25 May 2015 18:51

Fascinating as always, Felipe, and thank you for sharing these results. Indeed, SSD behavior will be different from that of traditional spinning disks. In addition, even software defined storage (SDS) can significantly affect the storage queue operations. For example, some SDS implementations will reduce enormously the contentions between reads and writes to the same storage by making heavy use of read caching, which of course frees up most of the I/O to be focused on writes. In addition, in some cases the write operations are queued and then processed all at once to flush the write-back cache and consequently, it is then not possible to avoid a large stream of write requests. Hence, it would seem the write-back mode itself is going to need to be aware of what mechanism is in place: small, constant increments vs. a periodic "flood" of data.

It would be most interesting if the operating system were able to become aware of the storage mechanisms and self-adjust the queue processing to provide such optimizations.

0
Fascinating as always, Felipe, and thank you for sharing these results. Indeed, SSD behavior will be different from that of traditional spinning disks. In addition, even software defined storage (SDS) can significantly affect the storage queue operations. For example, some SDS implementations will reduce enormously the contentions between reads and writes to the same storage by making heavy use of read caching, which of course frees up most of the I/O to be focused on writes. In addition, in some cases the write operations are queued and then processed all at once to flush the write-back cache and consequently, it is then not possible to avoid a large stream of write requests. Hence, it would seem the write-back mode itself is going to need to be aware of what mechanism is in place: small, constant increments vs. a periodic "flood" of data. It would be most interesting if the operating system were able to become aware of the storage mechanisms and self-adjust the queue processing to provide such optimizations.
Felipe Franciosi on Tuesday, 26 May 2015 15:02

Thanks for the comment. I am welcoming more people experimenting with these findings and discussing this topic. If this turns out to be portable (or generic) enough, I imagine an adaptive cooperation between drivers and the block layer where hints are exchanged to achieve near-optimal workload patterns.

0
Thanks for the comment. I am welcoming more people experimenting with these findings and discussing this topic. If this turns out to be portable (or generic) enough, I imagine an adaptive cooperation between drivers and the block layer where hints are exchanged to achieve near-optimal workload patterns.
Guest - Sam McLeod on Tuesday, 26 May 2015 02:50

Thanks for the post however I think IOP/s performance is far more interesting than throughput / MB/s.

We find that tapdisk3 generally has very poor performance on busy servers, generally it will max out a cpu core at just 35-40K IOP/s!

Given that our storage is capable of providing 700K-1M random 4k write IOP/s and bundle that with the fact that XenServer does not support TRIM / UNMAP on the VMs - we find that XenServer / tapdisk to be the biggest bottleneck in our datacenter.

See:
- https://bugs.xenserver.org/browse/XSO-263
- https://smcleod.net/building-a-high-performance-ssd-san/

0
Thanks for the post however I think IOP/s performance is far more interesting than throughput / MB/s. We find that tapdisk3 generally has very poor performance on busy servers, generally it will max out a cpu core at just 35-40K IOP/s! Given that our storage is capable of providing 700K-1M random 4k write IOP/s and bundle that with the fact that XenServer does not support TRIM / UNMAP on the VMs - we find that XenServer / tapdisk to be the biggest bottleneck in our datacenter. See: - https://bugs.xenserver.org/browse/XSO-263 - https://smcleod.net/building-a-high-performance-ssd-san/
Felipe Franciosi on Tuesday, 26 May 2015 15:17

Thanks for the comment.

IOPS measures how many requests (normally of the same size and issued with a constant queue depth) can be served in one second. Throughput measures how much data (or requests multiplied by their size) can be served over a fixed time period (normally a second). The two metrics are therefore different ways of looking at the same measurements.

Having said that, the two metrics are normally used in different ways. IOPS focuses on the bottlenecks involving small requests (normally pushing for a large amount of outstanding requests) and Throughput focuses on larger data transfers, pushing the system in a different way.

As you can read in my previous blog post (http://xenserver.org/blog/entry/tapdisk3), Tapdisk3 focused on improving aggregate performance over many guests. I believe the bottleneck you are referring to regards the performance of a single virtual disk. Without making any promises, rest assured that these are also being actively investigated.

With regards to Trim support, you did the right thing by raising your request in an XSO ticket. These are taken into account by our Product Management when prioritising feature development.

0
Thanks for the comment. IOPS measures how many requests (normally of the same size and issued with a constant queue depth) can be served in one second. Throughput measures how much data (or requests multiplied by their size) can be served over a fixed time period (normally a second). The two metrics are therefore different ways of looking at the same measurements. Having said that, the two metrics are normally used in different ways. IOPS focuses on the bottlenecks involving small requests (normally pushing for a large amount of outstanding requests) and Throughput focuses on larger data transfers, pushing the system in a different way. As you can read in my previous blog post (http://xenserver.org/blog/entry/tapdisk3), Tapdisk3 focused on improving aggregate performance over many guests. I believe the bottleneck you are referring to regards the performance of a single virtual disk. Without making any promises, rest assured that these are also being actively investigated. With regards to Trim support, you did the right thing by raising your request in an XSO ticket. These are taken into account by our Product Management when prioritising feature development.

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Technical support for XenServer is available from Citrix.