Virtualization Blog

Discussions and observations on virtualization.

Follow me on twitter: @franciozzy I have been a Software Performance Engineer working for Citrix since October 2011, more specifically on XenServer Storage Virtualisation. Previously, I finished a PhD at Imperial College London on the same subject. Regarding computing and besides performance evaluation of virtualised storage, my interests also include computer networks, distributed systems and high performance computing to name but a few. In my spare time I enjoy playing my bass guitars, practicing Kyokushin Karate and doing close-up magic. Oh, and playing Poker and Chess.

When Virtualised Storage is Faster than Bare Metal

An analysis of block size, inflight requests and outstanding data


Back in August 2014 I went to the Xen Project Developer Summit in Chicago (IL) and presented a graph that caused a few faces to go "ahn?". The graph was meant to show how well XenServer 6.5 storage throughput could scale over several guests. For that, I compared 10 fio threads running in dom0 (mimicking 10 virtual disks) with 10 guests running 1 fio thread each. The result: the aggregate throughput of the virtual machines was actually higher.

In XenServer 6.5 (used for those measurements), the storage traffic of 10 VMs corresponds to 10 tapdisk3 processes doing I/O via libaio in dom0. My measurements used the same disk areas (raw block-based virtual disks) for each fio thread or tapdisk3. So how can 10 tapdisk3 processes possibly be faster than 10 fio threads also using libaio and also running in dom0?

At the time, I hypothesised that the lack of indirect I/O support in tapdisk3 was causing requests larger than 44 KiB (the maximum supported request size in Xen's traditional blkif protocol) to be split into smaller requests. And that the storage infrastructure (a Micron P320h) was responding better to a higher number of smaller requests. In case you are wondering, I also think that people thought I was crazy.

You can check out my one year old hypothesis between 5:10 and 5:30 on the XPDS'14 recording of my talk:



For several years operating systems have been optimising storage I/O patterns (in software) before issuing them to the corresponding disk drivers. In Linux, this has been achieved via elevator schedulers and the block layer. Requests can be reordered, delayed, prioritised and even merged into a smaller number of larger requests.

Merging requests has been around for as long as I can remember. Everyone understands that less requests mean less overhead and that storage infrastructures respond better to larger requests. As a matter of fact, the graph above, which shows throughput as a function of request size, is proof of that: bigger requests means higher throughput.

It wasn't until 2010 that a proper means to fully disable request merging came into play in the Linux kernel. Alan Brunelle showed a 0.56% throughput improvement (and less CPU utilisation) by not trying to merge requests at all. I wonder if he questioned that splitting requests could actually be even more beneficial.


Given the results I have seen on my 2014 measurements, I would like to take this concept a step further. On top of not merging requests, let's forcibly split them.

The rationale behind this idea is that some drives today will respond better to a higher number of outstanding requests. The Micron P320h performance testing guide says that it "has been designed to operate at peak performance at a queue depth of 256" (page 11). Similar documentation from Intel uses a queue depth of 128 to indicate peak performance of its NVMe family of products.

But it is one thing to say that a drive requires a large number of outstanding requests to perform at its peak. It is a different thing to say that a batch of 8 requests of 4 KiB each will complete quicker than one 32 KiB request.


So let's put that to the test. I wrote a little script to measure the random read throughput of two modern NVMe drives when facing workloads with varying block sizes and I/O depth. For block sizes from 512 B to 4 MiB, I am particularly interested in analysing how these disks respond to larger "single" requests in comparison to smaller "multiple" requests. In other words, what is faster: 1 outstanding request of X bytes or Y outstanding requests of X/Y bytes?

My test environment consists of a Dell PowerEdge R720 (Intel E5-2643v2 @ 3.5GHz, 2 Sockets, 6 Cores/socket, HT Enabled), with 64 GB of RAM running Linux Jessie 64bit and the Linux 4.0.4 kernel. My two disks are an Intel P3700 (400GB) and a Micron P320h (175GB). Fans were set to full speed and the power profiles are configured for OS Control, with a performance governor in place.

sizes="512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 \
       1048576 2097152 4194304"
drives="nvme0n1 rssda"

for drive in ${drives}; do
    for size in ${sizes}; do
        for ((qd=1; ${size}/${qd} >= 512; qd*=2)); do
            bs=$[ ${size} / ${qd} ]
            tp=$(fio --terse-version=3 --minimal --rw=randread --numjobs=1  \
                     --direct=1 --ioengine=libaio --runtime=30 --time_based \
                     --name=job --filename=/dev/${drive} --bs=${bs}         \
                     --iodepth=${qd} | awk -F';' '{print $7}')
            echo "${size} ${bs} ${qd} ${tp}" | tee -a ${drive}.dat

There are several ways of looking at the results. I believe it is always worth starting with a broad overview including everything that makes sense. The graphs below contain all the data points for each drive. Keep in mind that the "x" axis represent Block Size (in KiB) over the Queue Depth.



While the Intel P3700 is faster overall, both drives share a common treat: for a certain amount of outstanding data, throughput can be significantly higher if such data is split over several inflight requests (instead of a single large request). Because this workload consists of random reads, this is a characteristic that is not evident in spinning disks (where the seek time would negatively affect the total throughput of the workload).

To make this point clearer, I have isolated the workloads involving 512 KiB of outstanding data on the P3700 drive. The graph below shows that if a workload randomly reads 512 KiB of data one request at a time (queue depth=1), the throughput will be just under 1 GB/s. If, instead, the workload would read 8 KiB of data with 64 outstanding requests at a time, the throughput would be about double (just under 2 GB/s).



Storage technologies are constantly evolving. At this point in time, it appears that hardware is evolving much faster than software. In this post I have discussed a paradigm of workload optimisation (request merging) that perhaps no longer applies to modern solid state drives. As a matter of fact, I am proposing that the exact opposite (request splitting) should be done in certain cases.

Traditional spinning disks have always responded better to large requests. Such workloads reduced the overhead of seek times where the head of a disk must roam around to fetch random bits of data. In contrast, solid state drives respond better to parallel requests, with virtually no overhead for random access patterns.

Virtualisation platforms and software-defined storage solutions are perfectly placed to take advantage of such paradigm shifts. By understanding the hardware infrastructure they are placed on top of, as well as the workload patterns of their users (e.g. Virtual Desktops), requests can be easily manipulated to better explore system resources.

Recent Comments
Tobias Kreidl
Fascinating as always, Felipe, and thank you for sharing these results. Indeed, SSD behavior will be different from that of tradit... Read More
Monday, 25 May 2015 18:51
Felipe Franciosi
Thanks for the comment. I am welcoming more people experimenting with these findings and discussing this topic. If this turns out ... Read More
Tuesday, 26 May 2015 15:02
Felipe Franciosi
Thanks for the comment. IOPS measures how many requests (normally of the same size and issued with a constant queue depth) can be... Read More
Tuesday, 26 May 2015 15:17
Continue reading
15840 Hits

Average Queue Size and Storage IO Metrics


There seems to be a bit of confusion around the metric "average queue size". This is a metric reported by iostat as "avgqu-sz". The confusion seems to arise when iostat reports a different avgqu-sz in dom0 and in domU for a single Virtual Block Device (VBD), while other metrics such as Input/Output Operations Per Second (IOPS) and Throughput (often expressed in MB/s) are the same. This page will describe what all of this actually mean and how this should be interpreted.


On any modern Operating System (OS), it is possible to concurrently submit several requests to a single storage device. This practice normally helps several layers of the data path to perform better, allowing systems to achieve higher numbers in metrics such as IOPS and throughput. However, measuring the average of outstanding (or "inflight") requests for a given block device over a period of time can be a bit tricky. This is because the number of outstanding requests is an "instant metric". That is, when you look, there might be zero requests pending for that device. When you look again, there might be 28. Without a lot of accounting and some intrusiveness, it is not really possible to tell what happened in-between.

Most users, however, are not interested in everything that happened in-between. People are much more interested in the average of outstanding requests. This average gives a good understanding of the workload that is taking place (i.e. how applications are using storage) and helps with tuning the environment for better performance.

Calculating the Average Queue Size

To understand how the average queue size is calculated, consider the following diagram which presents a Linux system running 'fio' as a benchmarking user application issuing requests to a SCSI disk.


Figure 1. Benchmark issuing requests to a disk

The application issues requests to the kernel through libraries such as libc or libaio. On the simple case where the benchmark is configured with an IO Depth of 1, 'fio' will attempt to keep one request "flying" at all times. As soon as one request completes, 'fio' will send another. This can be achieved with the following configuration file (which runs for 10 seconds and considers /dev/xvdb as the benchmarking disk):



Table 1. fio configuration file for a test workload

NOTE: In this experiment, /dev/xvdb was configured as a RAW VDI. Ensure to fully populate VHD VDIs before running experiments (especially if they are read-based).

One of the metrics made available by the block layer for a device is the number of read and write "ticks" (see stat.txt on the Linux Kernel documentation). This exposes the amount of time per request that the device has been occupied. The block layer starts this accounting immediately before shipping the request to the driver and stops it immediately after the request completed. The figure below represents this time in the RED and BLUE horizontal bars.


Figure 2. Diagram representing request accounting

It is important to understand that this metric can grow quicker than time. This will happen if more than one request has been submitted concurrently. On the example below, a new (green) request has been submitted before the first (red) request has been completed. It completed after the red request finished and after the blue request was issued. During the moments where requests overlapped, the ticks metric increased at a rate greater than time.


Figure 3. Diagram representing concurrent request accounting

Looking at this last figure, it is clear that there were moments were no request was present in the device driver. There were also moments where one or two requests were present in the driver. To calculate the average of inflight requests (or average queue size) between two moments in time, tools like iostat will sample "ticks" at moment one, sample "ticks" again at moment two, and divide the difference between these ticks by the time interval between these moments.


Figure 4. Formula to calculate the average queue size

The Average Queue Size in a Virtualised Environment

In a virtualised environment, the datapath between the benchmarking application (fio) running inside a virtual machine and the actual storage is different. Considering XenServer 6.5 as an example, the figure below shows a simplification of this datapath. As in the examples of the previous section, requests start in a virtual machine's user space application. When moving through the kernel, however, they are directed to paravirtualised (PV) storage drivers (e.g. blkfront) instead of an actual SCSI driver. These requests are picked up by the storage backend (tapdisk3) in dom0's user space. They are submitted to dom0's kernel via libaio, pass the block layer and reach the disk drivers for the corresponding storage infrastructure (in this example, a SCSI disk).


Figure 5. Benchmark issuing requests on a virtualised environment

The technique described above to calculate the average queue size will produce different values depending on where it is applied. Considering the diagram above, it could be used in the virtual machine's block layer, in tapdisk3 or in the dom0's block layer. Each of these would show a different queue size and actually mean something different. The diagram below extends the examples used in this article to include these layers.


Figure 6. Diagram representing request accounting in a virtualised environment

The figure above contains (almost) vertical arrows between the layers representing requests departing from and arriving to different system components. These arrows are slightly angled, suggesting that time passes as a request moves from one layer to another. There is also some elapsed time between an arrow arriving at a layer and a new arrow leaving from that layer.

Another detail of the figure is the horizontal (red and blue) bars. They indicate where requests are accounted at a particular layer. Note that this accounting starts some time after a request arrives at a layer (and some time before the request passes to another layer). These offsets, however, are merely illustrative. A thorough look at the output of specific performance tools is necessary to understand what the "Average Queue Size" is for certain workloads.

Investigating a Real Deployment

In order to place real numbers in this article, the following environment was configured:

  • Hardware: Dell PowerEdge R310
    • Intel Xeon X3450 2.67GHz (1 Socket, 4 Cores/socket, HT Enabled)
    • BIOS Power Management set to OS DBPM
    • Xen P-State Governor set to "Performance", Max Idle State set to "1"
    • 8 GB RAM
    • 2 x Western Digital WD2502ABYS
      • /dev/sda: XenServer Installation + guest's root disk
      • /dev/sdb: LVM SR with one 10 GiB RAW VDI attached to the guest
  • dom0: XenServer Creedence (Build Number 88873)
    • 4 vCPUs
    • 752 MB RAM
  • domU: Debian Wheezy x86_64
    • 2 vCPUs
    • 512 MB RAM

When issuing the fio workload as indicated in Table 1 (sequentially reading 4 KiB requests using libaio and with io_depth set to 1 during 10 seconds), an iostat within the guest reports the following:

root@wheezy64:~# iostat -xm | grep Device ; iostat -xm 1 | grep xvdb
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdb              0.00     0.00  251.05    0.00     0.98     0.00     8.00     0.04    0.18    0.18    0.00   0.18   4.47
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdb              0.00     0.00 4095.00    0.00    16.00     0.00     8.00     0.72    0.18    0.18    0.00   0.18  72.00
xvdb              0.00     0.00 5461.00    0.00    21.33     0.00     8.00     0.94    0.17    0.17    0.00   0.17  94.40
xvdb              0.00     0.00 5479.00    0.00    21.40     0.00     8.00     0.96    0.18    0.18    0.00   0.18  96.40
xvdb              0.00     0.00 5472.00    0.00    21.38     0.00     8.00     0.95    0.17    0.17    0.00   0.17  95.20
xvdb              0.00     0.00 5472.00    0.00    21.38     0.00     8.00     0.97    0.18    0.18    0.00   0.18  97.20
xvdb              0.00     0.00 5443.00    0.00    21.27     0.00     8.00     0.96    0.18    0.18    0.00   0.18  95.60
xvdb              0.00     0.00 5465.00    0.00    21.34     0.00     8.00     0.96    0.17    0.17    0.00   0.17  95.60
xvdb              0.00     0.00 5467.00    0.00    21.36     0.00     8.00     0.96    0.18    0.18    0.00   0.18  96.00
xvdb              0.00     0.00 5475.00    0.00    21.39     0.00     8.00     0.96    0.18    0.18    0.00   0.18  96.40
xvdb              0.00     0.00 5479.00    0.00    21.40     0.00     8.00     0.97    0.18    0.18    0.00   0.18  96.80
xvdb              0.00     0.00 1155.00    0.00     4.51     0.00     8.00     0.20    0.17    0.17    0.00   0.17  20.00
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

The value of interest is reported in the column "avgqu-sz". It is about 0.96 on average while the benchmark was running. This means that the guest's block layer (referring to Figure 6) is handling requests almost the entire time.

The next layer of the storage subsystem that accounts for utilisation is tapdisk3. This value can be obtained running /opt/xensource/debug/xsiostat in dom0. For the same experiment, it reports the following:

[root@dom0 ~]# /opt/xensource/debug/xsiostat | head -2 ; /opt/xensource/debug/xsiostat | grep 51728
  DOM   VBD         r/s        w/s    rMB/s    wMB/s rAvgQs wAvgQs
    1,51728:       0.00       0.00     0.00     0.00   0.00   0.00
    1,51728:    1213.04       0.00     4.97     0.00   0.22   0.00
    1,51728:    5189.03       0.00    21.25     0.00   0.71   0.00
    1,51728:    5196.95       0.00    21.29     0.00   0.71   0.00
    1,51728:    5208.94       0.00    21.34     0.00   0.71   0.00
    1,51728:    5208.10       0.00    21.33     0.00   0.71   0.00
    1,51728:    5194.92       0.00    21.28     0.00   0.71   0.00
    1,51728:    5203.08       0.00    21.31     0.00   0.71   0.00
    1,51728:    5245.00       0.00    21.48     0.00   0.72   0.00
    1,51728:    5482.02       0.00    22.45     0.00   0.74   0.00
    1,51728:    5474.02       0.00    22.42     0.00   0.74   0.00
    1,51728:    3936.92       0.00    16.13     0.00   0.53   0.00
    1,51728:       0.00       0.00     0.00     0.00   0.00   0.00
    1,51728:       0.00       0.00     0.00     0.00   0.00   0.00

Analogously to what was observed within the guest, xsiostat reports on the amount of time that it had outstanding requests. At this layer, this figure is reported at about 0.71 while the benchmark was running. This gives an idea of the time that passed between a request being accounted in the guest's block layer and at the dom0's backend system. Going further, it is possible to run iostat in dom0 and find out what is the perceived utilisation at the last layer before the request is issued to the device driver.

[root@dom0 ~]# iostat -xm | grep Device ; iostat -xm 1 | grep dm-3
Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-3              0.00     0.00 102.10  0.00     0.40     0.00     8.00     0.01    0.11   0.11   1.16
dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00 281.00  0.00     1.10     0.00     8.00     0.06    0.20   0.20   5.60
dm-3              0.00     0.00 5399.00  0.00    21.09     0.00     8.00     0.58    0.11   0.11  58.40
dm-3              0.00     0.00 5479.00  0.00    21.40     0.00     8.00     0.58    0.11   0.11  57.60
dm-3              0.00     0.00 5261.00  0.00    20.55     0.00     8.00     0.61    0.12   0.12  61.20
dm-3              0.00     0.00 5258.00  0.00    20.54     0.00     8.00     0.61    0.12   0.12  61.20
dm-3              0.00     0.00 5206.00  0.00    20.34     0.00     8.00     0.57    0.11   0.11  56.80
dm-3              0.00     0.00 5293.00  0.00    20.68     0.00     8.00     0.60    0.11   0.11  60.00
dm-3              0.00     0.00 5476.00  0.00    21.39     0.00     8.00     0.64    0.12   0.12  64.40
dm-3              0.00     0.00 5480.00  0.00    21.41     0.00     8.00     0.61    0.11   0.11  60.80
dm-3              0.00     0.00 5479.00  0.00    21.40     0.00     8.00     0.66    0.12   0.12  66.40
dm-3              0.00     0.00 5047.00  0.00    19.71     0.00     8.00     0.56    0.11   0.11  56.40
dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

At this layer, the block layer reports about 0.61 for the average queue size.

Varying the IO Depth

The sections above clarified why users might see a lower queue utilisation in dom0 when comparing the output of performance tools in different layers of the storage subsystem. The examples shown so far, however, covered mostly the case where IO Depth is set to "1". This means that the benchmark tool ran within the guest (e.g. fio) will attempt to keep one request inflight at all times. This tool's perception, however, might be incorrect given that it takes time for the request to actually reach the storage infrastructure.

Using the same environment described on the previous section and gradually increasing the IO Depth at the benchmark configuration, the following data can be gathered:


Figure 7. Average queue size vs. io depth as configured in fio


This article explained what the average queue size is and how it is calculated. As examples, it included real data from specific server and disk types. This should clarify why certain workloads cause different queue utilisations to be perceived from the guest and from dom0.

Recent Comments
Tobias Kreidl
Thanks much as always, Felipe, for the very insightful article! With a case where the %util parameter starts approaching 100%, th... Read More
Wednesday, 03 December 2014 19:17
Felipe Franciosi
Hi Tobias, Thanks for the question and apologies for the delay in responding. The %util column approaching 100% normally means th... Read More
Tuesday, 16 December 2014 19:20
Continue reading
26839 Hits

XenServer Storage Performance Improvements and Tapdisk3


The latest development builds of XenServer (check out the Creedence Alpha Releases) enjoy significantly superior storage performance compared to XenServer 6.2 as already mentioned by Marcus Granado in his blog post about Performance Improvements in Creedence. This improvement is primarily due to the integration of tapdisk3. This blog post will introduce and discuss this new storage virtualisation technology, presenting results for experiments reaching about 10 GB/s of aggregated storage throughput in a single host and explaining how this was achieved.


A few months ago I wrote a blog post on Project Karcygwins which covered a series of experiments and investigations we conducted around storage IO. These focused on workloads originating from a single VM and applied to a single virtual disk. We were particularly interested in understanding the virtualisation overhead added to these workloads, especially on low latency storage devices such as modern SSDs. Comparing different storage data paths (e.g. blkback, blktap2) available for use with the Xen Project Hypervisor, we explained why and when any overhead would exist as well as how noticeable it could get. The full post can be read here:

Since then, we expanded the focus of our investigations to encompass more complex workloads. More specifically, we started to focus on aggregate throughput and what circumstances were required for several VMs to make full use of a storage array’s potential. This investigation was conducted around the new tapdisk3, developed in XenServer by Thanos Makatos. Tapdisk3 was written to have a simpler architecture, implemented entirely in user space, and leading to substantial performance improvements.

What is new in Tapdisk3?

There are two major differences between tapdisk2 and tapdisk3. The first one is in the way this component is hooked up to the storage subsystem: while the former relied on blkback and blktap2, the latter connects directly to blkfront. The second major difference lies in the way data is transferred to and from guests: while the former used grant mapping and “memcpy”, the latter uses grant copy. For further details, refer to the section “Technical Details” at the end of this post.

Naturally, other changes were required to make all of this work. Most of them, however, are related to the control plane. For these, there were toolstack (xapi) changes and the appearance of a “tapback” component to connect everything up. Because of these changes (and some others regarding how tapdisk3 handles in-flight data), the dom0 memory footprint of a connected virtual disk also changed. This is currently under evaluation and may see further modifications before tapdisk3 is officially released.

Performance Evaluation

In order to measure the performance improvements achieved with tapdisk3, we selected the fastest host and the fastest disks we had available. This is the box we configured for this measurements:

  • Dell PowerEdge R720
    • 64 GB of RAM
    • Intel Xeon E5-2643 v2 @3.5 GHz
      • 2 Sockets, 6 cores per socket, hyper threaded = 24 pCPUs
    • Turbo up to 3.8 GHz
    • Xen Project Hypervisor governor set to Performance
      • Default is set to "On Demand" for power saving reasons
      • Refer to Rachel Berry's blog post for more information on governors
    • BIOS set to Performance per Watt (OS)
    • Maximum C-State set to 1
  • 4 x Micron P320 PCIe SSD (175 GB each)
  • 2 x Intel 910 PCIe SSD (400 GB each)
    • Each presented as 2 SCSI devices of 200 GB (for a total of 4 devices and 800 GB)
  • 1 x Fusion-io ioDrive2 (785 GB)

After installing XenServer Creedence Build #86278 (about 5 builds newer than Alpha 2) and the Fusion-io drivers (compiled separately), we created a Storage Repository (SR) on each available device. This produced a total of 9 SRs and about 2.3 TB of local storage. On each SR, we created 10 RAW Virtual Disk Images (VDI) of 10 GB each. One VDI from each SR was assigned to each VM in a round-robin fashion as in the diagram below. The guest of choice was Ubuntu 14.04 (x86_64, 2 vCPUs unpinned, 1024 MB RAM). We also assigned 24 vCPUs to dom0 and decided not to use pinning (see XenServer 6.2.0 CTX139714 for more information on pinning strategies).


We first measured what aggregate throughput the host would deliver when the VDIs were plugged to the VMs via the traditional tapdisk2-blktap2-blkback data path. For that, we got one VM to sequentially write for 10 seconds on all VDIs (at the same time). We observed the total amount of data transferred. This was done with requests varying from 512 bytes up to 4 MiB. Once completed, we repeated the experiment with an increasing number of VMs (up to ten). And then we did it all again for reads instead of writes. The results are plotted below:



In terms of aggregate throughput, the measurements suggest that the VMs cannot achieve more than 4 GB/s when reading or writing. Next, we repeated the experiment with the VDIs plugged with tapdisk3. The results were far more impressive:



This time, the workload produced numbers on a different scale. For writing, the aggregate throughput from the VMs approached the 8.0 GB/s mark. For reading, it approached the 10.0 GB/s mark. For some data points in this particular experiment, the tapdisk3 data path proves to be faster than tapdisk2 by ~100% when writing and ~150% when reading. This is an impressive speed up on a metric that users really care about. 

Technical Details

To understand why tapdisk3 is so much faster than tapdisk2 from a technical perspective, it is important to first review the relevant terminology and architectural aspects of the virtual storage subsystem used with paravirtualised guests and Xen Project Hypervisors. We will focus on the components used with XenServer and generic Linux VMs. Note, however, that the information below is very similar for Windows guests when they have PV drivers installed.


Traditionally, Linux guests (under Xen Project Hypervisors) load a driver named blkfront. As far as the guest is concerned, this is a driver for a normal block device. The difference is that, instead of talking to an actual device (hardware), blkfront talks to blkback (in dom0) through shared memory regions and event channels (Xen Project’s mechanism to deliver interrupts between domains). The protocol between these components is referred to as the blkif protocol.

Applications in the guest will issue read or write operations (via libc, libaio, etc) to files in a filesystem or directly to (virtual) block devices. These are eventually translated into block requests and delivered to blkfront, being normally associated with random pages within the guest’s memory space. Blkfront, in turn, will grant dom0 access to those pages so that blkback can read from or write to them. This type of access is known as grant mapping.

While the Xen Project developer community has made efforts to improve the scalability and performance of grant mapping mechanisms, there is still work to be done. This is a set of complex operations and some of its limitations are still showing up, especially when dealing with concurrent access from multiple guests. Some notable recent efforts were Matt Wilson's patches to improve locking for better scalability.


In order to avoid the overhead of grant mapping and unmapping memory regions for each request, Roger Pau Monne implemented a feature called “persistent grants” in the blkback/blkfront protocol. This can be negotiated between domains where supported. When used, blkfront will grant access to a set of pages to blkback and both components will use these pages for as long as they can.

The downside of this approach is that blkfront cannot control which pages are going to be associated with requests that come from the guest’s block layer. It therefore needs to copy data from/to these requests to this set of persistently granted pages before passing blkif requests to blkback. Even with the added copy, persistent grants is a proven method for increased scalability in concurrent IO.

Both approaches presented above are entirely implemented in kernel-space within dom0. They also have something else in common: requests issued to dom0’s block layer refer to pages that actually reside in the guest’s memory space. This can trigger a potential race condition when using network-based storage (e.g. NFS and possibly iSCSI); if there is a network packet (which is associated to a page grant) queued for retransmission and an ACK arrives for the original transmission of that same packet, dom0 might retransmit invalid data or even crash (because that grant could either contain invalid data or have already been unmapped).

To get around this problem, XenServer started copying the pages to dom0 instead of using grants directly. This was done by the blktap2 component, which was introduced with tapdisk2 to deliver other features such as thin-provisioning (using the VHD format) and Storage Motion. In this design, blktap2 copies the pages before passing them to tapdisk2, ensuring safety for network-based back ends. The reasoning behind blktap2 was to provide a block device in dom0 that represented the VDI as a full-provisioned device despite its origins (e.g. a thin-provisioned file in an NFS mount).


As we saw in the measurements above, this approach has its limitations. While it works well for a variety of storage types, it fails to scale in terms of performance with modern technologies such as several locally-attached PCIe SSDs. To respond to these changes in storage technologies, XenServer Creedence will include tapdisk3 which makes use of another approach: grant copy.


With the introduction of the 3.x kernel series to dom0 and consequently the grant device (gntdev), we were able to access pages from other domains directly from dom0’s user space (domains are still required to explicitly grant proper access through the Xen Project Hypervisor). This technology allowed us to implement tapdisk3, which uses the gntdev and the event channel device (evtchn) to communicate directly with blkfront. However, instead of accessing pages as before, we coded tapdisk3 to use a Xen Project feature called “grant copy”.

Grant copying data is much faster than grant mapping and then copying. With grant copy, pretty much everything happens within the Xen Project Hypervisor itself. This approach also ensures that data is present in dom0, making it safe to use with network-attached backends. Finally, because all the logic is implemented in a user-space application, it is trivial to support thin-provisioned formats (e.g. VHD) and all the other features we already provided such as Storage Motion, snapshotting, fast clones, etc. To ensure a block device representing the VDI is still available in dom0 (for VDI copy and other operations), we continued to connect tapdisk3 to blktap2.

Last but not least, the avid reader might wonder why XenServer is not following the footsteps of qemu-qdisk which implements persistent grants in user space. In order to remain safe for network-based backends (i.e. with persistent grants, requests would be associated with grants for pages that actually lie in guests’ memory space -- just like in Approach 2 above), qemu-qdisk disables the O_DIRECT flag to issue requests to a VDI. This causes data to be copied to/from dom0’s buffer cache (hence guaranteeing safety as requests will be associated with pages local to dom0). However, persistent grants imply that a copy has already happened in the guest and the extra copy in dom0 is simply adding on the latency of serving a request and CPU overhead. We believe grant copy to be a better alternative.


In this post I compared tapdisk2 to tapdisk3 by showing performance results for aggregated workloads from sets of up to ten VMs. This covered a variety of block sizes over read and write sequential operations. The experiment took place on a modern and fast Intel-based server using state-of-the-art PCIe SSDs. It showed tapdisk3’s superiority in terms of design and consequently performance. For those interested in what happens under the hood, I went further and compared the different virtual data paths used in Xen Project Hypervisors with focus on XenServer and Linux guests.

This is also a good opportunity to thank and acknowledge XenServer Storage Engineer Thanos Makatos’s brilliant work and effort on tapdisk3 as well as everyone else involved in the project: Keith Petley, Simon Beaumont, Jonathan Davies, Ross Lagerwall, Malcolm Crossley, David Vrabel, Simon Rowe and Paul Durrant.

Recent Comments
Niklas Ahden
This is great news! I do really appreciate the reading and I am looking forward to the next XenServer-release. What version-number... Read More
Wednesday, 02 July 2014 16:50
Felipe Franciosi
Hi Kai! The answer is YES. For this post, we have focused our measurements and comparisons on a storage infrastructure that is ca... Read More
Monday, 07 July 2014 16:02
Tobias Kreidl
Why would anyone base a commercial product on a pre-release or a release that's been out for just a few weeks or months? That is j... Read More
Monday, 07 July 2014 15:49
Continue reading
30021 Hits

Project Karcygwins and Virtualised Storage Performance


Over the last few years we have witnessed a revolution in terms of storage solutions. Devices capable of achieving millions of Input/Output Operations per Second (IOPS) are now available off-the-shelf. At the same time, Central Processing Unit (CPU) speeds remain largely constant. This means that the overhead of processing storage requests is actually affecting the delivered throughput. In a world of virtualisation, where extra processing is required in order to securely pass requests from virtual machines (VM) to storage domains, this overhead becomes more evident.

It is the first time that such an overhead became a concern. Until recently, the time spent within I/O devices was much longer than that of processing a request within CPUs. Kernel and driver developers were mainly worried about: (1) not blocking while waiting for devices to complete; and (2) sending requests optimised for specific device types. While the former was addressed by techniques such as Direct Memory Access (DMA), the latter was solved by elevator algorithms such as Completely Fair Queueing (CFQ).

Today, with the large adoption of Solid-State Drives (SSD) and the further development of low-latency storage solutions such as those built on top of PCI Express (PCIe) and Non-Volatile Memory (NVM) technologies, the main concern lies in not losing any unnecessary time in processing requests. Within the Xen Project community, some development already started in order to allow scalable storage traffic from several VMs. Linux kernel maintainers and storage manufacturers are also working on similar issues. In the meantime, XenServer Engineering delivered Project Karcygwins which allowed a better understanding of current bottlenecks, when they are evident and what can be done to overcome them. 

Project Karcygwins

Karcygwins was originally intended as three separate projects (Karthes, Cygni and Twins). Due to their topics being closely related, they were merged. Those three projects were proposed based on subjects believed to be affecting virtualised storage throughput.

Project Karthes aimed at assessing and mitigating the cost in mapping (and unmapping) memory between domains. When a VM issues an I/O request, the storage driver domain (dom0 in XenServer) requires access to certain memory areas in the guest domain. After the request is served, these areas need to be released (or unmapped). This is also an expensive operation due to flushes required in different cache tables. Karthes was proposed to investigate the cost related to these operations, how they impacted the delivered throughput and what could be done to mitigate them.

Project Cygni aimed at allowing requests larger than 44 KiB to be passed between a guest and a storage driver domain. Until recently, Xen's blkif protocol defined a fixed array of data segments per request. This array had room for 11 segments corresponding to a 4 KiB memory page each (hence the 44 KiB). The protocol has since been updated to support indirect I/O operations where the segments actually contained other segments. This change allowed for much larger requests at a small expense.

Project Twins aimed at evaluating the benefits of using two communication rings between dom0 and a VM. Currently, only one ring exists and it is used both for requests from the guests and responses from the back end. With two rings, requests and responses can be stored in their own ring. This new strategy allows for larger inflight data and better use of caching.

Due to initial findings, the main focus of Karcygwins stayed on Project Karthes. The code allowing for requests larger than 44 KiB, however, was constantly included in the measurements to address the goals proposed for Project Cygni. The idea of using split rings (Project Twins) was postponed and will be investigated at a later stage.

Visualising the Overhead

When a user installs a virtualisation platform, one of the first questions to be raised is: "what is the performance overhead?". When it comes to storage performance, a straightforward way to quantify this overhead is to measure I/O throughput on a bare metal Linux installation and repeat the measurement (on the same hardware) from a Linux VM. This can promptly be done with a generic tool like dd for a variety of block sizes. It is a simple test that does not cover concurrent workloads or greater IO depths.


Looking at the plot above we can see that, on a 7.2k RPM SATA Western Digital Blue WD5000AAKX, read requests as large as 16 KiB can reach the maximum disk throughput at just over 120 MB/s (red line). When repeating the same test from a VM (green and blue lines), however, we see that the throughput for small requests is much lower. They eventually reach the same 120 MB/s mark, but only with larger requests.

The green line represents the data path where blkback is directly plugged to the back end storage. This is the kernel module that receive requests from the VM. While this is the fastest virtualisation path in the Xen world, it lacks certain software-level features such as thin-provisioning, cloning, snapshotting and the capability of migrating guests without centralised storage.

The blue line represents the data path where requests go through tapdisk2. This is a user space application that runs in dom0 and can implement the VHD format. It also has an NBD plugin for migration of guests without centralised storage. It allows for thin-provisioning, cloning and snapshotting of Virtual Disk Images (VDI). Because requests transverse more components before reaching the disk, it is understandingly slower.

Using Solid-State Drives and Fast RAID Arrays

The shape of the plot above is not the same for all types of disks, though. Modern disk setups can achieve considerable higher data rates before flattening their throughputs.


Looking at the plot above, we can see a similar test executed from dom0 on two different back end types. The red line represents the throughput obtained from a RAID0 formed by two SSDs (Intel DC S3700). The blue line represents the throughput obtained from a RAID0 formed by two SAS disks (Seagate ST). Both arrays were measured independently and are connected to the host through a PERC H700 controller. While the Seagate SAS array achieves its maximum throughput at around 370 MB/s when using 48 KiB requests, the Intel SSD array continues to speed up even with requests as large as 4 MiB. Focusing on each array separately, it is possible to compare these dom0 measurements with measurements obtained from a VM. The plot below isolates the Seagate SAS array.


Similar to what is observed on the measurements taken on a single Western Digital, the throughput measured from a VM is smaller than that of dom0 when requests are not big enough. In this case, the blkback data path (the pink line) allows the VM to reach the same throughput offered by the array (370 MB/s) with requests larger than 116 KiB. The other data paths (orange, cyan and brown lines) represent user space alternatives that reach different bottlenecks and even with large requests cannot match the throughput measured from dom0.

It is interesting to observe that some user space implementations vary considerably in terms of performance. When using qdisk as the back end along the blkfront driver from the Linux Kernel 3.11.0 (the orange line), the throughput is higher for requests of sizes such as 256 KiB (when compared to other user space alternatives -- the blkback data path remains faster). The main difference in this particular setup is the support for persistent grants. This technique, implemented in 3.11.0, reuses memory grants and drastically reduces the map and unmap operations. It requires, however, an additional copy operation within the guest. The trade-off may have different implications when varying factors such as hardware architecture and workload types. More on that on the next section.


When repeating these measurements on the Intel SSD array, a new issue came to light. Because the array delivers higher throughput with no signs of abating as larger requests are issued, none of the virtualisation technologies are capable of matching the throughput measured from dom0. While this behaviour will probably differ with other workloads, this is what has been observed when using a single I/O thread with queue depth set to one. In a nutshell, 2 MiB read requests from dom0 achieves 900 MB/s worth of throughput while a similar measurement from one VM will only reach 300 MB/s when using user space back ends. This is a pathological example chosen for this particular hardware architecture to show how bad things can get.

Understanding the Overhead

In order to understand why the overhead is so evident in some cases, it is necessary to take a step back. The measurements taken on slower disks show that all virtualisation technologies are somewhat slower than what is observed in dom0. On such disks, this difference disappears as requests grow in size. What happens at that point is that the actual disk becomes "maxed out" and cannot respond faster no matter the request size. At the same time, much of the work done at the virtualisation layers do not get slower proportionally to the amount of data associated with requests. For example, interruptions between domains are unlikely to take longer simply because requests are bigger. This is exactly why there is no visible overhead with large enough requests on certain disks.

However, the question remains: what is consuming CPU time and causing such a visible overhead on the example previously presented? There are mainly two techniques that can be used to answer that question: profiling and tracing. Profiling allows instruction pointer samples to be collected at every so many events. The analysis of millions of such samples reveals code in hot paths where time is being spent. Tracing, on the other hand, measures the exact time passed between two events.

For this particular analysis, the tracing technique and the blkback data path have been chosen. To measure the amount of time spent between events, the code was actually modified and several RDTSC instructions have been inserted. These instructions read the Time Stamp Counters (TSC) and are relatively cheap while providing very accurate data. On modern hardware, TSCs are constant and consistent across cores of a host. This means that measurements from different domains (i.e. dom0 and guests) can be matched to obtain the time passed, for example, between blkfront kicking blkback. The diagram below shows where trace points have been inserted.


In order to gather meaningful results, 100 requests have been issued in succession. Domains have been pinned to the same NUMA node in the host and turbo capabilities were disabled. The TSC readings were collected for each request and analysed both individually and as an average. The individual analysis revealed interesting findings such as a "warm up" period where the first requests are always slower. This was attributed to caching and scheduling effects. It also showed that some requests were randomly faster than others in certain parts of the path. This was attributed to CPU affinity. For the average analysis, the 20 fastest and slowest requests were initially discarded. This produced more stable and reproducible results. The plots below show these results.



Without persistent grants, the cost of mapping and unmapping memory across domains is clearly a significant factor as requests grow in size. With persistent grants, the extra copy on the front end adds up and results in a slower overall path. Roger Pau Monne, however, showed that persistent grants can improve aggregate throughput from multiple VMs as it reduces contention on the grant tables. Matt Wilson, following on from discussions on the Xen Developer Summit 2013, produced patches that should also assist grant table contention.

Conclusions and Next Steps

In summary, Project Karcygwins allowed the understanding of several key elements in storage performance for both Xen and XenServer:

  • The time spent in processing requests (in CPU) definitely matters as disks get faster
  • Throughput is visibly affected for single-threaded I/O on low-latency storage
  • Kernel-only data paths can be significantly faster
  • The cost of mapping (and unmapping) grants is the most significant bottleneck at this time

It also raised the attention on such issues with the Linux and Xen Project communities by having these results shared over a series of presentations and discussions:

Next, new research projects are scheduled (or already underway) to:

  • Look into new ideas for low-latency virtualised storage
  • Investigate bottlenecks and alternatives for aggregate workloads
  • Reduce the overall CPU utilisation of processing requests in user space

Have a happy 2014 and thanks for reading!

Recent Comments
Lorscheider Santiago
Congratulations for the excellent article and for his work. Interesting to know that even with SSD storage, there is a limitation ... Read More
Thursday, 02 January 2014 13:16
Felipe Franciosi
Hi Santiago. SSDs can be great. The tests I wrote about are all using a single VM and this is the hardest case to deliver near bar... Read More
Tuesday, 07 January 2014 20:11
Lorscheider Santiago
Hi Felipe. Thanks for the replies. Indeed you have a great work to do. Very good to know the resources that can be inserted in Ke... Read More
Wednesday, 08 January 2014 00:30
Continue reading
19587 Hits

VM Density and Project Pulsar

While overcoming the architectural obstacles in order to allow XenServer 6.2.0 to run up to 500 HVM guests per host, we came across an interesting observation: even for apparently idle Windows 7 VMs, there is some amount of work done in dom0. With careful analysis and optimisations in the way we emulate virtual hardware for such VMs, we managed to eliminate most of this work for guests that are idle. This allows us to effectively run 500 Windows VMs and keep dom0 load average practically at zero (after the VMs have finished booting).

The busy bees

We started by observing the CPU utilisation of the QEMU process that is responsible for the hardware emulation of a Windows 7 guest. Even when the guest was just waiting for login and not running disk indexing services, Windows Update or other known I/O operations, the QEMU process would consistently consume between 1% and 3% of CPU time in dom0 (depending mostly on the host's characteristics). In this scenario, there is one QEMU process for each HVM guest running on a host.

When we succeeded at overcoming the hard limits that prevented us from starting 500 guests (see this post), we soon realised that the small amount of work done by each QEMU would quickly add up. With 3% of dom0 vCPU consumption per running Windows 7 VM, it takes just over 130 VMs to completely max out four dom0 vCPUs. At that point, dom0's load average quickly increases and the host's performance becomes compromised.

Without available CPU time, dom0 is unable to promptly serve network and storage I/O requests. Other control plane operations are also affected since the toolstack (e.g. xapi, xenopsd, xenstore) will struggle to run and respond to requests from XenCenter or other clients.

Project Pulsar

In order to investigate why those QEMUs were "spinning" so much, we created Project Pulsar (named after neutron stars that spin considerably). The project conducted a detailed analysis of every event that disturbed QEMU from its otherwise sleeping state.

With careful debugging, we learned that there are two categories of events that cause QEMU to wake up. Firstly, there are internal timers that occur several times a second to poll for certain events (e.g. buffered I/O). Secondly, there are actual interrupts generated by guests to check on certain virtual hardware (e.g. USB and the parallel port).

The following table shows the amount of events disturbing QEMU during the lifetime of a Windows 7 VM running for 5 minutes. The events are grouped in 30 seconds time intervals for simplicity. This facilitates the visualisation of the boot period (within the first 30 seconds interval), the time that the VM is allegedly idle and the shutdown phase at the end. We separated the columns into Timer, Read and Write Events. Timer events are internal to QEMU and Read or Write events come from the VM as interrupts. After the PV drivers are loaded (which happens during the boot), storage and network I/O are not handled by QEMU and therefore are not accounted anymore.

30 Secs Interval Timer Events Read Events Write Events
000 2,130 160,191 59,769
001 3,762 17,831 1,731
002 3,603 10,420 1,662
003 3,458 2,964 1,587
004 3,451 2,970 1,591
005 3,454 3,038 1,650
006 3,461 3,078 1,683
007 3,457 2,964 1,587
008 3,429 2,952 1,581
009 3,446 3,034 1,657
010 594 347 280

While certain events only happen during the guest initialisation or shortly after the boot completed (e.g. parallel port scans), others remain constant throughout the life of the VM (e.g. USB protocols).

Idling down dom0

To address the first category of events that disturb QEMU (e.g. internal timers), we first studied why they were ever required. As it turned out, newer versions of QEMU were already patched to disable some of these. A good example is the case of buffered I/O which required polling. With the creation of a dedicated event channel for buffered I/O interrupts (see this patch) and a corresponding change to the device model (see this patch), QEMU no longer needs to poll.

The other timers we identified are necessary only when certain features are enabled. These are the QEMU monitor (that allows debugging tools to be hooked up to QEMU processes), serial port emulation and VNC connections. Considering XenServer does not support QEMU debugging via its monitor feature nor direct serial connections to HVM guests, these could be safely disabled. The last timer would only be active when a VNC connection is established to a guest (e.g. via XenCenter).

The second category of events happens due to the very nature of the hardware that is being emulated. If we present a Windows VM with an IDE DVD-ROM drive, the guest OS handles this (virtual) drive as a real drive. For example, it will poll the drive every so often to check whether the media has changed. Similar types of interrupts will be initiated by the guest to communicate with other emulated hardware.

In order to address these, we modified Xapi to allow the emulation of USB, parallel and serial ports to be turned off. Parallel port emulation fits in the same category as the serial port (i.e. there is no supported way to plug virtual devices to these ports in a guest) and the emulation of both are fairly safe and symptomless to be turned off.

Disabling USB, however, may have side effects. When using VNC to connect to a guest's console, a USB tablet device driver is used to allow for absolute coordinates of the mouse on the screen. When not using this USB driver, the VNC falls back to a PS/2 emulation which can only provide relative mouse positioning. The side effect is that, without USB, the mouse pointer of the VNC client will very likely be misaligned with the mouse pointer in the guest. This makes the console very hard to use.

The good news is that the Windows Remote Desktop Protocol (RDP) does not rely on the USB tablet driver. If the guest is configured to allow RDP connections, the USB emulation can be disabled without this side effect. When available, XenCenter already prefers RDP over VNC connections to Windows VMs by default.

Configuring the toolstack

The recommendations of Project Pulsar were adopted as defaults wherever possible and have been incorporated in XenServer 6.2.0. These included changes not visible to the VM (such as QEMU internal timers). However, we decided not to change the virtual hardware presented to VMs unless this is explicitly configured.

In order to configure Xapi to disable the Serial Port emulation, use the following command:

xe vm-param-set uuid=<vm-uuid> platform:hvm_serial=none

Similarly, the Parallel Port emulation can be disabled as follows:

xe vm-param-set uuid=<vm-uuid> platform:parallel=none

Finally, the USB emulation can be disabled as follows:

xe vm-param-set uuid=<vm-uuid> platform:usb=false
xe vm-param-set uuid=<vm-uuid> platform:usb_tablet=false

Note that two commands are necessary to completely disable the USB emulation. These disable both the virtual USB hub and the virtual tablet device used for positioning the mouse.

The best way to disable the emulation of the DVD-ROM drive is to delete the associated VBD. For information on how to do that, refer to Section A.4.23.3 of the XenServer 6.2.0 Administrator's Guide.


The following table shows the amount of events disturbing QEMU during the lifetime of a Windows 7 VM running for 5 minutes. This is the same VM used for the numbers in the table above, except that this time we incorporated the timer patches in QEMU and disabled the USB, DVD-ROM, Monitor, Parallel- and Serial- port emulation.

30 Secs Interval Timer Events Read Events Write Events
000 3 146,114 57,254
001 0 0 1
002 0 0 0
003 0 0 0
004 0 0 1
005 0 0 0
006 0 0 0
007 0 0 0
008 0 0 0
009 0 8 13
010 0 58 108

With these figures, we are able to start 500 Windows 7 VMs on one host and keep dom0 load average practically at zero (after the VMs have booted). 

Recent Comments
Jonathan Pitre
Could you please provide a bash script to disable the serial and parallel port for all VMs ? something similar to the scripts we c... Read More
Monday, 11 November 2013 23:36
Mark Starikov
something like that maybe? for vms in $(xe vm-list is-control-domain=false is-a-template=false is-a-snapshot=false | grep uuid | a... Read More
Wednesday, 20 November 2013 21:51
srinivas j
@Felipe, thanks for the blog post ! are the QEMU timer patches inside XenServer 6.2? or is there a plan? All the XenServer custome... Read More
Wednesday, 01 January 2014 12:42
Continue reading
21143 Hits

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
Technical support for XenServer is available from Citrix.