Virtualization Blog

Discussions and observations on virtualization.

XenServer 7.0 performance improvements part 3: Parallelised plug and unplug VBD operations in xenopsd

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the third in a series of articles that will describe the principal improvements. For the first two, see here:

  1. http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html
  2. http://xenserver.org/blog/entry/dundee-networking-multi-queue.html

The topic of this post is control plane performance. XenServer 7.0 achieves significant performance improvements through the support for parallel VBD operations in xenopsd. With the improvements, xenopsd is able to plug and unplug many VBDs (virtual block devices) at the same time, substantially improving the duration of VM lifecycle operations (start, migrate, shutdown) for VMs with many VBDs, and making it practical to operate VMs with up to 255 VBDs.

Background of the VM lifecycle operations

In XenServer, xenopsd is the dom0 component responsible for VM lifecycle operations:

  • during a VM start, xenopsd creates the VM container and then plugs the VBDs before starting the VCPUs;
  • during a VM shutdown, xenopsd stops the VCPUs and then unplugs the VBDs before destroying the VM container;
  • during a VM migrate, xenopsd creates a new VM container, unplugs the VBDs of the old VM container, and plugs the VBDs for the new VM before starting its VCPUs; while the VBDs are being unplugged and plugged on the other VM container, the user experiences a VM downtime when the VM is unresponsive because both old and new VM containers are paused.

Measurements have shown that a large part, usually most of the duration of these VM lifecycle operations is due to plugging and unplugging the VBDs, especially on slow or contended storage backends.

b2ap3_thumbnail_vbd-plugs-sequential.png

 

Why does xenopsd take some time to plug and unplug the VBDs?

The completion of a xenopsd VBD plug operation involves the execution of two storage layer operations, VDI attach and VDI activate (where VDI stands for virtual disk image). These VDI operations include control plane manipulation of daemons, block devices and disk metadata in dom0, which will take different amounts of time to execute depending on the type of the underlying Storage Repository (SRs, such as LVM, NFS or iSCSI) used to hold the VDIs, and the current load on the storage backend disks and their types (SSDs or HDs). Similarly, the completion of a xenopsd VBD unplug operation involves the execution of two storage layer operations, VDI deactivate and VDI detach, with the corresponding overhead of manipulating the control plane of the storage layer.

If the underlying physical disks are under high load, there may be contention preventing progress of the storage layer operations, and therefore xenopsd may need to wait many seconds before the requests to plug and unplug the VBDs can be served.

Originally, xenopsd would execute these VBD operations sequentially, and the total time to finish all of them for a single VM would depend of the number of VBDs in the VM. Essentially, it would be a sum of the time to operate each of othe VBDs of this VM, which would result in several minutes of wait for a lifecycle operation of a VM that had, for instance, 255 VBDs.

What are the advantages of parallel VBD operations?

Plugging and unplugging the VBDs in parallel in xenopsd:

  • provides a total duration for the VM lifecycle operations that is independent of the number of VBDs in the VM. This duration will typically be the duration of the longest individual VBD operation amongst the parallel VBD operations for that VM;
  • provides a significant instantaneous improvement for the user, across all the VBD operations involving more than 1 VBD per VM. The more devices involved, the larger the noticeable improvement, up to the saturation of the underlying storage layer;
  • this single improvement is immediately applicable across all of VM start, VM shutdown and VM migrate lifecycle operations.

b2ap3_thumbnail_vbd-plugs-parallel.png

 

Are there any disadvantages or limitations?

Plugging and unplugging VBDs uses dom0 memory. The main disadvantage of doing these in parallel is that dom0 needs more memory to handle all the parallel operations. To prevent situations where a large number of such operations would cause dom0 to run out of memory, we have added two limits:

  • the maximum number of global parallel operations that xenopsd can request is the same as the number of xenopsd worker-pool threads as defined by worker-pool-size in /etc/xenopsd.conf. This prevents regression in the maximum dom0 memory usage compared to when sequential VBD operations per VM was used in xenopsd. An increase in this value will increase the number of parallel VBD operations, at the expense of having to increase the dom0 memory for about 15MB for each extra parallel VBD operation.
  • the maximum number of per-VM parallel operations that xenopsd can request is currently fixed to 10, which covers a wide range of VMs and still provides a 10x improvement in lifecycle operation times for those VMs that have more than 10 VBDs.

Where do I find the changes?

The changes that implemented this feature are available in github at https://github.com/xapi-project/xenopsd/pull/250

What sort of theoretical improvements should I expect in XenServer 7.0?

The exact numbers depend on the SR type, storage backend load characteristics and the limits specified in the previous section. Given the limits in the previous section, the results for the duration of VDB plugs for a single VM will follow the pattern in the following table:

Number n of VBDs/VM
Improvement of VBD operations
<=10 VBDs/VM times faster
> 10 VBDs/VM

10 times faster

The table above assumes that the maximum number of global parallel operations discussed in the section above is not reached. If you want to guarantee the improvement in the table above for x>1 simultaneous VM lifecycle operations, at the expense of using more dom0 memory in the worst case, you will probably want to set worker-pool-size = (n * x) in /etc/xenopsd.conf, where is a number reflecting the average number of VBDs/VM amongst all VMs up to a maximum of n=10.

What sort of practical improvements should I expect in XenServer 7.0?

The VBD plug and unplug operations are only part of the overall operations necessary to execute a VM lifecycle operation. The remaining parts, such as creation of the VM container and VIF plugs, will disperse the VBD improvements of the previous section, though they are still significant. Some examples of improvements, using a EXT SR on a local SSD storage backend:

VM lifecycle operation
mImprovement with 8 VBDs/VM

Toolstack time to start a single VM

b2ap3_thumbnail_vmstart-8vbds-1vm.png

 

Toolstack time to bootstorm 125 VMs

b2ap3_thumbnail_bootstorm-8vbds-125vms.png

 

The approximately 2s improvement in single VM start time was caused by plugging the 8 VBDs in parallel. As we see in the second row of the table, this can be a significant advantage in a bootstorm.

In XenServer 7.0, not only does xenopsd execute VBD operations in parallel, but it also has improvements in the storage layer operation times on VDIs, so you may observe that in your XenServer 7.0 environment further VM lifecycle time improvements beyond the expected ones from parallel VBD operations are noticeable, compared to XenServer 6.5SP1.

 

Recent comment in this post
Sam McLeod
Thanks for the post Marcus!
Wednesday, 20 July 2016 09:02
Continue reading
3597 Hits
1 Comment

XenServer 7.0 performance improvements part 1: Lower latency storage datapath

The XenServer team made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the first in a series of articles that will describe the principal improvements.

Our first topic is storage I/O performance. A performance improvement has been achieved through the adoption of a polling technique in tapdisk3, the component of XenServer responsible for handling I/O on virtual storage devices. Measurements of one particular use-case demonstrate a 50% increase in performance from 15,000 IOPS to 22,500 IOPS.

What is polling?

Normally, tapdisk3 operates in an event-driven manner. Here is a summary of the first few steps required when a VM wants to do some storage I/O:

  1. The VM's paravirtualised storage driver (called blkfront in Linux or xenvbd in Windows) puts a request in the ring it shares with dom0.
  2. It sends tapdisk3 a notification via an event-channel.
  3. This notification is delivered to domain 0 by Xen as an interrupt. If Domain 0 is not running, it will need to be scheduled in order to receive the interrupt.
  4. When it receives the interrupt, the domain 0 kernel schedules the corresponding backend process to run, tapdisk3.
  5. When tapdisk3 runs, it looks at the contents of the shared-memory ring.
  6. Finally, tapdisk3 finds the request which can then be transformed into a physical I/O request.

Polling is an alternative to this approach in which tapdisk3 repeatedly looks in the ring, speculatively checking for new requests. This means that steps 2–4 can be skipped: there's no need to wait for an event-channel interrupt, nor to wait for the tapdisk3 process to be scheduled: it's already running. This enables tapdisk3 to pick up the request much more promptly as it avoids these delays inherent to the event-driven approach.

The following diagram contrasts the timelines of these alternative approaches, showing how polling reduces the time until the request is picked up by the backend.

b2ap3_thumbnail_polling-explained.png

How does polling help improve storage I/O performance?

Polling is in established technique for reducing latency in event-driven systems. (One example of where it is used elsewhere to mitigate interrupt latency is in Linux networking drivers that use NAPI.)

Servicing I/O requests promptly is an essential part of optimising I/O performance. As I discussed in my talk at the 2015 Xen Project Developer Summit, reducing latency is the key to maintaining a low virtualisation overhead. As physical I/O devices get faster and faster, any latency incurred in the virtualisation layer becomes increasingly noticeable and translates into lower throughputs.

An I/O request from a VM has a long journey to physical storage and back again. Polling in tapdisk3 optimises one section of that journey.

Isn't polling really CPU intensive, and thus harmful?

Yes it is, so we need to handle it carefully. If left unchecked, polling could easily eat huge quantities of domain 0 CPU time, starving other processes and causing overall system performance to drop.

We have chosen to do two things to avoid consuming too much CPU time:

  1. Poll the ring only when there's a good chance of a request appearing. Of course, guest behaviour is totally unpredictable in general, but there are some principles that can increase our chances of polling at the right time. For example, one assumption we adopt is that it's worth polling for a short time after the guest issues an I/O request. It has issued one request, so there's a good chance that it will issue another soon after. And if this guess turns out to be correct, keep on polling for a bit longer in case any more turn up. If there are none for a while, stop polling and temporarily fall back to the event-based approach.
  2. Don't poll if domain 0 is already very busy. Since polling is expensive in terms of CPU cycles, we only enter the polling loop if we are sure that it won't starve other processes of CPU time they may need.

How much faster does it go?

The benefit you will get from polling depends primarily on the latency of your physical storage device. If you are using an old mechanical hard-drive or an NFS share on a server on the other side of the planet, shaving a few microseconds off the journey through the virtualisation layer isn't going to make much of a difference. But on modern devices and low-latency network-based storage, polling can make a sizeable difference. This is especially true for smaller request sizes since these are most latency-sensitive.

For example, the following graph shows an improvement of 50% in single-threaded sequential read I/O for small request sizes – from 15,000 IOPS to 22,500 IOPS. These measurements were made with iometer in a 32-bit Windows 7 SP1 VM on a Dell PowerEdge R730xd with an Intel P3700 NVMe drive.

b2ap3_thumbnail_5071.png

How was polling implemented?

The code to add polling to tapdisk3 can be found in the following set of commits: https://github.com/xapi-project/blktap/pull/179/commits.

Continue reading
7380 Hits
0 Comments

Implementing VDI-per-LUN storage

With storage providers adding better functionality to provide features like QoS, fast snapshot & clone and with the advent of storage-as-a-service, we are interested in the ability to utilize these features from XenServer. VMware’s VVols offering already allows integration of vendor provided storage features into their hypervisor. Since most storage allows operations at the granularity of a LUN, the idea is to have a one-to-one mapping between a LUN on the backend and a virtual disk (VDI) on the hypervisor. In this post we are going to talk about the supplemental pack that we have developed in order to enable VDI-per-LUN.

Xenserver Storage

To understand the supplemental pack, it is useful to first review how XenServer storage works. In XenServer, a storage repository (SR) is a top-level entity which acts as a pool for storing VDIs which appear to the VMs as virtual disks. XenServer provides different types of SRs (File, NFS, Local, iSCSI). In this post we will be looking at iSCSI based SRs as iSCSI is the most popular protocol for remote storage and the supplemental pack we developed is targeted towards iSCSI based SRs. An iSCSI SR uses LVM to store VDIs over logical volumes (hence the type is lvmoiscsi). For instance:

[root@coe-hq-xen08 ~]# xe sr-list type=lvmoiscsi
uuid ( RO)                : c67132ec-0b1f-3a69-0305-6450bfccd790
          name-label ( RW): syed-sr
    name-description ( RW): iSCSI SR [172.31.255.200 (iqn.2001-05.com.equallogic:0-8a0906-c24f8b402-b600000036456e84-syed-iscsi-opt-test; LUN 0: 6090A028408B4FC2846E4536000000B6: 10 GB (EQLOGIC))]
                host ( RO): coe-hq-xen08
                type ( RO): lvmoiscsi
        content-type ( RO):

The above SR is created from a LUN on a Dell EqualLogic. The VDIs belonging to this SR can be listed by:

[root@coe-hq-xen08 ~]# xe vdi-list sr-uuid=c67132ec-0b1f-3a69-0305-6450bfccd790 params=uuid
uuid ( RO)    : ef5633d2-2ad0-4996-8635-2fc10e05de9a

uuid ( RO)    : b7d0973f-3983-486f-8bc0-7e0b6317bfc4

uuid ( RO)    : bee039ed-c7d1-4971-8165-913946130d11

uuid ( RO)    : efd5285a-3788-4226-9c6a-0192ff2c1c5e

uuid ( RO)    : 568634f9-5784-4e6c-85d9-f747ceeada23

[root@coe-hq-xen08 ~]#

This SR has 5 VDI. From LVM’s perspective, an SR is a volume group (VG) and each VDI is a logical volume(LV) inside that volume group. This can be seen via the following commands:

[root@coe-hq-xen08 ~]# vgs | grep c67132ec-0b1f-3a69-0305-6450bfccd790
  VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790   1   6   0 wz--n-   9.99G 5.03G
[root@coe-hq-xen08 ~]# lvs VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790
  LV                                       VG                                                 Attr   LSize 
  MGT                                      VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-a-   4.00M                                 
  VHD-568634f9-5784-4e6c-85d9-f747ceeada23 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-ao   8.00M                               
  VHD-b7d0973f-3983-486f-8bc0-7e0b6317bfc4 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-ao   2.45G                               
  VHD-bee039ed-c7d1-4971-8165-913946130d11 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi---   8.00M                                
  VHD-ef5633d2-2ad0-4996-8635-2fc10e05de9a VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -ri-ao   2.45G
VHD-efd5285a-3788-4226-9c6a-0192ff2c1c5e VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -ri-ao  36.00M

Here c67132ec-0b1f-3a69-0305-6450bfccd790 is the UUID of the SR. Each VDI is represented by a corresponding LV which is of the format VHD-. Some of the LVs have a small size of 8MB. These are snapshots taken on XenServer. There is also a LV named MGT which holds metadata about the SR and the VDIs present in it. Note that all of this is present in an SR which is a LUN on the backend storage.

Now XenServer can attach a LUN at the level of an SR but we want to map a LUN to a single VDI. In order to do that, we restrict an SR to contain a single VDI. Our new SR has the following LVs:

[root@coe-hq-xen09 ~]# lvs VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9
LV                                       VG                                                 Attr   LSize
MGT                                      VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9 -wi-a- 4.00M
VHD-09b14a1b-9c0a-489e-979c-fd61606375de VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9 -wi--- 8.02G
[root@coe-hq-xen09 ~]#

b2ap3_thumbnail_vdi-lun.png

If a snapshot or clone of the LUN is taken on the backend, all the unique identifiers associated with the different entities in the LUN also get cloned and any attempt to attach the LUN back to XenServer will result in an error because of conflicts of unique IDs.

Resignature and supplemental pack

In order for the cloned LUN to be re-attached, we need to resignature the unique IDs present in the LUN. The following IDs need to be resignatured

  • LVM UUIDs (PV, VG, LV)
  • VDI UUID
  • SR metadata in the MGT Logical volume

We at CloudOps have developed an open-source supplemental pack which solves the resignature problem. You can find it here. The supplemental pack adds a new type of SR (relvmoiscsi) and you can use it to resignature your lvmoiscsi SRs. After installing the supplemental pack, you can resignature a clone using the following command

[root@coe-hq-xen08 ~]# xe sr-create name-label=syed-single-clone type=relvmoiscsi 
device-config:target=172.31.255.200
device-config:targetIQN=$IQN
device-config:SCSIid=$SCSIid
device-config:resign=true
shared=true
Error code: SR_BACKEND_FAILURE_1
Error parameters: , Error reporting error, unknown key The SR has been successfully resigned. Use the lvmoiscsi type to attach it,
[root@coe-hq-xen08 ~]#

Here, instead of creating a new SR, the supplemental pack re-signatures the provided LUN and detaches it (the error is expected as we don’t actually create an SR). You can see from the error message that the SR has been re-signed successfully. Now the cloned SR can be introduced back to XenServer without any conflicts using the following commands:

[root@coe-hq-xen09 ~]# xe sr-probe type=lvmoiscsi device-config:target=172.31.255.200 device-config:targetIQN=$IQN device-config:SCSIid=$SCSIid

   		 5f616adb-6a53-7fa2-8181-429f95bff0e7
   		 /dev/disk/by-id/scsi-36090a028408b3feba66af52e0000a0e6
   		 5364514816

[root@coe-hq-xen09 ~]# xe sr-introduce name-label=vdi-test-resign type=lvmoiscsi 
uuid=5f616adb-6a53-7fa2-8181-429f95bff0e7
5f616adb-6a53-7fa2-8181-429f95bff0e7

This supplemental pack can be used in conjunction with an external orchestrator like CloudStack or OpenStack which can manage both the storage and compute. Working with SolidFire we have implemented this functionality, available in the next release of Apache CloudStack. You can check out a preview of this feature in a screencast here.

Recent Comments
Nick
If I am reading this correctly, this is just basically setting up XS to use 1 SR per VM, this isn't scalable as the limits for LUN... Read More
Tuesday, 26 April 2016 14:57
Syed Ahmed
Hi Nick, The limit of 256 SRs is when using Multipating. If no multipath is used, the number of SRs that can be created are well... Read More
Tuesday, 26 April 2016 17:19
Syed Ahmed
There is an initial overhead when creating SRs. However, we did not find any performance degradation in our tests once the SR is s... Read More
Wednesday, 27 April 2016 09:21
Continue reading
4569 Hits
7 Comments

XenServer Dundee Beta 3 Available

I am pleased to announce that Dundee beta 3 has been released, and for those of you who monitor Citrix Tech Previews, beta 3 corresponds to the core XenServer platform used for Dundee TP3. This third beta marks a major development milestone representing the proverbial "feature complete" stage. Normally when announcing pre-release builds, I highlight major functional advances but this time I need to start with a feature which was removed.

Thin Provisioned Block Storage Removed

While its never great to start with a negative, I felt anything related to removal of a storage option takes priority over new and shiny. I'm going to keep this section short, and also highlight that only the new thin provisioned block feature was removed and existing thin provisioned NFS and file based storage repositories will function as they've always done.

What should I do before upgrading to beta 3?

While we don't actively encourage upgrades to pre-release software, we do recognize you're likely to do it at least once. If you have built out infrastructure using thin provisioned iSCSI or HBA storage using a previous pre-release of Dundee, please ensure you migrate any critical VMs to either local storage, NFS or thick provisioned block storage prior to performing an upgrade to beta 3.

So what happened?

As is occasionally the case with pre-release software not all features which are previewed will make it to the final release; for any of a variety of reasons. That is of course one reason we provide pre-release access. In the case of the thin provisioned block storage implementation present in earlier Dundee betas, we ultimately found that it had issues under performance stress. As a result, we've made the difficult decision to remove it from Dundee at this time. Investigation into alternative implementations are underway, and the team is preparing a more detailed blog on future directions.

Beta 3 Overview

Much of difference between beta 2 and beta 3 can be found in some of the details. dom0 has been updated to a CentOS 7.2 userspace, the Xen project hypervisor is now 4.6.1 and the kernel is 3.10.96. Support for xsave and xrestor floating point instructions has been added, enabling guest VMs to utilize AVX instructions available on newer Intel processors. We've also added experimental support for the Microsoft Windows Server 2016 Tech Preview and the Ubuntu 16.04 beta.

Beta 3 Bug Fixes

Earlier pre-releases of Dundee had an issue wherein performing a storage migration of a VM with snapshots and in particular orphaned snapshots would result in migration errors. Work has been done to resolve this, and it would be beneficial for anyone taking beta 3 to exercise storage motion to validate if the fix is complete.

One of the focus areas for Dundee is to improve scalability, and as part of that we've uncovered some situations where overall reliability wasn't what we wanted. An example of such a situation, which we've resolved, occurs when a VM with a very large number of VBDs is running on a host, and a XenServer admin requests the host to shutdown. Prior to the fix, such a host would become unresponsive.

The default logrotate for xensource.log has been changed to rotate at a 100MB in addition to daily. This change was done as on very active systems excessive disk consumption could result in the prior configuration.

Download Information

You can download Dundee beta.3 from the Preview Download page (http://xenserver.org/preview), and any issues found can be reported in our defect database (https://bugs.xenserver.org).     

Recent Comments
Tobias Kreidl
Given the importance of thin provisioning for block-based storage devices, I'm sure that Citrix will find a solution. It takes mak... Read More
Wednesday, 16 March 2016 03:08
Tim Mackey
@Tobias Thanks. We haven't given up on thin provisioned block at all. We're already down a different path which is looking promis... Read More
Wednesday, 16 March 2016 13:32
Gianni D
Not sure if this was by design but the Beta 3 XenCenter won't allow vm migration on older versions of XenServer. Option doesn't e... Read More
Wednesday, 16 March 2016 04:26
Continue reading
9037 Hits
11 Comments

Storage XenMotion in Dundee

Before we dive into the enhancements brought in the Storage XenMotion(SXM) feature in the XenServer Dundee Alpha 2 release; here is the refresher of various VM migrations supported in XenServer and how users can leverage them for different use cases.

XenMotion refers to the live migration of a VM(VM's disk residing on a shared storage) from one host to another within a pool with minimal downtime. Hence, with XenMotion we are moving the VM without touching its disks. E.g. XenMotion feature is very helpful during host and pool maintenance and is used with XenServer features such as Work Load Balancing(WLB)and Rolling Pool Upgrade(RPU) where the VM's residing on shared ­­­­storage can be moved to other host within a pool.

Storage XenMotion is the marketing name given to two distinct XenServer features, live storage migration and VDI move. Both features refer to the movement of a VM's disk (vdi) from one storage repository to another. Live storage migration also includes the movement of the running VM from one host to another host. In the initial state, the VM’s disk can reside either on local storage of the host or shared storage. It can then be motioned to either local storage of another host or shared storage of a pool (or standalone host). The following classifications exist for SXM:

  • When source and destination hosts happen to be part of the same pool we refer to it as IntraPool SXM. You can choose to migrate the VM's vdis to local storage of the destination host or another shared storage of the pool. E.g. Leverage it when you need to live migrate VMs residing on a slave host to the master of a pool.
  • When source and destination hosts happen to be part of different pools we refer to it as CrossPool SXM. VM migration between two standalone hosts can also be regarded as CrossPool SXM. You can choose to migrate the VM's vdis to local storage of the destination host or shared storage of the destination pool. E.g. I often leverage CrossPool SXM when I need to migrate VM’s residing on pool having old version of XenServer to the pool having latest XenServer
  • When the source and destination hosts are the same, we refer to it as LiveVDI Move. E.g. Leverage LiveVDI move when you want to upgrade your storage arrays but you don’t want to move the VMs to some another host. In such cases, you can LiveVDI move VM's vdi say from the shared storage (which is planned to be upgraded) to another shared storage in a pool without taking down your VM’s.

When SXM was first rolled out in XenServer 6.1 ,there were some restrictions on VMs before they can be motioned such as maximum number of snapshots a VM can have while undergoing SXM, VM having checkpoints cannot be motioned, the VM has to be running (otherwise it’s not live migration) etc. For the XenServer DundeeAlpha 2 release the XenServer team has removed some of those constraints .Thus below is the list of enhancements brought to SXM.

1. VM can be motioned regardless of its power status. Therefore I can successfully migrate a suspended VM or move a halted VM within a pool (intra pool) or across pools (cross pool)

a1sx2_Thumbnail1_CroosPoolCopyEdited_20150811-093802_1.jpg

a1sx2_Thumbnail1_CrossPoolSuspendEdited.jpg

2. VM can have more than one snapshot and checkpoint during SXM operation. Thus VM can have a mix of snapshots and checkpoints and it can still be successfully motioned.

a1sx2_Thumbnail1_CrossPoolCheckpointEdited.jpg

3. A Halted VM can be copied from say pool A to pool B (cross-pool copy) .Note that VM’s that are not in halted state cannot be cross pool copied.

a1sx2_Thumbnail1_CroosPoolCopyEdited_20150811-094555_1.jpg

4. User created templates can also be copied from say pool A to pool B (cross-pool copy).Note that system templates cannot be cross pool copied.

a1sx2_Thumbnail1_CrossPoolCopyTemplateEdited.jpg

Well this is not the end of the SXM improvements! Stay tuned with XenServer where in the upcoming release we aim to reduce VM downtime further during migration operations, and do download the Dundee preview and try it out yourself.     

Recent Comments
Tobias Kreidl
Nice post, Mayur! An important point to note in Dundee is that there is now also the option to create thin provisioned storage usi... Read More
Wednesday, 05 August 2015 21:38
Tobias Kreidl
Mayur, The images currently posted here are thumbnails and very hard to read (at least for some of us). Could you either post the ... Read More
Saturday, 08 August 2015 17:55
Mayur Vadhar
Thanks for the comments ,Tobias ! I will edit the post so the images are clearly visible. And I have mentioned cross pool copy opt... Read More
Monday, 10 August 2015 15:26
Continue reading
19089 Hits
6 Comments

Dundee Alpha 2 Released

I am pleased to announce that today we have made available the second alpha build for XenServer Dundee. For those of you who missed the first alpha, it was focused entirely on the move to CentOS 7 for dom0. This important operational change is one long time XenServer users and those who have written management tooling for XenServer should be aware of throughout the Dundee development cycle. At the time of Alpha 1, no mention was made for feature changes, and with Alpha 2 we're going to talk about some features. So here are some of the important items to be aware of.

Thin Provisioning on block storage

For those who aren't aware, when a XenServer SR is using iSCSI or an HBA, the virtual disks have always consumed their entire allocated space regardless of how utilized the actual virtual disk was. With Dundee we now have full thin provisioning for all block storage independent of storage vendor. In order to take advantage of this, you will need to indicate during SR creation that thin provisioning is required. You will also be given the opportunity to specify the default vdi allocation which allows users to optimize vdi utilization against storage performance. We do know about a number of areas still needing attention, but are providing early access such that the community can further identify issues our testing hasn't yet encountered.

NFS version 4

While a simple enhancement, this was identified as a priority item during the Creedence previews last year. We didn't really have the time then to fully implement it, but as of Dundee Alpha 2 you can specify NFS 4 for SR creation in XenCenter.

Intel GVT-d

XenServer 6.5 SP1 introduced support for Intel GVT-d graphics in Haswell and Broadwell chips. This support has been ported to Dundee and is now present in Alpha 2. At this point GPU operations in Dundee should have feature parity to XenServer 6.5 SP1.

CIFS for virtual disk storage

For some time we've had CIFS as an option for ISO storage, but lacked it for virtual disk storage. That has been remedied and if you are running CIFS you can now use it for all your XenServer storage needs.

Changed dom0 disk size

During installation of XenServer 6.5 and prior, a 4GB partition is created for dom0 with an additional 4GB partition created as a backup. For some users, the 4GB partition was too limiting, particularly if remote SYSLOG wasn't used or when third party monitoring tools were installed in dom0. With Dundee we've completely changed the local storage layout for dom0, and this has significant implications for all users wishing to upgrade to Dundee.

New layout

The new partition layout will consume 46GB from local storage. If there is less than 46 GB available, then a fresh install will fail. The new partition layout will be as follows:

  • 512 MB UEFI boot partition
  • 18 GB dom0 partition
  • 18 GB backup partition
  • 4 GB logs partition
  • 1 GB SWAP partition

As you can see from this new partition layout that we've separated logs and SWAP out of the main operating partition, and that we're now supporting UEFI boot.

Upgrades

During upgrade, if there is at least 46 GB available, we will recreate the partition layout to match that of a fresh install. In the event 46GB isn't available, we will shrink the existing dom0 partition from 4 GB to 3.5 GB and create the 512 MB UEFI boot partition.

Downloading Alpha 2

 

Dundee Alpha 2 is available for download from xenserver.org/prerelease

Recent Comments
Tim Stephenson
46Gb+ seems very large for Dom0 :/ We're running XenServer on a set of Dell Poweredge R630's s equipped with dual 16Gb SD card bo... Read More
Tuesday, 14 July 2015 20:17
Tim Mackey
It's not 46GB for dom0, but 46GB for all storage dedicated to XenServer itself (dom0 is only 18GB). I hear you about the SD card ... Read More
Tuesday, 14 July 2015 20:26
Tobias Kreidl
First off, this is fantastic news, regarding what's mentioned (NFSv4, thin provisioning on LVM, bigger partition space, CIFS stora... Read More
Tuesday, 14 July 2015 20:45
Continue reading
26029 Hits
42 Comments

XenServer's LUN scalability

"How many VMs can coexist within a single LUN?"

An important consideration when planning a deployment of VMs on XenServer is around the sizing of your storage repositories (SRs). The question above is one I often hear. Is the performance acceptable if you have more than a handful of VMs in a single SR? And will some VMs perform well while others suffer?

In the past, XenServer's SRs didn't always scale too well, so it was not always advisable to cram too many VMs into a single LUN. But all that changed in XenServer 6.2, allowing excellent scalability up to very large numbers of VMs. And the subsequent 6.5 release made things even better.

The following graph shows the total throughput enjoyed by varying numbers of VMs doing I/O to their VDIs in parallel, where all VDIs are in a single SR.

3541.png

In XenServer 6.1 (blue line), a single VM would experience modest 240 MB/s. But, counter-intuitively, adding more VMs to the same SR would cause the total to fall, reaching a low point around 20 VMs achieving a total of only 30 MB/s – an average of only 1.5 MB/s each!

On the other hand, in XenServer 6.5 (red line), a single VM achieves 600 MB/s, and it only requires three or four VMs to max out the LUN's capabilities at 820 MB/s. Crucially, adding further VMs no longer causes the total throughput to fall, but remains constant at the maximum rate.

And how well distributed was the available throughput? Even with 100 VMs, the available throughput was spread very evenly -- on XenServer 6.5 with 100 VMs in a LUN, the highest average throughput achieved by a single VM was only 2% greater than the lowest. The following graph shows how consistently the available throughput is distributed amongst the VMs in each case:

4016.png

Specifics

  • Host: Dell R720 (2 x Xeon E5-2620 v2 @ 2.1 GHz, 64 GB RAM)
  • SR: Hardware HBA using FibreChannel to a single LUN on a Pure Storage 420 SAN
  • VMs: Debian 6.0 32-bit
  • I/O pattern in each VM: 4 MB sequential reads (O_DIRECT, queue-depth 1, single thread). The graph above has a similar shape for smaller block sizes and for writes.
Recent Comments
Tobias Kreidl
Very nice, Jonathan, and it is always good to raise discussions about standards that are known to change over time. This is partic... Read More
Friday, 26 June 2015 19:52
Tobias Kreidl
Indeed, depending on the specific characteristics of each storage array there will be some maximum queue depth per connection (por... Read More
Saturday, 27 June 2015 04:27
Jonathan Davies
Thanks for your comments, Tobias and John. You're absolutely right -- the LUN's capabilities are an important consideration. And n... Read More
Monday, 29 June 2015 08:53
Continue reading
13114 Hits
6 Comments

When Virtualised Storage is Faster than Bare Metal

An analysis of block size, inflight requests and outstanding data

INTRODUCTION

Back in August 2014 I went to the Xen Project Developer Summit in Chicago (IL) and presented a graph that caused a few faces to go "ahn?". The graph was meant to show how well XenServer 6.5 storage throughput could scale over several guests. For that, I compared 10 fio threads running in dom0 (mimicking 10 virtual disks) with 10 guests running 1 fio thread each. The result: the aggregate throughput of the virtual machines was actually higher.

In XenServer 6.5 (used for those measurements), the storage traffic of 10 VMs corresponds to 10 tapdisk3 processes doing I/O via libaio in dom0. My measurements used the same disk areas (raw block-based virtual disks) for each fio thread or tapdisk3. So how can 10 tapdisk3 processes possibly be faster than 10 fio threads also using libaio and also running in dom0?

At the time, I hypothesised that the lack of indirect I/O support in tapdisk3 was causing requests larger than 44 KiB (the maximum supported request size in Xen's traditional blkif protocol) to be split into smaller requests. And that the storage infrastructure (a Micron P320h) was responding better to a higher number of smaller requests. In case you are wondering, I also think that people thought I was crazy.

You can check out my one year old hypothesis between 5:10 and 5:30 on the XPDS'14 recording of my talk: https://youtu.be/bbdWFB1mBxA?t=5m10s

20150525-01-slide.jpg

TRADITIONAL STORAGE AND MERGES

For several years operating systems have been optimising storage I/O patterns (in software) before issuing them to the corresponding disk drivers. In Linux, this has been achieved via elevator schedulers and the block layer. Requests can be reordered, delayed, prioritised and even merged into a smaller number of larger requests.

Merging requests has been around for as long as I can remember. Everyone understands that less requests mean less overhead and that storage infrastructures respond better to larger requests. As a matter of fact, the graph above, which shows throughput as a function of request size, is proof of that: bigger requests means higher throughput.

It wasn't until 2010 that a proper means to fully disable request merging came into play in the Linux kernel. Alan Brunelle showed a 0.56% throughput improvement (and less CPU utilisation) by not trying to merge requests at all. I wonder if he questioned that splitting requests could actually be even more beneficial.

SPLITTING I/O REQUESTS

Given the results I have seen on my 2014 measurements, I would like to take this concept a step further. On top of not merging requests, let's forcibly split them.

The rationale behind this idea is that some drives today will respond better to a higher number of outstanding requests. The Micron P320h performance testing guide says that it "has been designed to operate at peak performance at a queue depth of 256" (page 11). Similar documentation from Intel uses a queue depth of 128 to indicate peak performance of its NVMe family of products.

But it is one thing to say that a drive requires a large number of outstanding requests to perform at its peak. It is a different thing to say that a batch of 8 requests of 4 KiB each will complete quicker than one 32 KiB request.

MEASUREMENTS AND RESULTS

So let's put that to the test. I wrote a little script to measure the random read throughput of two modern NVMe drives when facing workloads with varying block sizes and I/O depth. For block sizes from 512 B to 4 MiB, I am particularly interested in analysing how these disks respond to larger "single" requests in comparison to smaller "multiple" requests. In other words, what is faster: 1 outstanding request of X bytes or Y outstanding requests of X/Y bytes?

My test environment consists of a Dell PowerEdge R720 (Intel E5-2643v2 @ 3.5GHz, 2 Sockets, 6 Cores/socket, HT Enabled), with 64 GB of RAM running Linux Jessie 64bit and the Linux 4.0.4 kernel. My two disks are an Intel P3700 (400GB) and a Micron P320h (175GB). Fans were set to full speed and the power profiles are configured for OS Control, with a performance governor in place.

#!/bin/bash
sizes="512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 \
       1048576 2097152 4194304"
drives="nvme0n1 rssda"

for drive in ${drives}; do
    for size in ${sizes}; do
        for ((qd=1; ${size}/${qd} >= 512; qd*=2)); do
            bs=$[ ${size} / ${qd} ]
            tp=$(fio --terse-version=3 --minimal --rw=randread --numjobs=1  \
                     --direct=1 --ioengine=libaio --runtime=30 --time_based \
                     --name=job --filename=/dev/${drive} --bs=${bs}         \
                     --iodepth=${qd} | awk -F';' '{print $7}')
            echo "${size} ${bs} ${qd} ${tp}" | tee -a ${drive}.dat
        done
    done
done

There are several ways of looking at the results. I believe it is always worth starting with a broad overview including everything that makes sense. The graphs below contain all the data points for each drive. Keep in mind that the "x" axis represent Block Size (in KiB) over the Queue Depth.

20150525-02-nvme.jpg

20150525-03-rssda.jpg

While the Intel P3700 is faster overall, both drives share a common treat: for a certain amount of outstanding data, throughput can be significantly higher if such data is split over several inflight requests (instead of a single large request). Because this workload consists of random reads, this is a characteristic that is not evident in spinning disks (where the seek time would negatively affect the total throughput of the workload).

To make this point clearer, I have isolated the workloads involving 512 KiB of outstanding data on the P3700 drive. The graph below shows that if a workload randomly reads 512 KiB of data one request at a time (queue depth=1), the throughput will be just under 1 GB/s. If, instead, the workload would read 8 KiB of data with 64 outstanding requests at a time, the throughput would be about double (just under 2 GB/s).

20150525-04-nvme512k.jpg

CONCLUSIONS

Storage technologies are constantly evolving. At this point in time, it appears that hardware is evolving much faster than software. In this post I have discussed a paradigm of workload optimisation (request merging) that perhaps no longer applies to modern solid state drives. As a matter of fact, I am proposing that the exact opposite (request splitting) should be done in certain cases.

Traditional spinning disks have always responded better to large requests. Such workloads reduced the overhead of seek times where the head of a disk must roam around to fetch random bits of data. In contrast, solid state drives respond better to parallel requests, with virtually no overhead for random access patterns.

Virtualisation platforms and software-defined storage solutions are perfectly placed to take advantage of such paradigm shifts. By understanding the hardware infrastructure they are placed on top of, as well as the workload patterns of their users (e.g. Virtual Desktops), requests can be easily manipulated to better explore system resources.

Recent Comments
Tobias Kreidl
Fascinating as always, Felipe, and thank you for sharing these results. Indeed, SSD behavior will be different from that of tradit... Read More
Monday, 25 May 2015 18:51
Felipe Franciosi
Thanks for the comment. I am welcoming more people experimenting with these findings and discussing this topic. If this turns out ... Read More
Tuesday, 26 May 2015 15:02
Felipe Franciosi
Thanks for the comment. IOPS measures how many requests (normally of the same size and issued with a constant queue depth) can be... Read More
Tuesday, 26 May 2015 15:17
Continue reading
12372 Hits
4 Comments

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.