Virtualization Blog

Discussions and observations on virtualization.

XenServer High-Availability Alternative HA-Lizard

XenServer High-Availability Alternative HA-Lizard

WHY HA AND WHAT IT DOES

XenServer (XS) contains a native high-availability (HA) option which allows quite a bit of flexibility in determining the state of a pool of hosts and under what circumstances Virtual Machines (VMs) are to be restarted on alternative hosts in the event of the loss of the ability of a host to be able to serve VMs. HA is a very useful feature that protects VMs from staying failed in the event of a server crash or other incident that makes VMs inaccessible. Allowing a XS pool to help itself maintain the functionality of VMs is an important feature and one that plays a large role in sustaining as much uptime as possible. Permitting the servers to automatically deal with fail-overs makes system administration easier and allows for more rapid reaction times to incidents, leading to increased up-time for servers and the applications they run.

XS allows for the designation of three different treatments of Virtual Machines: (1) always restart, (2) restart if possible, and (3) do not restart. The VMs designated with the highest restart priority will be the first to be attempted to restart and all will be handled, provided adequate resources (primarily, host memory) are available.  A specific start order, allowing for some VMs to be checked to be running before others, can also be established. VMs will be automatically distributed among whatever remaining XS hosts are considered active. Where necessary, note that hosts that contain expandable memory will be shrunk down to accommodate additional hosts and those hosts designated to be restarted will also be run with reduced memory, if necessary. If additional capacity exists to run more VMs, those designated as “start if possible” will be brought online. Whichever VMs that are not considered essential typically will be marked as “do not restart” and hence will be left “off” had they been running before, requiring any of those desired to be restarted to be done manually, resources permitting.

XS also allows for specifying the minimum number of active hosts to remain to accommodate failures; larger pools that are not overly populated with VMs can readily accommodate even two or more host failures.

The election of what hosts are “live” and should be considered active members of the pool follows a rather involved process of a combination of network accessibility plus access to an independent designated pooled Storage Repository (SR) that serves as an additional metric. The pooled SR can also be a fiber channel device, being independent of Ethernet connections. A quorum-based algorithm is applied to establish which servers are up and active as members of the pool and which -- in the event of a pool master failure -- should be elected the new pool master.

 

WHEN HA WORKS, IT WORKS GREAT

Without going into more detail, suffice it to say that this methodology works very well, however requiring a few prerequisite conditions that need to be taken into consideration. First of all, the mandate that a pooled storage device be available clearly means that a pool consisting of hosts that only make use of local storage will be precluded. Second, there is also a constraint that for a quorum to be possible, it is required to have a minimum of three hosts in the pool or HA results will be unpredictable as the election of a pool master can become ambiguous. This comes about because of the so-called “split brain” issue (http://linux-ha.org/wiki/Split_Brain) which is endemic in many different operating system environments that employ a quorum as means of making such a decision. Furthermore, while fencing (the process of isolating the host; see for example http://linux-ha.org/wiki/Fencing) is the typical recourse, the lack of intercommunication can result in a wrong decision being made and hence loss of access to VMs. Having experimented with two-host pools and the native XenServer HA, I would say that an estimate of it working about half the time is about right and from a statistical viewpoint, pretty much what you would expect.

This limitation is, however, still of immediate concern to those with either no pooled storage and/or only two hosts in a pool. With a little bit of extra network connectivity, a relatively simple and inexpensive solution to the external SR can be provided by making a very small NFS-based SR available. The second condition, however, is not readily rectified without the expense of at least one additional host and all the connectivity associated with it. In some cases, this may simply not be an affordable option.

 

ENTER HA-LIZARD

For a number of years now, an alternative method of providing HA has been available through the program package provided by HA-Lizard (http://www.halizard.com/) , a community project that provides a free alternative that is neither dependent on external SRs nor requires a minimum of three hosts within a pool. In this blog, the focus will be on the standard HA-Lizard version and because of the particularly harder-to-handle situation of a two-node pool, it will also be the subject of discussion.

I had been experimenting for some time with HA-Lizard and found in particular that I was able to create failure scenarios that needed some improvement. HA-Lizard’s Salvatore Costantino was more than willing to lend an ear to the cases I had found and this led further to a very productive collaboration on investigating and implementing means to deal with a number of specific cases involving two-host pools. The result of these several months of efforts is a new HA-Lizard release that manages to address a number of additional scenarios above and beyond its earlier capabilities.

It is worthwhile mentioning that there are two ways of deploying HA-Lizard:

1) Most use cases combine HA-Lizard and iSCSI-HA which creates a two-node pool using local storage while maintaining full VM agility with VMs being able to run on either host. In this case, DRBD (http://www.drbd.org/) is implemented in this type of deployment and it works very well making use of the real-time storage replication.

2) HA-Lizard, only, is used with an external Storage Repository (as in this particular case).

Before going into details of the investigation, a few words should go towards a brief explanation of how this works. Note that there is only Internet connectivity (the use of a heuristic network node) and no external SR, so how is a split brain situation then avoidable?

This is how I'd describe the course of action in this two-node situation:

If a node sees the gateway, assume it's alive. If it cannot, assume it's a good candidate for fencing. If the node that cannot see the gateway is the master, it should internally kill any running VMs and surrender its ability to be the master and fence itself. The slave node should promote itself to master and attempt to restart any missing VMs. Any that are on the previous master will probably fail though, because there is no communication to the old master. If the old VMs cannot be restarted, eventually the new master will be able to restart them regardless after a toolstack restart. If the slave node fails by not being able to communicate with the network, as long as the master still sees the network and not the slave’s network, it can assume the slave needs to fence itself, kill off its VMs and assume that they will be restarted on the current master. The slave needs to realize it cannot communicate out, and therefore should kill off any of its VMs and fence itself.

Naturally, the trickier part comes with the timing of the various actions, since each node has to blindly assume the other is going to conduct a sequence of events. The key here is that these are all agreed on ahead of time and as long as each follows its own specific instructions, it should not matter that each of the two nodes cannot see the other node. In essence, the lack of communication in this case allows for creating a very specific course of action! If both nodes fail, obviously the case is hopeless, but that would be true of any HA configuration in which no node is left standing.

Various test plans were worked out for various cases and the table below elucidates the different test scenarios, what was expected and what was actually observed. It is very encouraging that the vast majority of these cases can now be properly handled.

 

Particularly tricky here was the case of rebooting the master server from the shell, without first disabling HA-Lizard (something one could readily forget to do). Since the fail-over process takes a while, a large number of VMs cannot be handled before the communication breakdown takes place, hence one is left with a bit of a mess to clean up in the end. Nevertheless, it’s still good to know what happens if something takes place that rightfully shouldn’t!

The other cases, whether intentional or not, are handled predictably and reliably, which is of course the intent. Typically, a two-node pool isn’t going to have a lot of complex VM dependencies, so the lack of a start order of VMs should not be perceived as a big shortcoming. Support for this feature may even be added in a future release.

 

CONCLUSIONS

HA-Lizard is a viable alternative to the native Citrix HA configuration. It’s straightforward to set up and can handle standard failover cases with a selective “restart/do not restart” setting for each VM or can be globally configured. There are a quite a number of configuration parameters which the reader is encouraged to research in the extensive HA-Lizard documentation. There is also an on-line forum which serves as a source for information and prompt assistance with issues. This most recent release 2.1.3 is supported on both XenServer 6.5 and 7.0.

Above all, HA-Lizard shines when it comes to handling a non-pooled storage environment and in particular, all configurations of the dreaded two-node pool configuration. From my direct experience, HA-Lizard now handles the vast majority of issues involved in a two-node pool and can do so more reliably than the non-supported two-node pool using Citrix’ own HA application. It has been possible to conduct a lot of tests with various cases and importantly, and to do so multiple times to ensure the actions are predictable and repeatable.

I would encourage taking a look at HA-Lizard and giving it a good test run. The software is free (contributions are accepted) and it is in extensive use and has a proven track record.  For a two-host pool, I can frankly not think of a better alternative, especially with these latest improvements and enhancements.

I would also like to thank Salvatore Costantino for the opportunity to participate in this investigation and am very pleased to see the fruits of this collaboration. It has been one way of contributing to the Citrix XenServer user community that many can immediately benefit from.

 

 

 

 

 

 

Recent comment in this post
JK Benedict
I hath no idea why more have not read this intense article! As always: bravo, sir! BRAVO!
Wednesday, 04 January 2017 12:43
Continue reading
1256 Hits
1 Comment

PCI Pass-Through on XenServer 7.0

Plenty of people have asked me over the years how to pass-through generic PCI devices to virtual machines running on XenServer. Whilst it isn't officially supported by Citrix, it's none the less perfectly possible to do; just note that your mileage may vary, because clearly it's not rigorously tested with all the possible different types of device people might want to pass-through (from TV cards, to storage controllers, to USB hubs...!).

The process on XenServer 7.0 differs somewhat from previous releases, in that the Dom0 control domain is now CentOS 7.0-based, and UEFI boot (in addition to BIOS boot) is supported. Hence, I thought it would be worth writing up the latest instructions, for those who are feeling adventurous.

Of course, XenServer officially supports pass-through of GPUs to both Windows and Linux VMs, hence this territory isn't as uncharted as might first appear: pass-through in itself is fine. The wrinkles will be to do with a particular given piece of hardware.

A Short Introduction to PCI Pass-Through

Firstly, a little primer on what we're trying to do.

Your host will have a PCI bus, with multiple devices hosted on it, each with its own unique ID on the bus (more on that later; just remember this as "B:D.f"). In addition, each device has a globally unique vendor ID and device ID, which allows the operating system to look up what its human-readable name is in the PCI IDs database text file on the system. For example, vendor ID 10de corresponds to the NVIDIA Corporation, and device ID 11b4 corresponds to the Quadro K4200. Each device can then (optionally) have multiple sub-vendor and sub-device IDs, e.g. if an OEM has its own branded version of a supplier's component.

Normally, XenServer's control domain, Dom0, is given all PCI devices by the Xen hypervisor. Drivers in the Linux kernel running in Dom0 each bind to particular PCI device IDs, and thus make the hardware actually do something. XenServer then provides synthetic devices (emulated or para-virtualised) such as SCSI controllers and network cards to the virtual machines, passing the I/O through Dom0 and then out to the real hardware devices.

This is great, because it means the VMs never see the real hardware, and thus we can live migrate VMs around, or start them up on different physical machines, and the virtualised operating systems will be none the wiser.

If, however, we want to give a VM direct access to a piece of hardware, we need to do something different. The main reason one might want to is because the hardware in question isn't easy to virtualise, i.e. the hypervisor can't provide a synthetic device to a VM, and somehow then "share out" the real hardware between those synthetic devices. This is the case for everything from an SSL offload card to a GPU.

Aside: Virtual Functions

There are three ways of sharing out a PCI device between VMs. The first is what XenServer does for network cards and storage controllers, where a synthetic device is given to the VM, but then the I/O streams can effectively be mixed together on the real device (e.g. it doesn't matter that traffic from multiple VMs is streamed out of the same physical network card: that's what will end up happening at a physical switch anyway). That's fine if it's I/O you're dealing with.

The second is to use software to share out the device. Effectively you have some kind of "manager" of the hardware device that is responsible for sharing it between multiple virtual machines, as is done with NVIDIA GRID GPU virtualisation, where each VM still ends up with a real slice of GPU hardware, but controlled by a process in Dom0.

The third is to virtualise at the hardware device level, and have a PCI device expose multiple virtual functions (VFs). Each VF provides some subset of the functionality of the device, isolated from other VFs at the hardware level. Several VMs can then each be given their own VF (using exactly the same mechanism as passing through an entire PCI device). A couple of examples are certain Intel network cards, and AMD's MxGPU technology.

OK, So How Do I Pass-Through a Device?

Step 1

Firstly, we have to stop any driver in Dom0 claiming the device. In order to do that, we'll need to ascertain what the ID of the device we're interested in passing through is. We'll use B:D.f (Bus, Device, function) numbering to specify it.

Running lspci will tell you what's in your system:

davidcot@helical:~$ lspci
00:00.0 Host bridge: Intel Corporation 82X38/X48 Express DRAM Controller
00:01.0 PCI bridge: Intel Corporation 82X38/X48 Express Host-Primary PCI Express Bridge
00:06.0 PCI bridge: Intel Corporation 82X38/X48 Express Host-Secondary PCI Express Bridge
00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02)
00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02)
00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02)
00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 02)
00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 02)
00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IR (ICH9R) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: NVIDIA Corporation G86 [Quadro NVS 290] (rev a1)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754 Gigabit Ethernet PCI Express (rev 02)

Once you've found the device you're interested in, say 04:00.0 for my network card, we tell Dom0 to exclude it from being bound to by normal drivers. You can add to the Dom0 boot line as follows:

/opt/xensource/libexec/xen-cmdline --set-dom0 "xen-pciback.hide=(04:00.0)"

(What this does is edit /boot/grub/grub.cfg for you, or if you're booting using UEFI, /boot/efi/EFI/xenserver/grub.cfg instead!)

Step 2

Reboot! At the moment, a driver in Dom0 probably still has hold of your device, hence you need to reboot the host to get it relinquished.

Step 3

The easy bit: tell the toolstack to assign the PCI device to the VM. Run:

xe vm-list

And note the UUID of the VM you're interested in, then:

xe vm-param-set other-config:pci=0/0000:<B:D.f> uuid=<vm uuid>

Where, of course, <B.D.f> is the ID of the device you found in step 1 (like 04:00.0), and <vm uuid> corresponds to the VM you care about.

Step 4

Start your VM. At this point if you run lspci (or equivalent) within the VM, you should now see the device. However, that doesn't mean it will spring into life, because...

Step 5

Install a device driver for the piece of hardware you passed-through. The operating system within the VM may already ship with a suitable device driver, but it not, you'll need to go and get the appropriate one from the device manufacturer. This will normally be the standard Linux/Windows/other one that you would use for a physical system; the only difference occurs when you're using a virtual function, where the VF driver is likely to be a special one.

Health Warnings

As indicated above, pass-through has advantages and disadvantages. You'll get direct access to the hardware (and hence, for some functions, higher performance), but you'll forgo luxuries such as the ability to live migrate the virtual machine around (there's state now sitting on real hardware, versus virtual devices), and the ability to use high availability for that VM (because HA doesn't take into account how many free PCI devices of the right sort you have in your resource pool).

In addition, not all PCI devices take well to being passed through, and not all servers like doing so (e.g. if you're extending the PCI bus in a blade system to an expansion module, this can sometimes cause problems). Your mileage may therefore vary.

If you do get stuck, head over to the XenServer discussion forums and people will try to help out, but just note that Citrix doesn't officially support generic PCI pass-through, hence you're in the hands of the (very knowledgeable) community.

Conclusion

Hopefully this has helped clear up how pass-through is done on XenServer 7.0; do comment and let us know how you're using pass-through in your environment, so that we can learn what people want to do, and think about what to officially support on XenServer in the future!

Recent Comments
Tobias Kreidl
Yay, great to see this published in clear, concise steps! This is one for the XenServer forum to point to! ... Read More
Saturday, 05 November 2016 03:38
David Cottingham
If you want both the audio and GPU devices given to your VM, then yes, you need to use the procedure once for each device. Howeve... Read More
Monday, 07 November 2016 10:18
David Cottingham
Understood: it would be a performance gain for at least some use cases, as you're getting raw access to the NIC. The downside is t... Read More
Monday, 07 November 2016 10:06
Continue reading
3418 Hits
13 Comments

Better Together: PVS and XenServer!

XenServer adds new functionality to further simplify and enhance the secure and on-demand delivery of applications and desktops to enterprise environments.

If you haven't visited the Citrix blogs recently, we encourage you to visit https://www.citrix.com/blogs/2016/10/31/pvs-and-xenserver-better-together/ to read about the latest integration efforts between PVS and XS.

If you're a Citrix customer, this article is a must read!

Andy Melmed, Senior Solutions Architect, XenServer PM

 

Continue reading
1367 Hits
0 Comments

Set Windows guest VM Static IP address in XenServer

A Bit of Why

For a XenServer Virtual Machine(VM) administrator, traditional way to set a static IP to a VM maybe not that direct. That is because XenServer do not provide API to set VM IP address from any management tool in history. To change the IP setting for a VM in XenServer, you will need to email the VM user and let them to do the setting manually. Or you may need to install some 3rd party tools to help you to set the IP address to the VM. For create new VM for users by VM clone, set IP maybe means multiple time of reboot.

To provide a better user experience, XenServer is now trying to provides easier way to set static IP address to Guest VM.

 

Set static IP for XenServer 7.0 Windows guest VM

XenServer 7.0 now have the ability to set Windows guest VM IP address by below interfaces:

  • IPv4
    • Set VM IPv4 address by command line interface(CLI):
      xe vif-configure-ipv4 address=  gateway=  mode=[static/none] uuid=
    • Set VM IPv4 address by XAPI
      VIF.configure_ipv4(vifObject, "static/none", "Some IP address", "some gateway address")
  • IPv6
    • Set VM IPv6 address by command line interface(CLI):
      xe vif-configure-ipv6 address=  gateway=  mode=[static/none] uuid=
    • Set VM IPv6 address by XAPI
      VIF.configure_ipv6(vifObject, "static/none", "Some IP address", "some gateway address")

Note:

The mode "none" means remove the current static IP setting and back to DHCP mode. If the static IP is not set by new interface, use the command to set the mode to "none" only do nothing.

Dive into details

Below diagram show how the configuration goes:

b2ap3_thumbnail_workflow.png

By using the interface:

1. XAPI will first store the IP configuration to XenStore as:

/local/domain//xenserver/device/vif= ""
  static-ip-setting = ""
     mac = "some mac address"
     error-code = "some error code"
     error-msg = "some error message"
     address = "some IP address"
     gateway = "some gateway address"

2. XenStore will notify the change to XenServer Guest agent tool of the configuration change.

3. XenServer guest agent receives the notification and sets IP address using netsh.

4. After setting IP address, XenServer guest agent then writes the operation result to xenstore key as: error-code and error-msg

Example

1. Install XenServer PV tool to Windows Guest VM.

 2. From the command line interface (CLI), identify the Virtual Network Interface / Virtual Interface(VIF) you want to set the IP address by:

[root@dt65 ~]# xe vm-vif-list vm=Windows 7 (32-bit) (1) 
uuid ( RO)                         : 7dc56d5b-492c-bcf5-2549-b580dc928274
        vm-name-label ( RO): Windows 7 (32-bit) (1)
                     device ( RO): 1
                        MAC ( RO): 3e:aa:c3:dd:a7:ba
           network-uuid ( RO): 98f9a3b6-ad3f-14b3-da59-e3abc888e58e
network-name-label ( RO): Pool-wide network associated with eth1


uuid ( RO)                         : 0f59a97b-afcf-b6db-582d-2411d5bbc449
        vm-name-label ( RO): Windows 7 (32-bit) (1)
                     device ( RO): 0
                        MAC ( RO): 62:a1:03:31:a3:ee
           network-uuid ( RO): 41dac7d6-4a11-c9e6-cc48-ded78ceaf446
network-name-label ( RO): Pool-wide network associated with eth0

3. Call new interface to set IP address as:

[root@dt65 ~]# xe vif-configure-ipv4 uuid=0f59a97b-afcf-b6db-582d-2411d5bbc449 mode=static address=10.64.5.6/24 gateway=10.64.5.1

4. Check result by XenStore error code key "error-code" and "error-msg" as:

[root@XenServer ~]# xenstore-ls /local/domain/13/xenserver/device/vif
0 = ""
  static-ip-setting = ""
     mac = "62:a1:03:31:a3:ee"
     error-code = "0"
     error-msg = ""
     address = "10.64.5.6/24"
     gateway = "10.64.5.1"
  1 = ""
     static-ip-setting = ""
     mac = "3e:aa:c3:dd:a7:ba"
     error-code = "0"
     error-msg = ""

Recent comment in this post
yao
How use netsh.exe set IPs?
Monday, 07 November 2016 02:13
Continue reading
1799 Hits
1 Comment

Enable XSM on XenServer 6.5 SP1

1 Introduction

Certain virtualization environments require the extra security provided by XSM and FLASK (https://wiki.xenproject.org/wiki/Xen_Security_Modules_:_XSM-FLASK). XenServer 7 benefits from its upgrade of the control domain to CentOS 7, which includes support for enabling XSM and FLASK. But what about legacy XenServer 6.5 installations that also require the added security? XSM and FLASK may be enabled on XenServer 6.5 as well, but it requires a bit more work.

Note that XSM is not currently a user-visible feature in XenServer, or a supported technology.

This article describes how to enable XSM and FLASK in XenServer 6.5 SP1. It makes the assumption that the reader is familiar with accessing, building, and deploying XenServer's Xen RPMs from source. While this article pertains to resources from SP1 source RPMs (XS65ESP1-src-pkgs.tar.bz2 included with SP1, http://support.citrix.com/article/CTX142355), a similar approach can be followed for other XenServer 6.5 hotfixes.

2 Patching Xen and xen.spec

XenServer issues some hypercalls not handled by Xen's XSM hooks. The following patch shows one possible way to handle these operations and commands, which is to always permit them.

diff --git a/xs6.5sp1/xen/xen-4.4.1/xen/xsm/flask/hooks.c b/xs6.5sp1/xen/xen-4.4.1/xen/xsm/flask/hooks.c
index 0cf7daf..a41fcc4 100644
--- a/xs6.5sp1/xen/xen-4.4.1/xen/xsm/flask/hooks.c
+++ b/xs6.5sp1/xen/xen-4.4.1/xen/xsm/flask/hooks.c
@@ -727,6 +727,12 @@ static int flask_domctl(struct domain *d, int cmd)
     case XEN_DOMCTL_cacheflush:
         return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__CACHEFLUSH);

+    case XEN_DOMCTL_get_runstate_info:
+        return 0;
+
+    case XEN_DOMCTL_setcorespersocket:
+        return 0;
+
     default:
         printk("flask_domctl: Unknown op %dn", cmd);
         return -EPERM;
@@ -782,6 +788,9 @@ static int flask_sysctl(int cmd)
     case XEN_SYSCTL_numainfo:
         return domain_has_xen(current->domain, XEN__PHYSINFO);

+    case XEN_SYSCTL_consoleringsize:
+        return 0;
+
     default:
         printk("flask_sysctl: Unknown op %dn", cmd);
         return -EPERM;
@@ -1299,6 +1308,9 @@ static int flask_platform_op(uint32_t op)
     case XENPF_get_cpuinfo:
         return domain_has_xen(current->domain, XEN__GETCPUINFO);

+    case XENPF_get_cpu_features:
+        return 0;
+
     default:
         printk("flask_platform_op: Unknown op %dn", op);
         return -EPERM;

The only other file that needs patching is Xen's RPM spec file, xen.spec. Modify HV_COMMON_OPTIONS as shown below.  Change this line:

% define HV_COMMON_OPTIONS max_phys_cpus=256

to:

% define HV_COMMON_OPTIONS max_phys_cpus=256 XSM_ENABLE=y FLASK_ENABLE=y

3 Compiling and Loading a Policy

To build a security policy, navigate to tools/flask/policy in Xen's source tree. Run make to compile the default security policy. It will have a name like xenpolicy.24, depending on your version of checkpolicy.

Copy xenpolicy.24 over to Dom0's /boot directory. Open /boot/extlinux.conf and modify the default section's append /boot/xen.gz ... line so it has --- /boot/xenpolicy.24 at the end. For example:

append /boot/xen.gz dom0_mem=752M,max:752M [.. snip ..] splash --- /boot/initrd-3.10-xen.img --- /boot/xenpolicy.24

After making this change, reboot.

While booting (or afterwards, via xl dmesg), you should see messages indicating XSM and FLASK initialized, read the security policy, and started in permissive mode. For example:

(XEN) XSM Framework v1.0.0 initialized
(XEN) Policy len  0x1320, start at ffff830117ffe000.
(XEN) Flask:  Initializing.
(XEN) AVC INITIALIZED
(XEN) Flask:  Starting in permissive mode.

4 Exercises for the Reader

  1. Create a more sophisticated implementation for handling XenServer hypercalls in xen/xsm/flask/hooks.c.
  2. Write (and load) a custom policy.
  3. Boot with flask_enforcing=1 set, and study any violations that occur (see xl dmesg output).
Recent comment in this post
anshul makkar
Good Work. Once the user builds the default policy, it should work for most of the scenario except for the specialized once like p... Read More
Friday, 28 October 2016 11:22
Continue reading
987 Hits
1 Comment

Making a difference to the next release

With the 1st alpha build of the new, upcoming XenServer release (codenamed " Ely") as announced by Andy Melmed's blog on the 11th October – I thought it'd be useful to provide a little retrospective on how you, the xenserver.org community, have helped getting it off to a great start by providing great feedback on previous alphas, betas and releases - and how this has been used to strengthen the codebase for Ely as a whole based on your experiences.
 
As I am sure you are well aware - community users of xenserver.org can make use of the incident tracking database at bugs.xenserver.org to raise issues on recent alphas, betas and XenServer releases to raise issues or problems they've found on their hardware or configuration.  These incidents are raised in the form of XSO tickets which can then be commented upon by other members of the community and folks who work on the product.  
 
We listened
Looking back on all of the XSO tickets raised on the latest 7.0 release - these total more than 200 individual incident reports.  I want to take the time to thank everyone who contributed to these, often detailed, specific, constructive reports, and for working iteratively to understand more of the underlying issues.  Some of these investigations are ongoing, and need further feedback, but many of them are sufficiently clear to move forward to the next step.  
 
We understood
The incident reports were triaged and, by working with the user community, more than 80% of them have been processed.  Frequently this involved questions and answers to get a better handle on what was the underlying problem.  Then trying a change to the configuration or even a private fix to see and confirm if it related to the problem or resolved it.  The enthusiasm and skill of the reporters has been amazing, and continually useful.  At this point - we've separated the incidents into those which can be fixed as bugs, and those which are requests for features.  The latter have been provided to Citrix product management for consideration.  
 
We did
Out of these which can be fixed as bugs,  we raised or updated 45 acknowledged defects in XenServer.  More than 70% of these are already fixed - with another 20% being actively worked on.  The small remainder are blocked for some reason and awaiting a change elsewhere in the product, upstream or in our ability to test.  The 70% of fixes have now successfully either become part of some of the hotfixes which have been released for 7.0, or are in the codebase already and are being progressively released as part of the Ely alpha programme for the community to try.  
 
So what's next?  With work continuing apace on Ely - we have now opened the "Ely alpha" as a affects-version in the incident database to raise issues with this latest build.  At the same time - in the spirit of continuing to progressively improve the actively developing codebase - we have removed the 6.5 SP1 affects-version – so folks can focus on the new release.
 
Finally - on behalf of all xenserver.org users - my personal thanks to everyone who has helped improve both Dundee and Ely - both by reporting incidents, triaging and fixing them and by continuing to give your feedback on the latest new version.  This really makes a difference to all members of the community.
Continue reading
1643 Hits
0 Comments
Featured

XenServer Ely Alpha 1 Available

Hear ye, hear ye… we are pleased to announce that an alpha release of XenServer Project Ely is now available for download! After Dundee (7.0), we've come a little closer to Cambridge (the birthplace of Xen) for our codename, as the city of Ely is just up the road.

 
Since releasing version 7.0 in May, the XenServer engineering team has been working fervently to prepare the platform with the latest innovations in server virtualization technology. As a precursor, a pre-release containing the prerequisites for enabling a number of powerful (and really cool!) new features has been made available for download from the pre-release page.
 

What's In it?

 

The following is a brief description of some of the feature-prerequisites included in this pre-release:

 

Xen 4.7:  This release of Xen adds support for "live-patching" of the Xen hypervisor, allowing issues to be patched without requiring a host reboot. In this alpha release there is no functionality for you to test in this area, but we thought it was worth telling you about none the less. Xen 4.7 also includes various performance improvements, and updates to the virtual machine introspection code (surfaced in XenServer as Direct Inspect).

 

Kernel 4.4: Updated kernel to support future feature considerations. All device drivers will be at the upstream versions; we'll be updating these with drops direct from the hardware vendors as we go through the development cycle.

 

VM import/export performance: a longstanding request from our user community, we've worked to improve the import/export speeds of VMs, and Ely alpha 1 now averages 2x faster than the previous version.

 

What We'd Like Help With

 

The purpose of this alpha release is really to make sure that a variety of hardware works with project Ely. Because we've updated core platform components (Xen and the Dom0 kernel), it's always important to check on hardware that we don’t have in our QA labs that all is well. Thus, the more people who can download this build, install, and run a couple of VMs to check all is well the better.

 

Additionally, we've been working with the community (over on XSO-445) on improving VM import/export performance: we'd like to see whether the improvements we've seen in our tests are what you see too. If they're not, we can figure out why and fix it :-).

 

Upgrading

 

This is pre-release software, not for production use. Upgrades from XenServer 7.0 should work fine, but it goes without saying that you should ensure you back up any critical data.

 

Reporting Bugs

 

We encourage visitors to download the pre-release and provide us with your feedback. If you do find a problem, please head over to the bug tracker and file a ticket. Please be sure to include a server status report!

 

Now that we've moved up to a new pre-release project, it's time to remove the XS 6.5 SP1 fix version from the bug tracker, in order that we keep it tidy. You'll see an "Ely alpha" affects version is now present instead.

 

What Next?

 

Stay tuned for another pre-release build in the near future: as you may have heard, we've been keeping busy!

 
As always, we look forward to working with the XenServer community to make the next major release of XenServer the best version ever!

 

Cheers!

 

Andy M.

Senior Solutions Architect - XenServer PM

 

Recent Comments
David Reade
The download link for the release notes does not work. For "info.citrite.net" I'm getting "server not found". Is this a link to an... Read More
Tuesday, 11 October 2016 15:56
Tobias Kreidl
Always nice to see XenServer improvements and added features. The new kernel and live patching are nice. The vm-export/import gain... Read More
Wednesday, 12 October 2016 06:41
Willem Boterenbrood
Tobias, In my XenServer 7.0 test environment I see a large improvement of VM export speed compared to my 6.5SP1 live environment.... Read More
Friday, 14 October 2016 07:13
Continue reading
3474 Hits
30 Comments

XenServer Hotfix XS65ESP1035 Released

XenServer Hotfix XS65ESP1035 Released

News Flash: XenServer Hotfix XS65ESP1035 Released

Indeed, I was alerted early this morning (06:00 EST) via email that Citrix has released hotfix XS65ESP1035 for XenServer 6.5 SP1.  The official release and content is filed under CTX216249, which can be found here: http://support.citrix.com/article/CTX216249

As of the writing of this article, this hotfix has not yet been added to CTX138115 (entitled "Recommended Updates for XenServer Hotfixes") or, as we like to call it "The Fastest Way to Patch A Vanilla XenServer With One or Two Reboots!"  I imagine that resource will be updated to reflect XS65ESP1035 soon.

Personally/Professionally, I will be installing this hotfix as, per CTX216249, I am excited to read what is addressed/fixed:

  • Duplicate entry for XS65ESP1021 was created when both XS65ESP1021 and XS65ESP1029 were applied.
  • When BATMAP (Block Allocation Map) in Xapi database contains erroneous data, the parent VHD (Virtual Hard Disk) does not get inflated causing coalesce failures and ENOSPC errors.
  • After deleting a snapshot on a pool member that is not the pool master, a coalesce operation may not succeed. In such cases, the coalesce process can constantly retry to complete the operation, resulting in the creation of multiple RefCounts that can consume a lot of space on the pool member.
In addition, this hotfix contains the following improvement:
  • This fix lets users set a custom retrans value for their NFS SRs thereby giving them more fine-grained control over how they want NFS mounts to behave in their environment.

(Source: http://support.citrix.com/article/CTX216249)

So....

This is storage based hotfix and while we can create VMs all day, we rely on the storage substrate to hold our precious VHDs, so plan accordingly to deploy it!

Applying The Patch Manually

As a disclaimer of sorts, always plan your patching during a maintenance window to prevent any production outages.  For me, I am currently up-to-date and will be rebooting my XenServer host(s) in a few hours, so I manually applied this patch.

Why?  If you look in XenCenter for updates, you won't see this hotfix listed (yet).  If it was available in XenCenter, checks and balances would inform me I need to suspend, migrate, or shutdown VMs.  For a standalone host, I really can't do that.  In my pool, I can't reboot for a few hours, but I need this patch installed, so I simply do the following on my XenServer stand-alone server OR XenServer primary/master server:

Using the command line in XenCenter, I make a directory in /root/ called "ups" and then descend into that directory because I plan to use wget (Web Get) to download the patch via its link in http://support.citrix.com/article/CTX216249:

[root@colossus ~]# mkdir ups
[root@colossus ~]# cd ups

Now, using wget I specify what to download over port 80 and to save it as "hf35.zip":

[root@colossus ups]# wget http://support.citrix.com/supportkc/filedownload?uri=/filedownload/CTX216249/XS65ESP1035.zip -O hf35.zip

We then see the usual wget progress bar and once it is complete, I can unzip the file "hf35.zip":

HTTP request sent, awaiting response... 200 OK
Length: 110966324 (106M) [application/zip]
Saving to: `hf35.zip'

100%[======================================>] 110,966,324 1.89M/s   in 56s    
2016-08-25 11:06:32 (1.90 MB/s) - `hf35.zip' saved [110966324/110966324]
[root@colossus ups]# unzip hf35.zip 
Archive:  hf35.zip
  inflating: XS65ESP1035.xsupdate   
  inflating: XS65ESP1035-src-pkgs.tar.bz2

I'm a big fan of using shortcuts - especially where UUIDs are involved.  Now that I have the patch ready to expand onto my XenServer master/stand-alone server, I want to create some kind of variable so I don't have to remember my host's UUID or the patch's UUID. 

For the host, I can simply source in a file that contains the XenServer primary/master server's INSTALLATION_UUID (better known as the host's UUID):

[root@colossus ups]# source /etc/xensource-inventory 
[root@colossus ups]# echo $INSTALLATION_UUID
207cd7c1-da20-479b-98bc-e84cac64d0c0

With the variable $INSTALLATION_UUID set, I can now expand the patch and capture it's own UUID:

[root@colossus ups]# patchUUID=`xe patch-upload file-name=XS65ESP1035.xsupdate`
[root@colossus ups]# echo $patchUUID
cdf9eb54-c3da-423d-88ca-841b864f926b

NOW, I apply the patch to the host (yes, it still needs to be rebooted, but within a few hours) using both variables in the following command:

[root@colossus ups]# xe patch-apply uuid=$patchUUID host-uuid=$INSTALLATION_UUID
   
Preparing...                ##################################################
kernel                      ##################################################
unable to stat /sys/class/block//var/swap/swap.001: No such file or directory
Preparing...                ##################################################
sm                          ##################################################
Preparing...                ##################################################
blktap                      ##################################################
Preparing...                ##################################################
kpartx                      ##################################################
Preparing...                ##################################################
device-mapper-multipath-libs##################################################
Preparing...                ##################################################
device-mapper-multipath     ##################################################

At this point, I can back out of the "ups" directory and remove it.  Likewise, I can also check to see if the patch UUID is listed in the XAPI database:

[root@colossus ups]# cd ..
[root@colossus ~]# rm -rf ups/
[root@colossus ~]# ls
support.tar.bz2
[root@colossus ~]# xe patch-list uuid=$patchUUID
uuid ( RO)                    : cdf9eb54-c3da-423d-88ca-841b864f926b
              name-label ( RO): XS65ESP1035
        name-description ( RO): Public Availability: fixes to Storage
                    size ( RO): 21958176
                   hosts (SRO): 207cd7c1-da20-479b-98bc-e84cac64d0c0
    after-apply-guidance (SRO): restartHost

So, nothing really special -- just a quick way to apply patches to a XenServer primary/master server.  In the same manner, you can substitute the $INSTALLATION_UUID with other host UUIDs in a pool configuration, etc.

Well, off to reboot and thanks for reading!

 

-jkbs | @xenfomationMy Citrix Blog

To receive updates about the latest XenServer Software Releases, login or sign-up to pick and choose the content you need from http://support.citrix.com/customerservice/

 


Sources

Citrix Support Knowledge Center: http://support.citrix.com/article/CTX216249

Citrix Support Knowledge Center: http://support.citrix.com/customerservice/

Citrix Profile/RSS Feeds: http://support.citrix.com/profile/watches/

Original Image Source: http://www.gimphoto.com/p/download-win-zip.html

Continue reading
2641 Hits
0 Comments

XenServer 7.0 performance improvements part 4: Aggregate I/O throughput improvements

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the fourth in a series of articles that will describe the principal improvements. For the previous ones, see:

  1. http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html
  2. http://xenserver.org/blog/entry/dundee-networking-multi-queue.html
  3. http://xenserver.org/blog/entry/dundee-parallel-vbd-operations.html

In this article we return to the theme of I/O throughput. Specifically, we focus on improvements to the total throughput achieved by a number of VMs performing I/O concurrently. Measurements show that XenServer 7.0 enjoys aggregate network throughput over three times faster than XenServer 6.5, and also has an improvement to aggregate storage throughput.

What limits aggregate I/O throughput?

When a number of VMs are performing I/O concurrently, the total throughput that can be achieved is often limited by dom0 becoming fully busy, meaning it cannot do any additional work per unit time. The I/O backends (netback for network I/O and tapdisk3 for storage I/O) together consume 100% of available dom0 CPU time.

How can this limit be overcome?

Whenever there is a CPU bottleneck like this, there are two possible approaches to improving the performance:

  1. Reduce the amount of CPU time required to perform I/O.
  2. Increase the processing capacity of dom0, by giving it more vCPUs.

Surely approach 2 is easy and will give a quick win...? Intuitively, we might expect the total throughput to increase proportionally with the number of dom0 vCPUs.

Unfortunately it's not as straightforward as that. The following graph shows what happened to the aggregate network throughput on XenServer 6.5 if the number of dom0 vCPUs is artificially increased. (In this case, we are measuring the total network throughput of 40 VMs communicating amongst themselves on a single Dell R730 host.)

b2ap3_thumbnail_5179.png

Counter-intuitively, the aggregate throughput decreases as we add more processing power to dom0! (This explains why the default was at most 8 vCPUs in XenServer 6.5.)

So is there no hope for giving dom0 more processing power...?

The explanation for the degradation in performance is that certain operations run more slowly when there are more vCPUs present. In order to make dom0 work better with more vCPUs, we needed to understand what those operations are, and whether they can be made to scale better.

Three such areas of poor scalability were discovered deep in the innards of Xen by Malcolm Crossley and David Vrabel, and improvements were made for each:

  1. Maptrack lock contention – improved by http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=dff515dfeac4c1c13422a128c558ac21ddc6c8db
  2. Grant-table lock contention – improved by http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b4650e9a96d78b87ccf7deb4f74733ccfcc64db5
  3. TLB flush on grant-unmap – improved by https://github.com/xenserver/xen-4.6.pg/blob/master/master/avoid-gnt-unmap-tlb-flush-if-not-accessed.patch

The result of improving these areas is dramatic – see the green line in the following graph:

b2ap3_thumbnail_4190.png

Now, throughput scales very well as the number of vCPUs increases. This means that, for the first time, it is now beneficial to allocate many vCPUs to dom0 – so that when there is demand, dom0 can deliver. Hence we have given XenServer 7.0 a higher default number of dom0 vCPUs.

How many vCPUs are now allocated to dom0 by default?

Most hosts will now get 16 vCPUs by default, but the exact number depends on the number of CPU cores on the host. The following graph summarises how the default number of dom0 vCPUs is calculated from the number of CPU cores on various current and historic XenServer releases:

b2ap3_thumbnail_dom0-vcpus.png

Summary of improvements

I will conclude with some aggregate I/O measurements comparing XenServer 6.5 and 7.0 under default settings (no dom0 configuration changes) on a Dell R730xd.

  1. Aggregate network throughput – twenty pairs of 32-bit Debian 6.0 VMs sending and receiving traffic generated with iperf 2.0.5.
    b2ap3_thumbnail_aggr-intrahost-r730_20160729-093608_1.png
  2. Aggregate storage IOPS – twenty 32-bit Windows 7 SP1 VMs each doing single-threaded, serial, sequential 4KB reads with fio to a virtual disk on an Intel P3700 NVMe drive.
    b2ap3_thumbnail_storage-iops-aggr-p3700-win7.png
Continue reading
3313 Hits
2 Comments

XenServer 7.0 performance improvements part 3: Parallelised plug and unplug VBD operations in xenopsd

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the third in a series of articles that will describe the principal improvements. For the first two, see here:

  1. http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html
  2. http://xenserver.org/blog/entry/dundee-networking-multi-queue.html

The topic of this post is control plane performance. XenServer 7.0 achieves significant performance improvements through the support for parallel VBD operations in xenopsd. With the improvements, xenopsd is able to plug and unplug many VBDs (virtual block devices) at the same time, substantially improving the duration of VM lifecycle operations (start, migrate, shutdown) for VMs with many VBDs, and making it practical to operate VMs with up to 255 VBDs.

Background of the VM lifecycle operations

In XenServer, xenopsd is the dom0 component responsible for VM lifecycle operations:

  • during a VM start, xenopsd creates the VM container and then plugs the VBDs before starting the VCPUs;
  • during a VM shutdown, xenopsd stops the VCPUs and then unplugs the VBDs before destroying the VM container;
  • during a VM migrate, xenopsd creates a new VM container, unplugs the VBDs of the old VM container, and plugs the VBDs for the new VM before starting its VCPUs; while the VBDs are being unplugged and plugged on the other VM container, the user experiences a VM downtime when the VM is unresponsive because both old and new VM containers are paused.

Measurements have shown that a large part, usually most of the duration of these VM lifecycle operations is due to plugging and unplugging the VBDs, especially on slow or contended storage backends.

b2ap3_thumbnail_vbd-plugs-sequential.png

 

Why does xenopsd take some time to plug and unplug the VBDs?

The completion of a xenopsd VBD plug operation involves the execution of two storage layer operations, VDI attach and VDI activate (where VDI stands for virtual disk image). These VDI operations include control plane manipulation of daemons, block devices and disk metadata in dom0, which will take different amounts of time to execute depending on the type of the underlying Storage Repository (SRs, such as LVM, NFS or iSCSI) used to hold the VDIs, and the current load on the storage backend disks and their types (SSDs or HDs). Similarly, the completion of a xenopsd VBD unplug operation involves the execution of two storage layer operations, VDI deactivate and VDI detach, with the corresponding overhead of manipulating the control plane of the storage layer.

If the underlying physical disks are under high load, there may be contention preventing progress of the storage layer operations, and therefore xenopsd may need to wait many seconds before the requests to plug and unplug the VBDs can be served.

Originally, xenopsd would execute these VBD operations sequentially, and the total time to finish all of them for a single VM would depend of the number of VBDs in the VM. Essentially, it would be a sum of the time to operate each of othe VBDs of this VM, which would result in several minutes of wait for a lifecycle operation of a VM that had, for instance, 255 VBDs.

What are the advantages of parallel VBD operations?

Plugging and unplugging the VBDs in parallel in xenopsd:

  • provides a total duration for the VM lifecycle operations that is independent of the number of VBDs in the VM. This duration will typically be the duration of the longest individual VBD operation amongst the parallel VBD operations for that VM;
  • provides a significant instantaneous improvement for the user, across all the VBD operations involving more than 1 VBD per VM. The more devices involved, the larger the noticeable improvement, up to the saturation of the underlying storage layer;
  • this single improvement is immediately applicable across all of VM start, VM shutdown and VM migrate lifecycle operations.

b2ap3_thumbnail_vbd-plugs-parallel.png

 

Are there any disadvantages or limitations?

Plugging and unplugging VBDs uses dom0 memory. The main disadvantage of doing these in parallel is that dom0 needs more memory to handle all the parallel operations. To prevent situations where a large number of such operations would cause dom0 to run out of memory, we have added two limits:

  • the maximum number of global parallel operations that xenopsd can request is the same as the number of xenopsd worker-pool threads as defined by worker-pool-size in /etc/xenopsd.conf. This prevents regression in the maximum dom0 memory usage compared to when sequential VBD operations per VM was used in xenopsd. An increase in this value will increase the number of parallel VBD operations, at the expense of having to increase the dom0 memory for about 15MB for each extra parallel VBD operation.
  • the maximum number of per-VM parallel operations that xenopsd can request is currently fixed to 10, which covers a wide range of VMs and still provides a 10x improvement in lifecycle operation times for those VMs that have more than 10 VBDs.

Where do I find the changes?

The changes that implemented this feature are available in github at https://github.com/xapi-project/xenopsd/pull/250

What sort of theoretical improvements should I expect in XenServer 7.0?

The exact numbers depend on the SR type, storage backend load characteristics and the limits specified in the previous section. Given the limits in the previous section, the results for the duration of VDB plugs for a single VM will follow the pattern in the following table:

Number n of VBDs/VM
Improvement of VBD operations
<=10 VBDs/VM times faster
> 10 VBDs/VM

10 times faster

The table above assumes that the maximum number of global parallel operations discussed in the section above is not reached. If you want to guarantee the improvement in the table above for x>1 simultaneous VM lifecycle operations, at the expense of using more dom0 memory in the worst case, you will probably want to set worker-pool-size = (n * x) in /etc/xenopsd.conf, where is a number reflecting the average number of VBDs/VM amongst all VMs up to a maximum of n=10.

What sort of practical improvements should I expect in XenServer 7.0?

The VBD plug and unplug operations are only part of the overall operations necessary to execute a VM lifecycle operation. The remaining parts, such as creation of the VM container and VIF plugs, will disperse the VBD improvements of the previous section, though they are still significant. Some examples of improvements, using a EXT SR on a local SSD storage backend:

VM lifecycle operation
mImprovement with 8 VBDs/VM

Toolstack time to start a single VM

b2ap3_thumbnail_vmstart-8vbds-1vm.png

 

Toolstack time to bootstorm 125 VMs

b2ap3_thumbnail_bootstorm-8vbds-125vms.png

 

The approximately 2s improvement in single VM start time was caused by plugging the 8 VBDs in parallel. As we see in the second row of the table, this can be a significant advantage in a bootstorm.

In XenServer 7.0, not only does xenopsd execute VBD operations in parallel, but it also has improvements in the storage layer operation times on VDIs, so you may observe that in your XenServer 7.0 environment further VM lifecycle time improvements beyond the expected ones from parallel VBD operations are noticeable, compared to XenServer 6.5SP1.

 

Recent comment in this post
Sam McLeod
Thanks for the post Marcus!
Wednesday, 20 July 2016 09:02
Continue reading
2728 Hits
1 Comment

Resetting Lost Root Password in XenServer 7.0

XenServer 7.0, Grub2, and a Lost Root Password

In a previous article I detailed how one could reset a lost root password to XenServer 6.2.  While the article is not limited to 6.2 (it works just as well for 6.5, 6.1, and 6.0.2), this article is dedicated to XenServer 7.0 as grub2 has been brought in to replace extlinux.

As such, if the local root user's (LRU) password for a XenServer 7.0 is forgotten physical (or "lights out") access to the host and a reboot will be required.  The contrast comes with grub2, the methods to boot the XenServer 7.0 host into single user mode, and how to reset the root password to a known token.

The Grub Boot Screen

Once obtaining physical or "lights out" to the XenServer 7.0 host in question, on reboot the following screen will appear:

It is important to note that once this screen appears, you only have four seconds to take action before the host proceeds to boot the kernel.

As should be default, the XenServer kernel is highlighted.  One will want to immediately press the key (for edit).

This will then refresh the grub interface - stopping any count-down-to-boot timers - which will reveal the boot entry.  It is within this window (using up, down, left, and right) one will want to navigate to around line 4 or five and isolate "ro nolvm":

 

Next, one will want to remove (or backspace/delete) the "ro" characters and type in "rw init=/sysroot/bin/sh", or as illustrated:

 

Don't worry if the directive is not on one line!

 

With this change made, press both Control and X at the same time as this will boot the XenServer kernel into single user style mode, or better known as Emergency Mode:

How to Change Root's Password

From the Emergency Mode prompt, execute the following command:

chroot /sysroot

Now, once can execute the "passwd" command to change root's credentials:

Finally....

Now that root's credentials have been changed, utilize Control+Alt+Delete to reboot the XenServer 7.0 host and one will find via SSH, XenCenter, or directly that the root password has been changed: the host is ready to be managed again.

 

Recent Comments
Tobias Kreidl
Many thanks for this update, Jesse! It should be turned into a KB article, as well, if not already.
Friday, 24 June 2016 10:52
JK Benedict
Jordan -- Thanks for the compliments! However, it seems more apropos to say "Sorry to hear about your situation!" So, the steps... Read More
Monday, 27 June 2016 10:11
Continue reading
7336 Hits
6 Comments

XenServer 7.0 performance improvements part 2: Parallelised networking datapath

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the second in a series of articles that will describe the principal improvements. For the first, see http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html.

The topic of this post is network I/O performance. XenServer 7.0 achieves significant performance improvements through the support for multi-queue paravirtualised network interfaces. Measurements of one particular use-case show an improvement from 17 Gb/s to 41 Gb/s.

A bit of background about the PV network datapath

In order to perform network-based communications, a VM employs a paravirtualised network driver (netfront in Linux or xennet in Windows) in conjunction with netback in the control domain, dom0.

a1sx2_Original2_single-queue.png

To the guest OS, the netfront driver feels just like a physical network device. When a guest wants to transmit data:

  • Netfront puts references to the page(s) containing that data into a "Transmit" ring buffer it shares with dom0.
  • Netback in dom0 picks up these references and maps the actual data from the guest's memory so it appears in dom0's address space.
  • Netback then hands the packet to the dom0 kernel, which uses normal routing rules to determine that it should go to an Open vSwitch device and then on to either a physical interface or the netback device for another guest on the same host.

When dom0 has a network packet it needs to send to the guest, the reverse procedure applies, using a separate "Receive" ring.

Amongst the factors that can limit network throughput are:

  1. the ring becoming full, causing netfront to have to wait before more data can be sent, and
  2. the netback process fully consuming an entire dom0 vCPU, meaning it cannot go any faster.

Multi-queue alleviates both of these potential bottlenecks.

What is multi-queue?

Rather than having a single Transmit and Receive ring per virtual interface (VIF), multi-queue means having multiple Transmit and Receive rings per VIF, and one netback thread for each:

a1sx2_Original1_multi-queue.png

Now, each TCP stream has the opportunity to be driven through a different Transmit or Receive ring. The particular ring chosen for each stream is determined by a hash of the TCP header (MAC, IP and port number of both the source and destination).

Crucially, this means that separate netback threads can work on each TCP stream in parallel. So where we were previously limited by the capacity of a single dom0 vCPU to process packets, now we can exploit several dom0 vCPUs. And where the capacity of a single Transmit ring limited the total amount of data in-flight, the system can now support a larger amount.

Which use-cases can take advantage of multi-queue?

Anything involving multiple TCP streams. For example, any kind of server VM that handles connections from more than one client at the same time.

Which guests can use multi-queue?

Since frontend changes are needed, the version of the guest's netfront driver matters. Although dom0 is geared up to support multi-queue, guests with old versions of netfront that lack multi-queue support are limited to single Transmit and Receive rings.

  • For Windows, the XenServer 7.0 xennet PV driver supports multi-queue.
  • For Linux, multi-queue support was added in Linux 3.16. This means that Debian Jessie 8.0 and Ubuntu 14.10 (or later) support multi-queue with their stock kernels. Over time, more and more distributions will pick up the relevant netfront changes.

How does the throughput scale with an increasing number of rings?

The following graph shows some measurements I made using iperf 2.0.5 between a pair of Debian 8.0 VMs both on a Dell R730xd host. The VMs each had 8 vCPUs, and iperf employed 8 threads each generating a separate TCP stream. The graph reports the sum of the 8 threads' throughputs, varying the number of queues configured on the guests' VIFs.

5104.png

We can make several observations from this graph:

  • The throughput scales well up to four queues, with four queues achieving more than double the throughput possible with a single queue.
  • The blip at five queues probably arose when the hashing algorithm failed to spread the eight TCP streams evenly across the queues, and is thus a measurement artefact. With different TCP port numbers, this may not have happened.
  • While the throughput generally increases with an increasing number of queues, the throughput is not proportional to the number of rings. Ideally, the throughput would double when you double the number of rings. This doesn't happen in practice because the processing is not perfectly parallelisable: netfront needs to demultiplex the streams onto the rings, and there are some overheads due to locking and synchronisation between queues.

This graph also highlights the substantial improvement over XenServer 6.5, in which only one queue per VIF was supported. In this use-case of eight TCP streams, XenServer 7.0 achieves 41 Gb/s out-of-the-box where XenServer 6.5 could manage only 17 Gb/s – an improvement of 140%.

How many rings do I get by default?

By default the number of queues is limited by (a) the number of vCPUs the guest has and (b) the number of vCPUs dom0 has. A guest with four vCPUs will get four queues per VIF.

This is a sensible default, but if you want to manually override it, you can do so in the guest. In a Linux guest, add the parameter xen_netfront.max_queues=n, for some n, to the kernel command-line.

Recent Comments
Tobias Kreidl
Hi, Jonathan: Thanks for the insightful pair of articles. It's interesting how what appear to be nuances can make large performan... Read More
Tuesday, 21 June 2016 04:54
Jonathan Davies
Thanks for sharing your thoughts, Tobias. You ask about queue polling. In fact, netback already does this! It achieves this by us... Read More
Wednesday, 22 June 2016 08:40
Sam McLeod
Interesting post Jonathan, I've tried adjusting `xen_netfront.max_queues` amongst other similar values on both guests and hosts h... Read More
Tuesday, 21 June 2016 13:01
Continue reading
4855 Hits
5 Comments

XenServer 7.0 performance improvements part 1: Lower latency storage datapath

The XenServer team made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the first in a series of articles that will describe the principal improvements.

Our first topic is storage I/O performance. A performance improvement has been achieved through the adoption of a polling technique in tapdisk3, the component of XenServer responsible for handling I/O on virtual storage devices. Measurements of one particular use-case demonstrate a 50% increase in performance from 15,000 IOPS to 22,500 IOPS.

What is polling?

Normally, tapdisk3 operates in an event-driven manner. Here is a summary of the first few steps required when a VM wants to do some storage I/O:

  1. The VM's paravirtualised storage driver (called blkfront in Linux or xenvbd in Windows) puts a request in the ring it shares with dom0.
  2. It sends tapdisk3 a notification via an event-channel.
  3. This notification is delivered to domain 0 by Xen as an interrupt. If Domain 0 is not running, it will need to be scheduled in order to receive the interrupt.
  4. When it receives the interrupt, the domain 0 kernel schedules the corresponding backend process to run, tapdisk3.
  5. When tapdisk3 runs, it looks at the contents of the shared-memory ring.
  6. Finally, tapdisk3 finds the request which can then be transformed into a physical I/O request.

Polling is an alternative to this approach in which tapdisk3 repeatedly looks in the ring, speculatively checking for new requests. This means that steps 2–4 can be skipped: there's no need to wait for an event-channel interrupt, nor to wait for the tapdisk3 process to be scheduled: it's already running. This enables tapdisk3 to pick up the request much more promptly as it avoids these delays inherent to the event-driven approach.

The following diagram contrasts the timelines of these alternative approaches, showing how polling reduces the time until the request is picked up by the backend.

b2ap3_thumbnail_polling-explained.png

How does polling help improve storage I/O performance?

Polling is in established technique for reducing latency in event-driven systems. (One example of where it is used elsewhere to mitigate interrupt latency is in Linux networking drivers that use NAPI.)

Servicing I/O requests promptly is an essential part of optimising I/O performance. As I discussed in my talk at the 2015 Xen Project Developer Summit, reducing latency is the key to maintaining a low virtualisation overhead. As physical I/O devices get faster and faster, any latency incurred in the virtualisation layer becomes increasingly noticeable and translates into lower throughputs.

An I/O request from a VM has a long journey to physical storage and back again. Polling in tapdisk3 optimises one section of that journey.

Isn't polling really CPU intensive, and thus harmful?

Yes it is, so we need to handle it carefully. If left unchecked, polling could easily eat huge quantities of domain 0 CPU time, starving other processes and causing overall system performance to drop.

We have chosen to do two things to avoid consuming too much CPU time:

  1. Poll the ring only when there's a good chance of a request appearing. Of course, guest behaviour is totally unpredictable in general, but there are some principles that can increase our chances of polling at the right time. For example, one assumption we adopt is that it's worth polling for a short time after the guest issues an I/O request. It has issued one request, so there's a good chance that it will issue another soon after. And if this guess turns out to be correct, keep on polling for a bit longer in case any more turn up. If there are none for a while, stop polling and temporarily fall back to the event-based approach.
  2. Don't poll if domain 0 is already very busy. Since polling is expensive in terms of CPU cycles, we only enter the polling loop if we are sure that it won't starve other processes of CPU time they may need.

How much faster does it go?

The benefit you will get from polling depends primarily on the latency of your physical storage device. If you are using an old mechanical hard-drive or an NFS share on a server on the other side of the planet, shaving a few microseconds off the journey through the virtualisation layer isn't going to make much of a difference. But on modern devices and low-latency network-based storage, polling can make a sizeable difference. This is especially true for smaller request sizes since these are most latency-sensitive.

For example, the following graph shows an improvement of 50% in single-threaded sequential read I/O for small request sizes – from 15,000 IOPS to 22,500 IOPS. These measurements were made with iometer in a 32-bit Windows 7 SP1 VM on a Dell PowerEdge R730xd with an Intel P3700 NVMe drive.

b2ap3_thumbnail_5071.png

How was polling implemented?

The code to add polling to tapdisk3 can be found in the following set of commits: https://github.com/xapi-project/blktap/pull/179/commits.

Continue reading
5584 Hits
0 Comments

XenServer Dundee Released

It was a little over a year ago when I introduced a project code named Dundee to this community. In the intervening year, we've had a number pre-release builds; all introducing ever greater capabilities into what I'm now happy to announce as XenServer 7. As you would expect from a major version number, XenServer 7 makes some rather significant strides forward, and defines a significant new capability.

Let's start first with the significant new capability. Some of you may have noted an interesting new security effort appear in upstream Xen a few years ago. Leading this effort was Bitdefender, and at the time it was known by the catchy title of "virtual machine introspection". This effort takes full advantage of the Intel EPT virtualization extensions to permit a true agentless anti-malware solution, where the anti-malware engine is placed in a service VM which is inaccessible from the guest VMs. XenServer 7 officially supports this technology with the Direct Inspect API set, and is platform ready for Bitdefender GravityZone HVI. For virtualization users, the combination of Direct Inspect and GravityZone HVI reduces the attack surface for malware by both removing in-guest agents, and by actively monitoring memory usage from the hypervisor to detect malicious memory accesses and flag questionable activity for remediation. When combined with support for Intel SMAP and PML, XenServer 7 offers significantly increased security compared to previous versions. Since secure operation extends to secure access to the host management APIs, XenServer 7 fully supports TLS 1.2, and can optionally mandate the use of TLS 1.2.

XenServer 7 extends the vGPU market initially defined in 2013 to include both increased scalability with NVIDIA GRID Maxwell M10 and the latest Intel Iris Pro virtual graphics. When combined, these vGPU extensions open the door to greater adoption of virtualized graphics by both increasing the number of GPU enabled VMs per host, as well as potentially removing the requirement for a dedicated GPU add-in card.

Operating virtual infrastructure at any level of scale requires an understanding of the overall health of the environment. While recent XenServer versions have included the ability to upload server status information to the free Citrix Insight Services, this operation was completely manual. With XenServer 7, we're introducing Health Check which is a proactive service which works in concert with Insight Services to monitor the operational health of a XenServer environment, and proactively alert you to any issues. The best part of Health Check is that it's completely free and open to any user of XenServer 7.

No major release would be complete without a requisite bump in performance, and XenServer 7 is no exception. Host memory limits have been bumped to 5TB per host, with a corresponding bump to 1.5TB per VM; OS willing of course. Host CPU count has been increased to 288 cores, and guest virtual CPU count has increased to 32; again OS willing. Disk scalability has also increased with support for up to 255 virtual block devices per VM and 4096 VBDs per host, all while supporting up to 20,000 VDIs per SR. Since XenServer often is deployed in Microsoft Windows environments, Active Directory support for role based authentication is a key requirement, and with XenServer 7, we've improved overall AD performance to support very large AD forests with a resulting improvement in login times.

 

XenServer 7 is available for download today, and can be obtained for free from the XenServer download page.

Recent Comments
Willem Boterenbrood
Congrats on the new release! We were waiting for it to arrive, finally XenServer support for Xeon v3/4 CPU masking and much more i... Read More
Tuesday, 24 May 2016 18:22
David Cottingham
Fixed -- thanks for catching that :-). Upgrades: please see http://docs.citrix.com/content/dam/docs/en-us/xenserver/xenserver-7-0... Read More
Tuesday, 07 June 2016 14:59
David Cottingham
DVSC is supported on 7.0. We're working on getting the downloads on citrix.com accessible.
Tuesday, 07 June 2016 15:07
Continue reading
11727 Hits
29 Comments

XenServer Administrators Handbook Published

Last year, I announced that we were working on a XenServer Administrators Handbook, and I'm very pleased to announce that it's been published. Not only have we been published, but based on the Amazon reviews to date we've done a pretty decent job. In part, I suspect that has a ton to do with the book being focused on what information you, XenServer administrators, need to be successful when running a XenServer environment regardless of scale or workload.

XenServer Administrators HandbookThe handbook is formatted following a simple premise; first you need to plan your deployment and second you need to run it. With that in mind, we start with exactly what a XenServer is, define how it works and what expectations it has on infrastructure. After all, it's critical to understand how a product like XenServer interfaces with the real world, and how its virtual objects relate to each other. We even cover some of the misunderstandings those new to XenServer might have.

While it might be tempting to go deep on some of this stuff, Jesse and I both recognized that virtualization SREs have a job to do and that's to run virtual infrastructure. As interesting as it might be to dig into how the product is implemented, that's not the role of an administrators handbook. That's why the second half of the book provides some real world scenarios, and how to go about solving them.

We had an almost limitless list of scenarios to choose from, and what you see in the book represents real world situations which most SREs will face at some point. The goal of this format being to have a handbook which can be actively used, not something which is read once and placed on some shelf (virtual or physical). During the technical review phase, we sent copies out to actual XenServer admins, all of whom stated that we'd presented some piece of information they hadn't previously known. I for one consider that to be a fantastic compliment.

Lastly, I want to finish off by saying that like all good works, this is very much a "we" effort. Jesse did a top notch job as co-author and brings the experience of someone who's job it is to help solve customer problems. Our technical reviewers added tremendously to the polish you'll find in the book. The O'Reilly Media team was a pleasure to work with, pushing when we needed to be pushed but understanding that day jobs and family take precedence.

So whether you're looking at XenServer out of personal interest, have been tasked with designing a XenServer installation to support Citrix workloads, clouds, or for general purpose virtualization, or have a XenServer environment to call your own, there is something in here for you. On behalf of Jesse, we hope that everyone who gets a copy finds it valuable. The XenServer Administrator's handbook is available from book sellers everywhere including:

Amazon: http://www.amazon.com/XenServer-Administration-Handbook-Successful-Deployments/dp/149193543X/

Barnes and Noble: http://www.barnesandnoble.com/w/xenserver-administration-handbook-tim-mackey/1123640451

O'Reilly Media: http://shop.oreilly.com/product/0636920043737.do

If you need a copy of XenServer to work with, you can obtain that for free from: http://xenserver.org/download

Recent Comments
Tobias Kreidl
A timely publication, given all the major recent enhancements to XenServer. It's packed with a lot of hands-on, practical advice a... Read More
Tuesday, 03 May 2016 03:37
Eric Hosmer
Been looking forward to getting this book, just purchased it on Amazon. Now I just need to find that mythical free time to read ... Read More
Friday, 06 May 2016 22:41
Continue reading
7725 Hits
2 Comments

Implementing VDI-per-LUN storage

With storage providers adding better functionality to provide features like QoS, fast snapshot & clone and with the advent of storage-as-a-service, we are interested in the ability to utilize these features from XenServer. VMware’s VVols offering already allows integration of vendor provided storage features into their hypervisor. Since most storage allows operations at the granularity of a LUN, the idea is to have a one-to-one mapping between a LUN on the backend and a virtual disk (VDI) on the hypervisor. In this post we are going to talk about the supplemental pack that we have developed in order to enable VDI-per-LUN.

Xenserver Storage

To understand the supplemental pack, it is useful to first review how XenServer storage works. In XenServer, a storage repository (SR) is a top-level entity which acts as a pool for storing VDIs which appear to the VMs as virtual disks. XenServer provides different types of SRs (File, NFS, Local, iSCSI). In this post we will be looking at iSCSI based SRs as iSCSI is the most popular protocol for remote storage and the supplemental pack we developed is targeted towards iSCSI based SRs. An iSCSI SR uses LVM to store VDIs over logical volumes (hence the type is lvmoiscsi). For instance:

[root@coe-hq-xen08 ~]# xe sr-list type=lvmoiscsi
uuid ( RO)                : c67132ec-0b1f-3a69-0305-6450bfccd790
          name-label ( RW): syed-sr
    name-description ( RW): iSCSI SR [172.31.255.200 (iqn.2001-05.com.equallogic:0-8a0906-c24f8b402-b600000036456e84-syed-iscsi-opt-test; LUN 0: 6090A028408B4FC2846E4536000000B6: 10 GB (EQLOGIC))]
                host ( RO): coe-hq-xen08
                type ( RO): lvmoiscsi
        content-type ( RO):

The above SR is created from a LUN on a Dell EqualLogic. The VDIs belonging to this SR can be listed by:

[root@coe-hq-xen08 ~]# xe vdi-list sr-uuid=c67132ec-0b1f-3a69-0305-6450bfccd790 params=uuid
uuid ( RO)    : ef5633d2-2ad0-4996-8635-2fc10e05de9a

uuid ( RO)    : b7d0973f-3983-486f-8bc0-7e0b6317bfc4

uuid ( RO)    : bee039ed-c7d1-4971-8165-913946130d11

uuid ( RO)    : efd5285a-3788-4226-9c6a-0192ff2c1c5e

uuid ( RO)    : 568634f9-5784-4e6c-85d9-f747ceeada23

[root@coe-hq-xen08 ~]#

This SR has 5 VDI. From LVM’s perspective, an SR is a volume group (VG) and each VDI is a logical volume(LV) inside that volume group. This can be seen via the following commands:

[root@coe-hq-xen08 ~]# vgs | grep c67132ec-0b1f-3a69-0305-6450bfccd790
  VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790   1   6   0 wz--n-   9.99G 5.03G
[root@coe-hq-xen08 ~]# lvs VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790
  LV                                       VG                                                 Attr   LSize 
  MGT                                      VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-a-   4.00M                                 
  VHD-568634f9-5784-4e6c-85d9-f747ceeada23 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-ao   8.00M                               
  VHD-b7d0973f-3983-486f-8bc0-7e0b6317bfc4 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-ao   2.45G                               
  VHD-bee039ed-c7d1-4971-8165-913946130d11 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi---   8.00M                                
  VHD-ef5633d2-2ad0-4996-8635-2fc10e05de9a VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -ri-ao   2.45G
VHD-efd5285a-3788-4226-9c6a-0192ff2c1c5e VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -ri-ao  36.00M

Here c67132ec-0b1f-3a69-0305-6450bfccd790 is the UUID of the SR. Each VDI is represented by a corresponding LV which is of the format VHD-. Some of the LVs have a small size of 8MB. These are snapshots taken on XenServer. There is also a LV named MGT which holds metadata about the SR and the VDIs present in it. Note that all of this is present in an SR which is a LUN on the backend storage.

Now XenServer can attach a LUN at the level of an SR but we want to map a LUN to a single VDI. In order to do that, we restrict an SR to contain a single VDI. Our new SR has the following LVs:

[root@coe-hq-xen09 ~]# lvs VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9
LV                                       VG                                                 Attr   LSize
MGT                                      VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9 -wi-a- 4.00M
VHD-09b14a1b-9c0a-489e-979c-fd61606375de VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9 -wi--- 8.02G
[root@coe-hq-xen09 ~]#

b2ap3_thumbnail_vdi-lun.png

If a snapshot or clone of the LUN is taken on the backend, all the unique identifiers associated with the different entities in the LUN also get cloned and any attempt to attach the LUN back to XenServer will result in an error because of conflicts of unique IDs.

Resignature and supplemental pack

In order for the cloned LUN to be re-attached, we need to resignature the unique IDs present in the LUN. The following IDs need to be resignatured

  • LVM UUIDs (PV, VG, LV)
  • VDI UUID
  • SR metadata in the MGT Logical volume

We at CloudOps have developed an open-source supplemental pack which solves the resignature problem. You can find it here. The supplemental pack adds a new type of SR (relvmoiscsi) and you can use it to resignature your lvmoiscsi SRs. After installing the supplemental pack, you can resignature a clone using the following command

[root@coe-hq-xen08 ~]# xe sr-create name-label=syed-single-clone type=relvmoiscsi 
device-config:target=172.31.255.200
device-config:targetIQN=$IQN
device-config:SCSIid=$SCSIid
device-config:resign=true
shared=true
Error code: SR_BACKEND_FAILURE_1
Error parameters: , Error reporting error, unknown key The SR has been successfully resigned. Use the lvmoiscsi type to attach it,
[root@coe-hq-xen08 ~]#

Here, instead of creating a new SR, the supplemental pack re-signatures the provided LUN and detaches it (the error is expected as we don’t actually create an SR). You can see from the error message that the SR has been re-signed successfully. Now the cloned SR can be introduced back to XenServer without any conflicts using the following commands:

[root@coe-hq-xen09 ~]# xe sr-probe type=lvmoiscsi device-config:target=172.31.255.200 device-config:targetIQN=$IQN device-config:SCSIid=$SCSIid

   		 5f616adb-6a53-7fa2-8181-429f95bff0e7
   		 /dev/disk/by-id/scsi-36090a028408b3feba66af52e0000a0e6
   		 5364514816

[root@coe-hq-xen09 ~]# xe sr-introduce name-label=vdi-test-resign type=lvmoiscsi 
uuid=5f616adb-6a53-7fa2-8181-429f95bff0e7
5f616adb-6a53-7fa2-8181-429f95bff0e7

This supplemental pack can be used in conjunction with an external orchestrator like CloudStack or OpenStack which can manage both the storage and compute. Working with SolidFire we have implemented this functionality, available in the next release of Apache CloudStack. You can check out a preview of this feature in a screencast here.

Recent Comments
Nick
If I am reading this correctly, this is just basically setting up XS to use 1 SR per VM, this isn't scalable as the limits for LUN... Read More
Tuesday, 26 April 2016 14:57
Syed Ahmed
Hi Nick, The limit of 256 SRs is when using Multipating. If no multipath is used, the number of SRs that can be created are well... Read More
Tuesday, 26 April 2016 17:19
Syed Ahmed
There is an initial overhead when creating SRs. However, we did not find any performance degradation in our tests once the SR is s... Read More
Wednesday, 27 April 2016 09:21
Continue reading
3700 Hits
7 Comments

Running XenServer... without a server

With the exciting release of the latest XenServer Dundee beta, the immediate reaction is to download it to give it a whirl to see all the shiny new features (and maybe to find out if your favourite bug has been fixed!). Unfortunately, it's not something that can just be installed, tested and uninstalled like a normal application - you'll need to find yourself a server somewhere you're willing to sacrifice in order to try it out. Unless, of course, you decide to use the power of virtualisation!

XenServer as a VM

Nested virtualisation - running a VM inside another VM - is not something that anyone recommends for production use, or even something that works at all in some cases. However, since Xen has its origins way back before hardware virtualisation became ubiquitous in intel processors, running full PV guests (that don't require any HW extensions) when XenServer is running as a VM actually works very well indeed. So for the purposes of evaluating a new release of XenServer it's actually a really good solution. It's also ideal for trying out many of the Unikernel implementations, such as Mirage or Rump kernels as these are pure PV guests too.

XenServer works very nicely when run on another XenServer, and indeed this is what we use extensively to develop and test our own software. But once again, not everyone has spare capacity to do this. So let's look to some other virtualisation solutions that aren't quite so server focused and that you might well have installed on your own laptop. Enter Oracle's VirtualBox.

VirtualBox, while not as performant a virtualization solution as Xen, is a very capable platform that runs XenServer without any problems. It also has the advantage of being easily installable on your own desktop or laptop. Therefore it's an ideal way to try out these betas of XenServer in a quick and convenient way. It also has some very convenient tools that have been built around it, one of which is Vagrant.

Vagrant

Vagrant is a tool for provisioning and managing virtual machines. It targets several virtualization platforms including VirtualBox, which is what we'll use now to install our XenServer VM. The model is that it takes a pre-installed VM image - what Vagrant calls a 'box' - and some provisioning scripts (using scripts, Salt, Chef, Ansible or others), and sets up the VM in a reproducible way. One of its key benefits is it simplifies management of these boxes, and Hashicorp run a service called Atlas that will host your boxes and metadata associated with them. We have used this service to publish a Vagrant box for the Dundee Beta. 

Try the Dundee Beta

Once you have Vagrant installed, trying the Dundee beta is as simple as:

vagrant init xenserver/dundee-beta
vagrant up

This will download the box image (about 1 Gig) and create a new VM from this box image. As it's booting it will ask which network to bridge onto, which if you want your nested VMs to be available on the network should be a wired network rather than wireless. 

The XenServer image is tweaked a little bit to make it easier to access - for example, it will by default DHCP all of the interfaces, which is useful for testing XenServer, but wouldn't be advisable for a real deployment. To connect to your XenServer, we need to find the IP address, so the simplest way of doing this is to ssh in and ask:

Mac-mini:xenserver jon$ vagrant ssh -c "sudo xe pif-list params=IP,device"
device ( RO) : eth1 IP ( RO): 192.168.1.102 device ( RO) : eth2 IP ( RO): 172.28.128.5 device ( RO) : eth0 IP ( RO): 10.0.2.15

So you should be able to connect using one of those IPs via XenCenter or via a browser to download XenCenter (or via any other interface to XenServer).

Going Deeper

Let's now go all Inception and install ourselves a VM within our XenServer VM. Let's assume, for the sake of argument, and because as I'm writing this it's quite true, that we're not running on a Windows machine, nor do we have one handy to run XenCenter on. We'll therefore restrict ourselves to using the CLI.

As mentioned before, HVM VMs are out so we're limited to pure PV guests. Debian Wheezy is a good example of one of these. First, we need to ssh in and become root:

Mac-mini:xenserver jon$ vagrant ssh
Last login: Thu Mar 31 00:10:29 2016 from 10.0.2.2
[vagrant@localhost ~]$ sudo bash
[root@localhost vagrant]#

Now we need to find the right template:

[root@localhost vagrant]# xe template-list name-label="Debian Wheezy 7.0 (64-bit)"
uuid ( RO)                : 429c75ea-a183-a0c0-fc70-810f28b05b5a
          name-label ( RW): Debian Wheezy 7.0 (64-bit)
    name-description ( RW): Template that allows VM installation from Xen-aware Debian-based distros. To use this template from the CLI, install your VM using vm-install, then set other-config-install-repository to the path to your network repository, e.g. http:///

Now, as the description says, we use 'vm-install' and set the mirror:

[root@localhost vagrant]# xe vm-install template-uuid=429c75ea-a183-a0c0-fc70-810f28b05b5a new-name-label=wheezy
479f228b-c502-a791-85f2-a89a9f58e17f
[root@localhost vagrant]# xe vm-param-set uuid=479f228b-c502-a791-85f2-a89a9f58e17f other-config:install-repository=http://ftp.uk.debian.org/debian

The VM doesn't have any network connection yet, so we'll need to add a VIF. We saw the IP addresses of the network interfaces above, and in my case eth1 corresponds to the bridged network I selected when starting the XenServer VM using Vagrant. So I need the uuid of the network, so I'll list the networks:

[root@localhost vagrant]# xe network-list
uuid ( RO)                : c7ba748c-298b-20dc-6922-62e6a6645648
          name-label ( RW): Pool-wide network associated with eth2
    name-description ( RW):
              bridge ( RO): xenbr2

uuid ( RO)                : f260c169-20c3-2e20-d70c-40991d57e9fb 
          name-label ( RW): Pool-wide network associated with eth1  
    name-description ( RW): 
              bridge ( RO): xenbr1 

uuid ( RO)                : 8d57e2f3-08aa-408f-caf4-699b18a15532 
          name-label ( RW): Host internal management network 
    name-description ( RW): Network on which guests will be assigned a private link-local IP address which can be used to talk XenAPI 
              bridge ( RO): xenapi 

uuid ( RO)                : 681a1dc8-f726-258a-eb42-e1728c44df30 
          name-label ( RW): Pool-wide network associated with eth0 
    name-description ( RW):
              bridge ( RO): xenbr0

So I need a VIF on the network with uuid f260c...

[root@localhost vagrant]# xe vif-create vm-uuid=479f228b-c502-a791-85f2-a89a9f58e17f network-uuid=681a1dc8-f726-258a-eb42-e1728c44df30 device=0
e96b794e-fef3-5c2b-8803-2860d8c2c858

All set! Let's start the VM and connect to the console:

[root@localhost vagrant]# xe vm-start uuid=479f228b-c502-a791-85f2-a89a9f58e17f
[root@localhost vagrant]# xe console uuid=479f228b-c502-a791-85f2-a89a9f58e17f

This should drop us into the Debian installer:

b2ap3_thumbnail_Screen-Shot-2016-03-31-at-00.55.07.png

A few keystrokes later and we've got ourselves a nice new VM all set up and ready to go.

All of the usual operations will work; start, shutdown, reboot, suspend, checkpoint and even, if you want to set up two XenServer VMs, migration and storage migration. You can experiment with bonding, try multipathed ISCSI, check that alerts are generated, and almost anything else (with the exception of HVM and anything hardware specific such as VGPUs, of course!). It's also an ideal companion to the Docker build environment I blogged about previously, as any new things you might be experimenting with can be easily built using Docker and tested using Vagrant. If anything goes wrong, a 'vagrant destroy' followed by a 'vagrant up' and you've got a completely fresh XenServer install to try again in less than a minute.

The Vagrant box is itself created using Packer, a tool often used to create Vagrant boxes.  The configuration for this is available on github, and feedback on this box is very welcome!

In a future blog post, I'll be discussing how to use Vagrant to manage XenServer VMs.

Recent Comments
Ivan Grynenko
Very interesting post, thank you. We maintain XenServer VM for testing purposes of our GitHub project https://github.com/ivrh/xs D... Read More
Tuesday, 12 April 2016 00:44
Jon Ludlam
Hi Ivan, Thanks! Very interesting project you've got there too - looks really nice. To change the IP settings within the XenServ... Read More
Wednesday, 13 April 2016 00:19
Continue reading
7600 Hits
2 Comments

NAU VMbackup 3.0 for XenServer

NAU VMbackup 3.0 for XenServer

By Tobias Kreidl and Duane Booher

Northern Arizona University, Information Technology Services

Over eight years ago, back in the days of XenServer 5, not a lot of backup and restore options were available, either as commercial products or as freeware, and we quickly came to the realization that data recovery was a vital component to a production environment and hence we needed an affordable and flexible solution. The conclusion at the time was that we might as well build our own, and though the availability of options has grown significantly over the last number of years, we’ve stuck with our own home-grown solution which leverages Citrix XenServer SDK and XenAPI (http://xenserver.org/partners/developing-products-for-xenserver.html). Early versions were created from the contributions of Douglas Pace, Tobias Kreidl and David McArthur. During the last several years, the lion’s share of development has been performed by Duane Booher. This article discusses the latest 3.0 release.

A Bit of History

With the many alternatives now available, one might ask why we have stuck with this rather un-flashy script and CLI-based mechanism. There are clearly numerous reasons. For one, in-house products allow total control over all aspects of their development and support. The financial outlay is all people’s time and since there are no contracts or support fees, it’s very controllable and predictable. We also found from time-to-time that various features were not readily available in other sources we looked at. We also felt early on as an educational institution that we could give back to the community by freely providing the product along with its source code; the most recent version is available via GitHub at https://github.com/NAUbackup/VmBackup for free under the terms of the GNU General Public License. There was a write-up at https://www.citrix.com/blogs/2014/06/03/another-successful-community-xenserver-sdk-project-free-backup-tools-and-scripts-naubackup-restore-v2-0-released/ when the first GitHub version was published. Earlier versions were made available via the Citrix community site (Citrix Developer Network), sometimes referred to as the Citrix Code Share, where community contributions were published for a number of products. When that site was discontinued in 2013, we relocated the distribution to GitHub.

Because we “eat our own dog food,” VMbackup gets extensive and constant testing because we rely on it ourselves as the means to create backups and provide for restores for cases of accidental deletion, unexpected data corruption, or in the event that disaster recovery might be needed. The mechanisms are carefully tested before going into production and we perform frequent tests to ensure the integrity of the backups and that restores really do work. A number of times, we have relied on resorting to recovering from our backups and it has been very reassuring that these have been successful.

What VMbackup Does

Very simply, VMbackup provides a framework for backing up virtual machines (VMs) hosted on XenServer to an external storage device, as well as the means to recover such VMs for whatever reason that might have resulted in loss, be it disaster recovery, restoring an accidentally deleted VM, recovering from data corruption, etc.

The VMbackup distribution consists of a script written in Python and a configuration file. Other than a README document file, that’s it other than the XenServer SDK components which one needs to download separately; see http://xenserver.org/partners/developing-products-for-xenserver.html for details. There is no fancy GUI to become familiar with, and instead, just a few simple things that need to be configured, plus a destination for the backups needs to be made accessible (this is generally an NFS share, though SMB/CIFS will work, as well). Using cron job entries, a single host or an entire pool can be set up to perform periodic backups. Configurations on individual hosts in a pool are needed in that the pool master performs the majority of the work and it can readily change to a different XenServer, while individual host-based instances are also needed when local storage is also made use of, since access to any local SRs can only be performed from each individual XenServer. A cron entry and numerous configuration examples are given in the documentation.

To avoid interruptions of any running VMs, the process of backing up a VM follows these basic steps:

  1. A snapshot of the VM and its storage is made
  2. Using the xe utility vm-export, that snapshot is exported to the target external storage
  3. The snapshot is deleted, freeing up that space

In addition, some VM metadata are collected and saved, which can be very useful in the event a VM needs to be restored. The metadata include:

  • vm.cfg - includes name_label, name_description, memory_dynamic_max, VCPUs_max, VCPUs_at_startup, os_version, orig_uuid
  • DISK-xvda (for each attached disk)
    • vbd.cfg - includes userdevice, bootable, mode, type, unplugable, empty, orig_uuid
    • vdi.cfg - includes name_label, name_description, virtual_size, type, sharable, read_only, orig_uuid, orig_sr_uuid
  • VIFs (for each attached VIF)
    • vif-0.cfg - includes device, network_name_label, MTU, MAC, other_config, orig_uuid

An additional option is to create a backup of the entire XenServer pool metadata, which is essential in dealing with the aftermath of a major disaster that affects the entire pool. This is accomplished via the “xe pool-dump-database” command.

In the event of errors, there are automatic clean-up procedures in place that will remove any remnants plus make sure that earlier successful backups are not purged beyond the specified number of “good” copies to retain.

There are numerous configuration options that allow to specify which VMs get backed up, how many backup versions are to be retained, whether the backups should be compressed (1) as part of the process, as well as optional report generation and notification setups.

New Features in VMbackup 3.0

A number of additional features have been added to this latest 3.0 release, adding flexibility and functionality. Some of these came about because of the sheer number of VMs that needed to be dealt with, SR space issues as well as with changes coming to the next XenServer release. These additions include:

  • VM “preview” option: To be able to look for syntax errors and ensure parameters are being defined properly, a VM can have a syntax check performed on it and if necessary, adjustments can then be made based on the diagnosis to achieve the desired configuration.
  • Support for VMs containing spaces: By surrounding VM names in the configuration file with double quotes, VM names containing spaces can now be processed. 
  • Wildcard suffixes: This very versatile option permits groups of VMs to be configured to be handled similarly, eliminating the need to create individual settings for every desired VM. Examples include “PRD-*”, “SQL*” and in fact, if all VMs in the pool should be backed up, even “*”. There are however, a number of restrictions on wildcard usage (2).
  • Exclude VMs: Along with the wildcard option to select which VMs to back up, clearly a need arises to provide the means to exclude certain VMs (in addition to the other alternative, which is simply to rename them such that they do not match a certain backup set). Currently, each excluded VM must be named separately and any such VMs should de defined at the end of the configuration file. 
  • Export the OS disk VDI, only: In some cases, a VM may contain multiple storage devices (VDIs) that are so large that it is impractical or impossible to take a snapshot of the entire VM and its storage. Hence, we have introduced the means to backup and restore only the operating system device (assumed to be Disk 0). In addition to space limitations, some storage, such as DB data, may not be able to be reliably backed up using a full VM snapshot. Furthermore, the next XenServer release (Dundee) will likely support up to as many as perhaps 255 storage devices per VM, making a vm-export even more involved under such circumstances. Another big advantage here is that currently, this process is much more efficient and faster than a VM export by a factor of three or more!
  • Root password obfuscation: So that clear-text passwords associated with the XenServer pool are not embedded in the scripts themselves, the password can be basically encoded into a file.

The mechanism for a running VM from which only the primary disk is to be backed up is similar to the full VM backup. The process of backing up such a VM follows these basic steps:

  1. A snapshot of just the VM's Disk 0 storage is made
  2. Using the xe utility vdi-export, that snapshot is exported to the target external storage
  3. The snapshot is deleted, freeing up that space

As with the full VM export, some metadata for the VM are also collected and saved for this VDI export option.

These added features are of course subject to change in future releases, though typically later editions generally encompass the support of previous versions to preserve backwards compatibility.

Examples

Let’s look at the configuration file weekend.cfg:

# Weekend VMs
max_backups=4
backup_dir=/snapshots/BACKUPS
#
vdi-export=PROD-CentOS7-large-user-disks
vm-export=PROD*
vm-export=DEV-RH*:3
exclude=PROD-ubuntu12-benchmark
exclude=PRODtestVM

Comment lines start with a hash mark and may be contained anywhere with the file. The hash mark must appear as the first character in the line. Note that the default number of retained backups is set here to four. The destination directory is set next, indicating where the backups will be written to. We then see a case where only the OS disk is being backed up for the specific VM "PROD-CentOS7-large-user-disks" and below that, all VMs beginning with “PROD” are backed up using the default settings. Just below that, a definition is created for all VMs starting with "DEV-RH" and the default number of backups is reduced for all of these from the global default of four down to three. Finally, we see two excludes for specific VMs that fall into the “PROD*” group that should not be backed up at all.

To launch the script manually, you would issue from the command line:

./VmBackup.py password weekend.cfg

To launch the script via a cron job, you would create a single-line entry like this:

10 0 * * 6 /usr/bin/python /snapshots/NAUbackup/VmBackup.py password
/snapshots/NAUbackup/weekend.cfg >> /snapshots/NAUbackup/logs/VmBackup.log 2>&1

This would run the task at ten minutes past midnight on Saturday and create a log entry called VmBackup.log. This cron entry would need to be installed on each host of a XenServer pool.

Additional Notes

It can be helpful to break up when backups are run so that they don’t all have to be done at once, which may be impractical, take so long as to possibly impact performance during the day, or need to be coordinated with when is best for specific VMs (such as before or after patches are applied). These situations are best dealt with by creating separate cron jobs for each subset.

There is a fair load on the server, comparable to any vm-export, and hence the queue is processed linearly with only one active snapshot and export sequence for a VM being run at a time. This is also why we suggest you perform the backups and then asynchronously perform any compression on the files on the external storage host itself to alleviate the CPU load on the XenServer host end.

For even more redundancy, you can readily duplicate or mirror the backup area to another storage location, perhaps in another building or even somewhere off-site. This can readily be accomplished using various copy or mirroring utilities, such as rcp, sftp, wget, nsync, rsync, etc.

This latest release has been tested on XenServer 6.5 (SP1) and various beta and technical preview versions of the Dundee release. In particular, note that the vdi-export utility, while dating back a while, is not well documented and we strongly recommend not trying to use it on any XenServer release before XS 6.5. Doing so is clearly at your own risk.

The NAU VMbackup distribution can be found at: https://github.com/NAUbackup/VmBackup

In Conclusion

This is a misleading heading, as there is not really a conclusion in the sense that this project continues to be active and as long as there is a perceived need for it, we plan to continue working on keeping it running on future XenServer releases and adding functionality as needs and resources dictate. Our hope is naturally that the community can make at least as good use of it as we have ourselves.

Footnotes:

  1. Alternatively, to save time and resources, the compression can potentially be handled asynchronously by the host onto which the backups are written, hence reducing overhead and resource utilization on the XenServer hosts, themselves.
  2. Certain limitations exist currently with how wildcards can be utilized. Leading wildcards are not allowed, nor are multiple wildcards within a string. This may be enhanced at a later date to provide even more flexibility.

This article was written by Tobias Kreidl and Duane Booher, both of Northern Arizona University, Information Technology Services. Tobias' biography is available at this site, and Duane's LinkedIn profile is at https://www.linkedin.com/in/duane-booher-a068a03 while both can also be found on http://discussions.citrix.com primarily in the XenServer forum.     

Recent Comments
Lorscheider Santiago
Tobias Kreidl and Duane Booher, Greart Article! you have thought of a plugin for XenCenter?
Saturday, 09 April 2016 13:28
Tobias Kreidl
Thank you, Lorscheider, for your comment. Our thoughts have long been that others could take this to another level by developing a... Read More
Thursday, 14 April 2016 01:34
Niklas Ahden
Hi, First of all I want to thank you for this great article and NAUBackup. I am wondering about the export-performance while usin... Read More
Sunday, 17 April 2016 19:14
Continue reading
8614 Hits
11 Comments

XenServer + Docker + CEPH/RBDSR for shared local storage

No need to dance around it - the title says it all :)

So straight to it. What I've got here is an example of how to set up a XenServer pool, to use local storages as CEPH OSDs, running on each host as Docker containers, and present RBD objects from those CEPH docker containers as a shared storage back to the pool - wow, this is a mouthful :)

Please keep in mind that this post is full of unsupported, potentially dangerous, fairly unstable commands and technologies - just the way I like it :) But don’t go and set it up in production - this is just a demo, mostly for fun.

What’s the point of it all? Well, you don’t get more space that’s for sure. The idea is that you would run copy of the same data on each host, which might make it easier to migrate things around and manage backups(since we have copy of everything everywhere). Besides - local storage is cheap and fun :)

So let’s get cooking!

Ingredients:

Cookery:

Prepare XenServer Hosts

Install XenServer with defaults, except don’t use any local storage (i.e. untick sda when selecting disk for local storage see step 4-6 for details).

This should give you a XenServer hosts with Removable Storage and CDROM SRs only.

Once you've installed all three hosts, join them into a pool and patch to latest hotfix level(it’s always a good idea anyway ;) )

Install the supplemental pack for Docker and my RBDSR.py stuff on each one.

Prepare SRs and physical partitions

Now we are going to create partitions on the reaming space of the local disk and map those partition as VDI:

In dom0 of each host create folder, let’s call it partitions:

# mkdir /partitions

Then create SR type=file:

# xe sr-create type=file device-config:location=/partitions name-label=partitions \
host-uuid=<host you created folder on>

Here you can place file vdis(VHD or RAW) and link physical disks and partitions. 

Let’s do just that. Create third and fourth partitions on the local disk:

# gdisk /dev/sda
  • n -> enter(3 partition default) -> enter(starting sector, default first available)  -> +20G -> enter
  • n -> enter(4 partition default) -> enter(starting sector, default next after P4) -> enter(end sector, default remainder of the disk)
  • w -> Y

To recap, we’ve used gpt disk utility(gdisk) to create two new partitions with number 3 and 4 starting at the end of the sda2 and spanning across remaining space. First partition we've created(sda3 of 20GB) will be used as a VDI for the VM that will run docker, and second partition(sda4 of around 448GB) will be used as OSD disk for the CEPH. Then we’ve written changes to the disk and said “Y”es to the warning of potentially dangerous task.

Now that we have our raw disk partitioned, we will introduce it as VDIs in "partitions" SR. But, at the moment those partitions are not available, since we’ve modified partition table of the root disk, and Dom0 is not too keen on letting it go to re-read it. We could do some remount magic, but since nothing running yet, we’ll just reboot it before mapping the VDI…

OK, so the hosts are now rebooted, let’s map the VDI:

# ln -s /dev/sda3 /partitions/$(uuidgen).raw

This will create symlink with random(but unique) uuid and extension .raw, inside of partitions folder pointing to the sda3. If you rescan that SR now, new disk will pop up with size 0 and no name. Give it name and don’t worry about the size. SM stuff don’t get sg info from symlinks of partitions - which is a good thing I guess :) besides, if you want to know how big it is, just run gdisk -l /dev/sda3 in dom0. Our guest that will use this disk, will be fully aware of its geometry and properties, thanks to tapdisk magic.

Repeat the same with sda4:

# ln -s /dev/sda4 /partitions/$(uuidgen).raw

Rescan and add sensible name. 

*note: you may want to link path /dev/disk/by-id/ or /dev/disk/by-uuid/ instead of /dev/sda<num> to avoid problems in future if you decide to add more disks to hosts and mess up sda/sdb order. But for the sake of this example, I think /dev/sda<num> is good enough.
*Accidentally useful stuff: with “partitions” SR added to XS host, you can now check space of Dom0 in XenCenter. because SM stuff read size of that folder and inadvertently the size and utilisation of rootfs(sda1), the free and utilised space of that SR will represent Dom0 filesystem.

Prepare Docker Guest

At this stage you could have installed CoreOS, RancherOS or Boot2Docker to deploy CEPH container. But, while it would work with the RBDSR plugin demo that I’ve made(i.e. through using ssh public key in Dom0 to obtain information over ssh instead of a password), we would need to set up automated container start up and I’m not familiar enough with those systems to write about it. 

Instead, here I’ll use Linux guest(openSUSE leap) to run CEPH docker container inside of it. There are two additional reasons for doing that:

  1. I love SUSE - it’s a Swiss knife with precision of samurai sword. 
  2. It will help to demonstrate how the supplemental pack, and xscontainer monitoring in particular, works.

Anyway, let’s create the guest using latest x64 SLES template.. 

If you haven’t already, download openSUSE DVD or net install here

Add ISO SR to the pool. 

Since we don’t have any local storage, but need to build guest on each host locally, we are going to use the sda3(20GB) as a system VDI for that guest.

However, this presents a problem: default disk for the SLES guest is 8 GB but we don't have any local storage to place it on(only SR is /partitions and that is just 4GB of sda1). But do not worry - everything will fit just fine. To work around the SLES template recommendations of 8GB local disk, there are two possible solutions:

Solution 1

Change the recommendations of the template’s disk size like so: 

  • xe template-list | grep SUSE -A1
  • xe template-list uuid=<uuid of SUSE 12 x64 from command above> params
  • xe template-param-set uuid=<uuid from the command above> other-config:disks=’<copy of the disks section from the original template with size value changed to 1073741824>’
  • create the VM from the template, then delete attached disk and replace it with sda3 from "partitions" SR.

Solution 2

Or, you can use command line to install the template like so:

  • xe vm-install template=”SUSE Linux Enterprise Server 12 (64-bit)” new-name-label=docker1 sr-uuid=<uuid of the partition sr on the master> 
  • Once it create VM, you need to delete the 8GB disk and replace it with sda3 VDI from "partition" SR. in storage tab also click to create CDROM and insert openSUSE ISO.
  • set install-repository to cdrom: xe vm-param-set uuid=<docker1 uuid> other-config:install-repository=cdrom
  • add vif from xencenter or: xe vif-create vm-uuid=<docker1 uuid> network-uuid=<uuid of network you want to use> mac=random device=0
  • set cdrom as bootable device: 
    • xe vbd-list vm-name-label=docker1 type=CD
    • xe vbd-param-set uuid=<command above> bootable=true

Boot the VM and install it onto sda3 VDI. Few changes have to be made in default installation of openSUSE:

  1. set boot partition to ext4 instead of BTRFS since pygrub doesn’t read btrfs(yet :))
    • Remove proposed partitions. 
    • create first boot partition of around 100MB
    • then create swap
    • then create btrfs for the remaining space
  2. disable snapshoting on BTRFS(because we only have 20GB to play with and a lot of btrf operation related to CEPH)
    • This option is in "subvolume handling" of the partition.
  3. enable ssh(in firewall section of installation summary) and go.

Once the VM is up and running, login and patch it:

# zypper -n up

Then install xs-tools for dynamic devices: 

  • eject openSUSE DVD and insert xs-tools.iso(can be done in one click of the general tab of VM)
  • mount cdrom inside the VM: mount /dev/xvdd /mnt
  • install sles type tools: cd /mnt/Linux && ./install -d sles -m 12
  • reboot

We need to remove DVD as a repository with:

# zypper rr openSUSE-42.1-0

then install docker and ncat:

# zypper -n in docker ncat

Start and enable docker service:

# systemctl enable docker.service && systemctl start docker.service

In Dom0, enable monitoring of docker for that VM with:

# xscontainer-prepare-vm -v <uuid of the docker1> -u root

Note: XenServer doesn't support openSUSE as a docker host. I'm not sure what the problem exactly is, but there is something wrong with the docker socket in openSUSE or ncat that takes too long for the command to return when you are trying to do GET or there is not enough \n in the script. So as workaround you’d need to do following on each host:

edit /usr/lib/python2.4/site-packages/xscontainer/docker_monitor/__init__.py and comment out docker.wipe_docker_other_config with adding ‘pass’ above it:

016-01-13 12:35:32.000000000 +1100
@@ -101,7 +101,8 @@
                     try:
                         self.__monitor_vm_events()
                     finally:
-                        docker.wipe_docker_other_config(self)
+                        pass
+                    #    docker.wipe_docker_other_config(self)
                 except (XenAPI.Failure, util.XSContainerException):
                     log.exception("__monitor_vm_events threw an exception, "
                                   "will retry")

With this in place, docker status will be shown correctly in the XenCenter.

Prepare CEPH container

Pull the ceph/daemon image:

# sudo docker pull ceph/daemon.

Now that we have VM prepared with docker instance of CEPH inside, we can replicate this instance across remaining 3 hosts.

  • Shutdown the VM
  • detach the disk and make two additional copies of the VM.
  • Reassign home server for the two copies to remaining hosts and attach sda3 and sda4 as respective disks.

At the moment we have docker guest installed only on the master, so let’s copy it to other hosts:

# dd if=/dev/sda2 bs=4M | ssh root@<ip of slave 1> dd of=/dev/sda2

Then repeat the same for the second slave. Once that has been completed, you can boot all three VMs up. this is a good time to conf igure static IPs and add second NIC if you wish to use dedicated network from CEPH data syncing.

You’d also need to run xscontainer-prepare-vm for the two other VMs(alternatively you can just set other-config flag xscontainer-monitor=true and xscontainer-username=root, since we already have XenServer pool rsa key added to the authorised keys inside of the guest)

Once you've done all that and restarted xscontainer-monitor(/etc/init.d/xscontainer-monitor restart) to apply the changes we’ve made in the __init__.py, you should see Docker information inside of the each docker guest general tab. As you can see, xscontainer supplementary pack uses ssh to login into the instance, then ncat the docker unix socket and using GET, obtain information about docker version and containers it has.

Prepare CEPH cluster

So we are ready to deploy CEPH. 

Note, it’s a good idea to configure NTP at this stage because monitors are very sensitive to time skew between hosts.(use yast or /etc/ntp.conf)

In addition, it's a good idea to disable "wallclock" as otherwise it will never stay accurately in sync and CEPH will report problems on the cluster:

# sysctl xen.independent_wallclock=1

So all the information on how to deploy monitor and osd is available on the github page of the ceph/daemon docker here: https://github.com/ceph/ceph-docker/tree/master/daemon

Also interesting page is the ceph.config sample page: https://github.com/ceph/ceph/blob/master/src/sample.ceph.conf

One useful part in that config that of an interest for me is osd configuration to make btrfs partition instead of default xfs.

So bring up a monitor container on the host docker running on master, I run :

# sudo docker run -d --net=host --name=mon -v /etc/ceph:/etc/ceph \
       -v /var/lib/ceph/:/var/lib/ceph/ -e MON_IP=192.168.0.20 \
       -e CEPH_PUBLIC_NETWORK=192.168.0.0/24 ceph/daemon mon

Note: this container requires folder /etc/ceph and /var/lib/ceph to work, you can create it manually or just install ceph on your base system(in this case openSUSE host) so it populates all default paths.

If you have dedicated network that you want to use for the data sync between OSD you would need to specify CEPH_CLUSTER_NETWORK parameter as well

Once the monitor is started, you’d need to copy both /etc/ceph and /var/lib/ceph folders to two other docker hosts. keep in mind that you need to preserve permissions, as by the default it configures things as “ceph” user or uid 64045 and gid 64045.

# cd /etc
# tar -cvzf - ./ceph | ssh root@<second docker host> ‘cd /etc && tar -xvzf -’
# cd /var/lib
# tar -cvzf - ./ceph | ssh root@<second docker host> ‘cd /val/lib && tar -xvzf -’

Then repeat for the third docker host.

Once all hosts have those two folders you can start monitors on the other two hosts.

At this stage you should be able to use ceph -s to query status, It should display cluster status with Health Error, which is fine until we add OSDs.

With the OSD deployment, you should make sure that disk that you use have proper permissions. One way to do it is via udev rule. Thanks to Adrian Gillard on ceph mailing list, in this case the trick is to set permissions for the /dev/xvdb to 64045:

# cat > /etc/udev/rules.d/89-ceph-journal.rules << EOF
KERNEL=="xvdb?" SUBSYSTEM=="block" OWNER="64045" GROUP="disk" MODE="0660"
EOF

At this stage it’s probably good idea to reboot guest to make sure that new udev rule applies and deploy OSD after starting monitor like so:

# sudo docker start mon
# sudo docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph \
        -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ -e OSD_DEVICE=/dev/xvdb \
        -e OSD_TYPE=disk ceph/daemon osd

If there was some filesystem on that disk before, it’s a good idea to add -e OSD_FORCE_ZAP=1 as well and if you wish to use btrfs instead of xfs, change /etc/ceph/ceph.conf by adding this line: 

osd mkfs type = btrfs

If things get hairy, and creating OSD didn’t workout, make sure to clear xvdb1 partition with

# wipefs -o 0x10040 /dev/xvdb1

and then zap it with gdisk:

# gdisk /dev/xvdb -> x -> z -> yes

Create RBDoLVM SR

Once you have started all OSDs and you have pool of data, you can create new pool and add rbd object that will be used as an LVM SR. Since rbd object is thin provisioned and LVM based SR in XenServer is not, you can create it with fairly large size, like 2 or 4 TB for example. 

Then use XenCenter to create new SR on the that RBD object(example is available here)

Automate startup

With SR attached to the pool, it would be a good thing to make sure that containers start automatically:

To configure start of containers on boot you can use instructions here: https://docs.docker.com/engine/articles/host_integration/

In case of openSUSE, the systemd file would look something like that:

[Unit]
Description=CEPH MON container
After=network.target docker.service
Requires=docker.service
 
[Service]
ExecStart=/usr/bin/docker start -a mon
ExecStop=/usr/bin/docker stop -t 2 mon
Restart=always
RestartSec=5
 
[Install]
WantedBy=multi-user.target

There are two changes from docker documentation to the systemd example above:

the restart needs a timeout. Reason for that is the docker service starts up but doesn’t accept connection on the unix socket right away, so first attempt to start container might fail, but one after that should bring container up.

Depending on your linux distro, the target might be different from the docker’s example(i.e. local.target) in my case I’ve set it to multi-user.target, that’s where my docker boxes aim for.

The same systemd script will be required for the OSDs as well.

once scripts prepared, you can put them to test with:

# systemctl enable docker-osd
# systemctl enable docker-mon

And restart docker host to check if both containers come up.

Now that we have dockers starting automatically, we need to make sure that docker VMs come up on XenServer boot up automatically as well. You can use article here: http://support.citrix.com/article/CTX133910

Master host PBD fixup

One last settings that you might need to adjust manually: due to limitation in the RBDSR, the only monitor provided in the pbd is first in the list(i.e. first initial monitor in ceph configuration). As the result, if for example your first initial monitor is the one running on the master, when you reboot master it will be waiting for docker and monitor container to come up, thus most likely timeout and fail to plug pbd. To workaround that you can recreate pbd on the master to point to monitors on one of the slaves - this is a nasty solution but it would work for as long as this slave is part of the pool and running an active monitor.

# xe pbd-list host-name-label=[name of the master] sr-name-label=[name of the CEPH SR] params
# xe pbd-uplug uuid=[uuid from command above]
# xe pbd-destroy uuid=[the same]
# xe secret-create value=[ceph user password provided in CHAP configuration of the SR]

Now you can create PBD, by copying details from “pbd-list params” output and replacing target with the IP of the monitor on the slave

# xe pbd-create host-uuid=[master uuid] sr-uuid=[CEPH sr uuid] \ 
       device-config:SCSIid=[name of the rbd object] \
       device-config:chapuser=[docker admin that can run ceph commands] \
       device-config:port=6789  \
       device-config:targetIQN=[rbd pool name] \
       device-config:target=[docker instance on slave] \
       device-config:chappassword_secret=[secret uuid from command above]

That should be it - you now have local storage presented as a single SR to the pool with one copy of data on each host. Have fun!

Recent Comments
Tobias Kreidl
Cool stuff, Mark! It's great that you are allowed (encouraged?) to play/experiment with things like this. At some point, the inte... Read More
Thursday, 07 April 2016 18:50
Tobias Kreidl
I also meant to ask if any I/O benchmarks were performed to compare this Ceph implementation against conventional local storage an... Read More
Friday, 08 April 2016 03:20
Mark Starikov
Hi Tobias, Nothing substantial in performance benchmarks from my side at the moment. I don't really have any fancy hardware to t... Read More
Friday, 08 April 2016 04:48
Continue reading
6812 Hits
9 Comments

XenServer Dundee Beta 3 Available

I am pleased to announce that Dundee beta 3 has been released, and for those of you who monitor Citrix Tech Previews, beta 3 corresponds to the core XenServer platform used for Dundee TP3. This third beta marks a major development milestone representing the proverbial "feature complete" stage. Normally when announcing pre-release builds, I highlight major functional advances but this time I need to start with a feature which was removed.

Thin Provisioned Block Storage Removed

While its never great to start with a negative, I felt anything related to removal of a storage option takes priority over new and shiny. I'm going to keep this section short, and also highlight that only the new thin provisioned block feature was removed and existing thin provisioned NFS and file based storage repositories will function as they've always done.

What should I do before upgrading to beta 3?

While we don't actively encourage upgrades to pre-release software, we do recognize you're likely to do it at least once. If you have built out infrastructure using thin provisioned iSCSI or HBA storage using a previous pre-release of Dundee, please ensure you migrate any critical VMs to either local storage, NFS or thick provisioned block storage prior to performing an upgrade to beta 3.

So what happened?

As is occasionally the case with pre-release software not all features which are previewed will make it to the final release; for any of a variety of reasons. That is of course one reason we provide pre-release access. In the case of the thin provisioned block storage implementation present in earlier Dundee betas, we ultimately found that it had issues under performance stress. As a result, we've made the difficult decision to remove it from Dundee at this time. Investigation into alternative implementations are underway, and the team is preparing a more detailed blog on future directions.

Beta 3 Overview

Much of difference between beta 2 and beta 3 can be found in some of the details. dom0 has been updated to a CentOS 7.2 userspace, the Xen project hypervisor is now 4.6.1 and the kernel is 3.10.96. Support for xsave and xrestor floating point instructions has been added, enabling guest VMs to utilize AVX instructions available on newer Intel processors. We've also added experimental support for the Microsoft Windows Server 2016 Tech Preview and the Ubuntu 16.04 beta.

Beta 3 Bug Fixes

Earlier pre-releases of Dundee had an issue wherein performing a storage migration of a VM with snapshots and in particular orphaned snapshots would result in migration errors. Work has been done to resolve this, and it would be beneficial for anyone taking beta 3 to exercise storage motion to validate if the fix is complete.

One of the focus areas for Dundee is to improve scalability, and as part of that we've uncovered some situations where overall reliability wasn't what we wanted. An example of such a situation, which we've resolved, occurs when a VM with a very large number of VBDs is running on a host, and a XenServer admin requests the host to shutdown. Prior to the fix, such a host would become unresponsive.

The default logrotate for xensource.log has been changed to rotate at a 100MB in addition to daily. This change was done as on very active systems excessive disk consumption could result in the prior configuration.

Download Information

You can download Dundee beta.3 from the Preview Download page (http://xenserver.org/preview), and any issues found can be reported in our defect database (https://bugs.xenserver.org).     

Recent Comments
Tobias Kreidl
Given the importance of thin provisioning for block-based storage devices, I'm sure that Citrix will find a solution. It takes mak... Read More
Wednesday, 16 March 2016 03:08
Tim Mackey
@Tobias Thanks. We haven't given up on thin provisioned block at all. We're already down a different path which is looking promis... Read More
Wednesday, 16 March 2016 13:32
Gianni D
Not sure if this was by design but the Beta 3 XenCenter won't allow vm migration on older versions of XenServer. Option doesn't e... Read More
Wednesday, 16 March 2016 04:26
Continue reading
8227 Hits
11 Comments

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.