Virtualization Blog

Discussions and observations on virtualization.

XenServer Hotfix XS65ESP1035 Released

XenServer Hotfix XS65ESP1035 Released

News Flash: XenServer Hotfix XS65ESP1035 Released

Indeed, I was alerted early this morning (06:00 EST) via email that Citrix has released hotfix XS65ESP1035 for XenServer 6.5 SP1.  The official release and content is filed under CTX216249, which can be found here: http://support.citrix.com/article/CTX216249

As of the writing of this article, this hotfix has not yet been added to CTX138115 (entitled "Recommended Updates for XenServer Hotfixes") or, as we like to call it "The Fastest Way to Patch A Vanilla XenServer With One or Two Reboots!"  I imagine that resource will be updated to reflect XS65ESP1035 soon.

Personally/Professionally, I will be installing this hotfix as, per CTX216249, I am excited to read what is addressed/fixed:

  • Duplicate entry for XS65ESP1021 was created when both XS65ESP1021 and XS65ESP1029 were applied.
  • When BATMAP (Block Allocation Map) in Xapi database contains erroneous data, the parent VHD (Virtual Hard Disk) does not get inflated causing coalesce failures and ENOSPC errors.
  • After deleting a snapshot on a pool member that is not the pool master, a coalesce operation may not succeed. In such cases, the coalesce process can constantly retry to complete the operation, resulting in the creation of multiple RefCounts that can consume a lot of space on the pool member.
In addition, this hotfix contains the following improvement:
  • This fix lets users set a custom retrans value for their NFS SRs thereby giving them more fine-grained control over how they want NFS mounts to behave in their environment.

(Source: http://support.citrix.com/article/CTX216249)

So....

This is storage based hotfix and while we can create VMs all day, we rely on the storage substrate to hold our precious VHDs, so plan accordingly to deploy it!

Applying The Patch Manually

As a disclaimer of sorts, always plan your patching during a maintenance window to prevent any production outages.  For me, I am currently up-to-date and will be rebooting my XenServer host(s) in a few hours, so I manually applied this patch.

Why?  If you look in XenCenter for updates, you won't see this hotfix listed (yet).  If it was available in XenCenter, checks and balances would inform me I need to suspend, migrate, or shutdown VMs.  For a standalone host, I really can't do that.  In my pool, I can't reboot for a few hours, but I need this patch installed, so I simply do the following on my XenServer stand-alone server OR XenServer primary/master server:

Using the command line in XenCenter, I make a directory in /root/ called "ups" and then descend into that directory because I plan to use wget (Web Get) to download the patch via its link in http://support.citrix.com/article/CTX216249:

[root@colossus ~]# mkdir ups
[root@colossus ~]# cd ups

Now, using wget I specify what to download over port 80 and to save it as "hf35.zip":

[root@colossus ups]# wget http://support.citrix.com/supportkc/filedownload?uri=/filedownload/CTX216249/XS65ESP1035.zip -O hf35.zip

We then see the usual wget progress bar and once it is complete, I can unzip the file "hf35.zip":

HTTP request sent, awaiting response... 200 OK
Length: 110966324 (106M) [application/zip]
Saving to: `hf35.zip'

100%[======================================>] 110,966,324 1.89M/s   in 56s    
2016-08-25 11:06:32 (1.90 MB/s) - `hf35.zip' saved [110966324/110966324]
[root@colossus ups]# unzip hf35.zip 
Archive:  hf35.zip
  inflating: XS65ESP1035.xsupdate   
  inflating: XS65ESP1035-src-pkgs.tar.bz2

I'm a big fan of using shortcuts - especially where UUIDs are involved.  Now that I have the patch ready to expand onto my XenServer master/stand-alone server, I want to create some kind of variable so I don't have to remember my host's UUID or the patch's UUID. 

For the host, I can simply source in a file that contains the XenServer primary/master server's INSTALLATION_UUID (better known as the host's UUID):

[root@colossus ups]# source /etc/xensource-inventory 
[root@colossus ups]# echo $INSTALLATION_UUID
207cd7c1-da20-479b-98bc-e84cac64d0c0

With the variable $INSTALLATION_UUID set, I can now expand the patch and capture it's own UUID:

[root@colossus ups]# patchUUID=`xe patch-upload file-name=XS65ESP1035.xsupdate`
[root@colossus ups]# echo $patchUUID
cdf9eb54-c3da-423d-88ca-841b864f926b

NOW, I apply the patch to the host (yes, it still needs to be rebooted, but within a few hours) using both variables in the following command:

[root@colossus ups]# xe patch-apply uuid=$patchUUID host-uuid=$INSTALLATION_UUID
   
Preparing...                ##################################################
kernel                      ##################################################
unable to stat /sys/class/block//var/swap/swap.001: No such file or directory
Preparing...                ##################################################
sm                          ##################################################
Preparing...                ##################################################
blktap                      ##################################################
Preparing...                ##################################################
kpartx                      ##################################################
Preparing...                ##################################################
device-mapper-multipath-libs##################################################
Preparing...                ##################################################
device-mapper-multipath     ##################################################

At this point, I can back out of the "ups" directory and remove it.  Likewise, I can also check to see if the patch UUID is listed in the XAPI database:

[root@colossus ups]# cd ..
[root@colossus ~]# rm -rf ups/
[root@colossus ~]# ls
support.tar.bz2
[root@colossus ~]# xe patch-list uuid=$patchUUID
uuid ( RO)                    : cdf9eb54-c3da-423d-88ca-841b864f926b
              name-label ( RO): XS65ESP1035
        name-description ( RO): Public Availability: fixes to Storage
                    size ( RO): 21958176
                   hosts (SRO): 207cd7c1-da20-479b-98bc-e84cac64d0c0
    after-apply-guidance (SRO): restartHost

So, nothing really special -- just a quick way to apply patches to a XenServer primary/master server.  In the same manner, you can substitute the $INSTALLATION_UUID with other host UUIDs in a pool configuration, etc.

Well, off to reboot and thanks for reading!

 

-jkbs | @xenfomationMy Citrix Blog

To receive updates about the latest XenServer Software Releases, login or sign-up to pick and choose the content you need from http://support.citrix.com/customerservice/

 


Sources

Citrix Support Knowledge Center: http://support.citrix.com/article/CTX216249

Citrix Support Knowledge Center: http://support.citrix.com/customerservice/

Citrix Profile/RSS Feeds: http://support.citrix.com/profile/watches/

Original Image Source: http://www.gimphoto.com/p/download-win-zip.html

Continue reading
3185 Hits
0 Comments

XenServer Administrators Handbook Published

Last year, I announced that we were working on a XenServer Administrators Handbook, and I'm very pleased to announce that it's been published. Not only have we been published, but based on the Amazon reviews to date we've done a pretty decent job. In part, I suspect that has a ton to do with the book being focused on what information you, XenServer administrators, need to be successful when running a XenServer environment regardless of scale or workload.

XenServer Administrators HandbookThe handbook is formatted following a simple premise; first you need to plan your deployment and second you need to run it. With that in mind, we start with exactly what a XenServer is, define how it works and what expectations it has on infrastructure. After all, it's critical to understand how a product like XenServer interfaces with the real world, and how its virtual objects relate to each other. We even cover some of the misunderstandings those new to XenServer might have.

While it might be tempting to go deep on some of this stuff, Jesse and I both recognized that virtualization SREs have a job to do and that's to run virtual infrastructure. As interesting as it might be to dig into how the product is implemented, that's not the role of an administrators handbook. That's why the second half of the book provides some real world scenarios, and how to go about solving them.

We had an almost limitless list of scenarios to choose from, and what you see in the book represents real world situations which most SREs will face at some point. The goal of this format being to have a handbook which can be actively used, not something which is read once and placed on some shelf (virtual or physical). During the technical review phase, we sent copies out to actual XenServer admins, all of whom stated that we'd presented some piece of information they hadn't previously known. I for one consider that to be a fantastic compliment.

Lastly, I want to finish off by saying that like all good works, this is very much a "we" effort. Jesse did a top notch job as co-author and brings the experience of someone who's job it is to help solve customer problems. Our technical reviewers added tremendously to the polish you'll find in the book. The O'Reilly Media team was a pleasure to work with, pushing when we needed to be pushed but understanding that day jobs and family take precedence.

So whether you're looking at XenServer out of personal interest, have been tasked with designing a XenServer installation to support Citrix workloads, clouds, or for general purpose virtualization, or have a XenServer environment to call your own, there is something in here for you. On behalf of Jesse, we hope that everyone who gets a copy finds it valuable. The XenServer Administrator's handbook is available from book sellers everywhere including:

Amazon: http://www.amazon.com/XenServer-Administration-Handbook-Successful-Deployments/dp/149193543X/

Barnes and Noble: http://www.barnesandnoble.com/w/xenserver-administration-handbook-tim-mackey/1123640451

O'Reilly Media: http://shop.oreilly.com/product/0636920043737.do

If you need a copy of XenServer to work with, you can obtain that for free from: http://xenserver.org/download

Recent Comments
Tobias Kreidl
A timely publication, given all the major recent enhancements to XenServer. It's packed with a lot of hands-on, practical advice a... Read More
Tuesday, 03 May 2016 03:37
Eric Hosmer
Been looking forward to getting this book, just purchased it on Amazon. Now I just need to find that mythical free time to read ... Read More
Friday, 06 May 2016 22:41
Continue reading
8410 Hits
2 Comments

Implementing VDI-per-LUN storage

With storage providers adding better functionality to provide features like QoS, fast snapshot & clone and with the advent of storage-as-a-service, we are interested in the ability to utilize these features from XenServer. VMware’s VVols offering already allows integration of vendor provided storage features into their hypervisor. Since most storage allows operations at the granularity of a LUN, the idea is to have a one-to-one mapping between a LUN on the backend and a virtual disk (VDI) on the hypervisor. In this post we are going to talk about the supplemental pack that we have developed in order to enable VDI-per-LUN.

Xenserver Storage

To understand the supplemental pack, it is useful to first review how XenServer storage works. In XenServer, a storage repository (SR) is a top-level entity which acts as a pool for storing VDIs which appear to the VMs as virtual disks. XenServer provides different types of SRs (File, NFS, Local, iSCSI). In this post we will be looking at iSCSI based SRs as iSCSI is the most popular protocol for remote storage and the supplemental pack we developed is targeted towards iSCSI based SRs. An iSCSI SR uses LVM to store VDIs over logical volumes (hence the type is lvmoiscsi). For instance:

[root@coe-hq-xen08 ~]# xe sr-list type=lvmoiscsi
uuid ( RO)                : c67132ec-0b1f-3a69-0305-6450bfccd790
          name-label ( RW): syed-sr
    name-description ( RW): iSCSI SR [172.31.255.200 (iqn.2001-05.com.equallogic:0-8a0906-c24f8b402-b600000036456e84-syed-iscsi-opt-test; LUN 0: 6090A028408B4FC2846E4536000000B6: 10 GB (EQLOGIC))]
                host ( RO): coe-hq-xen08
                type ( RO): lvmoiscsi
        content-type ( RO):

The above SR is created from a LUN on a Dell EqualLogic. The VDIs belonging to this SR can be listed by:

[root@coe-hq-xen08 ~]# xe vdi-list sr-uuid=c67132ec-0b1f-3a69-0305-6450bfccd790 params=uuid
uuid ( RO)    : ef5633d2-2ad0-4996-8635-2fc10e05de9a

uuid ( RO)    : b7d0973f-3983-486f-8bc0-7e0b6317bfc4

uuid ( RO)    : bee039ed-c7d1-4971-8165-913946130d11

uuid ( RO)    : efd5285a-3788-4226-9c6a-0192ff2c1c5e

uuid ( RO)    : 568634f9-5784-4e6c-85d9-f747ceeada23

[root@coe-hq-xen08 ~]#

This SR has 5 VDI. From LVM’s perspective, an SR is a volume group (VG) and each VDI is a logical volume(LV) inside that volume group. This can be seen via the following commands:

[root@coe-hq-xen08 ~]# vgs | grep c67132ec-0b1f-3a69-0305-6450bfccd790
  VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790   1   6   0 wz--n-   9.99G 5.03G
[root@coe-hq-xen08 ~]# lvs VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790
  LV                                       VG                                                 Attr   LSize 
  MGT                                      VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-a-   4.00M                                 
  VHD-568634f9-5784-4e6c-85d9-f747ceeada23 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-ao   8.00M                               
  VHD-b7d0973f-3983-486f-8bc0-7e0b6317bfc4 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi-ao   2.45G                               
  VHD-bee039ed-c7d1-4971-8165-913946130d11 VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -wi---   8.00M                                
  VHD-ef5633d2-2ad0-4996-8635-2fc10e05de9a VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -ri-ao   2.45G
VHD-efd5285a-3788-4226-9c6a-0192ff2c1c5e VG_XenStorage-c67132ec-0b1f-3a69-0305-6450bfccd790 -ri-ao  36.00M

Here c67132ec-0b1f-3a69-0305-6450bfccd790 is the UUID of the SR. Each VDI is represented by a corresponding LV which is of the format VHD-. Some of the LVs have a small size of 8MB. These are snapshots taken on XenServer. There is also a LV named MGT which holds metadata about the SR and the VDIs present in it. Note that all of this is present in an SR which is a LUN on the backend storage.

Now XenServer can attach a LUN at the level of an SR but we want to map a LUN to a single VDI. In order to do that, we restrict an SR to contain a single VDI. Our new SR has the following LVs:

[root@coe-hq-xen09 ~]# lvs VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9
LV                                       VG                                                 Attr   LSize
MGT                                      VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9 -wi-a- 4.00M
VHD-09b14a1b-9c0a-489e-979c-fd61606375de VG_XenStorage-1fe527a4-7e96-cdd9-f347-a15c240f26e9 -wi--- 8.02G
[root@coe-hq-xen09 ~]#

b2ap3_thumbnail_vdi-lun.png

If a snapshot or clone of the LUN is taken on the backend, all the unique identifiers associated with the different entities in the LUN also get cloned and any attempt to attach the LUN back to XenServer will result in an error because of conflicts of unique IDs.

Resignature and supplemental pack

In order for the cloned LUN to be re-attached, we need to resignature the unique IDs present in the LUN. The following IDs need to be resignatured

  • LVM UUIDs (PV, VG, LV)
  • VDI UUID
  • SR metadata in the MGT Logical volume

We at CloudOps have developed an open-source supplemental pack which solves the resignature problem. You can find it here. The supplemental pack adds a new type of SR (relvmoiscsi) and you can use it to resignature your lvmoiscsi SRs. After installing the supplemental pack, you can resignature a clone using the following command

[root@coe-hq-xen08 ~]# xe sr-create name-label=syed-single-clone type=relvmoiscsi 
device-config:target=172.31.255.200
device-config:targetIQN=$IQN
device-config:SCSIid=$SCSIid
device-config:resign=true
shared=true
Error code: SR_BACKEND_FAILURE_1
Error parameters: , Error reporting error, unknown key The SR has been successfully resigned. Use the lvmoiscsi type to attach it,
[root@coe-hq-xen08 ~]#

Here, instead of creating a new SR, the supplemental pack re-signatures the provided LUN and detaches it (the error is expected as we don’t actually create an SR). You can see from the error message that the SR has been re-signed successfully. Now the cloned SR can be introduced back to XenServer without any conflicts using the following commands:

[root@coe-hq-xen09 ~]# xe sr-probe type=lvmoiscsi device-config:target=172.31.255.200 device-config:targetIQN=$IQN device-config:SCSIid=$SCSIid

   		 5f616adb-6a53-7fa2-8181-429f95bff0e7
   		 /dev/disk/by-id/scsi-36090a028408b3feba66af52e0000a0e6
   		 5364514816

[root@coe-hq-xen09 ~]# xe sr-introduce name-label=vdi-test-resign type=lvmoiscsi 
uuid=5f616adb-6a53-7fa2-8181-429f95bff0e7
5f616adb-6a53-7fa2-8181-429f95bff0e7

This supplemental pack can be used in conjunction with an external orchestrator like CloudStack or OpenStack which can manage both the storage and compute. Working with SolidFire we have implemented this functionality, available in the next release of Apache CloudStack. You can check out a preview of this feature in a screencast here.

Recent Comments
Nick
If I am reading this correctly, this is just basically setting up XS to use 1 SR per VM, this isn't scalable as the limits for LUN... Read More
Tuesday, 26 April 2016 14:57
Syed Ahmed
Hi Nick, The limit of 256 SRs is when using Multipating. If no multipath is used, the number of SRs that can be created are well... Read More
Tuesday, 26 April 2016 17:19
Syed Ahmed
There is an initial overhead when creating SRs. However, we did not find any performance degradation in our tests once the SR is s... Read More
Wednesday, 27 April 2016 09:21
Continue reading
4275 Hits
7 Comments

XenServer 6.5 Can Do True UEFI Boot

Overview

We were interested in getting XenServer 6.5 to boot via UEFI.  Leaving servers in Legacy/BIOS boot was not an option in our target environment.  We still have to do the initial install with the server in Legacy BIOS mode; however, I managed to compile Xen as an EFI bootable binary using the source and patches distributed by Citrix.  With that I am able to change the servers boot mode back to UEFI and boot XenServer.  Here are the steps I used to compile it.

Steps

  1. Prepare a DDK
  2. Prepare a build environment
  3. Build some prerequisites
  4. Unpack the SRPM
  5. Compile Xen

DDK Preparation

Development will be done inside a 6.5 DDK. This is a CentOS 5.4-based Linux that has the same kernel as Dom0 and some of the required development tools. 

Import the VM template per Citrix DDK developer documentation.

After importing, set the following VM options:

  • 2 vCPUs
  • Increase memory to 2048MB
  • Resize disk image to 10GB
  • Add a network interface for SSH

Start the VM, set a root password, and then finalize resizing the disk by running:

# fdisk /dev/xvda 
cmd: d  -del
cmd:  n  -new
cmd:  p  -primary
cmd:  1  -num 1
	use default size
cmd:  w  -save

Preparing the Build Environment

Install rpmdevtools:

# yum --disablerepo citrix install rpmdevtools
# rpmdev-setuptree

I also needed several packages many of which were provided on the binpkg ISO from Citrix. I made them available by inserting XenServer-6.5-binpkg.iso and running the following:

# mount /dev/xvdb /mnt
# mkdir /opt/binpkg
# cp -a /mnt/domain0/RPMS/* /opt/binpkg
# cd /opt/binpkg
# createrepo /opt/binpkg
# cat >/etc/yum.repos.d/binpkg.repo
[binpkg]
name=binpkg
baseurl=file:///opt/binpkg
gpgcheck=0
^D

I also added the epel repository:

rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm

and I added the following packages:

bzip2-devel
e4fsprogs-devel
gettext
glib-devel
glib2-devel
iasl
netpbm
netpbm-progs
psutils
tetex-dvips
tetex-latex
DejaGnu
tcl
expect
makeinfo
texinfo
pixman-devel

Building the Prerequisites

This page: http://xenbits.xen.org/docs/4.3-testing/misc/efi.html says that it is required to use gcc 4.5 or better and that that binutils must be compiled with --enable-targets=x86_64-pep. I could not satisfy this with packages in the repositories, so I compiled and installed some requirements for gcc:

ftp://ftp.gnu.org/gnu/gmp/gmp-4.3.2.tar.gz
http://www.mpfr.org/mpfr-2.4.2/mpfr-2.4.2.tar.gz
http://www.multiprecision.org/mpc/download/mpc-0.8.1.tar.gz

and made sure the required files could be found:

# export LD_LIBRARY_PATH=:/usr/lib:/usr/local/lib:/usr/local/lib64:/usr/lib64
# ln -s /usr/lib64/libcrypto.so.0.9.8e libcrypto.so

then I compiled binutils using: https://ftp.gnu.org/gnu/binutils/binutils-2.22.tar.gz:

# tar xzf binutils-2.22.tar.gz
# mkdir binutils-build
# cd binutils-build
# ../binutils-2.22/configure --disable-werror --enable-targets=x86_64-pep
# make
# make install

then I compiled gcc I using: http://mirrors-usa.go-parts.com/gcc/releases/gcc-4.6.2/gcc-4.6.2.tar.gz. I adapted compilation instructions from: http://www.linuxfromscratch.org/lfs/view/7.1/chapter06/gcc.html:

# cd gcc-4.6.2
# sed -i 's/install_to_$(INSTALL_DEST) //' libiberty/Makefile.in
# case `uname -m` in
i?86) sed -i 's/^T_CFLAGS =$/& -fomit-frame-pointer/' 
gcc/Makefile.in ;;
esac
# sed -i 's@./fixinc.sh@-c true@' gcc/Makefile.in
# mkdir -v ../gcc-build
# cd ../gcc-build
# ../gcc-4.6.2/configure --prefix=/usr 
--libexecdir=/usr/lib --enable-shared 
--enable-threads=posix --enable-__cxa_atexit 
--enable-clocale=gnu --enable-languages=c,c++ 
--disable-multilib --disable-bootstrap --with-system-zlib
# make -j3
# ulimit -s 16384
# make -k check
# ../gcc-4.6.2/contrib/test_summary
# make install

Unpack the SRPM

At this point everything is ready to compile Xen as an EFI bootable binary.  The source code for Xen with Citrix's patches is available here: http://xenserver.org/open-source-virtualization-download.html so download XenServer-6.5.0-source-main-1.iso and mount it at /mnt/cdrom:

# mount -r /dev/xvdb /mnt/cdrom

Now install the SRPM:

# rpm -ivh /mnt/cdrom/xen/xen-4.4.1-1.9.0.459.28798.src.rpm 

Compile Xen

The source code and required scripts are now all under /root/rpmbuild/, so just run:

# cd ~/rpmbuild
# QA_RPATHS=$[ 0x0020 ] rpmbuild -bc SPECS/xen.spec

The -bc flag causes the process to follow the spec file and patch the source, but then stop just before running the make commands. The make commands would fail to compile with warnings about uninitialized variables being treated as errors. Fix this by changing line 45 of ~/rpmbuild/BUILD/xen-4.4.1/xen/Rules.mk to read:

CFLAGS += -Werror -Wno-error=uninitialized -Wredundant-decls
-Wno-pointer-arith

and line 39 of ~/rpmbuild/BUILD/xen-4.4.1/Config.mk to read:

HOSTCFLAGS = -Wall -Werror -Wno-error=uninitialized -Wstrict-prototypes -O2 -fomit-frame-pointer

after making those changes run:

# cd ~/rpmbuild/BUILD/xen-4.4.1/
# make clean
# make max_phys_cpus=256 XEN_TARGET_ARCH=x86_64 -C xen
XEN_VENDORVERSION=-xs90192 debug=n build

and the compile will finish successfully.  I probably could have just made a quick patch to add the -Wno-error flag and allowed the rpmbuild to run the full spec file, but I didn't actually need to compile xen-tools etc, those are already compiled and installed on the XenServer installation. The only file needed is ~/rpmbuild/BUILD/xen-4.4.1/xen/xen.efi. With that in hand I created a xen.cfg file like this:

[global]
default=xen

[xen]
options=console=vga,com1,com2 com1=115200,8n1,0x3F8,4
com2=115200,8n1,0x2F8,3 loglvl=all noreboot
kernel=vmlinuz-3.10-xen root=UUID=b4ee0ace-b587-41df-a66b-16f89731b2a8
rw ignore_loglevel acpi_rsdp=0x7B7FE014
ramdisk=initrd-3.10-xen.img

where the root UUID is the boot disk created during the XenServer install and the RSDP number came from running:

# dmesg | grep RSDP

I ran that in an EFI booted live Linux environment. I found that some
vendors' UEFI implementations were able to provide the RSDP during boot
and some were not, so without specifying it in the xen.cfg I had trouble
with things like usb peripherals.

Boot XenServer

With the xen.efi and xen.cfg I was able to boot XenServer in UEFI boot mode using refind.  We have done extensive testing on several different servers and found no problems.  I was also able to repeat the process with the source code provided by the service packs up to and including Service Pack 1.  I haven't tried any further than that yet.

Editors Note

For those of you wishing to retain Citrix commercial support status, the above procedure will convert the XenServer 6.5 host into an "unsupported configuration".

 

Continue reading
8132 Hits
0 Comments

Debugging Neutron Networking in XenServer

One of the tasks I was assigned was to fix the code preventing XenServer with Neutron from working properly. This configuration used to work well, but the support was broken when more and more changes were made in Neutron, and the lack of a CI environment with XenServer hid the problem. I began getting XenServer with Neutron back to a working state by following the outline in the Quantum with Grizzly blog post from a few years ago. It's important to note that with the Havana release, Quantum was renamed to Neutron, and we'll use Neutron throughout this post. During my work, I needed to debug why OpenStack images were not obtaining IP addresses. This blog post covers the workflow I used, and I hope you'll find it helpful.

Environment

  • XenServer: 6.5
  • OpenStack: September 2015 master code
  • Network: ML2 plugin, OVS driver, VLAN type
  • Single Box installation

I had made some changes in the DevStack script to let XenServer with Neutron to be installed and run properly. The following are some debugging processes I followed when newly launched VMs could not get an IP from Neutron DHCP agent automatically.

Brief description of the DHCP process

When guest VMs are booting, they will try to send DHCP request broadcast message within the same network broadcast domain and then wait for a DHCP server's reply. In OpenStack Neutron, the DHCP server, or DHCP agent, is responsible for allocating IP addresses. If VMs cannot get IP addresses, our first priority is to check whether the packets from the VMs can be received by the DHCP server.

 

b2ap3_thumbnail_XenServerNeutron_20160209-225337_1.png

 

Dump traffic in Network Node

Since I used DevStack with single box installation, all OpenStack nodes reside in the same DomU (VM). Perform the following steps

1. Check namespace that DHCP agent uses

In the DevStack VM, execute:

    sudo ip netns

The output will look something like this:

    qrouter-17bdbe51-93df-4bd8-93fd-bb399ed3d4c1
qdhcp-49a623fd-c168-4f27-ad82-946bfb6df3d7

Note: qdhcp-xxx is the namespace for the DHCP agent

2. Check interface DHCP agent uses for L3 packets

In the DevStack VM, execute:

    sudo ip netns exec \
qdhcp-49a623fd-c168-4f27-ad82-946bfb6df3d7 ifconfig

The results will look something like the following, and the "tapYYYY" entry is the one we care about.

    lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

tap7b39ecad-81 Link encap:Ethernet HWaddr fa:16:3e:e3:46:c1
inet addr:10.0.0.2 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::f816:3eff:fee3:46c1/64 Scope:Link
inet6 addr: fdff:631:9696:0:f816:3eff:fee3:46c1/64 Scope:Global
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:42606 errors:0 dropped:0 overruns:0 frame:0
TX packets:38 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:4687150 (4.6 MB) TX bytes:4867 (4.8 KB)

3. Monitor traffic flow with DHCP agent's interface tapYYY

In the DevStack VM monitor the traffic flow to the tapdisk interface by executing this command:

    sudo ip netns exec \
qdhcp-49a623fd-c168-4f27-ad82-946bfb6df3d7 \
tcpdump -i tap7b39ecad-81 -s0 -w dhcp.cap

Theoretically, when launching a new instance, you should see DHCP request and reply messages like this:

    16:29:40.710953 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from fa:16:3e:f9:f6:b0 (oui Unknown), length 302
16:29:40.713625 IP 172.20.0.1.bootps > 172.20.0.10.bootpc: BOOTP/DHCP, Reply, length 330

Dump traffic in Compute Node

Meanwhile, you will definitely want to dump traffic in the OpenStack compute note, and with XenServer this is Dom0.

When new instance is launched, there will be a new virtual interface created named “vifX.Y”. 'X' is the domain ID for the new VM and Y is the ID if the VIF defined in XAPI. Domain IDs are sequential - if the latest interface is vif20.0, the next one will most likely be vif21.0. Then you can try tcpdump -i vif21.0. Note that it may fail at first if the virtual interface hasn't been created yet, but once the virtual interface is created, you can monitor the packets. Theoretically you should see DHCP request and reply packets in Dom0; just like you see in DHCP agent side.

Note: If you cannot catch the dump packet at the instance’s launching time, you can also try this using ifup eth0 after doing a login to the instance via XenCenter. "ifup eth0" will also trigger the instance to send a DHCP request.

Check DHCP request goes out of the compute node

In most case, you should see the DHCP request packets sent out from Dom0, this means that the VM itself is OK. It has sent out DHCP request message.

Note: Some images will try to send DHCP requests from time to time until it gets a response message. However, some images won’t. They will only try several times, e.g. three times, and if it cannot get DHCP response it won’t try again any more. In some scenarios, this will let the instance lose the chance of sending DHCP request. That’s why some people on the Internet suggest changing images when launching instance cannot get an IP address via DHCP.

Check DHCP request arrives at the DHCP server side

When I was first testing, I didn't see any DHCP request from the DHCP agent side. Where the request packet go? It’s possible that the packets are dropped? Then who dropped these packets? Why drop them?

If we think it a bit more, it’s either L2 or L3 that dropped. With this in mind, we can begin to check one by one. For L3/L4, I don’t have a firewall setup and the security group’s default rule is to let all packets go through. So, I don’t spent so much effort on this part. For L2, since we use OVS, I began by checking OVS rules. If you are not familiar with OVS, this can take some time. At least I spent a lot of time on it to completely understand the mechanism and the rules.

The main aim is to check all existing rules in Dom0 and DomU, and then try to find out which rule let the packets dropped.

Check OVS flow rules

OVS flow rules in Network Node

To get the port information on the network bridge "br-int" execute the following in the DevStack VM

    sudo ovs-ofctl show br-int 
  
    stack@DevStackOSDomU:~$ sudo ovs-ofctl show br-int
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000ba78580d604a
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: OUTPUT SET_VLAN_VID SET_VLAN_PCP STRIP_VLAN SET_DL_SRC SET_DL_DST SET_NW_SRC SET_NW_DST SET_NW_TOS SET_TP_SRC SET_TP_DST ENQUEUE
1(int-br-eth1): addr:1a:2d:5f:48:64:47
config: 0
state: 0
speed: 0 Mbps now, 0 Mbps max
2(tap7b39ecad-81): addr:00:00:00:00:00:00
config: PORT_DOWN
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
3(qr-78592dd4-ec): addr:00:00:00:00:00:00
config: PORT_DOWN
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
4(qr-55af50c7-32): addr:00:00:00:00:00:00
config: PORT_DOWN
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
LOCAL(br-int): addr:9e:04:94:a4:95:bb
config: PORT_DOWN
state: LINK_DOWN
speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

To get the flow rules, execute:

    sudo ovs-ofctl dump-flows br-int
  
    stack@DevStackOSDomU:~$ sudo ovs-ofctl dump-flows br-int
    NXST_FLOW reply (xid=0x4):
      cookie=0x9bf3d60450c2ae94, duration=277625.02s, table=0, n_packets=31, n_bytes=4076, idle_age=15793, hard_age=65534, priority=3,in_port=1,dl_vlan=1041 actions=mod_vlan_vid:1,NORMAL
      cookie=0x9bf3d60450c2ae94, duration=277631.928s, table=0, n_packets=2, n_bytes=180, idle_age=65534, hard_age=65534, priority=2,in_port=1 actions=drop
      cookie=0x9bf3d60450c2ae94, duration=277632.116s, table=0, n_packets=42782, n_bytes=4706099, idle_age=1, hard_age=65534, priority=0 actions=NORMAL
      cookie=0x9bf3d60450c2ae94, duration=277632.103s, table=23, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=0 actions=drop
      cookie=0x9bf3d60450c2ae94, duration=277632.09s, table=24, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=0 actions=drop

These rules in DomU look normal without concerns, so let's go on with Dom0 and try to find more.

OVS flow rules in Compute Node

Looking at the traffic flow in picture 1, the traffic direction from VM to DHCP server is xapiX->xapiY(Dom0), then ->br-eth1->br-int(DomU). So, maybe some rules filtered the packets at the layer 2 level by OVS. While I do suspect xapiY, I cannot provide any specific reasons why.

To determine the xapiY in your environment, execute:

    xe network-list

In the results, look for the "bridge" which matches the name-label for your network. In our case, it was xapi3, so to determine the port information, execute:

   ovs-ofctl show xapi3 get port information
  
   [root@rbobo ~]# ovs-ofctl show xapi3
   OFPT_FEATURES_REPLY (xid=0x2): dpid:00008ec00170b013
   n_tables:254, n_buffers:256
   capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
   actions: OUTPUT SET_VLAN_VID SET_VLAN_PCP STRIP_VLAN SET_DL_SRC SET_DL_DST SET_NW_SRC SET_NW_DST SET_NW_TOS SET_TP_SRC SET_TP_DST ENQUEUE
    1(vif15.1): addr:fe:ff:ff:ff:ff:ff
      config:     0
      state:      0
      speed: 0 Mbps now, 0 Mbps max
    2(phy-xapi3): addr:d6:37:17:1d:01:ee
      config:     0
      state:      0
      speed: 0 Mbps now, 0 Mbps max
    LOCAL(xapi3): addr:5a:46:65:a2:3b:4f
      config:     0
      state:      0
      speed: 0 Mbps now, 0 Mbps max
   OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

Execute ovs-ofctl dump-flows xapi3 to get flow rules

  [root@rbobo ~]# ovs-ofctl dump-flows xapi3
  NXST_FLOW reply (xid=0x4):
    cookie=0x0, duration=278700.004s, table=0, n_packets=42917, n_bytes=4836933, idle_age=0, hard_age=65534, priority=0 actions=NORMAL
    cookie=0x0, duration=276117.558s, table=0, n_packets=31, n_bytes=3976, idle_age=16859, hard_age=65534, priority=4,in_port=2,dl_vlan=1 actions=mod_vlan_vid:1041,NORMAL
    cookie=0x0, duration=278694.945s, table=0, n_packets=7, n_bytes=799, idle_age=65534, hard_age=65534, priority=2,in_port=2 actions=drop

Please pay attention to port 2(phy-xapi3), it has two specific rules:

  • The higher priority=4 will be matched first. If the dl_vlan=1, it will modify the tag and then with normal process, which will let the flow through
  • The lower priority=2 will be matched second, and it will drop the flow. So, will the flows be dropped? If the flow doesn’t have dl_vlan=1, it will be definitely be dropped.

Note:

(1) For dl_vlan=1, this is the virtual LAN tag id which corresponding to the Port tag
(2) I didn’t realize the problem was a missing tag for the new launched instance for a long time due to my lack of OVS understanding. Thus I didn’t have know to check the port’s tag first. So next time when we meet this problem, we can check these part first.

With this question, I checked the new launched instance’s port information, ran command ovs-vsctl show in Dom0, you can get results like these:

Bridge "xapi5"
fail_mode: secure
Port "xapi5"
Interface "xapi5"
type: internal
Port "vif16.0"
Interface "vif16.0"
Port "int-xapi3"
Interface "int-xapi3"
type: patch
options: {peer="phy-xapi3"}

 

For port vif16.0, it really doesn’t have tag with value 1, so the flow will be unconditionally dropped.

Note: When launching a new instance under XenServer, it will have a virtual network interface named vifx.0, and from OVS’s point of view, it will also create a port and bind that interface correspondingly.
Check why tag is not set

The next step is to find out why the newly launched instance don’t have a tag in OVS. There is no obvious findings for new comers like me. Just read the code over and over and make assumptions and test and so forth. But after trying various ideas, I did find that each time when I restart neutron-openvswitch-agent(q-agt) in the Compute Node, the VM can get IP if I execute ifup eth0 command. So, there must be something which is done when q-agt restarts and is not done when launching a new instance. With this information, I can focus my code inspection. Finally I found that, with XenServer, when a new instance is launched, q-agt cannot detect the newly added port and it will not add a tag to the corresponding port.

That then left the question of why q-agt cannot detect port changes. We have a session from DomU to Dom0 to monitor port changes, which seems not to work as we expect. With this in mind, I first ran command ovsdb-client monitor Interface name,ofport in Dom0, which produces output like this:

   [root@rbobo ~]# ovsdb-client monitor Interface name,ofport
    row                                  action  name        ofport
    ------------------------------------ ------- ----------- ------
    54bcda61-de64-4d0e-a1c8-d339a2cabb50 initial "eth1"      1     
    987be636-b352-47a3-a570-8118b59c7bbc initial "xapi3"     65534 
    bb6a4f70-9f9c-4362-9397-010760f85a06 initial "xapi5"     65534 
    9ddff368-0be5-4f23-a03c-7940543d0ccc initial "vif15.2"   1     
    ba3af0f5-e8ed-4bdb-8c3d-67a638b81091 initial "phy-xapi3" 2     
    b57284cf-1dcd-4a10-bee1-42516afe2573 initial "eth0"      1     
    38a0dd37-173f-421c-9aba-3e03a5b8c900 initial "vif16.0"   2     
    58b83fe4-5f33-40f3-9dd9-d5d4b3f25981 initial "xenbr0"    65534 
    6c792964-3930-477c-bafa-5415259dea96 initial "int-xapi3" 1     
    caa52d63-59ed-4917-9ec3-1ea957470d5e initial "vif15.1"   1     
    d8805d05-bbd2-40cb-b219-eb9177c217dc initial "vif15.0"   6     
    8131dcd2-69ea-401a-a65e-4d4a17203e0c initial "xapi4"     65534 
    086e6e3a-1ab2-469f-9604-56bbd4c2fe86 initial "xenbr1"    65534 

Then I launched a new instance try to find whether OVS monitor can give new output for the new launched instance, and I do get outputs like:

           row                       action name      ofport
------------------------------------ ------ --------- ------
249c424a-4c9a-47b4-991a-bded9ec63ada insert "vif17.0" []    

row                                  action name      ofport
------------------------------------ ------ --------- ------
249c424a-4c9a-47b4-991a-bded9ec63ada old              []    
 new    "vif17.0" 3    

So, this means the OVS monitor itself works well! There maybe other errors with the code that makes the monitoring. Seems I'm getting closer to the the root cause :)

Finally, I found that with XenServer, our current implementation cannot read the OVS monitor's output, and thus q-agt doesn't know there is a new port added. But lucky enough, L2 Agent provides another way of getting the port changes, and thus we can use that way instead.

Setting minimize_polling=false in the L2 agent's configuration file ensures the Agent does not rely on ovsdb-client monitor, which means that the port will be identified and the tag gets added properly!

In this case, this is all that was needed to get an IP address and everything else worked normally. I hope the process I went through in debugging this problem will be beneficial to others.     

Continue reading
5746 Hits
1 Comment

Review: XenServer 6.5 SP1 Training CXS-300

A few weeks ago, I received an invitation to participate in the first new XenServer class to be rolled out in over three years, namely CXS-300: Citrix XenServer 6.5 SP1 Administration. Those of you with good memories may recall that XenServer 6.0, on which the previous course was based, was officially released on September 30, 2011. Being an invited guest in what was to be only the third time the class had been ever held was something that just couldn’t be passed up, so I hastily agreed. After all, the evolution of the product since 6.0 has been enormous. Plus, I have been a huge fan of XenServer since first working with version 5.0 back in 2008.  Shortly before the open-sourcing of XenServer in 2013, I still recall the warnings of brash naysayers that XenServer was all but dead. However, things took a very different turn in the summer of 2013 with the open-source release and subsequent major efforts to improve and augment product features. While certain elements were pulled and restored and there was a bit of confusion about changes in the licensing models, things have stabilized and all told, the power and versatility of XenServer with the 6.5 SP1 release is at a level now some thought it would never reach.

FROM 6.0 TO 6.5 – AND BEYOND

XenServer (XS for short) 6.5 SP1 made its debut on May 12, 2015. The feature set and changes are – as always – incorporated within the release notes. There are a number of changes of note that include an improved hotfix application mechanism, a whole new XenCenter layout (since 6.5), increased VM density, more guest OS support, a 64-bit kernel, the return of workload balancing (WLB) and the distributed virtual switch controller (DVSC) appliance, in-memory read caching, and many others. Significant improvements have been made to storage and network I/O performance and overall efficiency. XS 6.5 was also a release that benefited significantly from community participation in the Creedence project and the SP1 update builds upon this.

One notable point is that XenServer has been found to now host more XenDesktop/XenApp (XD/XA) instances than any other hypervisor (see this reference). And, indeed, when XenServer 6.0 was released, a lot of the associated training and testing on it was in conjunction with Provisioning Services (PVS). Some users, however, discovered XenServer long before this as a perfectly viable hypervisor capable of hosting a variety of Linux and Windows virtual machines, without having even given thought to XenDesktop or XenApp hosting. For those who first became familiar with XS in that context, the added course material covering provisioning services had in reality relatively little to do with XenServer functionality as an entity. Some viewed PVS an overly emphasized component of the course and exam. In this new course, I am pleased to say that XS’s original roots as a versatile hypervisor is where the emphasis now lies. XD/XA is of course discussed, but the many features available that are fundamental to XS itself is what the course focuses on, and it does that well.

COURSE MATERIALS: WHAT’S INCLUDED

The new “mission” of the course from my perspective is to focus on the core product itself and not only understand its concepts, but to be able to walk away with practical working knowledge. Citrix puts it that the course should be “engaging and immersive”. To that effect, the instructor-led course CXS-300 can be taken in a physical classroom or via remote GoToMeeting (I did the latter) and incorporates a lecture presentation, a parallel eCourseware manual plus a student exercise workbook (lab guide) and access to a personal live lab during the entire course. The eCourseware manual serves multiple purposes, providing the means to follow along with the instructor and later enabling an independent review of the presented material. It adds a very nice feature of providing an in-line notepad for each individual topic (hence, there are often many of these on a page) and these can be used for note taking and can be saved and later edited. In fact, a great takeaway of this training is that you are given permanent access to your personalized eCourseware manual, including all your notes.

The course itself is well organized; there are so many components to XenServer that five days works out in my opinion to be about right – partly because often question and answer sessions with the instructor will take up more time than one might guess, and also, in some cases all participants may have already some familiarity with XS or other hypervisor that makes it possible to go into some added depth in some areas. There will always need to be some flexibility depending on the level of students in any particular class.

A very strong point of the course is the set of diagrams and illustrations that are incorporated, some of which are animated. These compliment the written material very well and the visual reinforcement of the subject matter is very beneficial. Below is an example, illustrating a high availability (HA) scenario:

XS6.5SP1_course_image.jpg 

 

The course itself is divided into a number of chapters that cover the whole range of features of XS, enforced by some in-line Q&A examples in the eCourseware manual and with related lab exercises.  Included as part of the course are not only important standard components, such as HA and Xenmotion, but some that require plugins or advanced licenses, such as workload balancing (WLB), the distributed virtual switch controller (DVSC) appliance and in-memory read caching. The immediate hands-on lab exercises in each chapter with the just-discussed topics are a very strong point of the course and the majority of exercises are really well designed to allow putting the material directly to practical use. For those who have already some familiarity with XS and are able to complete the assignments quickly, the lab environment itself offers a great sandbox in which to experiment. Most components can readily be re-created if need be, so one can afford to be somewhat adventurous.

The lab, while relying heavily on the XenCenter GUI for most of the operations, does make a fair amount of use of the command line interface (CLI) for some operations. This is a very good thing for several reasons. First off, one may not always have access to XenCenter and knowing some essential commands is definitely a good thing in such an event. The CLI is also necessary in a few cases where there is no equivalent available in XenCenter. Some CLI commands offer some added parameters or advanced functionality that may again not be available in the management GUI. Furthermore, many operations can benefit from being scripted and this introduction to the CLI is a good starting point. For Windows aficionados, there are even some PowerShell exercises to whet their appetites, plus connecting to an Active Directory server to provide role-based access control (RBAC) is covered.

THE INSTRUCTOR

So far, the materials and content have been the primary points of discussion. However, what truly can make or break a class is the instructor. The class happened to be quite small, and primarily with individuals attending remotely. Attendees were in fact from four different countries in different time zones, making it a very early start for some and very late in the day for others. Roughly half of those participating in the class were not native English speakers, though all had admirable skills in both English and some form of hypervisor administration.  Being all able to keep up a common general pace allowed the class to flow exceptionally well. I was impressed with the overall abilities and astuteness of each and every participant.

The instructor, Jesse Wilson, was first class in many ways. First off, knowing the material and being able to present it well are primary prerequisites. But above and beyond that was his ability to field questions related to the topic at hand and even to go off onto relevant tangential material and be able to juggle all of that and still make sure the class stayed on schedule. Both keeping the flow going and also entertaining enough to hold students’ attention are key to holding a successful class. When elements of a topic became more of a debatable issue, he was quick to not only tackle the material in discussion, but to try this out right away in the lab environment to resolve it. The same pertained to demonstrating some themes that could benefit from a live demo as opposed to explaining them just verbally. Another strong point was his adding his own drawings to material to further clarify certain illustrations, where additional examples and explanations were helpful.

SUMMARY

All told, I found the course well structured, very relevant to the product and the working materials to be top notch. The course is attuned to the core product itself and all of its features, so all variations of the product editions are covered.

Positive points:

  • Good breadth of material
  • High-quality eCourseware materials
  • Well-presented illustrations and examples in the class material
  • Q&A incorporated into the eCourseware book
  • Ability to save course notes and permanent access to them
  • Relevant lab exercises matching the presented material
  • Real-life troubleshooting (nothing ever runs perfectly!)
  • Excellent instructor

Desiderata:

  • More “bonus” lab materials for those who want to dive deeper into topics
  • More time spent on networking and storage
  • A more responsive lab environment (which was slow at times)
  • More coverage of more complex storage Xenmotion cases in the lecture and lab

In short, this is a class that fulfills the needs of anyone from just learning about XenServer to even experienced administrators who want to dive more deeply into some of the additional features and differences that have been introduced in this latest XS 6.5 SP1 release. CXS-300: Citrix XenServer 6.5 SP1 Administration represents a makeover in every sense of the word, and I would say the end result is truly admirable.

Continue reading
15925 Hits
0 Comments

Preview of XenServer Administrators Handbook

Administering any technology can be both fun and challenging at times. For many, the fun part is designing a new deployment while for others the hardware selection process, system configuration and tuning and actual deployment can be a rewarding part of being an SRE. Then the challenging stuff hits where the design and deployment become a real part of the everyday inner workings of your company and with it come upgrades, failures, and fixes. For example, you might need to figure out how to scale beyond the original design, deal with failed hardware or find ways to update an entire data center without user downtime. No matter how long you've been working with a technology, the original paradigms often do change, and there is always an opportunity to learn how to do something more efficiently.

That's where a project JK Benedict and I have been working on with the good people of O'Reilly Media comes in. The idea is a simple one. We wanted a reference guide which would contain valuable information for anyone using XenServer - period. If you are just starting out, there would be information to help you make that first deployment a successful one. If you are looking at redesigning an existing deployment, there are valuable time-saving nuggets of info, too. If you are a longtime administrator, you would find some helpful recipes to solve real problems that you may not have tried yet. We didn't focus on long theoretical discussions, and we've made sure all content is relevant in a XenServer 6.2 or 6.5 environment. Oh, and we kept it concise because your time matters.

I am pleased to announce that attendees of OSCON will be able to get their hands on a preview edition of the upcoming XenServer Administrators Handbook. Not only will you be able to thumb through a copy of the preview book, but I'll have a signing at the O'Reilly booth on Wednesday July 22nd at 3:10 PM. I'm also told the first 25 people will get free copies, so be sure to camp out ;)

Now of course everyone always wants to know what animal which gets featured for the book cover. As you can see below, we have a bird. Not just any bird mind you, but a xenops. Now I didn't do anything to steer O'Reilly towards this, but find it very cool that we have an animal which also represents a very core component in XenServer; the xenopsd. For me, that's a clear indication we've created the appropriate content, and I hope you'll agree.

 

             

Recent Comments
prashant sreedharan
cool ! cant wait to get my hands on the book :-)
Tuesday, 07 July 2015 19:32
Tobias Kreidl
Congratulations, Tim and Jesse, as an update in this area is long overdue and in very good hands with you two. The XenServer commu... Read More
Tuesday, 07 July 2015 19:42
JK Benedict
Ah, Herr Tobias -- Danke freund. Danke fur ihre unterstutzung! Guten abent!
Thursday, 23 July 2015 09:26
Continue reading
11556 Hits
6 Comments

XenServer's LUN scalability

"How many VMs can coexist within a single LUN?"

An important consideration when planning a deployment of VMs on XenServer is around the sizing of your storage repositories (SRs). The question above is one I often hear. Is the performance acceptable if you have more than a handful of VMs in a single SR? And will some VMs perform well while others suffer?

In the past, XenServer's SRs didn't always scale too well, so it was not always advisable to cram too many VMs into a single LUN. But all that changed in XenServer 6.2, allowing excellent scalability up to very large numbers of VMs. And the subsequent 6.5 release made things even better.

The following graph shows the total throughput enjoyed by varying numbers of VMs doing I/O to their VDIs in parallel, where all VDIs are in a single SR.

3541.png

In XenServer 6.1 (blue line), a single VM would experience modest 240 MB/s. But, counter-intuitively, adding more VMs to the same SR would cause the total to fall, reaching a low point around 20 VMs achieving a total of only 30 MB/s – an average of only 1.5 MB/s each!

On the other hand, in XenServer 6.5 (red line), a single VM achieves 600 MB/s, and it only requires three or four VMs to max out the LUN's capabilities at 820 MB/s. Crucially, adding further VMs no longer causes the total throughput to fall, but remains constant at the maximum rate.

And how well distributed was the available throughput? Even with 100 VMs, the available throughput was spread very evenly -- on XenServer 6.5 with 100 VMs in a LUN, the highest average throughput achieved by a single VM was only 2% greater than the lowest. The following graph shows how consistently the available throughput is distributed amongst the VMs in each case:

4016.png

Specifics

  • Host: Dell R720 (2 x Xeon E5-2620 v2 @ 2.1 GHz, 64 GB RAM)
  • SR: Hardware HBA using FibreChannel to a single LUN on a Pure Storage 420 SAN
  • VMs: Debian 6.0 32-bit
  • I/O pattern in each VM: 4 MB sequential reads (O_DIRECT, queue-depth 1, single thread). The graph above has a similar shape for smaller block sizes and for writes.
Recent Comments
Tobias Kreidl
Very nice, Jonathan, and it is always good to raise discussions about standards that are known to change over time. This is partic... Read More
Friday, 26 June 2015 19:52
Tobias Kreidl
Indeed, depending on the specific characteristics of each storage array there will be some maximum queue depth per connection (por... Read More
Saturday, 27 June 2015 04:27
Jonathan Davies
Thanks for your comments, Tobias and John. You're absolutely right -- the LUN's capabilities are an important consideration. And n... Read More
Monday, 29 June 2015 08:53
Continue reading
12910 Hits
6 Comments

When Virtualised Storage is Faster than Bare Metal

An analysis of block size, inflight requests and outstanding data

INTRODUCTION

Back in August 2014 I went to the Xen Project Developer Summit in Chicago (IL) and presented a graph that caused a few faces to go "ahn?". The graph was meant to show how well XenServer 6.5 storage throughput could scale over several guests. For that, I compared 10 fio threads running in dom0 (mimicking 10 virtual disks) with 10 guests running 1 fio thread each. The result: the aggregate throughput of the virtual machines was actually higher.

In XenServer 6.5 (used for those measurements), the storage traffic of 10 VMs corresponds to 10 tapdisk3 processes doing I/O via libaio in dom0. My measurements used the same disk areas (raw block-based virtual disks) for each fio thread or tapdisk3. So how can 10 tapdisk3 processes possibly be faster than 10 fio threads also using libaio and also running in dom0?

At the time, I hypothesised that the lack of indirect I/O support in tapdisk3 was causing requests larger than 44 KiB (the maximum supported request size in Xen's traditional blkif protocol) to be split into smaller requests. And that the storage infrastructure (a Micron P320h) was responding better to a higher number of smaller requests. In case you are wondering, I also think that people thought I was crazy.

You can check out my one year old hypothesis between 5:10 and 5:30 on the XPDS'14 recording of my talk: https://youtu.be/bbdWFB1mBxA?t=5m10s

20150525-01-slide.jpg

TRADITIONAL STORAGE AND MERGES

For several years operating systems have been optimising storage I/O patterns (in software) before issuing them to the corresponding disk drivers. In Linux, this has been achieved via elevator schedulers and the block layer. Requests can be reordered, delayed, prioritised and even merged into a smaller number of larger requests.

Merging requests has been around for as long as I can remember. Everyone understands that less requests mean less overhead and that storage infrastructures respond better to larger requests. As a matter of fact, the graph above, which shows throughput as a function of request size, is proof of that: bigger requests means higher throughput.

It wasn't until 2010 that a proper means to fully disable request merging came into play in the Linux kernel. Alan Brunelle showed a 0.56% throughput improvement (and less CPU utilisation) by not trying to merge requests at all. I wonder if he questioned that splitting requests could actually be even more beneficial.

SPLITTING I/O REQUESTS

Given the results I have seen on my 2014 measurements, I would like to take this concept a step further. On top of not merging requests, let's forcibly split them.

The rationale behind this idea is that some drives today will respond better to a higher number of outstanding requests. The Micron P320h performance testing guide says that it "has been designed to operate at peak performance at a queue depth of 256" (page 11). Similar documentation from Intel uses a queue depth of 128 to indicate peak performance of its NVMe family of products.

But it is one thing to say that a drive requires a large number of outstanding requests to perform at its peak. It is a different thing to say that a batch of 8 requests of 4 KiB each will complete quicker than one 32 KiB request.

MEASUREMENTS AND RESULTS

So let's put that to the test. I wrote a little script to measure the random read throughput of two modern NVMe drives when facing workloads with varying block sizes and I/O depth. For block sizes from 512 B to 4 MiB, I am particularly interested in analysing how these disks respond to larger "single" requests in comparison to smaller "multiple" requests. In other words, what is faster: 1 outstanding request of X bytes or Y outstanding requests of X/Y bytes?

My test environment consists of a Dell PowerEdge R720 (Intel E5-2643v2 @ 3.5GHz, 2 Sockets, 6 Cores/socket, HT Enabled), with 64 GB of RAM running Linux Jessie 64bit and the Linux 4.0.4 kernel. My two disks are an Intel P3700 (400GB) and a Micron P320h (175GB). Fans were set to full speed and the power profiles are configured for OS Control, with a performance governor in place.

#!/bin/bash
sizes="512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 \
       1048576 2097152 4194304"
drives="nvme0n1 rssda"

for drive in ${drives}; do
    for size in ${sizes}; do
        for ((qd=1; ${size}/${qd} >= 512; qd*=2)); do
            bs=$[ ${size} / ${qd} ]
            tp=$(fio --terse-version=3 --minimal --rw=randread --numjobs=1  \
                     --direct=1 --ioengine=libaio --runtime=30 --time_based \
                     --name=job --filename=/dev/${drive} --bs=${bs}         \
                     --iodepth=${qd} | awk -F';' '{print $7}')
            echo "${size} ${bs} ${qd} ${tp}" | tee -a ${drive}.dat
        done
    done
done

There are several ways of looking at the results. I believe it is always worth starting with a broad overview including everything that makes sense. The graphs below contain all the data points for each drive. Keep in mind that the "x" axis represent Block Size (in KiB) over the Queue Depth.

20150525-02-nvme.jpg

20150525-03-rssda.jpg

While the Intel P3700 is faster overall, both drives share a common treat: for a certain amount of outstanding data, throughput can be significantly higher if such data is split over several inflight requests (instead of a single large request). Because this workload consists of random reads, this is a characteristic that is not evident in spinning disks (where the seek time would negatively affect the total throughput of the workload).

To make this point clearer, I have isolated the workloads involving 512 KiB of outstanding data on the P3700 drive. The graph below shows that if a workload randomly reads 512 KiB of data one request at a time (queue depth=1), the throughput will be just under 1 GB/s. If, instead, the workload would read 8 KiB of data with 64 outstanding requests at a time, the throughput would be about double (just under 2 GB/s).

20150525-04-nvme512k.jpg

CONCLUSIONS

Storage technologies are constantly evolving. At this point in time, it appears that hardware is evolving much faster than software. In this post I have discussed a paradigm of workload optimisation (request merging) that perhaps no longer applies to modern solid state drives. As a matter of fact, I am proposing that the exact opposite (request splitting) should be done in certain cases.

Traditional spinning disks have always responded better to large requests. Such workloads reduced the overhead of seek times where the head of a disk must roam around to fetch random bits of data. In contrast, solid state drives respond better to parallel requests, with virtually no overhead for random access patterns.

Virtualisation platforms and software-defined storage solutions are perfectly placed to take advantage of such paradigm shifts. By understanding the hardware infrastructure they are placed on top of, as well as the workload patterns of their users (e.g. Virtual Desktops), requests can be easily manipulated to better explore system resources.

Recent Comments
Tobias Kreidl
Fascinating as always, Felipe, and thank you for sharing these results. Indeed, SSD behavior will be different from that of tradit... Read More
Monday, 25 May 2015 18:51
Felipe Franciosi
Thanks for the comment. I am welcoming more people experimenting with these findings and discussing this topic. If this turns out ... Read More
Tuesday, 26 May 2015 15:02
Felipe Franciosi
Thanks for the comment. IOPS measures how many requests (normally of the same size and issued with a constant queue depth) can be... Read More
Tuesday, 26 May 2015 15:17
Continue reading
12224 Hits
4 Comments

Security bulletin covering VENOM

Last week a vulnerability in QEUM was reported with the marketing name of "VENOM", but which is more correctly known as CVE-2015-3456.  Citrix have released a security bulletin covering CVE-2015-3456 which has been updated to include hotfixes for XenServer 6.5, 6.5 SP1 and XenServer 6.2 SP1.

Learning about new XenServer hotfixes

When a hotfix is released for XenServer, it will be posted to the Citrix support web site. You can receive alerts from the support site by registering at http://support.citrix.com/profile/watches and following the instructions there. You will need to create an account if you don't have one, but the account is completely free. Whenever a security hotfix is released, there will be an accompanying security advisory in the form of a CTX knowledge base article for it, and those same KB articles will be linked on xenserver.org in the download page.

Patching XenServer hosts

XenServer admins are encouraged to schedule patching of their XenServer installations at their earliest opportunity. Please note that this bulletin does impact XenServer 6.2 hosts, and to apply the patch, all XenServer 6.2 hosts will first need to be patched to service pack 1 which can be found on the XenServer download page

Continue reading
26720 Hits
1 Comment

XenServer 6.5 SP1 Released

Wait, another XenServer release? Yes folks, there is no question we've been very busy improving upon XenServer over the past year, and the pace is quite fast. In case you missed it, we released XenServer 6.5 in January (formerly known as Creedence). Just a few weeks ago I announced and made available pre-release binaries for Dundee, and now we've just announced availability at Citrix Synergy of the first service pack for XenServer 6.5. Exciting times indeed.

What's in XenServer 6.5 SP1

I could bury the lead talk about hot fixes and roll-ups (more on that later), but the real value for SP1 is in the increased capabilities. Here are the lead items for this service pack:

  1. The Docker work we previewed in January at FOSDEM and later on xenserver.org is now available. If you've been using xscontainer in preview form, it should upgrade fine, but you should back up any VMs first. Completion of the Docker work also implies that CoreOS 633.1.0 is also an officially supported operating system with SP1. Containers deployed in Unbuntu 14.04 and RHEL, CentOS, and Oracle Enterprise Linux 7 and higher are supported.
  2. Adoption of LTS (long term support) guest support. XenServer guest support has historically required users of guest operating system to wait for XenServer to adopt official support for point releases in order to remain in a supported configuration. Starting with SP1, all supported operating systems can be upgraded within their major version and still retain "supported" status from Citrix support. For example, if a CentOS 6.6 VM is deployed, and the CentOS project subsequently releases CentOS 6.7, then upgrading that VM to CentOS 6.7 requires no changes to XenServer in order to remain a supported configuration.
  3. Intel GVT-d support for GPU pass through for Windows guests. This allows users of Xeon E3 Haswell processors to use the embedded GPU in those processors within a Windows guest using standard Intel graphics drivers.
  4. NVIDIA GPU pass though to Linux VMs allowing OpenGL and CUDA support for these operating systems.
  5. Installation of supplemental packs can now be performed through XenCenter Update. Note that since driver disks are only a special case of a supplemental pack, driver updates or installation of drivers not required for host installation can now also be performed using this mechanism
  6. Virtual machine density has been increased to 1000. What this means is that if you have a server which can reasonably be expected to run 1000 VMs of a given operating system, then using XenServer you can do so. No changes were made to the supported hardware configuration to accommodate this change.

Hotfix process

As with all XenServer service packs, XenServer 6.5 SP1 contains a rollup of all existing hot fixes for XenServer 6.5. This means that when provisioning a new host, your first post-installation step should be to apply SP1. It's also important to call out that when a service pack is released, hotfixes for the prior service pack level will no longer be created within six months. In this case, hotfixes for XenServer 6.5 will only be created through November 12th and following that point hotfixes will only be created for XenServer 6.5 SP1. In order for the development teams to streamline that transition, any defects raised for XenServer 6.5 in bugs.xenserver.org should be raised against 6.5 SP1 and not base 6.5.

Where to get XenServer 6.5 SP1

 

Downloading XenServer 6.5 SP1 is very easy, simply go to http://xenserver.org/download and download it!     

Recent Comments
Tobias Kreidl
Very nice! One quick question: It we want to work with both XS 6.5 SP1 and the technical pre-release Dundee version, is there or w... Read More
Tuesday, 12 May 2015 18:28
Tobias Kreidl
Found out these are 99% compatible (thanks, Andy!), so either should work fine. The Dundee XenCenter release shows a slightly newe... Read More
Wednesday, 13 May 2015 09:39
Stephen Turner
Hi, Tobias. Dundee XenCenter already includes all the changes in 6.5SP1, so you can just use that one to manage everything (everyt... Read More
Wednesday, 13 May 2015 08:23
Continue reading
32075 Hits
16 Comments

XenServer 6.5 and Asymmetric Logical Unit Access (ALUA) for iSCSI Devices

INTRODUCTION

There are a number of ways to connect storage devices to XenServer hosts and pools, including local storage, HBA SAS and fiber channel, NFS and iSCSI. With iSCSI, there are a number of implementation variations including support for multipathing with both active/active and active/passive configurations, plus the ability to support so-called “jumbo frames” where the MTU is increased from 1500 to typically 9000 to optimize frame transmissions. One of the lesser-known and somewhat esoteric iSCSI options available on many modern iSCSI-based storage devices is Asymmetric Logical Unit Access (ALUA), a protocol that has been around for a decade and is furthermore mysterious and intriguing because of its ability to be used not only with iSCSI, but also with fiber channel storage. The purpose of this article is an attempt to both clarify and outline how ALUA can be used more flexibly now with iSCSI on XenServer 6.5.

HISTORY

ALUA support on XenServer goes way back to XenServer 5.6 and initially only with fiber channel devices. The support of iSCSI ALUA connectivity started on XenServer 6.0 and was initially limited to specific ALUA-capable devices, which included the EMC Clariion, NetApp FAS as well as the EMC VMAX and VNX series. Each device required specific multipath.conf file configurations to properly integrate with the server used to access them, XenServer being no exception. The upstream XenServer code also required customizations. The "How to Configure ALUA Multipathing on XenServer 6.x for Enterprise Arrays" article CTX132976 (March 2014, revised March 2015) currently only discusses ALUA support through XenServer 6.2 and only for specific devices, stating: “Most significant is the usability enhancement for ALUA; for EMC™ VNX™ and NetApp™ FAS™, XenServer will automatically configure for ALUA if an ALUA-capable LUN is attached”.

It was announced in the XenServer 6.5 Release Notes that XenServer will automatically connect to one of these aforementioned documented devices and it is now running the updated device mapper multipath (DMMP) version 0.4.9-72. This rekindled my interest in ALUA connectivity and after some research and discussions with Citrix and Dell about support, it appeared this might now be possible specifically for the Dell MD3600i units we have used on XenServer pools for some time now. What is not stated in the release notes is that XenServer 6.5 now has the ability to connect generically to a large number of ALUA-capable storage arrays. This will be gone into detail later. It is also of note that MPP-RDAC support is no longer available in XenServer 6.5 and DMMP is the exclusive multipath mechanism supported. This was in part because of support and vendor-specific issues (see, for example, the XenServer 6.5 Release Notes or this document from Dell, Inc.).

But first, how are ALUA connections even established? And perhaps of greater interest, what are the benefits of ALUA in the first place?

ALUA DEFINITIONS AND SETTINGS

As the name suggests, ALUA is intended to optimize storage traffic by making use of optimized paths. With multipathing and multiple controllers, there are a number of paths a packet can take to reach its destination. With two controllers on a storage array and two NICs dedicated to iSCSI traffic on a host, there are four possible paths to a storage Logical Unit Number (LUN). On the XenServer side, LUNs then are associated with storage repositories (SRs). ALUA recognizes that once an initial path is established to a LUN that any multipathing activity destined for that same LUN is better served if routed through the same storage array controller. It attempts to do so as much as possible, unless of course a failure forces the connection to have to take an alternative path. ALUA connections fall into five self-explanatory categories (listed along with their associated hex codes):

  • Active/Optimized : x0
  • Active/Non-Optimized : x1
  • Standby : x2
  • Unavailable : x3
  • Transitioning : xf

For ALUA to work, it is understood that an active/active storage path is required and furthermore that an asymmetrical active/active mechanism is involved. The advantage of ALUA comes from less fragmentation of packet traffic by routing if at all possible both paths of the multipath connection via the same storage array controller as the extra path through a different controller is less efficient. It is very difficult to locate specific metrics on the overall gains, but hints of up to 20% can be found in on-line articles (e.g., this openBench Labs report on Nexsan), hence this is not an insignificant amount and potentially more significant that gains reached by implementing jumbo frames. It should be noted that the debate continues to this day regarding the benefits of jumbo frames and to what degree, if any, they are beneficial. Among numerous articles to be found are: The Great Jumbo Frames Debate from Michael Webster, Jumbo Frames or Not - Purdue University Research, Jumbo Frames Comparison Testing, and MTU Issues from ESNet. Each installation environment will have its idiosyncrasies and it is best to conduct tests within one's unique configuration to evaluate such options.

The SCSI Architecture Model version defines these SCSI Primary Commands (SPC-3) used to determine paths. The mechanism by which this is accomplished is target port group support (TPGS). The characteristics of a path can be read via an RTPG command or set with an STPG command. With ALUA, non-preferred controller paths are used only for fail-over purposes. This is illustrated in Figure 1, where an optimized network connection is shown in red, taking advantage of routing all the storage network traffic via Node A (e.g., storage controller module 0) to LUN A (e.g., 2).

 

b2ap3_thumbnail_ALUAfig1.jpg

Figure 1.  ALUA connections, with the active/optimized paths to Node A shown as red lines and the active/non-optimized paths shown as dotted black lines.

 

Various SPC commands are provided as utilities within the sg3_utils (SCSI generic) Linux package.

There are other ways to make such queries, for example, VMware has a “esxcli nmp device list” command and NetApp appliances support “igroup” commands that will provide direct information about ALUA-related connections.

Let us first examine a generic Linux server containing ALUA support connected to an ALUA-capable device. In general, note that this will entail a specific configuration to the /etc/multipath.conf file and typical entries, especially for some older arrays or XenServer versions, will use one or more explicit configuration parameters such as:

  • hardware_handler ”1 alua”
  • prio “alua”
  • path_checker “alua”

Consulting the Citrix knowledge base article CTX132976, we see for example the EMC Corporation DGC Clariion device makes use of an entry configured as:

        device{
                vendor "DGC"
                product "*"
                path_grouping_policy group_by_prio
                getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout "/sbin/mpath_prio_emc /dev/%n"
                hardware_handler "1 alua"
                no_path_retry 300
                path_checker emc_clariion
                failback immediate
        }

To investigate the multipath configuration in more detail, we can make use of the TPGS setting. The TPGS setting can be read using the sg_rtpg command. By using multiple “v” flags to increase verbosity and “d” to specify the decoding of the status code descriptor returned for the asymmetric access state, we might see something like the following for one of the paths:

# sg_rtpg -vvd /dev/sde
open /dev/sdg with flags=0x802
    report target port groups cdb: a3 0a 00 00 00 00 00 00 04 00 00 00
    report target port group: requested 1024 bytes but got 116 bytes
Report list length = 116
Report target port groups:
  target port group id : 0x1 , Pref=0
    target port group asymmetric access state : 0x01 (active/non optimized)
    T_SUP : 0, O_SUP : 0, U_SUP : 1, S_SUP : 0, AN_SUP : 1, AO_SUP : 1
    status code : 0x01 (target port asym. state changed by SET TARGET PORT GROUPS command)
    vendor unique status : 0x00
    target port count : 02
    Relative target port ids:
      0x01
      0x02
(--snip--)

Noting the boldfaced characters above, we see here specifically that target port ID 1 is an active/non-optimized ALUA path, both from the “target port group id” line as well as from the “status code”. We also see there are two paths identified, with target port IDs 1,1 and 1,2.

There are a slew of additional “sg” commands, such as the sg_inq command, often used with the flag “-p 0x83” to get the VPD (vital product data) page of interest, sg_rdac, etc. The sg_inq command will in general return, in fact, TPGS > 0 for devices that support ALUA. More on that will be discussed later on in this article. One additional command of particular interest, because not all storage arrays in fact support target port group queries (more also on this important point later!), is sg_vpd (sg vital product data fetcher), as it does not require TPG access. The base syntax of interest here is:

sg_vpd –p 0xc9 –hex /dev/…

where “/dev/…” should be the full path to the device in question. Looking at an example of the output of a real such device, we get:

# sg_vpd -p 0xc9 --hex /dev/mapper/mpathb1
Volume access control (RDAC) VPD Page:
00     00 c9 00 2c 76 61 63 31  f1 01 00 01 01 01 00 00    ...,vac1........
10     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
20     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................

If one reads the source code for various device handlers (see the multipath tools hardware table for an extensive list of hardware profiles as well as the Linux SCSI device handler regarding how the data are interpreted through the device handler), one can determine that the value of interest here is that of avte_cvp (part of the RDAC c9_inquiry structure), which is the sixth hex value, and will indicate if the connected device is using ALUA (if shifted right five bits together with a logical AND with 0x1, in the RDAC world, known as IOSHIP mode), AVT, or Automatic Volume Transfer mode (if shifted right seven bits together with a logical AND with 0x1), or otherwise defaults in general to basic RDAC (legacy) mode. In the case above we see “61” returned (indicated in boldface), so (0x61 >> 5 & 0x1) is equal to 1, and hence the above connection is indeed an ALUA RDAC-based connection.

I will revisit sg commands once again later on. Do note that the sg3_utils package is not installed on stock XenServer distributions and as with any external package, the installation of external packages may void any official Citrix support.

MULTIPATH CONFIGURATIONS AND REPORTS

In addition to all the information that various sg commands provide, there is also an abundance of information available from the standard multipath command. We saw a sample multipath.conf file earlier, and at least with many standard Linux OS versions and ALUA-capable arrays, information on the multipath status can be more readily obtained using stock multipath commands.

For example, on an ALUA-enabled connection we might see output similar to the following from a “multipath –ll” command (there will be a number of variations in output, depending on the version, verbosity and implementation of the multipath utility):

mpath2 (3600601602df02d00abe0159e5c21e111) dm-4 DGC,VRAID
[size=100G][features=1 queue_if_no_path][hwhandler=1 alua][rw]
_ round-robin 0 [prio=50][active]
 _ 1:0:3:20  sds   70:724   [active][ready]
 _ 0:0:1:20  sdk   67:262   [active][ready]
_ round-robin 0 [prio=10][enabled]
 _ 0:0:2:20  sde   8:592    [active][ready]
 _ 1:0:2:20  sdx   128:592  [active][ready]

Recalling the device sde from the section above, note that it falls under a path with a lower priority of 10,  indicating it is part of an active, non-optimized network connection vs. 50, which indicates being in an active, optimized group; a priority of “1” would indicate the device is in the standby group. Depending on what mechanism is used to generate the priority values, be aware that these priority values will vary considerably; the most important point is that whatever path has a higher “prio” value will be the optimized path. In some newer versions of the multipath utility, the string “hwhandler=1 alua” shows clearly that the controller is configured to allow the hardware handler to help establish the multipathing policy as well as that ALUA is established for this device. I have read that the path priority will be elevated to typically a value of between 50 and 80 for optimized ALUA-based connections (cf. mpath_prio_alua in this Suse article), but have not seen this consistently.

The multipath.conf file itself has traditionally needed tailoring to each specific device. It is particularly convenient, however, that using a generic configuration is now possible for a device that makes use of the internal hardware handler and is rdac-based and can auto-negotiate an ALUA connection. The italicized entries below represent the specific device itself, but others should now work using this generic sort of connection:

device {
                vendor                  "DELL"
                product                 "MD36xx(i|f)"
                features                "2 pg_init_retries 50"
                hardware_handler        "1 rdac"
                path_selector           "round-robin 0"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_min_io               100
                path_checker            rdac
                prio                    rdac
                no_path_retry           30
                detect_prio             yes
                retain_attached_hw_handler yes
        }

Note how this differs (the additional entries above are in boldface type) from the “stock” version (in XenServer 6.5) of the MD36xx multipath configuration:

device {
                vendor                  "DELL"
                product                 "MD36xx(i|f)"
                features                "2 pg_init_retries 50"
                hardware_handler        "1 rdac"
                path_selector           "round-robin 0"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_min_io               100
                path_checker            rdac
                prio                    rdac
                no_path_retry           30
        }

THE CURIOUS CASE OF DELL MD32XX/36XX ARRAY CONTROLLERS

The LSI controllers incorporated into Dell’s MD32xx and MD36xx series of iSCSI storage arrays represent an unusual and interesting case. As promised earlier, we will get back to looking at the sg_inq command, which queries a storage device for several pieces of information, including TPGS. Typically, an array that supports ALUA will return a value of TPGS > 0, for example:

# sg_inq /dev/sda
standard INQUIRY:
PQual=0 Device_type=0 RMB=0 version=0x04 [SPC-2]
[AERC=0] [TrmTsk=0] NormACA=1 HiSUP=1 Resp_data_format=2
SCCS=0 ACC=0 TPGS=1 3PC=1 Protect=0 BQue=0
EncServ=0 MultiP=1 (VS=0) [MChngr=0] [ACKREQQ=0] Addr16=0
[RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1
[SPI: Clocking=0x0 QAS=0 IUS=0]
length=117 (0x75) Peripheral device type: disk
Vendor identification: NETAPP
Product identification: LUN
Product revision level: 811a

Highlighted in boldface, we see in this case above that TPGS is reported to have a value of 1. The MD36xx has supported ALUA since RAID controller firmware 07.84.00.64 and  NVSRAM  N26X0-784890-904, however, even with that (or newer) revision level, an sg_inq returns the following for this particular storage array:

# sg_inq /dev/mapper/36782bcb0002c039d00005f7851dd65de
standard INQUIRY:
  PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=1  HiSUP=1  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=1  Protect=0  BQue=0
  EncServ=1  MultiP=1 (VS=0)  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=1  Sync=1  Linked=0  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=74 (0x4a)   Peripheral device type: disk
 Vendor identification: DELL
 Product identification: MD36xxi
 Product revision level: 0784
 Unit serial number: 142002I

Various attempts to modify the multipath.conf file to try to force TPGS to appear with any value greater than zero all failed. Above all, it seemed that without access to the TPGS command, there was no way to query the device for ALUA-related information.  Furthermore, the command mpath_prio_alua and similar commands appear to have been deprecated in newer versions of the device-mapper-multipath package, and so offer no help.

This proved to be a major roadblock in making any progress. Ultimately it turned out that the key to looking for ALUA connectivity in this particular case comes oddly from ignoring what TPGS reports, and rather focusing on what the MD36xx controller is doing. What is going on here is that the hardware handler is taking over control and the clue comes from the sg_vpd output shown above. To see how a LUN is mapped for these particular devices, one needs to hunt back through the /var/log/messages file for entries that appear when the LUN was first attached. To investigate this for the MD36xx array, we know it uses the internal “rdac” connection mechanism for the hardware handler, so a Linux grep command for “rdac” in the /var/log/messages file around the time the connection was established to a LUN should reveal how it was established.

Sure enough, if one looks at a case where the connection is known to not be making use of ALUA, you might see entries such as these:

[   98.790309] rdac: device handler registered
[   98.796762] sd 4:0:0:0: rdac: AVT mode detected
[   98.796981] sd 4:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.797672] sd 5:0:0:0: rdac: AVT mode detected
[   98.797883] sd 5:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.798590] sd 6:0:0:0: rdac: AVT mode detected
[   98.798811] sd 6:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.799475] sd 7:0:0:0: rdac: AVT mode detected
[   98.799691] sd 7:0:0:0: rdac: LUN 0 (owned (AVT mode))

In contrast, an ALUA-based connection to LUNs shown below on an MD3600i that has new enough firmware to support ALUA and using an appropriate client that also supports ALUA and has a properly configured entry in the /etc/multipath.conf file will instead show the IOSHIP connection mechanism (see p. 124 of this IBM System Storage manual for more on I/O Shipping):

Mar 11 09:45:45 xs65test kernel: [   70.823257] scsi 8:0:0:1: rdac: LUN 1 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.385835] scsi 9:0:0:0: rdac: LUN 0 (IOSHIP) (unowned)
Mar 11 09:45:46 xs65test kernel: [   71.389345] scsi 9:0:0:1: rdac: LUN 1 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.957649] scsi 10:0:0:0: rdac: LUN 0 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.961788] scsi 10:0:0:1: rdac: LUN 1 (IOSHIP) (unowned)
Mar 11 09:45:47 xs65test kernel: [   72.531325] scsi 11:0:0:0: rdac: LUN 0 (IOSHIP) (owned)

Hence, we happily recognize that indeed, ALUA is working.

The even better news is that not only is ALUA now functional in XenServer 6.5 but should, in fact, work now with a large number of ALUA-capable storage arrays, both with custom configuration needs as well as potentially many that may work generically. Another surprising find was that for the MD3600i arrays tested, it turns out that even the “stock” version of the MD36xxi multipath configuration entry provided with XenServer 6.5 creates ALUA connections. The reason for this is that the hardware handler is being used consistently, provided no specific profile overrides are intercepted, and so primarily the storage device is doing the negotiation itself instead of being driven by the file-based configuration. This is what made the determination of ALUA connectivity more difficult, namely that the TPGS setting was never changed from zero and could consequently not be used to query for the group settings.

CONCLUSIONS

First off, it is really nice to know now that many modern storage devices support ALUA and that XenServer 6.5 now provides an easier means to leverage this protocol. It is also a lesson that documentation can be either hard to find and in some cases, is in need of being updated to reflect the current state. Individual vendors will generally provide specific instructions regarding iSCSI connectivity, and should of course be followed. Experimentation is best carried out on non-production servers where a major faux pas will not have catastrophic consequences.

To me, this was also a lesson in persistence as well as an opportunity to share the curiosity and knowledge among a number of individuals who were helpful throughout this process. Above all, among many who deserve thanks, I would like to thank in particular Justin Bovee from Dell and Robert Breker of Citrix for numerous valuable conversations and information exchanges.

Recent Comments
JK Benedict
POETRY! Thank you so much for this effort, Tobias!!!
Monday, 20 April 2015 14:01
Loren Saxby
This is excellent! It's amazing how certain features are overlooked or never brought up to begin with. Articles like this one shed... Read More
Wednesday, 06 May 2015 06:01
Tobias Kreidl
Thank you for all your collective comments. Shedding some light on the obscure can be very rewarding, even if only a small audienc... Read More
Sunday, 10 May 2015 23:38
Continue reading
27515 Hits
9 Comments

Participe do XenServer Day Fortaleza 2015

Participe do XenServer Day Fortaleza 2015

Fortaleza | Ceará | Brasil | 27/02/2015 | 14:00h
UFC - Universidade Federal do Ceará Campus do Pici | Bloco 902 |
Auditório Reitor Ícaro de Souza Moreira (Auditório do Centro de Ciências)

O XenServer Day foi um evento criado para usuários corporativos, desenvolvedores, fornecedores de serviços e entusiastas pelo Citrix™ XenServer™. O evento será realizado na Cidade de Fortaleza/CE, na Universidade Federal do Ceará, Auditório Reitor Ícaro de Souza Moreira (Auditório do Centro de Ciências), no dia 27/02/2015. Nesta edição do XenServer Day, vai ser realizada em conjunto com o XenServer Creedence World Tour, iniciativa da comunidade Open Source XenServer.org para lançamento do XenServer 6.5 (Codename Creedence).

Segue abaixo agenda de palestras:

14:00h - Abertura do Evento
14:20h - O que há de novo no Citrix XenServer 6.5 - Lorscheider Santiago (Quales Tecnologia)
15:30h - Palestra Unitrends - André Favoretto (Globix)
16:10h - Gerenciando infraestruturas virtuais em nuvem, servidor e desktop com Citrix XenServer 6.5  - Lorscheider Santiago (Quales Tecnologia)
17:00h - Infraestrutura de TI com Segurança: O que isso representa para o seu negócio - Vinícius Minneto (Ascenty)
17:30h - Encerramento

Os primeiros participantes que chegarem ao evento vão ganhar a camisa oficial da XenServer Creedence World Tour (Estoque limitado)

Para mais informações sobre o evento e fazer a sua inscrição, clique aqui

Está nas Redes Sociais? Compartilhe a hashtag: #XenServerDayFortaleza

Lorscheider Santiago - @lsantiagos

Recent comment in this post
Hafiz
Hi, I am attempting to create a virtual windows 2012 r2 server in Citrix XenServer 7.6 but keep getting an error during the initi... Read More
Saturday, 28 November 2015 00:55
Continue reading
6664 Hits
1 Comment

XenServer at FOSDEM

Having just released Creedence as XenServer 6.5, 2015 has definitely started off with a bang. In 2014 the focus for XenServer was on a platform refresh, and creating a solid platform for future work. For me, 2015 is about enabling the ecosystem to be successful with XenServer, and that's where FOSDEM comes in. For those unfamiliar with FOSDEM, it's the Free and Open Source Developers European Meeting, and many of the most influential projects will have strong representation. Many of those same projects have strong relationships with other hypervisors, but not necessarily with XenServer. For those projects, XenServer needs to demonstrate its relevance, and I hope through a set of demos within the Xen Project stand to provide exactly that.

Demo #1 - Provisioning Efficiency

XenServer is a hypervisor, and as such is first and foremost a provisioning target. That means it needs to work well with provisioning solutions and their respective template paradigms. Some of you may have seen me present at various events on the topic of hypervisor selection in various cloud provisioning tools. One of the core workflow items for all cloud solutions is the ability to take a template and provision it consistently to the desired hypervisor. In Apache CloudStack with XenServer for example, those templates are VHD files. Unfortunately, XenServer by default exports XVA files, not native VHD; which makes the template process for CloudStack needlessly difficult.

This is where a technology like Packer comes in. Some of the XenServer engineers have been working on a Packer integration to support Vagrant. That's cool, but I'm also looking at this from the perspective of other tools and so will be showing Packer creating a CentOS 7 template which could be used anywhere. That template would then be provisioned and as part of the post-provisioning configuration management become a "something" with the addition of applications.

Demo #2 - Application Containerization

Once I have my template from Packer, and have provisioned it into a XenServer 6.5 host, the next step is application management. For this I'm going to use Ansible to personalize the VM, and to add in some applications which are containerized by Docker. There has been some discussion in the marketplace about containers replacing VMs, and I really see proper use of containers as being efficient use of VMs not as a replacement for a VM. Proper container usage is really proper application management, and understanding when to use which technology. For me this means that a host is a failure point which contains VMs. A VM represents a security and performance wrapper for a given tenant and their applications. Within a VM applications are provisioned, and where containerization of the applications makes sense, it should be used.

System administrators should be able to directly manage each of these three "containers" from the same pane of glass, and as part of my demo, I'll be showing just that using XenCenter. XenCenter has a simple GUI from which host and VM level management can be performed, and which is in the process of being extended to include Dockerized containers.

With this as the demo backdrop, I encourage anyone planning on attending FOSDEM to please stop by and ask about the work we've done with Creedence and also where we're thinking of going. If you're a contributor to a project and would like to talk more about how integrating with XenServer might make sense, either for your project or as something we should be thinking about, please do feel free to reach out to me. Of course if you're not planning on being at FOSDEM, but know folks who are, please do feel free to have them seek me out. We want XenServer to be a serious contender in every data center, but if we don't know about issues facing your favorite projects, we can't readily work to resolve them.

btw, if you'd like to plan anything around FOSDEM, please either comment on this blog, or contact me on Twitter as @XenServerArmy.

-tim     

Recent Comments
Tobias Kreidl
Thank you for sharing this, Tim. Some progress has already been made in being able to export VHD files and even VHD snapshots, so ... Read More
Monday, 26 January 2015 20:39
Felipe Franciosi
For those arriving in Brussels a day earlier, I'll be presenting at the CentOS Dojo and talking about Optimising Xen Deployments f... Read More
Tuesday, 27 January 2015 09:07
Tobias Kreidl
Tim, Would like to see a summary of your experiences and impressions at FOSDEM after it has concluded.
Sunday, 01 February 2015 15:54
Continue reading
9884 Hits
4 Comments

XenServer Support Options

Now that Creedence has been released as XenServer 6.5, I'd like to take this opportunity to highlight where to obtain what level of support for your installation.

Commercial Support

Commercial support is available from Citrix and many of its partners. A commercial support contract is appropriate if you're running XenServer in a production environment, particularly if downtime is a critical component of your SLA. It's important to note that commercial support is only available if the deployment follows the Citrix deployment guidelines, uses third party components from the Citrix Ready Marketplace, and is operated in accordance with the terms of the commercial EULA. Of course, since your deployment might not precisely follow these guidelines, commercial support may not be able to resolve all issues and that's where community support comes in.

Community Support

Community support is available from the Citrix support forums. The people on the forum are both Citrix support engineers and also your fellow system administrators. They are generally quite knowledgeable and enthusiastic to help someone be successful with XenServer. It's important to note that while the product and engineering teams may monitor the support forums from time to time, engineering level support should not be expected on the community forums.

Developer Support

Developer level support is available from the xs-devel list. This is your traditional development mailing list and really isn't appropriate for general support questions. Many of the key engineers are part of this list, and do engage on topics related to performance, feature development and code level issues. It's important to remember that the XenServer software is actually built from many upstream components, so the best source of information might be an upstream developer list and not xs-devel.

Self-support tool

Citrix maintains an self-support tool called Citrix Insight Services, formerly known as Tools-as-a-Service (TaaS). Insight Services takes a XenServer status report, and analyzes it to determine if there are any operational issues present in the deployment. A best practice is to upload a report after installing a XenServer host to determine if any issues are present which can result in latent performance or stability problems. CIS is used extensively by the Citrix support teams, but doesn't require a commercial support contract for end users.

Submitting Defects

If you believe you have encountered a defect or limitation in the XenServer software, simply using one of these support options isn't sufficient for the incident to be added to the defect queue for evaluation. Commercial support users will need to have their case triaged and potentially escalated, with the result potentially being a hotfix. All other users will need to submit an incident report via bugs.xenserver.org. Please be as detailed as possible with any defect reports such that they can be reproduced, and it doesn't hurt to include the URL of any forum discussion or the TaaS ID in your report. Also, please be aware that while the issue may be urgent for you any potential fix may take some time to be created. If your issue is urgent, you are strongly encouraged to follow the commercial support route as Citrix escalation engineers have the ability to prioritize customer issues.

Additionally, its important to point out that submitting a defect or incident report doesn't guarantee it'll be fixed. Some things simply work the way they do for very important reasons, other things may behave the way they do due to the way components interact. XenServer is tuned to provide a highly scalable virtualization platform, and if an incident would require destabilizing that platform, it's unlikely to be changed.

Recent comment in this post
JK Benedict
Tim - thank you very much for this post and as always, we greatly appreciate your work here @ xenserver.org. #XenServer = #Citr... Read More
Friday, 16 January 2015 04:37
Continue reading
13783 Hits
1 Comment

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.