Virtualization Blog

Discussions and observations on virtualization.
Originally an astronomer by vocation and one of the early users of CCD technology for imaging space objects, major interests have long included image processing and data analysis along with programming and systems and application administration. Since 1994, a full-time employee in the IT department of Northern Arizona University (28.5k students, 3.5k faculty/staff) primarily involved in the academic area, and a Team Lead / Senior Programmer since 1998 as well as adjunct faculty in astronomy since 1988. Interests include computer security, any kind of student services (lab technology, distance education, authentication/authorization, providing applications as services, etc.), assisting faculty with special Web hosting projects (in particular involving LAMP), open source applications, server/desktop virtualization (since 2004), ubiquitous computing, storage technologies, and GPU/vGPU technologies. His team has run multiple XenServer/XenDesktop/XenApp configurations since 2009. Tobias holds a PhD in astronomy from the University of Vienna, Austria, as well as CCA certification in XenServer 6. He can be encountered regularly on the Citrix Discussion Forum and lurks as well on LinkedIn and more recently, CUGC for which he also serves in the capacity of a Steering Committee member and is on the Content Working Group. He received the Citrix Technology Professional award in 2016.

XenServer High-Availability Alternative HA-Lizard

XenServer High-Availability Alternative HA-Lizard

WHY HA AND WHAT IT DOES

XenServer (XS) contains a native high-availability (HA) option which allows quite a bit of flexibility in determining the state of a pool of hosts and under what circumstances Virtual Machines (VMs) are to be restarted on alternative hosts in the event of the loss of the ability of a host to be able to serve VMs. HA is a very useful feature that protects VMs from staying failed in the event of a server crash or other incident that makes VMs inaccessible. Allowing a XS pool to help itself maintain the functionality of VMs is an important feature and one that plays a large role in sustaining as much uptime as possible. Permitting the servers to automatically deal with fail-overs makes system administration easier and allows for more rapid reaction times to incidents, leading to increased up-time for servers and the applications they run.

XS allows for the designation of three different treatments of Virtual Machines: (1) always restart, (2) restart if possible, and (3) do not restart. The VMs designated with the highest restart priority will be the first to be attempted to restart and all will be handled, provided adequate resources (primarily, host memory) are available.  A specific start order, allowing for some VMs to be checked to be running before others, can also be established. VMs will be automatically distributed among whatever remaining XS hosts are considered active. Where necessary, note that hosts that contain expandable memory will be shrunk down to accommodate additional hosts and those hosts designated to be restarted will also be run with reduced memory, if necessary. If additional capacity exists to run more VMs, those designated as “start if possible” will be brought online. Whichever VMs that are not considered essential typically will be marked as “do not restart” and hence will be left “off” had they been running before, requiring any of those desired to be restarted to be done manually, resources permitting.

XS also allows for specifying the minimum number of active hosts to remain to accommodate failures; larger pools that are not overly populated with VMs can readily accommodate even two or more host failures.

The election of what hosts are “live” and should be considered active members of the pool follows a rather involved process of a combination of network accessibility plus access to an independent designated pooled Storage Repository (SR) that serves as an additional metric. The pooled SR can also be a fiber channel device, being independent of Ethernet connections. A quorum-based algorithm is applied to establish which servers are up and active as members of the pool and which -- in the event of a pool master failure -- should be elected the new pool master.

 

WHEN HA WORKS, IT WORKS GREAT

Without going into more detail, suffice it to say that this methodology works very well, however requiring a few prerequisite conditions that need to be taken into consideration. First of all, the mandate that a pooled storage device be available clearly means that a pool consisting of hosts that only make use of local storage will be precluded. Second, there is also a constraint that for a quorum to be possible, it is required to have a minimum of three hosts in the pool or HA results will be unpredictable as the election of a pool master can become ambiguous. This comes about because of the so-called “split brain” issue (http://linux-ha.org/wiki/Split_Brain) which is endemic in many different operating system environments that employ a quorum as means of making such a decision. Furthermore, while fencing (the process of isolating the host; see for example http://linux-ha.org/wiki/Fencing) is the typical recourse, the lack of intercommunication can result in a wrong decision being made and hence loss of access to VMs. Having experimented with two-host pools and the native XenServer HA, I would say that an estimate of it working about half the time is about right and from a statistical viewpoint, pretty much what you would expect.

This limitation is, however, still of immediate concern to those with either no pooled storage and/or only two hosts in a pool. With a little bit of extra network connectivity, a relatively simple and inexpensive solution to the external SR can be provided by making a very small NFS-based SR available. The second condition, however, is not readily rectified without the expense of at least one additional host and all the connectivity associated with it. In some cases, this may simply not be an affordable option.

 

ENTER HA-LIZARD

For a number of years now, an alternative method of providing HA has been available through the program package provided by HA-Lizard (http://www.halizard.com/) , a community project that provides a free alternative that is neither dependent on external SRs nor requires a minimum of three hosts within a pool. In this blog, the focus will be on the standard HA-Lizard version and because of the particularly harder-to-handle situation of a two-node pool, it will also be the subject of discussion.

I had been experimenting for some time with HA-Lizard and found in particular that I was able to create failure scenarios that needed some improvement. HA-Lizard’s Salvatore Costantino was more than willing to lend an ear to the cases I had found and this led further to a very productive collaboration on investigating and implementing means to deal with a number of specific cases involving two-host pools. The result of these several months of efforts is a new HA-Lizard release that manages to address a number of additional scenarios above and beyond its earlier capabilities.

It is worthwhile mentioning that there are two ways of deploying HA-Lizard:

1) Most use cases combine HA-Lizard and iSCSI-HA which creates a two-node pool using local storage while maintaining full VM agility with VMs being able to run on either host. In this case, DRBD (http://www.drbd.org/) is implemented in this type of deployment and it works very well making use of the real-time storage replication.

2) HA-Lizard, only, is used with an external Storage Repository (as in this particular case).

Before going into details of the investigation, a few words should go towards a brief explanation of how this works. Note that there is only Internet connectivity (the use of a heuristic network node) and no external SR, so how is a split brain situation then avoidable?

This is how I'd describe the course of action in this two-node situation:

If a node sees the gateway, assume it's alive. If it cannot, assume it's a good candidate for fencing. If the node that cannot see the gateway is the master, it should internally kill any running VMs and surrender its ability to be the master and fence itself. The slave node should promote itself to master and attempt to restart any missing VMs. Any that are on the previous master will probably fail though, because there is no communication to the old master. If the old VMs cannot be restarted, eventually the new master will be able to restart them regardless after a toolstack restart. If the slave node fails by not being able to communicate with the network, as long as the master still sees the network and not the slave’s network, it can assume the slave needs to fence itself, kill off its VMs and assume that they will be restarted on the current master. The slave needs to realize it cannot communicate out, and therefore should kill off any of its VMs and fence itself.

Naturally, the trickier part comes with the timing of the various actions, since each node has to blindly assume the other is going to conduct a sequence of events. The key here is that these are all agreed on ahead of time and as long as each follows its own specific instructions, it should not matter that each of the two nodes cannot see the other node. In essence, the lack of communication in this case allows for creating a very specific course of action! If both nodes fail, obviously the case is hopeless, but that would be true of any HA configuration in which no node is left standing.

Various test plans were worked out for various cases and the table below elucidates the different test scenarios, what was expected and what was actually observed. It is very encouraging that the vast majority of these cases can now be properly handled.

 

Particularly tricky here was the case of rebooting the master server from the shell, without first disabling HA-Lizard (something one could readily forget to do). Since the fail-over process takes a while, a large number of VMs cannot be handled before the communication breakdown takes place, hence one is left with a bit of a mess to clean up in the end. Nevertheless, it’s still good to know what happens if something takes place that rightfully shouldn’t!

The other cases, whether intentional or not, are handled predictably and reliably, which is of course the intent. Typically, a two-node pool isn’t going to have a lot of complex VM dependencies, so the lack of a start order of VMs should not be perceived as a big shortcoming. Support for this feature may even be added in a future release.

 

CONCLUSIONS

HA-Lizard is a viable alternative to the native Citrix HA configuration. It’s straightforward to set up and can handle standard failover cases with a selective “restart/do not restart” setting for each VM or can be globally configured. There are a quite a number of configuration parameters which the reader is encouraged to research in the extensive HA-Lizard documentation. There is also an on-line forum which serves as a source for information and prompt assistance with issues. This most recent release 2.1.3 is supported on both XenServer 6.5 and 7.0.

Above all, HA-Lizard shines when it comes to handling a non-pooled storage environment and in particular, all configurations of the dreaded two-node pool configuration. From my direct experience, HA-Lizard now handles the vast majority of issues involved in a two-node pool and can do so more reliably than the non-supported two-node pool using Citrix’ own HA application. It has been possible to conduct a lot of tests with various cases and importantly, and to do so multiple times to ensure the actions are predictable and repeatable.

I would encourage taking a look at HA-Lizard and giving it a good test run. The software is free (contributions are accepted) and it is in extensive use and has a proven track record.  For a two-host pool, I can frankly not think of a better alternative, especially with these latest improvements and enhancements.

I would also like to thank Salvatore Costantino for the opportunity to participate in this investigation and am very pleased to see the fruits of this collaboration. It has been one way of contributing to the Citrix XenServer user community that many can immediately benefit from.

 

 

 

 

 

 

Recent comment in this post
JK Benedict
I hath no idea why more have not read this intense article! As always: bravo, sir! BRAVO!
Wednesday, 04 January 2017 12:43
Continue reading
2421 Hits
1 Comment

NAU VMbackup 3.0 for XenServer

NAU VMbackup 3.0 for XenServer

By Tobias Kreidl and Duane Booher

Northern Arizona University, Information Technology Services

Over eight years ago, back in the days of XenServer 5, not a lot of backup and restore options were available, either as commercial products or as freeware, and we quickly came to the realization that data recovery was a vital component to a production environment and hence we needed an affordable and flexible solution. The conclusion at the time was that we might as well build our own, and though the availability of options has grown significantly over the last number of years, we’ve stuck with our own home-grown solution which leverages Citrix XenServer SDK and XenAPI (http://xenserver.org/partners/developing-products-for-xenserver.html). Early versions were created from the contributions of Douglas Pace, Tobias Kreidl and David McArthur. During the last several years, the lion’s share of development has been performed by Duane Booher. This article discusses the latest 3.0 release.

A Bit of History

With the many alternatives now available, one might ask why we have stuck with this rather un-flashy script and CLI-based mechanism. There are clearly numerous reasons. For one, in-house products allow total control over all aspects of their development and support. The financial outlay is all people’s time and since there are no contracts or support fees, it’s very controllable and predictable. We also found from time-to-time that various features were not readily available in other sources we looked at. We also felt early on as an educational institution that we could give back to the community by freely providing the product along with its source code; the most recent version is available via GitHub at https://github.com/NAUbackup/VmBackup for free under the terms of the GNU General Public License. There was a write-up at https://www.citrix.com/blogs/2014/06/03/another-successful-community-xenserver-sdk-project-free-backup-tools-and-scripts-naubackup-restore-v2-0-released/ when the first GitHub version was published. Earlier versions were made available via the Citrix community site (Citrix Developer Network), sometimes referred to as the Citrix Code Share, where community contributions were published for a number of products. When that site was discontinued in 2013, we relocated the distribution to GitHub.

Because we “eat our own dog food,” VMbackup gets extensive and constant testing because we rely on it ourselves as the means to create backups and provide for restores for cases of accidental deletion, unexpected data corruption, or in the event that disaster recovery might be needed. The mechanisms are carefully tested before going into production and we perform frequent tests to ensure the integrity of the backups and that restores really do work. A number of times, we have relied on resorting to recovering from our backups and it has been very reassuring that these have been successful.

What VMbackup Does

Very simply, VMbackup provides a framework for backing up virtual machines (VMs) hosted on XenServer to an external storage device, as well as the means to recover such VMs for whatever reason that might have resulted in loss, be it disaster recovery, restoring an accidentally deleted VM, recovering from data corruption, etc.

The VMbackup distribution consists of a script written in Python and a configuration file. Other than a README document file, that’s it other than the XenServer SDK components which one needs to download separately; see http://xenserver.org/partners/developing-products-for-xenserver.html for details. There is no fancy GUI to become familiar with, and instead, just a few simple things that need to be configured, plus a destination for the backups needs to be made accessible (this is generally an NFS share, though SMB/CIFS will work, as well). Using cron job entries, a single host or an entire pool can be set up to perform periodic backups. Configurations on individual hosts in a pool are needed in that the pool master performs the majority of the work and it can readily change to a different XenServer, while individual host-based instances are also needed when local storage is also made use of, since access to any local SRs can only be performed from each individual XenServer. A cron entry and numerous configuration examples are given in the documentation.

To avoid interruptions of any running VMs, the process of backing up a VM follows these basic steps:

  1. A snapshot of the VM and its storage is made
  2. Using the xe utility vm-export, that snapshot is exported to the target external storage
  3. The snapshot is deleted, freeing up that space

In addition, some VM metadata are collected and saved, which can be very useful in the event a VM needs to be restored. The metadata include:

  • vm.cfg - includes name_label, name_description, memory_dynamic_max, VCPUs_max, VCPUs_at_startup, os_version, orig_uuid
  • DISK-xvda (for each attached disk)
    • vbd.cfg - includes userdevice, bootable, mode, type, unplugable, empty, orig_uuid
    • vdi.cfg - includes name_label, name_description, virtual_size, type, sharable, read_only, orig_uuid, orig_sr_uuid
  • VIFs (for each attached VIF)
    • vif-0.cfg - includes device, network_name_label, MTU, MAC, other_config, orig_uuid

An additional option is to create a backup of the entire XenServer pool metadata, which is essential in dealing with the aftermath of a major disaster that affects the entire pool. This is accomplished via the “xe pool-dump-database” command.

In the event of errors, there are automatic clean-up procedures in place that will remove any remnants plus make sure that earlier successful backups are not purged beyond the specified number of “good” copies to retain.

There are numerous configuration options that allow to specify which VMs get backed up, how many backup versions are to be retained, whether the backups should be compressed (1) as part of the process, as well as optional report generation and notification setups.

New Features in VMbackup 3.0

A number of additional features have been added to this latest 3.0 release, adding flexibility and functionality. Some of these came about because of the sheer number of VMs that needed to be dealt with, SR space issues as well as with changes coming to the next XenServer release. These additions include:

  • VM “preview” option: To be able to look for syntax errors and ensure parameters are being defined properly, a VM can have a syntax check performed on it and if necessary, adjustments can then be made based on the diagnosis to achieve the desired configuration.
  • Support for VMs containing spaces: By surrounding VM names in the configuration file with double quotes, VM names containing spaces can now be processed. 
  • Wildcard suffixes: This very versatile option permits groups of VMs to be configured to be handled similarly, eliminating the need to create individual settings for every desired VM. Examples include “PRD-*”, “SQL*” and in fact, if all VMs in the pool should be backed up, even “*”. There are however, a number of restrictions on wildcard usage (2).
  • Exclude VMs: Along with the wildcard option to select which VMs to back up, clearly a need arises to provide the means to exclude certain VMs (in addition to the other alternative, which is simply to rename them such that they do not match a certain backup set). Currently, each excluded VM must be named separately and any such VMs should de defined at the end of the configuration file. 
  • Export the OS disk VDI, only: In some cases, a VM may contain multiple storage devices (VDIs) that are so large that it is impractical or impossible to take a snapshot of the entire VM and its storage. Hence, we have introduced the means to backup and restore only the operating system device (assumed to be Disk 0). In addition to space limitations, some storage, such as DB data, may not be able to be reliably backed up using a full VM snapshot. Furthermore, the next XenServer release (Dundee) will likely support up to as many as perhaps 255 storage devices per VM, making a vm-export even more involved under such circumstances. Another big advantage here is that currently, this process is much more efficient and faster than a VM export by a factor of three or more!
  • Root password obfuscation: So that clear-text passwords associated with the XenServer pool are not embedded in the scripts themselves, the password can be basically encoded into a file.

The mechanism for a running VM from which only the primary disk is to be backed up is similar to the full VM backup. The process of backing up such a VM follows these basic steps:

  1. A snapshot of just the VM's Disk 0 storage is made
  2. Using the xe utility vdi-export, that snapshot is exported to the target external storage
  3. The snapshot is deleted, freeing up that space

As with the full VM export, some metadata for the VM are also collected and saved for this VDI export option.

These added features are of course subject to change in future releases, though typically later editions generally encompass the support of previous versions to preserve backwards compatibility.

Examples

Let’s look at the configuration file weekend.cfg:

# Weekend VMs
max_backups=4
backup_dir=/snapshots/BACKUPS
#
vdi-export=PROD-CentOS7-large-user-disks
vm-export=PROD*
vm-export=DEV-RH*:3
exclude=PROD-ubuntu12-benchmark
exclude=PRODtestVM

Comment lines start with a hash mark and may be contained anywhere with the file. The hash mark must appear as the first character in the line. Note that the default number of retained backups is set here to four. The destination directory is set next, indicating where the backups will be written to. We then see a case where only the OS disk is being backed up for the specific VM "PROD-CentOS7-large-user-disks" and below that, all VMs beginning with “PROD” are backed up using the default settings. Just below that, a definition is created for all VMs starting with "DEV-RH" and the default number of backups is reduced for all of these from the global default of four down to three. Finally, we see two excludes for specific VMs that fall into the “PROD*” group that should not be backed up at all.

To launch the script manually, you would issue from the command line:

./VmBackup.py password weekend.cfg

To launch the script via a cron job, you would create a single-line entry like this:

10 0 * * 6 /usr/bin/python /snapshots/NAUbackup/VmBackup.py password
/snapshots/NAUbackup/weekend.cfg >> /snapshots/NAUbackup/logs/VmBackup.log 2>&1

This would run the task at ten minutes past midnight on Saturday and create a log entry called VmBackup.log. This cron entry would need to be installed on each host of a XenServer pool.

Additional Notes

It can be helpful to break up when backups are run so that they don’t all have to be done at once, which may be impractical, take so long as to possibly impact performance during the day, or need to be coordinated with when is best for specific VMs (such as before or after patches are applied). These situations are best dealt with by creating separate cron jobs for each subset.

There is a fair load on the server, comparable to any vm-export, and hence the queue is processed linearly with only one active snapshot and export sequence for a VM being run at a time. This is also why we suggest you perform the backups and then asynchronously perform any compression on the files on the external storage host itself to alleviate the CPU load on the XenServer host end.

For even more redundancy, you can readily duplicate or mirror the backup area to another storage location, perhaps in another building or even somewhere off-site. This can readily be accomplished using various copy or mirroring utilities, such as rcp, sftp, wget, nsync, rsync, etc.

This latest release has been tested on XenServer 6.5 (SP1) and various beta and technical preview versions of the Dundee release. In particular, note that the vdi-export utility, while dating back a while, is not well documented and we strongly recommend not trying to use it on any XenServer release before XS 6.5. Doing so is clearly at your own risk.

The NAU VMbackup distribution can be found at: https://github.com/NAUbackup/VmBackup

In Conclusion

This is a misleading heading, as there is not really a conclusion in the sense that this project continues to be active and as long as there is a perceived need for it, we plan to continue working on keeping it running on future XenServer releases and adding functionality as needs and resources dictate. Our hope is naturally that the community can make at least as good use of it as we have ourselves.

Footnotes:

  1. Alternatively, to save time and resources, the compression can potentially be handled asynchronously by the host onto which the backups are written, hence reducing overhead and resource utilization on the XenServer hosts, themselves.
  2. Certain limitations exist currently with how wildcards can be utilized. Leading wildcards are not allowed, nor are multiple wildcards within a string. This may be enhanced at a later date to provide even more flexibility.

This article was written by Tobias Kreidl and Duane Booher, both of Northern Arizona University, Information Technology Services. Tobias' biography is available at this site, and Duane's LinkedIn profile is at https://www.linkedin.com/in/duane-booher-a068a03 while both can also be found on http://discussions.citrix.com primarily in the XenServer forum.     

Recent Comments
Lorscheider Santiago
Tobias Kreidl and Duane Booher, Greart Article! you have thought of a plugin for XenCenter?
Saturday, 09 April 2016 13:28
Tobias Kreidl
Thank you, Lorscheider, for your comment. Our thoughts have long been that others could take this to another level by developing a... Read More
Thursday, 14 April 2016 01:34
Niklas Ahden
Hi, First of all I want to thank you for this great article and NAUBackup. I am wondering about the export-performance while usin... Read More
Sunday, 17 April 2016 19:14
Continue reading
9984 Hits
11 Comments

Review: XenServer 6.5 SP1 Training CXS-300

A few weeks ago, I received an invitation to participate in the first new XenServer class to be rolled out in over three years, namely CXS-300: Citrix XenServer 6.5 SP1 Administration. Those of you with good memories may recall that XenServer 6.0, on which the previous course was based, was officially released on September 30, 2011. Being an invited guest in what was to be only the third time the class had been ever held was something that just couldn’t be passed up, so I hastily agreed. After all, the evolution of the product since 6.0 has been enormous. Plus, I have been a huge fan of XenServer since first working with version 5.0 back in 2008.  Shortly before the open-sourcing of XenServer in 2013, I still recall the warnings of brash naysayers that XenServer was all but dead. However, things took a very different turn in the summer of 2013 with the open-source release and subsequent major efforts to improve and augment product features. While certain elements were pulled and restored and there was a bit of confusion about changes in the licensing models, things have stabilized and all told, the power and versatility of XenServer with the 6.5 SP1 release is at a level now some thought it would never reach.

FROM 6.0 TO 6.5 – AND BEYOND

XenServer (XS for short) 6.5 SP1 made its debut on May 12, 2015. The feature set and changes are – as always – incorporated within the release notes. There are a number of changes of note that include an improved hotfix application mechanism, a whole new XenCenter layout (since 6.5), increased VM density, more guest OS support, a 64-bit kernel, the return of workload balancing (WLB) and the distributed virtual switch controller (DVSC) appliance, in-memory read caching, and many others. Significant improvements have been made to storage and network I/O performance and overall efficiency. XS 6.5 was also a release that benefited significantly from community participation in the Creedence project and the SP1 update builds upon this.

One notable point is that XenServer has been found to now host more XenDesktop/XenApp (XD/XA) instances than any other hypervisor (see this reference). And, indeed, when XenServer 6.0 was released, a lot of the associated training and testing on it was in conjunction with Provisioning Services (PVS). Some users, however, discovered XenServer long before this as a perfectly viable hypervisor capable of hosting a variety of Linux and Windows virtual machines, without having even given thought to XenDesktop or XenApp hosting. For those who first became familiar with XS in that context, the added course material covering provisioning services had in reality relatively little to do with XenServer functionality as an entity. Some viewed PVS an overly emphasized component of the course and exam. In this new course, I am pleased to say that XS’s original roots as a versatile hypervisor is where the emphasis now lies. XD/XA is of course discussed, but the many features available that are fundamental to XS itself is what the course focuses on, and it does that well.

COURSE MATERIALS: WHAT’S INCLUDED

The new “mission” of the course from my perspective is to focus on the core product itself and not only understand its concepts, but to be able to walk away with practical working knowledge. Citrix puts it that the course should be “engaging and immersive”. To that effect, the instructor-led course CXS-300 can be taken in a physical classroom or via remote GoToMeeting (I did the latter) and incorporates a lecture presentation, a parallel eCourseware manual plus a student exercise workbook (lab guide) and access to a personal live lab during the entire course. The eCourseware manual serves multiple purposes, providing the means to follow along with the instructor and later enabling an independent review of the presented material. It adds a very nice feature of providing an in-line notepad for each individual topic (hence, there are often many of these on a page) and these can be used for note taking and can be saved and later edited. In fact, a great takeaway of this training is that you are given permanent access to your personalized eCourseware manual, including all your notes.

The course itself is well organized; there are so many components to XenServer that five days works out in my opinion to be about right – partly because often question and answer sessions with the instructor will take up more time than one might guess, and also, in some cases all participants may have already some familiarity with XS or other hypervisor that makes it possible to go into some added depth in some areas. There will always need to be some flexibility depending on the level of students in any particular class.

A very strong point of the course is the set of diagrams and illustrations that are incorporated, some of which are animated. These compliment the written material very well and the visual reinforcement of the subject matter is very beneficial. Below is an example, illustrating a high availability (HA) scenario:

XS6.5SP1_course_image.jpg 

 

The course itself is divided into a number of chapters that cover the whole range of features of XS, enforced by some in-line Q&A examples in the eCourseware manual and with related lab exercises.  Included as part of the course are not only important standard components, such as HA and Xenmotion, but some that require plugins or advanced licenses, such as workload balancing (WLB), the distributed virtual switch controller (DVSC) appliance and in-memory read caching. The immediate hands-on lab exercises in each chapter with the just-discussed topics are a very strong point of the course and the majority of exercises are really well designed to allow putting the material directly to practical use. For those who have already some familiarity with XS and are able to complete the assignments quickly, the lab environment itself offers a great sandbox in which to experiment. Most components can readily be re-created if need be, so one can afford to be somewhat adventurous.

The lab, while relying heavily on the XenCenter GUI for most of the operations, does make a fair amount of use of the command line interface (CLI) for some operations. This is a very good thing for several reasons. First off, one may not always have access to XenCenter and knowing some essential commands is definitely a good thing in such an event. The CLI is also necessary in a few cases where there is no equivalent available in XenCenter. Some CLI commands offer some added parameters or advanced functionality that may again not be available in the management GUI. Furthermore, many operations can benefit from being scripted and this introduction to the CLI is a good starting point. For Windows aficionados, there are even some PowerShell exercises to whet their appetites, plus connecting to an Active Directory server to provide role-based access control (RBAC) is covered.

THE INSTRUCTOR

So far, the materials and content have been the primary points of discussion. However, what truly can make or break a class is the instructor. The class happened to be quite small, and primarily with individuals attending remotely. Attendees were in fact from four different countries in different time zones, making it a very early start for some and very late in the day for others. Roughly half of those participating in the class were not native English speakers, though all had admirable skills in both English and some form of hypervisor administration.  Being all able to keep up a common general pace allowed the class to flow exceptionally well. I was impressed with the overall abilities and astuteness of each and every participant.

The instructor, Jesse Wilson, was first class in many ways. First off, knowing the material and being able to present it well are primary prerequisites. But above and beyond that was his ability to field questions related to the topic at hand and even to go off onto relevant tangential material and be able to juggle all of that and still make sure the class stayed on schedule. Both keeping the flow going and also entertaining enough to hold students’ attention are key to holding a successful class. When elements of a topic became more of a debatable issue, he was quick to not only tackle the material in discussion, but to try this out right away in the lab environment to resolve it. The same pertained to demonstrating some themes that could benefit from a live demo as opposed to explaining them just verbally. Another strong point was his adding his own drawings to material to further clarify certain illustrations, where additional examples and explanations were helpful.

SUMMARY

All told, I found the course well structured, very relevant to the product and the working materials to be top notch. The course is attuned to the core product itself and all of its features, so all variations of the product editions are covered.

Positive points:

  • Good breadth of material
  • High-quality eCourseware materials
  • Well-presented illustrations and examples in the class material
  • Q&A incorporated into the eCourseware book
  • Ability to save course notes and permanent access to them
  • Relevant lab exercises matching the presented material
  • Real-life troubleshooting (nothing ever runs perfectly!)
  • Excellent instructor

Desiderata:

  • More “bonus” lab materials for those who want to dive deeper into topics
  • More time spent on networking and storage
  • A more responsive lab environment (which was slow at times)
  • More coverage of more complex storage Xenmotion cases in the lecture and lab

In short, this is a class that fulfills the needs of anyone from just learning about XenServer to even experienced administrators who want to dive more deeply into some of the additional features and differences that have been introduced in this latest XS 6.5 SP1 release. CXS-300: Citrix XenServer 6.5 SP1 Administration represents a makeover in every sense of the word, and I would say the end result is truly admirable.

Continue reading
15789 Hits
0 Comments

Configuring XenApp to use two NVIDIA GRID engines

SUMMARY

The configuration of a XenApp virtual machine (VM) hosted on XenServer that supports two concurrent graphics processing engines in passthrough mode is shown to work reliably and provide the opportunity to give more flexibility to a single XenApp VM rather than having to spread the access to the engines over two separate XenApp VMs. This in turn can provide more flexibility, save operating system licensing costs and ostensibly, could be extended to incorporate additional GPU engines.

INTRODUCTION

A XenApp virtual machine (VM) that supports two or more concurrent graphics processing units (GPUs) has a number of advantages over running separate VM instances, each with its own GPU engine. For one, if users happen to be unevenly relegated to particular XenApp instances, some XenApp VMs may idle while other instances are overloaded, to the detriment of users associated with busy instances. It is also simpler to add capacity to such a VM as opposed to building and licensing yet another Windows Server VM.  This study made use of an NVIDIA GRID K2 (driver release 340.66), comprised of two Kepler GK104 engines and 8 GB of GDDR5 RAM (4 GB per GPU). It is hosted in a base system that consists of a Dell R720 with dual Intel Xeon E5-2680 v2 CPUs (40 VCPUs, total, hyperthreaded) hosting XenServer 6.2 SP1 running XenApp 7.6 as a VM with 16 VCPUs and 16 GB of memory on Windows 2012 R2 Datacenter.

PROCEDURE

It is important to note that these steps constitute changes that are not officially supported by Citrix or NVIDIA and are to be regarded as purely experimental at this stage.

Registry changes to XenApp were made according to these instructions provided in the Citrix Product Documentation.

On the XenServer, first list devices and look for GRID instances:
# lspci|grep -i nvid
44:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)
45:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

Next, get the UUID of the VM:
# xe vm-list
uuid ( RO)           : 0c8a22cf-461f-0030-44df-2e56e9ac00a4
     name-label ( RW): TST-Win7-vmtst1
    power-state ( RO): running
uuid ( RO)           : 934c889e-ebe9-b85f-175c-9aab0628667c
     name-label ( RW): DEV-xapp
    power-state ( RO): running

Get the address of the existing GPU engine, if one is currently associated:
# xe vm-param-get param-name=other-config uuid=934c889e-ebe9-b85f-175c-9aab0628667c
vgpu_pci: 0/0000:44:00.0; pci: 0/0000:44:0.0; mac_seed: d229f84d-73cc-e5a5-d105-f5a3e87b82b7; install-methods: cdrom; base_template_name: Windows Server 2012 (64-bit)
(Note: ignore any vgpu_pci parameters that are irrelevant now to this process, but may be left over from earlier procedures and experiments.)

Dissociate the GPU via XenCenter or via the CLI, set GPU type to “none”.
Then, add both GPU engines following the recommendations in assigning multiple GPUs to a VM in XenServer using the other-config:pci parameter:
# xe vm-param-set uuid=934c889e-ebe9-b85f-175c-9aab0628667c
   other-config:pci=0/0000:44:0.0,0/0000:45:0.0
In other words, do not use the vgpu_pci parameter at all.

Check if the new parameters took hold:
# xe vm-param-get param-name=other-config uuid=934c889e-ebe9-b85f-175c-9aab0628667c params=all
vgpu_pci: 0/0000:44:00.0; pci: 0/0000:44:0.0,0/0000:45:0.0; mac_seed: d229f84d-73cc-e5a5-d105-f5a3e87b82b7; install-methods: cdrom; base_template_name: Windows Server 2012 (64-bit)
Next, turn GPU passthrough back on for the VM in XenCenter or via the CLI and start up the VM.

On the XenServer you should now see no GPUs available:
# nvidia-smi
Failed to initialize NVML: Unknown Error
This is good, as both K2 engines now have been allocated to the XenApp server.
On the XenServer you can also run “xn –v pci-list  934c889e-ebe9-b85f-175c-9aab0628667c” (the UUID of the VM) and should see the same two PCI devices allocated:
# xn -v pci-list 934c889e-ebe9-b85f-175c-9aab0628667c
id         pos bdf
0000:44:00.0 2   0000:44:00.0
0000:45:00.0 1   0000:45:00.0
More information can be gleaned from the “xn diagnostics” command.

Next, log onto the XenApp VM and check settings using nvidia-smi.exe. The output will resemble that of the image in Figure 1.

 

 GRID-Fig-1.jpg
Figure 1. Output From the nvidia-smi utility, showing the allocation of both K2 engines.


Note the output shows correctly that 4096 MiB of memory are allocated for each of the two engines in the K2, totaling its full capacity of 8196 MiB. XenCenter will still show only one GPU engine allocated (see Figure 2) since it is not aware that both are allocated to the XenApp VM and has currently no way of making that distinction.

 

GRID-Fig-2.jpgFigure 2. XenCenter GPU allocation (showing just one engine – all XenServer is currently capable of displaying)

 

So, how can you tell if it is really using both GRID engines? If you run the nvidia-smi.exe program on the XenApp VM itself, you will see it has two GPUs configured in passthrough mode (see the earlier screenshot in Figure 1). Depending on how apps are launched, you will see one or the other or both of them active.  As a test, we ran two concurrent Unigine "Heaven" benchmark instances and both came out with the same metrics within 1% of each other as well as when just one instance was run, and both engines showed as being active. Displayed in Figure 3 is a sample screenshot of the Unigine ”Heaven” benchmark running with one active instance; note that it sees both K2 engines present, even though the process is making use of just one.


GRID-Fig-3.jpg
Figure 3. A sample Unigine “Heaven” benchmark frame. Note the two sets of K2 engine metrics displayed in the upper right corner.


It is evident from the display in the upper right hand corner that one engine has allocated memory and is working, as evidenced by the correspondingly higher temperature reading and memory frequency. The result of a benchmark using openGL and a 1024x768 pixel resolution is seen in Figure 4. Note again the difference between what is shown for the two engines, in particular the memory and temperature parameters.

 GRID-Fig-4.jpg

Figure 4. Outcome of the benchmark. Note the higher memory and temperature on the second K2 engine.

 

When another instance is running concurrently, you see its memory and temperature also rise accordingly in addition to the load evident on the first engine, as well as activity on both engines in the output from the nvidia-smi.exe utility (Figure 5).


CaptureDualK2c.JPG
Figure 5. Two simultaneous benchmarks running, using both GRID K2 engines, and the nvidia-smi output.

You can also see with two instances running concurrently how the load is affected. Note in the performance graphs from XenCenter shown in Figure 6 how one copy of the “Heaven” benchmark impacts the server and then about halfway across the graphs, a second instance is launched.

 GRID-Fig-6.jpg

Figure 6. XenCenter performance metrics of first one, then a second concurrent Unigine “Heaven” benchmark.


CONCLUSIONS

The combination of two GRID K2 engines associated with a single, hefty XenApp VM works well for providing adequate capacity to support a number of concurrent users in GPU passthrough mode without the need of hosting additional XenApp instances. As there is a fair amount of leeway in the allocation of CPUs and memory to a virtualized instance under XenServer (up to 16 vCPUs and 128 GB of memory under XenServer 6.2 when these tests were run), one XenApp VM should be able to handle a reasonably large number of tasks.  As many as six concurrent sessions of this high-demand benchmark with 800x600 high-resolution settings have been tested with the GPUs still not saturating. A more typical application, like Google Earth, consumes around 3 to 5% of the cycles of a GRID K2 engine per instance during active use, depending on the activity and size of the window, so fairly minimal. In other words, twenty or more sessions could be handled by each engine, or potentially 40 or more for the entire GRID K2 with a single XenApp VM, provided of course that the XenApp’s memory and its own CPU resources are not overly taxed.

XenServer 6.2 already supports as many as eight physical GPUs per host, so as servers expand, one could envision having even more available engines that could be associated with a particular VM. Under some circumstances, passthrough mode affords more flexibility and makes better use of resources compared to creating specific vGPU assignments. Windows Server 2012 R2 Datacenter supports up to 64 sockets and 4 TB of memory, and hence should be able to support a significantly larger number of associated GPUs. XenServer 6.2 SP1 has a processor limit of 16 VCPUs and 128 GB of virtual memory. XenServer 6.5, officially released in January 2015, supports up to four K2 GRID cards in some physical servers and up to 192 GB of RAM per VM for some guest operating systems as does the newer release documented in the XenServer 6.5 SP1 User's Guide, so there is a lot of potential processing capacity available. Hence, a very large XenApp VM could be created that delivers a lot of raw power with substantial Microsoft server licensing savings. The performance meter shown above clearly indicates that VCPUs are the primary limiting factor in the XenApp configuration and with just two concurrent “Heaven” sessions running, about a fourth of the available CPU capacity is consumed compared to less than 3 GB of RAM, which is only a small additional amount of memory above that allocated by the first session.

These same tests were run after upgrading to XenServer 6.5 and with newer versions of the NVIDIA GRID drivers and continue to work as before. At various times, this configuration was run for many weeks on end with no stability issues or errors detected during the entire time.

ACKNOWLEDGEMENTS

I would like to thank my co-worker at NAU, Timothy Cochran, for assistance with the configuration of the Windows VMs used in this study. I am also indebted to Rachel Berry, Product Manager of HDX Graphics at Citrix and her team, as well as Thomas Poppelgaard and also Jason Southern of the NVIDIA Corporation for a number of stimulating discussions. Finally, I would like to greatly thank Will Wade of NVIDIA for making available the GRID K2 used in this study.

Continue reading
15791 Hits
0 Comments

XenServer 6.5 and Asymmetric Logical Unit Access (ALUA) for iSCSI Devices

INTRODUCTION

There are a number of ways to connect storage devices to XenServer hosts and pools, including local storage, HBA SAS and fiber channel, NFS and iSCSI. With iSCSI, there are a number of implementation variations including support for multipathing with both active/active and active/passive configurations, plus the ability to support so-called “jumbo frames” where the MTU is increased from 1500 to typically 9000 to optimize frame transmissions. One of the lesser-known and somewhat esoteric iSCSI options available on many modern iSCSI-based storage devices is Asymmetric Logical Unit Access (ALUA), a protocol that has been around for a decade and is furthermore mysterious and intriguing because of its ability to be used not only with iSCSI, but also with fiber channel storage. The purpose of this article is an attempt to both clarify and outline how ALUA can be used more flexibly now with iSCSI on XenServer 6.5.

HISTORY

ALUA support on XenServer goes way back to XenServer 5.6 and initially only with fiber channel devices. The support of iSCSI ALUA connectivity started on XenServer 6.0 and was initially limited to specific ALUA-capable devices, which included the EMC Clariion, NetApp FAS as well as the EMC VMAX and VNX series. Each device required specific multipath.conf file configurations to properly integrate with the server used to access them, XenServer being no exception. The upstream XenServer code also required customizations. The "How to Configure ALUA Multipathing on XenServer 6.x for Enterprise Arrays" article CTX132976 (March 2014, revised March 2015) currently only discusses ALUA support through XenServer 6.2 and only for specific devices, stating: “Most significant is the usability enhancement for ALUA; for EMC™ VNX™ and NetApp™ FAS™, XenServer will automatically configure for ALUA if an ALUA-capable LUN is attached”.

It was announced in the XenServer 6.5 Release Notes that XenServer will automatically connect to one of these aforementioned documented devices and it is now running the updated device mapper multipath (DMMP) version 0.4.9-72. This rekindled my interest in ALUA connectivity and after some research and discussions with Citrix and Dell about support, it appeared this might now be possible specifically for the Dell MD3600i units we have used on XenServer pools for some time now. What is not stated in the release notes is that XenServer 6.5 now has the ability to connect generically to a large number of ALUA-capable storage arrays. This will be gone into detail later. It is also of note that MPP-RDAC support is no longer available in XenServer 6.5 and DMMP is the exclusive multipath mechanism supported. This was in part because of support and vendor-specific issues (see, for example, the XenServer 6.5 Release Notes or this document from Dell, Inc.).

But first, how are ALUA connections even established? And perhaps of greater interest, what are the benefits of ALUA in the first place?

ALUA DEFINITIONS AND SETTINGS

As the name suggests, ALUA is intended to optimize storage traffic by making use of optimized paths. With multipathing and multiple controllers, there are a number of paths a packet can take to reach its destination. With two controllers on a storage array and two NICs dedicated to iSCSI traffic on a host, there are four possible paths to a storage Logical Unit Number (LUN). On the XenServer side, LUNs then are associated with storage repositories (SRs). ALUA recognizes that once an initial path is established to a LUN that any multipathing activity destined for that same LUN is better served if routed through the same storage array controller. It attempts to do so as much as possible, unless of course a failure forces the connection to have to take an alternative path. ALUA connections fall into five self-explanatory categories (listed along with their associated hex codes):

  • Active/Optimized : x0
  • Active/Non-Optimized : x1
  • Standby : x2
  • Unavailable : x3
  • Transitioning : xf

For ALUA to work, it is understood that an active/active storage path is required and furthermore that an asymmetrical active/active mechanism is involved. The advantage of ALUA comes from less fragmentation of packet traffic by routing if at all possible both paths of the multipath connection via the same storage array controller as the extra path through a different controller is less efficient. It is very difficult to locate specific metrics on the overall gains, but hints of up to 20% can be found in on-line articles (e.g., this openBench Labs report on Nexsan), hence this is not an insignificant amount and potentially more significant that gains reached by implementing jumbo frames. It should be noted that the debate continues to this day regarding the benefits of jumbo frames and to what degree, if any, they are beneficial. Among numerous articles to be found are: The Great Jumbo Frames Debate from Michael Webster, Jumbo Frames or Not - Purdue University Research, Jumbo Frames Comparison Testing, and MTU Issues from ESNet. Each installation environment will have its idiosyncrasies and it is best to conduct tests within one's unique configuration to evaluate such options.

The SCSI Architecture Model version defines these SCSI Primary Commands (SPC-3) used to determine paths. The mechanism by which this is accomplished is target port group support (TPGS). The characteristics of a path can be read via an RTPG command or set with an STPG command. With ALUA, non-preferred controller paths are used only for fail-over purposes. This is illustrated in Figure 1, where an optimized network connection is shown in red, taking advantage of routing all the storage network traffic via Node A (e.g., storage controller module 0) to LUN A (e.g., 2).

 

b2ap3_thumbnail_ALUAfig1.jpg

Figure 1.  ALUA connections, with the active/optimized paths to Node A shown as red lines and the active/non-optimized paths shown as dotted black lines.

 

Various SPC commands are provided as utilities within the sg3_utils (SCSI generic) Linux package.

There are other ways to make such queries, for example, VMware has a “esxcli nmp device list” command and NetApp appliances support “igroup” commands that will provide direct information about ALUA-related connections.

Let us first examine a generic Linux server containing ALUA support connected to an ALUA-capable device. In general, note that this will entail a specific configuration to the /etc/multipath.conf file and typical entries, especially for some older arrays or XenServer versions, will use one or more explicit configuration parameters such as:

  • hardware_handler ”1 alua”
  • prio “alua”
  • path_checker “alua”

Consulting the Citrix knowledge base article CTX132976, we see for example the EMC Corporation DGC Clariion device makes use of an entry configured as:

        device{
                vendor "DGC"
                product "*"
                path_grouping_policy group_by_prio
                getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout "/sbin/mpath_prio_emc /dev/%n"
                hardware_handler "1 alua"
                no_path_retry 300
                path_checker emc_clariion
                failback immediate
        }

To investigate the multipath configuration in more detail, we can make use of the TPGS setting. The TPGS setting can be read using the sg_rtpg command. By using multiple “v” flags to increase verbosity and “d” to specify the decoding of the status code descriptor returned for the asymmetric access state, we might see something like the following for one of the paths:

# sg_rtpg -vvd /dev/sde
open /dev/sdg with flags=0x802
    report target port groups cdb: a3 0a 00 00 00 00 00 00 04 00 00 00
    report target port group: requested 1024 bytes but got 116 bytes
Report list length = 116
Report target port groups:
  target port group id : 0x1 , Pref=0
    target port group asymmetric access state : 0x01 (active/non optimized)
    T_SUP : 0, O_SUP : 0, U_SUP : 1, S_SUP : 0, AN_SUP : 1, AO_SUP : 1
    status code : 0x01 (target port asym. state changed by SET TARGET PORT GROUPS command)
    vendor unique status : 0x00
    target port count : 02
    Relative target port ids:
      0x01
      0x02
(--snip--)

Noting the boldfaced characters above, we see here specifically that target port ID 1 is an active/non-optimized ALUA path, both from the “target port group id” line as well as from the “status code”. We also see there are two paths identified, with target port IDs 1,1 and 1,2.

There are a slew of additional “sg” commands, such as the sg_inq command, often used with the flag “-p 0x83” to get the VPD (vital product data) page of interest, sg_rdac, etc. The sg_inq command will in general return, in fact, TPGS > 0 for devices that support ALUA. More on that will be discussed later on in this article. One additional command of particular interest, because not all storage arrays in fact support target port group queries (more also on this important point later!), is sg_vpd (sg vital product data fetcher), as it does not require TPG access. The base syntax of interest here is:

sg_vpd –p 0xc9 –hex /dev/…

where “/dev/…” should be the full path to the device in question. Looking at an example of the output of a real such device, we get:

# sg_vpd -p 0xc9 --hex /dev/mapper/mpathb1
Volume access control (RDAC) VPD Page:
00     00 c9 00 2c 76 61 63 31  f1 01 00 01 01 01 00 00    ...,vac1........
10     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
20     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................

If one reads the source code for various device handlers (see the multipath tools hardware table for an extensive list of hardware profiles as well as the Linux SCSI device handler regarding how the data are interpreted through the device handler), one can determine that the value of interest here is that of avte_cvp (part of the RDAC c9_inquiry structure), which is the sixth hex value, and will indicate if the connected device is using ALUA (if shifted right five bits together with a logical AND with 0x1, in the RDAC world, known as IOSHIP mode), AVT, or Automatic Volume Transfer mode (if shifted right seven bits together with a logical AND with 0x1), or otherwise defaults in general to basic RDAC (legacy) mode. In the case above we see “61” returned (indicated in boldface), so (0x61 >> 5 & 0x1) is equal to 1, and hence the above connection is indeed an ALUA RDAC-based connection.

I will revisit sg commands once again later on. Do note that the sg3_utils package is not installed on stock XenServer distributions and as with any external package, the installation of external packages may void any official Citrix support.

MULTIPATH CONFIGURATIONS AND REPORTS

In addition to all the information that various sg commands provide, there is also an abundance of information available from the standard multipath command. We saw a sample multipath.conf file earlier, and at least with many standard Linux OS versions and ALUA-capable arrays, information on the multipath status can be more readily obtained using stock multipath commands.

For example, on an ALUA-enabled connection we might see output similar to the following from a “multipath –ll” command (there will be a number of variations in output, depending on the version, verbosity and implementation of the multipath utility):

mpath2 (3600601602df02d00abe0159e5c21e111) dm-4 DGC,VRAID
[size=100G][features=1 queue_if_no_path][hwhandler=1 alua][rw]
_ round-robin 0 [prio=50][active]
 _ 1:0:3:20  sds   70:724   [active][ready]
 _ 0:0:1:20  sdk   67:262   [active][ready]
_ round-robin 0 [prio=10][enabled]
 _ 0:0:2:20  sde   8:592    [active][ready]
 _ 1:0:2:20  sdx   128:592  [active][ready]

Recalling the device sde from the section above, note that it falls under a path with a lower priority of 10,  indicating it is part of an active, non-optimized network connection vs. 50, which indicates being in an active, optimized group; a priority of “1” would indicate the device is in the standby group. Depending on what mechanism is used to generate the priority values, be aware that these priority values will vary considerably; the most important point is that whatever path has a higher “prio” value will be the optimized path. In some newer versions of the multipath utility, the string “hwhandler=1 alua” shows clearly that the controller is configured to allow the hardware handler to help establish the multipathing policy as well as that ALUA is established for this device. I have read that the path priority will be elevated to typically a value of between 50 and 80 for optimized ALUA-based connections (cf. mpath_prio_alua in this Suse article), but have not seen this consistently.

The multipath.conf file itself has traditionally needed tailoring to each specific device. It is particularly convenient, however, that using a generic configuration is now possible for a device that makes use of the internal hardware handler and is rdac-based and can auto-negotiate an ALUA connection. The italicized entries below represent the specific device itself, but others should now work using this generic sort of connection:

device {
                vendor                  "DELL"
                product                 "MD36xx(i|f)"
                features                "2 pg_init_retries 50"
                hardware_handler        "1 rdac"
                path_selector           "round-robin 0"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_min_io               100
                path_checker            rdac
                prio                    rdac
                no_path_retry           30
                detect_prio             yes
                retain_attached_hw_handler yes
        }

Note how this differs (the additional entries above are in boldface type) from the “stock” version (in XenServer 6.5) of the MD36xx multipath configuration:

device {
                vendor                  "DELL"
                product                 "MD36xx(i|f)"
                features                "2 pg_init_retries 50"
                hardware_handler        "1 rdac"
                path_selector           "round-robin 0"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_min_io               100
                path_checker            rdac
                prio                    rdac
                no_path_retry           30
        }

THE CURIOUS CASE OF DELL MD32XX/36XX ARRAY CONTROLLERS

The LSI controllers incorporated into Dell’s MD32xx and MD36xx series of iSCSI storage arrays represent an unusual and interesting case. As promised earlier, we will get back to looking at the sg_inq command, which queries a storage device for several pieces of information, including TPGS. Typically, an array that supports ALUA will return a value of TPGS > 0, for example:

# sg_inq /dev/sda
standard INQUIRY:
PQual=0 Device_type=0 RMB=0 version=0x04 [SPC-2]
[AERC=0] [TrmTsk=0] NormACA=1 HiSUP=1 Resp_data_format=2
SCCS=0 ACC=0 TPGS=1 3PC=1 Protect=0 BQue=0
EncServ=0 MultiP=1 (VS=0) [MChngr=0] [ACKREQQ=0] Addr16=0
[RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1
[SPI: Clocking=0x0 QAS=0 IUS=0]
length=117 (0x75) Peripheral device type: disk
Vendor identification: NETAPP
Product identification: LUN
Product revision level: 811a

Highlighted in boldface, we see in this case above that TPGS is reported to have a value of 1. The MD36xx has supported ALUA since RAID controller firmware 07.84.00.64 and  NVSRAM  N26X0-784890-904, however, even with that (or newer) revision level, an sg_inq returns the following for this particular storage array:

# sg_inq /dev/mapper/36782bcb0002c039d00005f7851dd65de
standard INQUIRY:
  PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=1  HiSUP=1  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=1  Protect=0  BQue=0
  EncServ=1  MultiP=1 (VS=0)  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=1  Sync=1  Linked=0  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=74 (0x4a)   Peripheral device type: disk
 Vendor identification: DELL
 Product identification: MD36xxi
 Product revision level: 0784
 Unit serial number: 142002I

Various attempts to modify the multipath.conf file to try to force TPGS to appear with any value greater than zero all failed. Above all, it seemed that without access to the TPGS command, there was no way to query the device for ALUA-related information.  Furthermore, the command mpath_prio_alua and similar commands appear to have been deprecated in newer versions of the device-mapper-multipath package, and so offer no help.

This proved to be a major roadblock in making any progress. Ultimately it turned out that the key to looking for ALUA connectivity in this particular case comes oddly from ignoring what TPGS reports, and rather focusing on what the MD36xx controller is doing. What is going on here is that the hardware handler is taking over control and the clue comes from the sg_vpd output shown above. To see how a LUN is mapped for these particular devices, one needs to hunt back through the /var/log/messages file for entries that appear when the LUN was first attached. To investigate this for the MD36xx array, we know it uses the internal “rdac” connection mechanism for the hardware handler, so a Linux grep command for “rdac” in the /var/log/messages file around the time the connection was established to a LUN should reveal how it was established.

Sure enough, if one looks at a case where the connection is known to not be making use of ALUA, you might see entries such as these:

[   98.790309] rdac: device handler registered
[   98.796762] sd 4:0:0:0: rdac: AVT mode detected
[   98.796981] sd 4:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.797672] sd 5:0:0:0: rdac: AVT mode detected
[   98.797883] sd 5:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.798590] sd 6:0:0:0: rdac: AVT mode detected
[   98.798811] sd 6:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.799475] sd 7:0:0:0: rdac: AVT mode detected
[   98.799691] sd 7:0:0:0: rdac: LUN 0 (owned (AVT mode))

In contrast, an ALUA-based connection to LUNs shown below on an MD3600i that has new enough firmware to support ALUA and using an appropriate client that also supports ALUA and has a properly configured entry in the /etc/multipath.conf file will instead show the IOSHIP connection mechanism (see p. 124 of this IBM System Storage manual for more on I/O Shipping):

Mar 11 09:45:45 xs65test kernel: [   70.823257] scsi 8:0:0:1: rdac: LUN 1 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.385835] scsi 9:0:0:0: rdac: LUN 0 (IOSHIP) (unowned)
Mar 11 09:45:46 xs65test kernel: [   71.389345] scsi 9:0:0:1: rdac: LUN 1 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.957649] scsi 10:0:0:0: rdac: LUN 0 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.961788] scsi 10:0:0:1: rdac: LUN 1 (IOSHIP) (unowned)
Mar 11 09:45:47 xs65test kernel: [   72.531325] scsi 11:0:0:0: rdac: LUN 0 (IOSHIP) (owned)

Hence, we happily recognize that indeed, ALUA is working.

The even better news is that not only is ALUA now functional in XenServer 6.5 but should, in fact, work now with a large number of ALUA-capable storage arrays, both with custom configuration needs as well as potentially many that may work generically. Another surprising find was that for the MD3600i arrays tested, it turns out that even the “stock” version of the MD36xxi multipath configuration entry provided with XenServer 6.5 creates ALUA connections. The reason for this is that the hardware handler is being used consistently, provided no specific profile overrides are intercepted, and so primarily the storage device is doing the negotiation itself instead of being driven by the file-based configuration. This is what made the determination of ALUA connectivity more difficult, namely that the TPGS setting was never changed from zero and could consequently not be used to query for the group settings.

CONCLUSIONS

First off, it is really nice to know now that many modern storage devices support ALUA and that XenServer 6.5 now provides an easier means to leverage this protocol. It is also a lesson that documentation can be either hard to find and in some cases, is in need of being updated to reflect the current state. Individual vendors will generally provide specific instructions regarding iSCSI connectivity, and should of course be followed. Experimentation is best carried out on non-production servers where a major faux pas will not have catastrophic consequences.

To me, this was also a lesson in persistence as well as an opportunity to share the curiosity and knowledge among a number of individuals who were helpful throughout this process. Above all, among many who deserve thanks, I would like to thank in particular Justin Bovee from Dell and Robert Breker of Citrix for numerous valuable conversations and information exchanges.

Recent Comments
JK Benedict
POETRY! Thank you so much for this effort, Tobias!!!
Monday, 20 April 2015 14:01
Loren Saxby
This is excellent! It's amazing how certain features are overlooked or never brought up to begin with. Articles like this one shed... Read More
Wednesday, 06 May 2015 06:01
Tobias Kreidl
Thank you for all your collective comments. Shedding some light on the obscure can be very rewarding, even if only a small audienc... Read More
Sunday, 10 May 2015 23:38
Continue reading
27107 Hits
9 Comments

About XenServer

XenServer is the leading open source virtualization platform, powered by the Xen Project hypervisor and the XAPI toolstack. It is used in the world's largest clouds and enterprises.
 
Commercial support for XenServer is available from Citrix.