Administering VMware SRM 4.0

Administering VMware™

Site Recovery Manager™ 4.0

By Mike Laverick

© Mike Laverick Ltd

With Contributions and Assistance from

Adam Carter (Lefthand Networks/HP) Chad Sakac (EMC) Alex Tanner (EMC)

Vaughn Stewart (NetApp) Luke Reed (NetApp)

Lee Dilworth (VMware) Cormac Hogan (VMware)

Jeff Drury Tim Oudin Luc Dikens Al Renouf

Dave Medvitz

Report Errors: [email protected] Follow on Twitter: http://twitter.com/Mike_Laverick

Mike Laverick Podcast: http://www.rtfm-ed.co.uk/podcasts/podcast.xml

2

Administering VMware’s Site Recovery Manager 4.0 Copyright © 2010 by Mike Laverick Ltd All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, or otherwise, without written permission from Mike Laverick Ltd. No patent liability is assumed with respect to the use of the information contained herein. Although every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions. Neither is any liability assumed for damages resulting from the use of the information contained herein.

3

Table of Contents Chapter 1: Introduction ....................................................................................... 7

Charitable Donation .............................................................................................. 8 Acknowledgement ................................................................................................ 8 About this Book .................................................................................................... 9 About you The Reader ........................................................................................... 9 About Hyperlinks ................................................................................................ 10 Disclaimer ......................................................................................................... 10 What’s New in SRM 4.0 ....................................................................................... 10 A Brief History of Life - before VMware SRM ........................................................... 12 What is not a DR Technology? .............................................................................. 13 What is VMware SRM? ......................................................................................... 15 What about File Level Consistency? ....................................................................... 16 Principles of Storage Management and Replication .................................................. 17 Storage Vendor Guides ........................................................................................ 22 Summary .......................................................................................................... 24

Chapter 2: Getting started with EMC Celerra Replicator .................................... 25 Creating an EMC Celerra iSCSI Target ................................................................... 28 Granting Access to ESX hosts to EMC Celerra iSCSI Target ...................................... 30 Creating a New File System ................................................................................. 33 Creating an iSCSI LUN ........................................................................................ 36 Configuring Celerra Replication ............................................................................. 39 Conclusion ......................................................................................................... 45

Chapter 3: Getting started with EMC Clariion MirrorView/S .............................. 47 Creating EMC LUN .............................................................................................. 49 Configure EMC MirrorView ................................................................................... 51 Creating a Snapshot for SRM Tests ....................................................................... 53 Creating Consistency Groups (Recommended) ....................................................... 55 Granting Access to ESX hosts to Clariion LUNs ....................................................... 55 Conclusion ......................................................................................................... 58

Chapter 4: Getting started with HP LeftHand Scheduled Remote Copy .............. 59 Some Frequently Asked Questions about HP LeftHand VSA ...................................... 60 Download and Upload the VSA ............................................................................. 61 Importing the HP Lefthand VSA ............................................................................ 62 Modifying the VSA’s Settings ................................................................................ 63 Licensed by Virtual MAC Address .......................................................................... 64 Primary Configuration of VSA Host ........................................................................ 65 Install Management Client ................................................................................... 66 Configure the VSA (Management Groups, Clusters & Volumes) ................................. 67 Licensing the VSA ............................................................................................... 72 Configuring VSA for Replication ............................................................................ 73 Monitoring your replication/snapshot .................................................................... 76 Adding ESX hosts & Allocating Volumes to Them .................................................... 78 Configuring the ESX Software iSCSI ...................................................................... 80 HP Lefthand - Create a Test Volume at the Recovery Site ........................................ 85 Shutting Down the VSA ....................................................................................... 90 Conclusion ......................................................................................................... 90

Chapter 5: Getting started with NetApp and SnapMirror ................................... 93 Provisioning NetApp storage for VMware ESX ......................................................... 95 Creating NetApp Volumes for NFS ......................................................................... 95 Modify the Export Properties ................................................................................ 99 Granting Access to ESX hosts to NetApp Volumes .................................................. 101 Creating NetApp Volumes for Fibre Channel and iSCSI ........................................... 102 Gaining Access to NetApp LUNs to ESX ................................................................. 107 Configure NetApp SnapMirror .............................................................................. 108 Introducing NetApp Rapid Clone Utility 3.0 and Virtual Storage Console ................... 114 Using the NetApp Virtual Storage Console ............................................................. 119

4

Conclusion ........................................................................................................ 125 Chapter 6: Installing VMware SRM .................................................................. 127

Architecture of VMware SRM ............................................................................... 128 VMware SRM Product Limitations and Gotchas ...................................................... 140 Licensing VMware SRM ....................................................................................... 142 Setting up the VMware SRM Database with Microsoft SQL 2005 ............................... 143 Installing VMware SRM Server ............................................................................. 152 Installing the vSphere Client SRM Plug-in ............................................................. 159 Failure to Connect to the SRM Server ................................................................... 160 Conclusion ........................................................................................................ 161

Chapter 7: Protection Site Configuration ......................................................... 163 Pairing the Protected and Recovery Site SRM together ........................................... 164 Configuring Array Managers – An Introduction ...................................................... 169 Configure Inventory Mappings ............................................................................. 193 Creating Protection Groups ................................................................................. 199 Failures to Protect a Virtual Machine .................................................................... 205 Conclusion ........................................................................................................ 207

Chapter 8: Recovery Site Configuration ........................................................... 209 Creating A Basic Full Site Recovery Plan ............................................................... 210 Testing Storage Configuration at the Recovery Site ................................................ 214 Overview: First Recovery Plan Test ...................................................................... 215 Practise: First Recovery Plan Test ........................................................................ 221 Controlling & Troubleshooting Recovery Plans ....................................................... 223 EMC Celerra and Testing Plans ............................................................................ 229 EMC Clariion and Testing Plans ............................................................................ 230 NetApp and Testing Plans ................................................................................... 231 HP Lefthand and Testing Plans ............................................................................ 232 Conclusion ........................................................................................................ 236

Chapter 9: Custom Recovery Plans .................................................................. 237 Configuring Shutdown of Protected Virtual Machines at the Protected Site ................ 239 Configuring Priority/Order for Recovery Virtual Machines ........................................ 242 Parallel Host Start-Up Order and Normal/Low ........................................................ 244 Adding Message Steps ....................................................................................... 244 Adding Command Steps ..................................................................................... 246 Adding Command Steps with PowerCLI ................................................................ 247 Adding Command Steps to Call Scripts with the VM ............................................... 250 Configure IP Address changes for Recovery Virtual Machines .................................. 251 Configure Bulk IP Address changes for Recovery Virtual Machine DR-IP-Exporter) ..... 256 Customized VM Mappings ................................................................................... 260 Managing Changes at the Protection Site .............................................................. 262 Managing Changes at the Recovery Site ............................................................... 268 Other Changes in the vSphere and SRM Environment ............................................. 268 Creating New VMs on New Networks and on New Storage....................................... 270 Storage VMotion and Protection Groups ................................................................ 275 Virtual Machines Stored on Multiple VMFS Datastores ............................................. 277 Virtual machines with Raw Device/Disk Mappings .................................................. 279 Multiple Protections Groups and Multiple Recovery Plans ........................................ 283 The Repair Array Manager’s Button ...................................................................... 286 Conclusion ........................................................................................................ 287

Chapter 10: Alarms, Exporting History and Access Control .............................. 289 vCenter “Linked Mode” and Site Recovery Manager ................................................ 290 Alarms Overview ............................................................................................... 292 Exporting & History ............................................................................................ 298 Access Control .................................................................................................. 301 Testing your Permissions .................................................................................... 307 Some Permission Limitations – Test & Run Plans ................................................... 308 VMware SRM Log Files........................................................................................ 310

5

Conclusions ...................................................................................................... 311 Chapter 11: Bi-Directional and Multi-Site Configurations ................................ 313

Configuring the Array Manager ............................................................................ 315 Configuring the Inventory Mappings ..................................................................... 316 Creating the Protection Group ............................................................................. 317 Creating the Recovery Plan ................................................................................. 317 Shared Site Configurations ................................................................................. 318 Decommissioning a Site ..................................................................................... 325 Conclusions ...................................................................................................... 325

Chapter 12: Failover and Failback ................................................................... 327 Considerations before Failover and Failback .......................................................... 329 Planned Failover – Protected Site is available ........................................................ 329 EMC Celerra and Running Plans ........................................................................... 332 EMC Clariion and Running Plans .......................................................................... 333 HP Lefthand and Running Plans ........................................................................... 335 NetApp and Running Plans .................................................................................. 335 Planned Failback – Protected Site is available ........................................................ 337 Clean-Up of the Planned Failback ......................................................................... 355 Unplanned Failover - Protected Site is DEAD ......................................................... 364 Planned Failback – Protected Site is BACK ............................................................ 367 Conclusions ...................................................................................................... 372

Chapter 13: Scripted Site Recovery ................................................................. 375 A Very Special Acknowledgement ........................................................................ 376 Part 1: Introduction - Automating VMware SRM ..................................................... 376 Part Two: Introduction – Recovery with Site Recovery Manager ............................... 380 Conclusions ...................................................................................................... 390

The End – Final Conclusions ............................................................................. 392

7

Chapter 1: Introduction

8

Charitable Donation This book has taken 6 months to write, and contains an additional 100 pages of text and images. However, the digital PDF format version of this book is available for free – and the print version is delivered at-cost price via LULU. I do not intend to profit personally from this book. I would however, strongly urge you to donate money to my chosen charity. That charity is Unicef and this is what they do:

“UNICEF works with families, communities and governments in more than 190 countries worldwide to protect and promote the rights of all children. We are guided throughout our work by the UN Convention on the Rights of the Child, which guarantees every child the same rights: to an education, to a childhood, to be as healthy as possible, to be treated fairly and to be heard. UNICEF works in all these areas, and does so in a joined up way to achieve the best possible outcomes for children.” So before you begin to read book, please pause to think of the millions of children you help by making a relatively small donation. The recommended donation is $10 (US Dollars) or the equivalent in your currency. http://www.supportunicef.org/forms/whichcountry2.html

Acknowledgement Before I begin this book I would like to thank the many people who helped me along the way. Firstly, I would like to thank Carmel Edwards my partner. She puts up with me ranting and raving about VMware and Virtualization generally. Carmel is the first to read my works and does the first proof-read for the document. Secondly, I would like to thank Adam Carter of Lefthand Networks, Chad Sakac of EMC, Vaughn Stewart of NetApp. All three were invaluable in allowing me to bounce ideas around, and to ask newbie-like questions – not just with reference to their technologies – but storage issues generally. If I sound like some kind of storage guru in this book – I have these guys to thank for that. Actually, I’m not a guru at all, even in VMware Products. I can’t even stand the use of the word. Within EMC I would like to especially thank Alex Tanner who is part of “Chad’s Army” who was instrumental in getting me setup with the EMC NS20 systems – and for giving me on going help and support as I re-wrote the book. I would also like to thank Luke Reed of NetApp who helped in a very similar capacity. Thirdly, I would like to personally thank Mornay Van Der Walt of VMware and the SRM Team generally. Mornay is the Director for Enterprise & Technical Marketing. I first met Mornay at Cannes in 2008, and Mornay was very helpful in introducing me to the Adam at Lefthand Networks. He was also very helpful in assisting me with my more obscure technical questions surrounding the SRM 1.0 product without which the idea of writing a SRM 4.0 book would have been impossible. I would also like to thank Lee Dilworth of VMware in the UK. Lee has been very helpful in my travels with SRM, and it’s to him that I direct my emails when even I can’t work out what is going on! I would like to thank Cormac Hogan, Tim Oudin and Jeff Drury for their feedback. I’m often asked what kind of technical review books like mine go through. The answer is not much I’m afraid. People often offer to review my work, but almost never have the time to do it. The reality is I think you have to employ and pay someone for proper technical and typographical review something this model of publication doesn’t have a budget for. So I

9

would like to thank these guys for taking the time out and giving me their valuable feedback. Note: I had hoped to include HP EVA and Dell Equallogics in this book. Unfortunately, there wasn’t time to do this. Clearly, it’s impossible for me to cover every storage vendor. If you are reading this book and fancy a go at writing then read one of my chapters on storage replication and see if you can mimic my style and conventions. You never know if it cuts the mustard – you might find it included in the next edition and be given an authorial credit.

About this Book This is a complete guide to using VMware Site Recovery Manager (SRM). The version of ESX and vCenter used is 4.0 respectively. This book was tested against the ESX4i release. This is in marked contrast to the first edition of this book and the SRM product where ESXi was not initially supported. In the previous edition of the book I used abstract names for my vCenter structures – literally calling the vCenter in the Protected Site – virtualcenterprotectesite.rtfm-ed.co.uk. Later I used two cities in the United Kingdom (London and Reading) to represent a Protected Site, and a Recovery Site. This time around I have done much the same thing. But the protected location is New York, and the recovery location is New Jersey. I thought that as most of my readers are from the US, and there isn’t a person on the planet that hasn’t heard of these locations, people would latch on quicker. The screen grab below shows my structure – with one domain (corp.com) being used in New York and New Jersey. Each site has its own Microsoft Active Directory Domain Controller, and there is a router between the sites. Each site has its own vCenter, Microsoft SQL 2005 Server and SRM Server. In this case I’ve chosen not to use the new “linked mode” feature of vCenter4 – this configuration will be introduce later in the book. I’ve taken this decision merely to keep the distinction clear – that I have two separate locations or sites. The reality is that all the hardware is safely located in a collocation in the UK in a city called Nottingham.

About you The Reader I have a very clear idea of the kind of person reading this book. Ideally, you have been working with VMware vSphere4 for some time – perhaps you have attended an authorized course in vSphere4 such as the “Install, Configure and Manage” class or even the “Fast Track” class. On top of this perhaps you may have pursued the VMware Certified Professional (VCP) certification. So what am I getting at? This is not a dummies’ or idiot’s guide to SRM. You are going to need some background, or at least read my other guides or books to get up to speed. Apart from that I will be gentle with you – assuming that you have forgotten some of the material from those courses such as VMFS Metadata, UUIDs

10

and VMFS Resignaturing and that you just have a passing understanding of storage replication. Lastly, if you are a VMware Certified Instructor you might find this book very useful. This is because this book is currently quite heavily based on Lefthand Networks VSA – which is also used in the official VMware Courses. This use of Lefthand Networks VSA shouldn’t be constituted as a recommendation of their products. I just happened to meet the Lefthand Networks guys at VMworld Europe 2008 – Cannes. They very kindly offered me two free NFR licenses for their storage technologies. The other storage vendors who helped me whilst writing this book have been helpful. It’s just that Lefthand Networks got there first. In 2008 both Chad and Vaughn arranged for my lab environment to be kitted out in the very latest versions of their Clariion/Celerra and NetApp FSA systems. This empowered me to be much more “storage neutral” than in the previous edition.

About Hyperlinks The internet is a fantastic resource as we all know. However, printed hyperlinks are often quite lengthy, difficult to type correctly and frequently change. I’ve created a very simple webpage which contains all the URLs contained in this book. I will endeavour to keep this page up-to-date to make life easy for everyone concerned. The single URL you need for all the links and online content is here: http://www.rtfm-ed.co.uk/srm.html

Disclaimer No book on an IT product would be complete without a disclaimer. Here’s mine: Although every precaution has been taken in the preparation of this book, the contributors and author assume no responsibility for errors or omissions. Neither is any liability assumed for damages resulting from the use of the information contained herein. Phew, glad that’s over with!

What’s New in SRM 4.0 At this point I would like to flag up what’s new in the SRM product. This will form the basis of the new content in this book, and hopefully new chapters as well. It’s especially relevant to people who purchased my previous book. It’s these changes that made it worthwhile updating my old book to be compatible with SRM 4.0. In the headings below I’ve listed what I feel are the major enhancements to the SRM product. I’ve chosen not to include a change log style list of every little change. So here I’m looking at new features that might sway a customer or organization in adopting SRM. These changes address flaws or limitations in the previous product which may have “black balled” an SRM implementation from the get go. Compatible with vSphere4: This might seem like a small matter, but when vSphere4 was released many of the advanced management systems such as SRM and View were incompatible with the new platform. This “delay” or lag between the vSphere4 release and the SRM 4.0 release was regarded by some in the community as a poor show by VMware. I’m inclined to be a bit more forgiving than some. Firstly, I think many people underestimate what a huge undertaking from a development perspective vSphere actually is – VMware isn’t as big as some of the ISVs it competes with, so has to be strategic in where it spends its development resources. Secondly, saturating the market with product release after product release can alienate customers who feel overwhelmed by too much change too quickly. Finally, I would personally prefer VMware takes it time with product releases and properly QAs the software rather than rolls out new versions injudiciously. The same

11

people who complained about the delay would complain that it was a rush job had the software been released sooner. Most of the people who seemed to complain the most viciously about this were contractors whose livelihoods depended on projects being signed off. In short they were often looking out for themselves, not their customers. Most of my big customers didn’t have plans for a roll-out of vSphere until 2010. The release of SRM in Q4 for 2009 didn’t really interfere with their long-term plans. Support for NFS: As stated earlier SRM 1.0 was only compatible with fibre-channel and iSCSI SANs. Going forward SRM 4.0 is compatible with NFS as well. Shared Recovery Sites: In SRM 1.0 there was a one-to-one pairing of a “Protected Site” with a “Recovery Site”. This clearly precluded so called spoke-and-hug configurations where one DR location could offer up failover resources for many different locations. SRM 4.0 introduces the concept of “shared Recovery Sites” which allow for this configuration. This introduces new capacity planning requirements – the DR location might now need enough resources to deal with multiple outages of the protected locations. Improved Scalability: Like all new products the scalability numbers have gone up. But this in the past could have been a barrier to adopting SRM. The new version now supports up to a 1,000 in a single Protection Group. The re-write of SRM has allowed for this jump, as has the improvements in vCenter generally Compatibility with Distributed Power Management: (DPM) The early version of SRM was not always compatible with some of advanced clustering features from VMware. A case in point being the DPM feature which allows VMware to shutdown ESX hosts if they are not required from a performance perspective. This new integration allows customers who host their own dedicated DR resources to have the ESX hosts in the DR site to be in a powered-down state – and only powered on when carrying out tests or triggering a real DR event. Remember servers contribute about 60% of the total power consumption in the datacenter. DR-IP-Customizer Utility: Technically, this is not a new feature to SRM. It was actually introduce in SRM 1.0 Update 1. I took a personal decision not to re-issue an update to the book at the time as SRM U1 did not introduce much new functionality. But permit me to treat this feature as if it were brand new. Early versions of SRM used vCenters’s integration with Sysprep and the “Guest Customization Wizard” to re-IP VMs when they were brought online at the DR location. So as well as being patched into a different network, the DR based VMs would be given a valid IP address. As you know Microsoft Sysprep isn’t the fastest engine in the world, especially if you are merely using it to change a couple of all important octets. Additionally, having to configure a “Guest Customization” settings for each and every SRM protected VM was especially administratively intensive (despite a handy copy functionality). To address this VMware developed the command-line based DR-IP Customizer utility. ...and a whole host of other improvements and enhancements including:

• A new resilience built-in to the SRM product which protects the core service should the vCenter Service fail during the running of a Recovery Plan

• A new repair mode feature which allows you to change configuration settings gathered during the installation such as the vCenter credentials, database details and security certificates used

• A new GUI front-end to the advanced settings of SRM which was historically held in an .XML file

12

• Support for IBM DB2 • If you’re using certificate based authentication, SRM validates the certificates

integrity before allowing you to continue • Support for the VMware Fault-Tolerance features such that FT-enabled VMs can be

included in Recovery Plans. However, FT will need re-enabling on the SRM protected VM when the Recovery Plan is executed

• Improved context sensitive help together with new PDFs held on the SRM 4.0 CD

A Brief History of Life - before VMware SRM To really appreciate the impact of VMware’s SRM it’s perhaps worth pausing for a moment to think about what life was like before virtualization and VMware SRM was released. Until virtualization became popular, conventional DR meant dedicating physical equipment at the DR location on a one-to-one basis. So for every business critical server or service there was a duplicate at the DR location. By its nature this was expensive and difficult to manage – the servers were only there as standbys waiting to be used if a disaster happened. For those people who lacked those resources internally, it meant hiring out rack space at a commercial location, and if that included servers as well, that often meant the hardware being used was completely different from at the physical location. Although DR is likely to remain a costly management headache, virtualization goes a long way to reducing the financial and administrative penalties of DR planning. In the main virtual machines are cheaper than physical machines. We can have many instances of software, Windows for example, running on one piece of hardware – reducing the amount of rack space required for a DR location. We no longer need to worry about dissimilar hardware, as long as the hardware at the DR location supports VMware ESX, our precious time can be dedicated to getting the services we support up and running in the shortest time possible. One of the most common things I’ve heard on courses and conferences from people who are new to virtualization is, among other things: “We’re going to try virtualization in our DR location, before rolling out into production” This is often used as a cautious approach by businesses who are adopting virtualization technologies for the first time. Whenever this is said to me I always say to the individual concerned – think about the consequences of what you’re saying. In my view once you go down the road of virtualizing your DR, it is almost inevitable that you will want to virtualize your production systems. For two main reasons, firstly you will be so impressed and convinced by the merits of virtualization anyway you will want to do it. Secondly, and more importantly in the context of this book – if your production environment is not already virtualized, then how are you going to keep your DR locations synchronized with the primary location? There are currently a couple of ways of achieving this – you could rely solely on conventional backup and restore – but that won’t be very slick or very quick. A better alternative might be to use some kind of P2V technology. In recent years many of the P2V (Physical to Virtual Conversion) providers such as PlateSpin and LeoStream, have repositioned themselves as “availability tools”, the idea being that you use P2V software to keep the production environment synchronized with the DR location. These technologies do work, and there will be some merits in adopting this strategy – say for services that must for whatever reason remain on a physical host at the “primary” location. But generally I am sceptical about this approach. I subscribe to the view that you should use the right tools for the right job. Never take a spanner to do the work of a hammer. From its very inception and design you will discover flaws and problems – because you are using a tool for a purpose for which it was never designed. For me P2V is P2V, it isn’t about DR – although it can be re-engineered to do this task. I guess the proof is in the quality of the

13

re-engineering. On top of this you should know that in the long-term VMware has plans to integrate their “VMware Convertor” technology into SRM to allow for this very functionality. Another approach to this problem has been to virtualize production systems before you virtualize the DR location. By doing this you merely have to use your storage vendor’s replication or snapshot technology to pipe the data files that make up a virtual machine (vmx, vmdk, nvram, log, snapshot, swapfile) to the DR location. Although this approach is much neater, this in itself introduces a number of problems – not least getting to grip with your storage vendor’s replication technology and ensuring there is enough bandwidth available from the Protected Site to the Recovery Site to make it workable. Additionally, this introduces a management issue. The guys who manage the virtualization layer and test the Recovery Plan are not in the larger corporates, the people who manage the storage layer. So a great deal of liaising and sometimes cajoling would have to take place to make these two teams speak and interact with each other effectively. But putting these very important storage considerations to one side for the moment – there would still be a lot of work that needs to be done at the virtualization layer to make this sing. Not least these “replicated” virtual machines need to be “registered” on an ESX host at the Recovery Site, and associated with the correct folder, network and resource pool at the destination. They must be contained within some kind of management system to be powered on such as vCenter – and additionally to power on the virtual machine, the “metadata” held within the VMX file might need modifying by hand for each and every virtual machine. Once powered on (in the right order) their IP configuration might need modification. Although some of this could be scripted, it would take a great deal of time to create and verify those scripts. Additionally, as your production environment started to evolve and change those scripts would need constant maintenance and revalidation. For organizations that make hundreds of virtual machines a month, this can quickly become unmanageable. It’s worth saying that if your organization has already invested a lot of time in scripting this process and making a bespoke solution – you might find that SRM does not meet your entire needs. This is a kind of truism. Any bespoke system created internally is always going to be more finely tuned to the business’s requirements – the problem then becomes maintaining it, testing it and proving to auditors it works reliably. It was within this context that VMware engineers began working on the first release of SRM. It has a lofty goal, to create a push-button automated DR system to simplify the process greatly. Personally, compared to alternatives that came before it I’m convinced that out of all the plethora of management tools added to the VMware stable in recent years VMware SRM is the one with the clearest agenda and remit. People more or less understand and appreciate its significance and importance. At last we can finally use the term “virtualizing DR” without it actually being a throw-away marketing term. If you want to learn more about this manual DR, VMware has written a VM Book about virtualizing DR which is called “A Practical Guide to Business Continuity & Disaster Recovery with VMware Infrastructure”. It is free and available online here: http://www.vmware.com/files/pdf/practical_guide_bcdr_vmb.pdf I recommend reading this guide, perhaps before even reading this book. It has a much broader brief than mine, which is narrowly focused on the SRM product.

What is not a DR Technology? In my time of using VMware Technologies various features have come along which people often either confuse for or try to engineer into being a DR technology. The kind of thing I’m getting at here is when someone tries to take a technology and make it do something

14

it wasn’t designed for. Personally, I’m in favour of using the right tools for the right job. Let take each of these technologies in turn and try to make a case for their use in DR.

VMotion:

In the early days of me using VMware I would hear my clients often say that they intended to use VMotion as part of their DR plan. Most of them understood that such a statement could be only be valid if the outage was in the category of a planned DR event such as power outage or the demolition of a nearby building. Increasingly, VMware and the network and storage vendors are postulating the concept of long-distance VMotion. In fact one of the contributors to this book – Chad Sakac of EMC – had a session at VMworld San Francisco 2009 about this topic. Technically, it is possible to do VMotion across large distances but the technical challenges are not to be underestimated or taken lightly given the requirements of VMotion for shared storage and shared networking. We will no doubt get there in the end – it’s the next logical step especially if we want to see the move from an internal cloud to an external cloud become as easy as moving a VM from one ESX host in a blade enclosure to another. But putting all this aside, VMware has never claimed that VMotion constitutes a DR technology, despite the FUD that emanates from their competitors. As an indication of how misunderstood VMotion is and the concept of what constitutes a DR location – one these clients said they could carry VMotion from their primary site to their Recovery Site. I asked him how far away the DR location was. The answer was that it was a couple of hundred feet away. This kind of wonky thinking and misunderstanding will not get you very far down the road of auditable and affective DR plan. The real usage of VMotion currently is being able to claim a maintenance window on an ESX host without affecting the uptime of the VMs within a site. Once coupled with VMware’s DRS technology it becomes an effective performance optimization technology too.

VMware HA Clusters:

Occasionally, I’ve been asked by customers about the possibility of using VMware HA technology across two sites. Essentially, what they are describing is a “stretched cluster” concept. This is certainly possible but it suffers from the technical challenges that confront geo-based VMotion – access to shared storage and shared networking. There are certainly storage vendors who will be happy to sell you technology to achieve this configuration – for example NetApp’s MetroCluster technology. The operative word here is “metro”. This type of clustering is often limited by distance (say from one part of a city to another). So as in my anecdote about my client, the distances involved may be too narrow to be regarded as a DR location. When VMware put together the design of HA the goal was being able to restart VMs on another ESX host. Its primary goal was merely to “protect” VMs from a failed ESX host, which is far from being a DR goal. It was in part VMware’s first attempt to address the “eggs in one basket” anxiety that comes with much of the server consolidation projects we did in the early part of the last decade. Again, VMware has never made claims that HA clusters constitute a DR solution. Fundamentally, HA lacks the bits and pieces to make it work as a DR technology – for example unlike SRM there is really no way to order its power-on events, to halt a power-on event to allow manual operator intervention and it doesn’t contain a scripting component to allow you to automate residual reconfiguration when the VM gets started at the other site. The other concern I have with this is when customers try to combine technologies together in a way that is not endorsed or QA’d by the vendor. For example some folks think about overlaying a stretched VMware HA cluster on top of their SRM deployment. The theory is that they can get the best of both worlds. The trouble is the requirements of stretched VMware HA and SRM are at odds with each other. In SRM the architecture demands two separate vCenters managing distinct ESX hosts. HA on the other requires the two or more hosts that make up a HA cluster to be managed by just one vCenter. Now, I dare say with a little bit of planning and forethought this configuration could be engineered. The real usage of VMware HA is to restart VMs when an ESX host fails within a site – something that most people would not regard as a DR event.

15

VMware Fault Tolerance:

VMware FT is a new feature of vSphere4. It allows for a primary VM on one host to be “mirrored” on a secondary ESX host. Everything that happens on the primary VM is replayed in “lockstep” with the secondary VM on the different ESX host. In the event of an ESX host outage the secondary will immediately take over the primary’s role. It requires a modern CPU chipset to provide this functionality, together with two 1GB vmnics dedicated to the FT Logging network which is used to send the lockstep data to the secondary VM. FT scales to about four primaries and four secondaries on the ESX host, and is currently limited to VMs with 1xvCPU. VMware FT is really an extension of VMware HA (in fact FT requires HA to be enabled on the cluster) which offers much better availability than HA, because they is no “restart” of the VM. As with HA, VMware FT has quite high requirements as well as shared networking and shared storage – there are the additional requirements such as bandwidth and network redundancy. Critically, FT require very low latency links to maintain the lockstep functionality – and it will be in most environments cost prohibitive to provide the bandwidth to protect the same number of VMs that SRM currently protects. The real usage of VMware Fault Tolerance is to provide a much better level of availability to a select number of VMs within a site than currently offered by VMware HA.

What is VMware SRM? Currently, SRM is a DR automation tool. It automates the test and invocation of “disaster recovery” (DR) or as it is now called in the preferred parlance of the day, “business continuity” (BC) of virtual machines. Actually, it’s more complicated than that – for many DR is a procedural event. A disaster occurs and steps are required to get the business functional and up and running again. On the other hand business continuity is more a strategic event which is concerned with the long term prospects of the business post-disaster, and it should include a plan for how the business might one day return to the primary site or carry on in another location entirely. Someone could write an entire book on this topic – indeed books have been written along these lines. So I do not intend to ramble on about Recovery Time Objectives (RTO), Recovery Point Objectives (RPO) and Maximum Tolerable Downtimes (MTD). That’s not really the subject of this book. In a nutshell VMware SRM isn’t a “silver bullet” for DR or BC – but a tool that facilitates those decision processes planned way before the disaster occurred. This book is about how to get up and running with VMware’s SRM. I started this paragraph with the word currently. Whenever I do that, I’m giving you a hint that either technology will change or I believe it will. Personally, I think VMware’s long-term strategy will be to lose the “R” in SRM, and for the product to evolve into a Site Management utility. This will enable people to move VMs from the internal/private cloud, to an external/public cloud. It might also assist in datacenter moves from one geographical location to another for example because a lease will expire and either it can’t be renewed or it is too expensive to renew. With VMware SRM, if you lose your primary or “Protected site”, the goal is to be able to go to the secondary or “Recovery Site” – click a button and find your VMs being powered on at the Recovery Site. To achieve this, your third party storage vendor must provide an engine for replicating your VMs from the Protected Site to the Recovery Site – your storage vendor will also provide a “Site Recovery Adapter” (SRA) which is installed to your SRM server. At the time when SRM 1.0 was released it only supported fibre-channel SAN and iSCSI. In SRM 4.0 there is now full support for NFS. Anything that increases your options at the storage layer by being protocol agnostic is a good thing. Increasingly we find ourselves living in a converged world – where the storage and network become one entity. It is still the case that “host” based replication is not supported with SRM. By which I mean some software is installed to the ESX hosts console, which is used to replicate your virtual machines around. In the short-term this doesn’t look like it will change. VMware’s publically stated goal is to deprecate the “Service Console” version of ESX (which I sometimes refer to as ESX “Classic”) for the more pared down version of ESXi. Neither versions of ESX are promoted as “development” environments by VMware, so it does

16

leave a big question mark over the long-term viability of such “third party software” based replication. However, the door is open to VMware developing host-based replication in future editions of ESX. This might be attractive to SMBs who find storage vendor based solutions cost prohibitive. As replication or snapshots are an absolute requirement for SRM to work, I felt it was a good idea to begin with covering a couple of different storage arrays from the SRM perspective. This will give people a basic run through on how to get the storage replication or snapshot piece working – especially for those like myself who would not class themselves as storage experts. Remember that VMware’s SRM does not currently provide a replication or snapshot engine or control that engine. This book does not constitute a replacement for good training and education in these technologies, ideally directly from the storage array vendor. If you are already confident with your particular vendor’s storage array replication or snapshot features you could decide to progress to Chapter 6: Installing VMware SRM. It was my very good fortune to be introduced via the product manager for SRM to guys from LeftHand Networks at the VMworld Europe event in Cannes in 2008. From that introduction I was offered two free NFR licenses for LeftHand Networks iSCSI SAN Virtual Appliance for testing purposes, called the VSA. Some time later I was introduced to guys from both EMC and NetApp. I became very interested in these storage technologies both from a SRM and VDI perspective. I was fortunate in having access to two EMC NS20 and NetApp FAS2020 systems whist writing this book – it was this that allowed me to spread my knowledge into other storage vendors replication technologies. In terms of the initial setup – I will deliberately keep it simple – starting with a single LUN/volume replicated to another array. However, later on I will change the configuration so I have multiple LUN/Volumes with virtual machines that have virtual disks on those LUNs. Clearly, managing this frequency of replication will be important. If we have multiple VMDK files on multiple LUNs/Volumes – the parts of the VM could easily become unsynchronized or even missed out all together in the replication strategy – thus creating half-baked, half-complete VMs at the DR location. Additionally, at a VMware ESX host level, if you use VMFS extents but fail to include all the LUNs/Volumes that make up those extents – then the extent would be broken at the recovery location and the files making up the VM corrupted. So how you use LUNs and where your VMs are stored can be more complicated than this simple example will first allow. Our focus is on VMware SRM, not storage. However, with this said a well thought out storage and replication structure is fundamental to an implementation of SRM.

What about File Level Consistency? One concern you will and should have is what level of consistency will the recovery have? This is very easy to answer – the same level of consistency had you not virtualized your DR. Through the storage layer we could be replicating the virtual machines from one site to anther synchronously. This means the data held at both sites is going to be of a very high quality. However, what is not being synchronized is the memory state of your servers at the production location. What this means is, if a real disaster occurs that memory state would be lost. So whatever happens there will be some kind of data loss incurred unless your storage vendor has a way to quiesce the applications and services inside your virtual machine. This level of awareness inside the virtual machine was historically limited to your backup software, but increasingly storage vendors such as EMC and NetApp are hooking into the new VMware vStorage APIs – which allow for such things as VMware Snapshots to be created. So although you may well be able to power on virtual machines in a recovery location – you may still need to use your application vendor’s tools to repair these systems from this “crash consistent” state – indeed if these vendor tools fail you may be forced to repair them with something called a BACKUP. With applications like Microsoft SQL and Exchange

17

this could take a long time depending on whether the data is inconsistent and the quantity of it to be checked and then repaired. You should really factor this issue into your recovery time objectives. The first thing to ensure in your DR plan is that you have an effective backup and restore strategy to handle possible data corruption and virus attacks.

Principles of Storage Management and Replication In my next chapter I will document in detail a series of different storage systems. Before I do that I want to write very briefly and generically about how storage management is handled by the vendors – and how they commonly manage duplication of data from one location to another. By its necessity this following section is going to be very vanilla and not vendor specific, so to address that issue I will end with a whole series of web links from these many varied storage vendors that point to their specific documentation which outlines their requirements for their arrays and their configuration with VMware’s Site Recovery Manager. When I started writing the first edition of this book I had some very ambitious, I would say outlandish hopes, that I would be able to cover the basic configuration of every storage vendor and how to get VMware’s SRM communicating to them. However, after a short time I recognised how unfeasible and unrealistic this ambition was! In this second edition I hope to out-source this content to people in the VMware/Storage Community and release this material as PDFs as a companion to this book. After all this is a book about VMware’s SRM – not storage. But storage and duplication is an absolute requirement for VMware’s SRM to function, so I would feel it remiss of me not to at least outline some basic concepts and caveats for those for whom storage is not the daily meat and drink. Caveat Number 1: In essence all storage management systems are the same, it’s just that storage vendors confuse the hell out of everyone (and me in particular) by using their own vendor specific terms. The storage vendors have never got together and agreed on terms. So for some vendors a “storage group” is a “device group” whereas others will call this a “volume group”. For others a volume is a LUN, but with another storage vendor “volumes” are collections of LUNs. Indeed some storage vendors think that the word LUN is some kind of dirty word, and storage teams will look at you like you are from Planet Zog if you use the word LUN. In short, download the documentation from your storage vendor – and immerse yourself in their terms and language – so that they become almost second nature to you. This will stop you feeling confused, and reduce the number of times you put your foot in inappropriate places when discussing data replication concerns with your storage guys. Caveat Number 2: All storage vendors sell replication. In fact they may well support three different types and a fourth legacy type that they inherited from a previous development or acquisition – and oh, they will have their own unique trademarked product names! Some vendors will not implement or support ALL their types of replication with VMware SRM. So you may have a license for replication type A, but your vendor only supports type B, C and D – this may force you to upgrade your licenses, firmware, management systems to support either type B, C or D. Indeed in some case you may well need a combination of features forcing you to buy B and C or C and D. In fairness to the storage vendors, as SRM has matured you will find they support all their different types of replication. This is mainly been triggered by responding to their competition. In a nutshell it could well cost you money to make the switch to the right type of replication. Alternatively, you might find that although the type of replication you have is supported – you discover it isn’t the most efficient from an I/O or storage capacity perspective. A good example of this situation is with EMC’s Clariion systems. On the Clariion system you can use a replication technology called MirrorView. In 2008, MirrorView was supported by EMC with VMware’s SRM, but only in an asynchronous mode, not in a synchronous mode. However, by the end of 2008 this support changed. The

18

reason this was so significant to EMC customers is because of the practical limits imposed by synchronous replication. Although synchronous replication is highly desirable it is frequently limited by distance between the Protected and Recovery Site. In short the Recovery Site is perhaps too close to the Protected Site to be regarded as a true DR location. At the upper level synchronous replication’s maximum distance is in the range of 400-450Km, however in practice the real world distances can come down as low as 50-60Km. The upshot of this limitation is that without asynchronous replication it becomes increasingly difficult to class the Recovery Site as a genuine DR location. Distance is clearly relative – in the US these limitations become especially significant as the recent hurricanes have demonstrated, but in my postage stamp sized country they are perhaps less pressing! If you’re looking for another example of these vendor specific support differences, HP EVAs are supported with SRM. However, you must have licenses for both their “Business Copy” feature and their “Continuous Access” technology for it to function properly. The Business Copy license is only used when snapshots are created during testing a SRM Recovery Plan. The Continuous Access license enables the replication of what HP rather confusingly call “vdisks” in the storage groups. Caveat Number 3: Storage management systems have lots of containers which contain other containers and so on. This means the system can be very flexibly managed. You can see this a bit like Microsoft having a rich and varied group structure options in Active Directory. Beware that sometimes this means that storage replication is limited to a particular type of container or level. This means you or your storage team have to sit down very carefully and think how you will group your LUNs to ensure that you ONLY replicate what you need to and that your replication process doesn’t in itself cause corruption by mismatched replication schedules. Critically, some storage vendors have VERY specific requirements about the relationships between these various containers when used with VMware SRM. Additionally, some storage vendors impose naming requirements for the name of these objects and snapshots. If you deviate from these recommendations you might find that you can’t even get SRM communicating with your storage correctly. In a nutshell it’s a combination of the right type of replication coupled with the right management structures that will make it work – and you can only know that by consulting the documentation that comes with your storage vendor. In short – RTFM! Now we have these caveats in place. I want to map out the structures of how most storage vendors’ systems work – and then outline some storage planning considerations. I will initially use non-vendor specific terms. Below is a diagram of a storage array which contains many drives.

19

In this case: A. is the array you are using, whether this is fibre-channel, iSCSI or NFS it isn’t dreadfully important in this case. B. shows that even before allowing access, many storage vendors allow disks in the array to be grouped. For example NetApp refers to this grouping as a disk aggregate, and this is often your first opportunity to set a default RAID level. C. is another group – this is referred to by some vendors as a storage group, device group or volume group. D. Within these groups we can have blocks of storage, and most vendors do call these LUNs. With some vendors they stop at this point, and replication is enabled at Group Type C indicated by arrow E. In this case every LUN within this group is replicated to the other array – and if this was incorrectly planned you might find LUNs that did not need replicating were being unnecessarily duplicated to the recovery location wasting valuable bandwidth and space. F. Some storage vendors allow for another sub-group. These are sometimes referred to as recovery groups, protected groups, contingency groups or consistency groups - in this case only LUNs contained in Group E are replicated to the other array. LUNs not included in sub-group E are not replicated. If you like, group C is the rule, but group E represents an exception to the rule. G. The last group is G: this a group of ESX hosts that allow access either to Group C or Group E depending on what the array vendor supports. These ESX hosts will be added to Group G by either the fibre-channel WWN, iSCSI IQN or IP address or hostname. The vendors who develop their Site Recovery Adapter (software that allows SRM to communicate to the storage layer) to work with VMware’s SRM - often have their own rules and regulations about the creation of these groupings, for instance they may state that no group E can be a member of more than one group C at anytime. This can result in SRA failing to return all the LUNs expected back to the ESX hosts. Some vendors SRAs automatically allow the hosts access to the replicated LUNs/Volumes at the Recovery Site

20

Array, others do not – and you may have to allocate these units of storage to the ESX host prior to doing any testing. This grouping structure can have some important consequences. A good example of this is when you place virtual machines on multiple LUNs. This is a recommendation by VMware generally for performances reasons, as it can allow different spindles and RAID levels to be adopted. If incorrectly planned you could cause corruption of the virtual machines.

In the example above the two virtual disks that make up the virtual machine (SCSI 0:0 and SCSI 0:1) have been split across two LUNs in two different groups. The schedule for one group has a latency of 15 minutes, whereas the other has no latency at all. In this case we could get a corruption of log files, date stamps, and file creation as the virtual machines’ operating system would not be recovered at the same state as the file data. We can see another example of this if you choose to use VMFS extents. As you might know this ESX has the ability to add space to a VMFS volume that is either running out of capacity or to break through the 2TB limitation on the maximum size of single VMFS volume. This is achieved by “spanning” a VMFS volume across multiple blocks of storage or LUNs.

In this case the problem is being caused by not storing the virtual machine on two separate LUNs in two separate groups. The impression from vSphere Client would be that the virtual machine is being stored at one VMFS datastore. Unless you were looking very closely at the storage section of the vSphere Client – you might not notice that the virtual machines files were being spanned across two LUNs in two different groups. This wouldn’t just cause a problem with the virtual machine, but more seriously completely undermine the integrity of the VMFS extent. This said, VMFS extents are generally frowned upon by the VMware Community at large, but they are occasionally used as a temporary “band aid”

21

to fix a problem in the short term. I would ask you this question – how often in IT does a band aid remain the way we do things, weeks, months or years beyond the time frame we originally agreed? However, I do recognise that some folks are given such small volumes sizes by their storage teams, that they have no option but to use extents in this manner. It’s often caused by quite harsh policies imposed by the storage team in an effort to save space – the reality is that if the storage admins only give you 50GB LUNs, you find yourself asking for 10 of them, to create a 500GB extent! If you do, then fair enough – but please do the due diligence to make sure all the LUNs that make up a VMFS extent are being replicated. My only message is to proceed with caution, otherwise catastrophic situations could occur. The ESX host is largely ignorant of the underlying structure – but this lack of awareness could mean you create an extent which includes a LUN which isn’t even being replicated. The result would be a corrupted VMFS volume at the destination. Clearly, there will be times where you feel pulled in two directions. For ultimate flexibility one group with one LUN allows you to control the replication cycles. Firstly, if you intend to take this strategy beware of virtual machine files spanned across multiple LUNs and VMFS extents because different replication cycles would cause corruption. Beware that the people using the vSphere, say your average server guy who only knows how to make a new virtual machine, may have little awareness of the replication structure underneath. Secondly, if you go for many LUNs being contained in a single group – beware this offers less flexibility – if you’re not careful you may include LUNs which do not need replicating or limit your capacity to replicate at the frequency you need. These storage management issues are going to be a tough nut to crack – because no one strategy will suit everyone. But in my own mind I would imagine some organizations could have three groups which are designed for replication in mind – one might use synchronous replication, and the other two might have intervals of 30 minutes and 60 minutes. It depends greatly on what your “recovery point objectives” are. This organization would then create virtual machines on the right VMFS volumes that are being replicated with the right frequency suited for their recovery needs. I think enforcing this as a strategy would be tricky. How would our virtual machine administrators know the correct VMFS volumes to create the virtual machines? Fortunately, in vSphere we now able to create folders that contain volumes and set permissions on those. It is possible to guide the people who create VMFS to store them on the correct locations. One method would be to create storage groups in the array management software that is mapped to different virtual machines and their functionality. The VMFS volume names would reflect their different purposes. Additionally, in the VMware SRM we can create what are called “Protection Groups”; these Protection Groups could map directly to these VMFS volumes and their storage groups in the array. The simple diagram below illustrates this approach I am proposing.

22

In this case I could have two “Protection Groups” in VMware Site Recovery Manager – one for the boot/data VMFS volumes for Exchange, and one for the boot/data VMFS for SQL. This would also allow for three types of SRM Recovery Plans – a Recovery Plan to failover just Exchange, a Recovery Plan to failover just SQL and a recovery failover for all the virtual machines. Now that I have outlined the principles of storage management I want to point you in the direction of some extremely important PDF files from various storage vendors which outline in more detail than I can in this book the storage replication and management requirements of their various technologies. Some of these guides are included in the Site Recovery Adapter when you download them from VMware’s website. I’m sorry to say that many of these guides relate to SRM 1.0 rather SRM 4.0. There are a number of reasons for this. Some vendors haven’t yet got round to updating their white papers and best practises. Some vendors persist in only allowing access to the documents to customers – and if you not a customer you cannot access the document. I’ve tried to reach out to all the vendors on this issue – but I received a mixed response. If you find a better link for documents than the ones I provide here – then drop me an email and I will endeavour to update the PDF version of this book.

Storage Vendor Guides 3PAR SRA for VMware Site Recovery Manager Overview http://www.3par.com/SiteObjects/2A5DF027BB6B45E39686254D991ED27B/3PAR-srm-ds-08.0.pdf 3PAR with VMware SRM White Paper http://www.3par.com/SiteObjects/E84F0E66E650D9888CBF03BFD48481ED/3PAR_srm-wp-08.1.pdf LeftHand Failback Proceedure for VMware Site Recovery Manager http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-5085ENW&cc=us&lc=en HP disaster tolerant solutions using Continuous Access for HP StorageWorks Enterprise Virtual Array in a VMware Infrastructure 3 environment [Document ID: 4AA1-0820ENW] http://h71028.www7.hp.com/ERC/downloads/4AA1-0820ENW.pdf http VMware Site Recovery Manager in a NetApp Environment

23

[Document ID: TR-3671] http://kb.vmware.com/Platform/Publishing/attachments/1007098_dNetApp-SRM-tr-3671.pdf http://media.netapp.com/documents/tr-3671.pdf http Disaster Recovery Using Dell Equallogic Ps Series Storage And VMware Site Recovery Manager [Document ID: TR1039] http://www.equallogic.com/uploadedFiles/Resources/Tech_Reports/TR1039-Dell-EqualLogic-PS-Series-SAN-and-VMware-SRM.pdf http http://www.vmware.com/files/pdf/partners/dell/ASSET_7_SB122_VMware_SRM_DellViD.PDF http://www.equallogic.com/partnerships/default.aspx?id=6535 Improving VMware Disaster Recovery with EMC RecoverPoint [Document ID: H5582] http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Technical_Documentation/H5582-VMware_Site_Recovery_Manager_with_EMC_RecoverPoint_Implementation_Guide.pdf http Using EMC SRDF Adapter VMware Site Recovery Manager [Document ID: H5511] http://powerlink.emc.com/km/live1/en_US/Offering_Technical/White_Paper/H5511-using-emc-srdf-adapter-vmware-site-rcvry-mgr-wp.pdf http VMware Site Recovery Manager with EMC Celerra NS Series and Celerra Replicator Implementation Guide [Document ID: H5581] http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Technical_Documentation/H5581-VMware_Site_Recovery_Manager_with_EMC_Celerra_NS_Series_and_Celerra_Replicator_Implementation_Guide.pdf VMware Site Recovery Manager with EMC CLARiiON CX3 and MirrorView Implementation Guide [Document ID: H5583] http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Technical_Documentation/H5583-VMware_Site_Recovery_Manager_with_EMC_CLARiiON_CX3_and_MirrorViewS_Implementation_Guide.pdf

DOWNLOAD THESE NOW! READ THEM! RTFM NOW! Finally, you should know that over on VMware’s viops.com website a VMware employee called Cormac Hogan has written a whole series of getting started guides to configuring and enabling replication between many different array vendors. If you’re new to storage replication they are well worth a read. Steps to setup 3PAR Inserv arrays http://viops.vmware.com/home/docs/DOC-1471 Steps to setup EMC Clariions http://viops.vmware.com/home/docs/DOC-1227

24

Steps to setup EMC Celerra (iSCSI) http://viops.vmware.com/home/docs/DOC-1233 Steps to setup EMC Celerra NAS Replication http://viops.vmware.com/home/docs/DOC-1602 Steps to setup IBM SVC http://viops.vmware.com/home/docs/DOC-1601 Steps to setup NetApp NAS Replication http://viops.vmware.com/home/docs/DOC-1603 Steps to setup NetApp arrays http://viops.vmware.com/home/docs/DOC-1229

Summary Well, that’s it for this brief introduction. Before we dive into SRM, I want to spend the next couple of chapters looking at the configuration of this very same storage layer – to make sure it is fit for use with the SRM product. I will cover each vendor alphabetically (EMC, HP, NetApp) to avoid being accused of vendor bias. In time I hope that other vendors will step forward to add additional PDFs to cover the configuration of their storage systems too. Please don’t see these chapters as utterly definitive guides to these storage vendors systems. This is an SRM book after all, and the emphasis is squarely on it. If you are comfortable with your particular storage vendor’s replication technologies you could bypass them and head directly to chapter 6. Alternatively, you could jump to the chapter that reflects your storage array and then head off to chapter 6. I wouldn’t expect anyone to read all the chapters from 2 to 5, unless you’re a consultant who needs to be familiar with as many different types of replication as possible – or you’re a masochistic. With that said, some folks say that being a consultant and being a masochistic is much the same thing.

25

Chapter 2: Getting started with EMC Celerra Replicator

26

EMC are a company that provide both physical storage appliances in the Fibre-Channel market for which they are probably best known. However, like many storage vendors their systems are multiple storage protocol aware and will support iSCSI and NFS connectivity using their “Celerra” system. Like other vendors, EMC does have publically available virtual appliance versions of their iSCSI/NAS storage systems – specifically their “Celerra” system is available as a virtual machine. If you want to learn more about the setup of the Celerra VSA – Cormac Hogan of VMware has written a getting started guide on the viops.com website: http://viops.vmware.com/home/docs/DOC-1233 Additionally, the virtualgeek website run by Chad Sakac has a whole series of blog posts and videos on how to setup the Celerra VSA: http://virtualgeek.typepad.com/virtual_geek/2009/04/new-celerra-vsa.html The whole configuration is called “EMC Celerra NS20FC System with a dual blade with FC Option Enabled” and looks like this:

From the rear the NS20FC diagrams show you how the Celerra system works.

The system is managed by what’s called the “Control Station”. This is purely a management node, and is not involved in any I/O. Celerra is actually two “blades” which

27

contain the code that allows for iSCSI/NAS protocol to work. They are referred to as “data movers” because they are responsible for moving data between the ESX hosts and the storage layer. The reason there are two is for redundancy. They find their storage by being in turn uplinked to the Clariion CX3. The complete package when bought together allows for all three protocols (Fibre-Channel, iSCSI and NAS) to be used. EMC Clariion uses a concept called “RAID Groups” to description a collection of disks with certain RAID level. In my case “RAID Group 0” is a collection of drives used by Celerra hosts using RAID5. Allocating physical storage to a Celerra system when it has fibre-connectivity to the Clariion, is not unlike giving any host (ESX, Windows, Linux) access to the storage in a RAID Group. You can see the Celerra host registered in the Navisphere management system like any other host.

In the screen grab of the Celerra management console above you can see I have two Celerra systems, which have been uplinked to two different EMC Clariion CX3. Where new-york-celerra1.corp.com represents the Protected Site (New York) and the new-jersey-celerra1.corp.com represents the Recovery Site (New Jersey). I’m going to assume that this similar configuration is already in place.

28

Creating an EMC Celerra iSCSI Target Before you begin it’s perhaps worth checking that the Celerra is properly licensed for the features and protocols you want to use. Login to the Control Station for the Celerra using a web-browser, and select the name and select the licenses tab:

Once the Celerra is licensed for iSCSI, we can set about creating an iSCSI Target. The iSCSI Target is the listener that allows inbound iSCSI requests from initiators to be received and processed. Many array come ready configured with iSCSI Target, but with the Celerra you have complete control to define its properties as you see fit. Like many iSCSI systems Celerra supports many iSCSI Targets each with different aliases. It’s important to know that the Celerra’s Control Station IP address is purely used for management. The I/O generated by ESX hosts reading and writing to a volume will be driven by the Data Mover’s interfaces. Similarly, the I/O generated by Celerra Replication will be drive by the Data Mover’s interfaces. If you are unsure of what IP address the Data Movers interfaces have you can see them in +Data Movers and Network

So in my case on the New York Celerra the 172.168.3.75 (Prod_Replication) will be used to drive replication traffic, whereas 172.168.3.76 (Prod_Access) will be used by the ESX hosts to access the iSCSI LUN

1. In Celerra Manager on the Protected Site Celerra (New York)

29

2. Select the Wizards node, and select the New iSCSI Target button

3. Click Next in the wizard to select the default datamover

4. Enter a unique “target alias” such as newyorkcelerra1 – this is the friendly

name by which the iSCSI Target will be known in the management tools. By default if you leave enabled the option to “Auto Generate Target Qualified Name” the system will create an IQN for the target. This will generate an IQN that looks like this: iqn.1992-05.com.emc:ck2000734004790000-22 Alternatively, you could remove this tick box and set your own custom IQN such as: iqn.2009-10.com.corp:newyorkcelerra1

5. Next click the Add button to include the network interfaces used by the Data

Mover; these network interfaces have IP addresses, and will listen by default on

30

the iSCSI TCP port of 3260 for inbound requests from initiators – in our case this will be the Software iSCSI Initiator which is built-in to ESX

Note: cge0 with the IP address of 172.168.3.76 is my Prod_Access Data Mover interface

6. Click Finish and Close Note: This wizard essentially automates the process of creating an iSCSI Target. It could have been created manually, and be modified at any time by navigating to +Data Movers, selecting the Data Mover (in my case nyc_datamover2), +iSCSI and the Target tab

Note: Now repeat this step at the Recovery Site Celerra (New Jersey)

Granting Access to ESX hosts to EMC Celerra iSCSI Target Now we have created the iSCSI Target it’s a good idea to enable the Software iSCSI Target on the ESX hosts. This means when we come to create an iSCSI LUN they will be “pre-registered” on the Celerra system, so we will just need to select them as ESX hosts. If you have a dedicated iSCSI hardware adapter you can configure your IP settings and IQN directly on the card. One advantage of this is that if you wipe your ESX host, your iSCSI settings remain on the card – however they are quite pricy. Many VMware customers prefer to use the ESX host’s iSCSI Software Initiator. The iSCSI stack in ESX4 has been recently overhauled and it is now easier to setup and offers better performance. The following instructions explain how to set it up to speak to the Celerra iSCSI Target we recently created. Before you enable the Software Initiator/Adapter in the ESX host you will need to create a VMkernel Portgroup with the correct IP data to communicate to the Celerra iSCSI Target. In the past you also needed to have a second “Service Console” connection. This is no longer the case in ESX4.

31

The diagram below shows my configuration for esx1 and esx2, notice the vSwitch has two NICs for fault-tolerance.

Before proceeding with the configuration of the VMware software initiator/adapter you might wish to confirm you can communicate with the Celerra by using a simple test using ping and vmkping against the IP address of the Data Mover. Note: Depending on the version of ESX 4.x.x you are using – you may or may not need to manually open the iSCSI Software TCP Port on the ESX firewall. I’ve always done this manually – to be 100% certain there is no communication barrier between the ESX hosts and the iSCSI Target

1. Select the ESX host, and the Configuration Tab 2. Select the Security Profile link, in the Software Tab 3. Click the Properties... link 4. In the dialog box open the TCP Port (3260) for the iSCSI Software Client

5. Next click the Storage Adapters link and select the iSCSI Software Adapter 6. Choose Properties... 7. In the dialog box click the Configure button 8. Enable the option under status, as shown below

32

Note: This can take some time. Be patient. You will not be able set a custom IQN until you click OK. VMware will try to help you out by setting a default IQN.

9. Click the Configure button again, replace the auto generated IQN with your own standards like so:

Note: After clicking OK, this time – a dialog box may appear indicating you must reboot the ESX host

This is true, but you can defer the reboot until the configuration is finished completely

33

10. Next select the Dynamic Discovery Tab, and Click the Add button 11. Type in the IP address of the iSCSI Target which is serviced by the two

NICs of the Data Mover in my case 172.168.3.76

Note: Static discovery is only supported with hardware initiators. Notice here the IP address of the Celerra iSCSI Target is that of the Prod_Access Data Mover interface, not the Control Station which is there purely for management.

12. Click OK Note: This can take some time as well

13. After clicking Close to the main dialog box you will be asked if you want to rescan the Software iSCSI virtual HBA (in my case vmhba34). Click Yes Note: Occasionally, I’ve noticed that some changes to the Software iSCSI Initiator after this initial configuration may require a reboot of the ESX host. So try to limit your changes where possible, and think through what settings you require up front to avoid this

Note: Now repeat this setup for the Recovery Site ESX hosts – changing the IP addresses relative to the location

Creating a New File System As with other storage vendors the Celerra system has its own “file system” within which LUNs can be created. Often storage vendors have their own file system to allow for advanced features such as de-duplication and thin-provisioning. It is possible to create

34

configuration manually, but again I prefer to use the wizards in the Celerra to guide me through the process.

1. In Celerra Manager on the Protected Site Celerra (New York) 2. Select the Wizards node, and select the New File System button

3. Click Next in the wizard to select the default datamover 4. Select Storage Pool as the Volume Management Type 5. Select a Storage Pool, in my case I have just one clar_r5_performance with

933,494MB of space available

6. Next specify a friendly name for the File System you are creating. In my case I

used “newyorkcelerraFS1” which is 200GB in size. It would be possible for me to make a file system almost 1TB in size, and then populate that file system with many iSCSI LUNs. Notice how the value is specified in MB.

7. The rest of the wizard allows for more advanced settings, and you just accept their

defaults, and click Finish and then Close Note:

35

This wizard essentially automates the process of creating a file system to hold iSCSI LUNs. It could have been created manually, and be modified at any time by navigating to +File Systems, and selecting the File Systems tab

IMPORTANT: Now repeat these steps at the Recovery Site Celerra (New Jersey). Adjust your naming convention to reflect the location. However, when you do so there is one critical caveat. The File System of the Recovery Site needs to be slightly larger to account for the snapshots that are generated by the replication process. How much larger? EMC recommends that if you expect to see a low volume of changes, the file system at the Recovery Site will need be 20% larger (in my case 240GB). In the worst case scenario where you were experience a high volume of changes, this can be as high as 150% of reserved space. If you don’t reserve space for the snapshot then you may receive a “Version Set out of space” error message. The fortunate thing is that if you get this wrong then you can increase the size of the file system as required and it doesn’t disrupt your production environment. The screen grab below shows the file system on the Recovery Site Celerra (New Jersey) as 250GB. It could be made larger using the Extend button

36

Creating an iSCSI LUN In this section I will create an iSCSI LUN. Once that is created I will then set-up asynchronous replication – between the New York Celerra and the New Jersey Celerra using the “ReplicatorV2” technology for which I have a license. It is possible to create the configuration manually, but in this case I prefer to use the wizards in the Celerra to guide me through the process.

1. In Celerra Manager on the Protected Site Celerra (New York) 2. Select the Wizards node, and select the New iSCSI LUN button

3. Click Next in the wizard to select the default Data Mover

4. Click Next, to accept the Target we created in the previous steps

37

5. Next select the file system within which the iSCSI LUN will reside. In my case this is the newyorkcelerraFS1 file system that was created in the previous wizard. Notice how although I asked for a file system of 200GB, not all of it is available, as some of the space is needed for the file system metadata itself

6. Next set your LUN number, in my case I chose 100. The “create multiple” LUNs allows you to create many LUNs in one go, that all reside on the same files system each being the same size. In my case if I did that they would be 100GB in size.

7. In the LUN Masking part of the wizard, select the option to “Enable Multiple Access”, and use the Grant button to add the known initiators into the list. My ESX hosts are already listed here because of me enabling the iSCSI Software Target on the ESX hosts and carrying out a rescan

8. The next dialog box allows you to set CHAP Access (option). If you do enable this,

as you grant the ESX hosts access you will receive a prompt each time to set the

38

CHAP Secret for each ESX host. Additionally, you would need to revisit the ESX host Software iSCSI Initiator configuration – and configure CHAP for it.

9. Click Finish and Close Note: This wizard essentially automates the process of creating a LUN and allocating ESX hosts by their IQN to the LUN. It could have been created manually, and be modified at any time by navigating to +Data Movers, and selecting the iSCSI node and the LUNs tab.

Additionally, if you select +iSCSI node from the top-level, and select the Targets Tab, and right-click the properties of the Target – and select Properties, and the LUN Mask Tab you can see the ESX hosts have been allocated to the LUN like so:

Note: If you returned to the ESX hosts that were allocated to the iSCSI Target they should now have the iSCSI LUN available

39

At this stage it would be a very good idea to format the iSCSI LUN with VMFS, and populate it with some virtual machines. We can then proceed to replicating the LUN to the Celerra in the Recovery Site IMPORTANT: Now repeat these steps at the Recovery Site Celerra (New Jersey) adjusting your names and IP addresses to reflect the location. When you create the iSCSI LUN remember to set the LUN to be “read-only”. When you run a Recovery Plan in SRM, the Celerra SRA will automatically make the LUN read-writable.

There is no need to format the VMs volumes and populate with VMs, this empty volume created at the Recovery Site will be in receipt of replication updates from the Protected Site Celerra (New York)

Configuring Celerra Replication So by now you should have two Celerra systems up and running – each with an iSCSI Target, a File System and iSCSI LUN within each. One iSCSI LUN is read-write on the Production Site Celerra (New York), whilst the other is read-only on the Recovery Site Celerra (New Jersey). Again we can use a wizard to configure replication. Generally, the replication setup is a three-phase process:

• Create a trust between the Protected Site and Recovery Site Celerras in the form of a shared secret/password

• Create a Data Mover Interconnect to allow the Celerras to replicate to each other • Enable the Replication between the Protected and Recovery Site

40

Before you begin with the wizard you wish to confirm that two Celerras can see each other via the cge interface that you intend to use for replication. You can carry out a simple “ping” test using the Celerra administration webpages under + Data Movers, + Network and the “Ping” tab like so:

1. On the Protected Site Celerra (New York) open the Wizard node 2. Click the New Replication button

3. In Replication Type, select iSCSI LUN 4. In the “Specify Destination Celerra Network Server” click the New

Destination Celerra button

5. In the following dialog box type in a friendly name to represent the Recovery

Site Celerra (New Jersey) together with the IP address of the Control Station. The passphase creates a trust between the Protected and Recovery Site such that they are then able to send data and block updates

41

6. Next type the credentials used to authenticate to the Recovery Site Celerra

(New Jersey)

7. Now that the Protected Site Celerra (New York) knows of the Recovery Site Celerra

(New Jersey) the next part of the wizard “Create Peer Celerra Network Server”, informs the Recovery Site Celerra of the identity of the Protected Site Celerra (New York). Effectively creating a two-way trust between the two Celerras

Note: After clicking Next you will receive a summary of how the systems will be paired together. After clicking the Submit button the Celerras will trust each other with the shared passphrase. When it completes it will take you back to the beginning of the wizard where we were asked to select a destination Celerra server.

42

Select the Recovery Site Celerra (New Jersey) and click Next

8. Now that the Celerras are trusted we need to create a Data Mover InterConnect – that will allow the Data Mover in the Protected Site to send data and block updates to the Recovery Site Celerra (New Jersey). Click the New InterConnect button in the Data Mover Interconnect part of the wizard

9. Then type in a friendly name for the Data Mover Interconnect

In my case I called the source (the Protected Site Celerra) Interconnect “new-york-celerra1-to-new-jersey-celerra1”. And I did the reverse for the destination Data Mover Interconnect. Notice how I enabled the advanced settings, so I could select the Prod_Replication interface with the IP address of 172.168.3.75 used by cge1

10. The Interconnect Bandwidth Schedule allows you to control how much bandwidth to allocate to the replication cycle – and to control when replication happens.

43

Set your schedule as you see fit and click Submit to create the Data Mover Interconnect

11. Now that Data Move Interconnect has been created you can then select it in the wizard

12. In the Replication Session Interface you can select which IP address (and

therefore which network interface will take the replication traffic). In my case I selected the 172.168.4.75 address which is dedicated to Prod_Replication on the cdg1

13. Next set a friendly Replication Session Name for the session, additionally select the iSCSI LUN that you wish to replicate. In my case this is the 100GB LUN I created earlier

44

14. In the Select Destination – select the destination LUN which receive replication updates from the source LUN

15. The Update Policy allows you to configure the tolerance on what happens if

replication is unavailable for a period of time. This reflects your RPO (Recovery Point Objective). So if you select 10 minutes, the Recovery Site would be 10 minutes behind your Protected Site

16. Finally, click Submit – and the replication process will begin.

Note: This wizard essentially automates the manual process of pairing the Celerras together, creating the Data Mover Interconnects and the Replication Session. You can view and modify these entries from the +Replication node in the Celerra administration web-pages: The “Celerra Network Servers” tab is where the Control Station IP address references are held so that the Protected Site Celerra (New York) knows how to communicate to the Recovery Site Celerra (New Jersey)

The “Data Mover Interconnects” tab shows the network pipe between the two Celerras which is used to transfer replication updates. You can right-click and “validate” these connections, and also modify their properties as you wish

45

The “Replications” tab shows you the replication session created by the wizard – it has buttons that allow you to stop, start, reverse, switch-over and failback the replication relationships.

Conclusion In this section I have quickly shown you how to setup the EMC Celerra iSCSI Replicator which is suitable for use with VMware SRM. We configured two Celerra Systems and then configured them for replication. From this point onwards I would recommend you create virtual machines on the VMFS iSCSI LUN on the Protected Site Celerra so you have some test VMs to use with VMware SRM. SRM is designed to only pick-up on LUNs/Volumes that accessible to the ESX host and contain virtual machine files. In previous releases if you had a volume which was blank it wouldn’t be displayed in the SRM Array Manager Configuration wizard – in the new release it does warn you if this is the case. This was apparently a popular error people had with SRM 1.0, but one that I rarely saw – mainly because I always ensured my replicated volumes have virtual machines contained on them. I don’t see any point in replicating empty volumes! Since ESX 3.5 and vCenter 2.5 you have been able to relocate the virtual machine swap file (.vswp) onto different datastores, rather than locating it in the default location. A good tip is to relocate the virtual machines swap file onto shared but not replicated storage. This will reduce the amount of replication bandwidth needed. It does not reduce the amount of disk space used at the Recovery Site, as this will be automatically generated on the storage at the Recovery Site. In SRM 4.0 the Array Manager Configuration wizard now displays this error if you fail to populate the datastore with virtual machines:

In my demonstrations I mainly use virtual disks. VMware’s RDM feature is now fully supported by SRM (it wasn’t when version 1.0 was first released). I will be covering RDMs later in this book because it is an extremely popular VMware feature.

47

Chapter 3: Getting started with EMC Clariion MirrorView/S

48

EMC are a company that provide both physical and virtual storage appliances in the fibre-channel market for which they are probably best known. However, like many storage vendors their systems are multiple storage protocol aware and will support iSCSI and NFS connectivity using their “Celerra” system. Like other some other vendors, EMC does have publically available virtual appliance versions of their iSCSI/NAS storage systems – specifically their “Celerra” system is available as virtual machine. However, at the moment there is no publically available virtual appliance version of the Clariion System. In 2009 EMC very kindly allocated to me two Clariion CX3s. In fact the whole configuration is the “EMC Celerra NS20FC System with a dual blade with FC Option Enabled” and looks like this:

I have two of these NS20FC systems and they have each been added as separate “domains” in the Navisphere management console. In this screen grab of the Navisphere Management console above you can see I have two Clariion CX3 systems. Where the “Local Domain” represents the Protected Site (New York) and the domain called “New Jersey” represents the Recovery Site. I’m going to assume that this is the configuration that you already have in place. If you don’t have the two Clariions listed in the same view – take at look at the File menu, which allows you to configure “Multi-Domain Management”. Also I’m going to assume that work that’s needed at the fabric layer (WWNs and Zoning) has already been carried out correctly.

49

In SRM 1.0 the EMC Clariion SRA required a “Consistency Group” to for the SRA to work. This is no longer a requirement in SRM 4.0. Consistency Groups are used when you have multiple LUNs to ensure that replication or MirrorView keeps the state of those multiple LUNs in synch. Although consistency groups no longer a hard requirement – I will show you how to create them so you know how – because in the main they’re a good thing. This requirement in SRM 1.0 for Consistency Groups did create some problems for some people because at the time there was a limit to the number of groups that could be created.

Creating EMC LUN EMC uses a concept called “RAID Groups” to description a collection of disks with certain RAID level. In my case “RAID Group 1” is a collection of drives used by ESX hosts using RAID5. In this section I will create a LUN in this Raid Group called “LUN_60_100GB_VIRTUALMACHINES”. Once that is created I will then set-up synchronous replication – between the New York Clariion and the New Jersey Clariion using EMC “MirrorView” technology

1. Open a web-browser to your the IP address used to access Navisphere, in my case this is https://172.168.3.79

2. Login with username and your password 3. Expand the + Local Domain (Protected Site, my “Local Domain”) 4. Expand the + CKNNNNNNNNNNNN (This is the serial number of the CX3) 5. Expand the + RAID Groups 6. Right-click a Raid Group, and select Bind LUN. In my case this is RAID Group 1

Note:

50

“Bind LUN” means to “create a LUN”

7. In the dialog box select a free LUN ID number, in my case I used LUN ID 60 8. In the same dialog box under default owner select the Auto option 9. Finally, set the LUN size, in my case used 100GB

Note: This LUN number is not necessarily the LUN ID that will be actually presented to the ESX host. The LUN number could be 300, but when it’s later allocated to the ESX hosts the number must reside between the values of 0-255 as this is the maximum number of LUNs an ESX4 host can currently see. The auto option instructs the Clariion to allocate the LUN to one of the storage controllers (either SPA or SPB). The idea of auto is stop just one controller (SPA or SPB) owning all the LUNs and thus degrading performance. This is referred to as Pseudo Active where both SPA and SPB can take a IO load, but only one SP can own a LUN at anyone time. If a SP goes down (say for a firmware update) all the LUNs are transferred to the other controller – in process referred to as a “transit”. It’s worth occasionally checking this ownership to make sure that one SP isn’t overworked by too many LUNs or by two IO intensive LUNs

10. Click Apply to the Bind LUN Dialog box, and click Yes to the second dialog box to confirm your change. Once the LUN has been created, click OK to confirm the successful operation. The bind LUN dialog box will stay on screen (it assumes you want to create a number of LUNs – click Cancel to dismiss it). Once the LUN has been created, you can right-click it and choose Properties to give it a more meaningful name. In my case I renamed it to be called LUN_60_100GB_VIRTUALMACHINES.

Note: You can see a number of LUNs in the list. For example, you can see my ISOs/Templates LUN. You can more or less ignore these LUNs – as none of these

51

will be replicated to the Recovery Site (New Jersey). There used internally by me when I do my development work or other activities.

Configure EMC MirrorView 1. Right-click the LUN you wish to replicate – in my case

LUN_60_100GB_VIRTUALMACHINES

2. In the menu select MirrorView and Create Remote Mirror

3. In the dialog box select Synchronous and type in a friendly name and Click OK

4. After clicking OK you will receive this dialog

Note: For MirrorView to be successfully setup, a secondary mirror image needs to be created on the Recovery Site (New Jersey). After clicking Yes to the dialog box above. This will create a “Primary Image” in the Remote Mirrors node like so:

52

Notice how it says in the [brackets] that there is No Secondary Images. Yet!

5. The next step is to create this Secondary Image LUN in the Recovery Site Clariion (New Jersey)

6. Right-click the same LUN again, select MirrorView and this time select Create Secondary Image LUN

7. This will open a dialog box which views the other Recovery Site Clariion array visible in Navisphere. Select a free LUN number to be used to create the Secondary Image LUN

Note: After clicking OK, and confirming the other ancillary dialog boxes this will recreate LUN at the Recovery Site Clariion (New Jersey). It can be renamed if you so wish to make it more meaningful. I called mine: LUN60_100GB_VIRTUALMACHINES_MIRRORVIEW

8. Next this new Secondary Image LUN must be added to the Remote Mirror created earlier. On the Protected Site Clariion (New York). Right-click the Remote Mirror, in my case “Replica of LUN60 – virtualmachines” and select Add Secondary Image

53

9. In the corresponding dialog box select the Recovery Site Clariion (New Jersey) and expand the +SPA or +SPB to select the Secondary Image LUN create earlier.

Note: You can reduce the time it takes to synchronize by changing the “Synchronization Rate” high. This does increase the load on the array, so perhaps if you do change this its worth changing the setting once the first synchronization has completed. Later on we will see that we remove the “Initial Sync Required” option during the SRM “Failback” process to reduce this time even further. The initial sync required will cause a full synchronization of data from one LUN to another so both LUNs have the same data state. The Recovery Policy controls what happens if the Secondary Mirror Image is inaccessible. Automatic forces a re-sync of the data without the administrator intervening, whereas the manual option would require the administrator to manually sync up the LUNs. The Synchronization Rate controls speed of writes between the Protected Site Clariion (New York) and the Recovery Site Clariion (New Jersey). After clicking OK. You should see the icons change within Navisphere indicated with capital T.

The “T” indicates that MirrorView is in a “transition” state and is in the process of syncing the Primary and Secondary LUNs together.

Creating a Snapshot for SRM Tests During a test of a SRM Recovery plan, the replicated volume – in my case “LUN60_100GB_VIRTUALMACHINES_MIRRORVIEW” is not directly mounted and made accessible to the ESX hosts. Instead, a snapshot of the Secondary LUN is taken, and this snapshot is presented to the Recovery ESX hosts during the test. This allows tests of the Recovery Plan to happen during the day without interrupting normal operations or the synchronous replication between the Primary and Secondary LUNs. This snapshot is not

54

created automatically. Instead its created manually, and when created it must be named in the correct way – in order for the EMC Clariion SRA to locate it. The name of the snapshot must begin with the text “VMWARE_SRM_SNAP”. This procedure is carried out at the both Protected Site (New York) and Recovery Site Clariion Array (New Jersey). This will allow for full tests of the Recovery Plans, runs of the Recovery Plans – as well as allowing the test of the failback process and run of the failback process.

1. Expand +Raid Groups locate the Secondary Image LUN 2. Right-click the LUN, and select from the menu SnapView and Create Snapshot

3. In the dialog box type in the name of the snapshot such as

VMWARE_SRM_SNAP_LUN60

Note: It is possible to allocate the snapshot to a storage group, but in this case it is not necessary at this stage. Snapshots use what is called a “Reserved LUN Pool” (RPL). You can see the RPL as an allocation of storage purely for snapshots – as they are not for “free”. EMC documentation indicates you should reserve 20-30% of the space in the Reserved LUN Pool for the snapshot data. So in my case the 100GB volume would need around 20-30GB the RPL. Under the +SnapView this should create a snapshot following the naming convention outlined a moment ago..

IMPORTANT: If you do engage DR for real, or hard-test your plan – it is likely you will want to test the failback procedure, before carrying out failback for real. For this test of failback to be successful you will need similar snapshot ready at the Protected Site (New York). So repeat this process for the LUN at the Protected Site also

55

Creating Consistency Groups (Recommended) Remember that strictly speaking “Consistency Groups” are no longer required by the EMC Clariion SRA. However, you may find them useful especially if you are replicating multiple MirrorView enabled volumes.

1. On the Protected Clariion Array (New York) right-click Consistency Groups and select Create Group

2. Change the Mirror Type to be Synchronous 3. In the same dialog type in friendly name for the group – and then add in the

Available Remote Mirrors into the Selected Remote Mirrors list, and Click OK

Note: After clicking OK, this will create a consistency group containing in my case just one LUN. As my system grows and I create more MirrorView protected LUNs I could add them to the same Consistency Group or create different consistency groups for different types of application. As you will see later consistency groups map almost directly to the “Protection Group” object/concept in VMware SRM.

After clicking OK, the consistency group will also be created at the Recovery Site Clariion (New Jersey). Once synchronization is in place it can take some time for this “Transition” period indicated by the letter T to complete. Waiting until these have cleared, before setting up the SRA for the SRM Server – as it might return a negative result until the two arrays are in synch.

Granting Access to ESX hosts to Clariion LUNs Now we have created our Primary, Secondary and Snapshot objects we can set about making them available to the ESX hosts. This is should be a simple procedure of locating the Storage Groups which contain the ESX hosts, and then allocating to them the correct volume.

56

At the Recovery Site Clariion (New Jersey)

At the Recovery Site the ESX hosts will need to be granted access to both the MirrorView Secondary Image LUN and the Snapshot created earlier – for both tests and runs of the SRM Recovery Plans to work correctly.

1. Expand + Storage Groups, and right-click the Storage Group that contains your ESX hosts – in my case this is called New_Jersey_Cluster1

2. Choose Select LUNs in the menu 3. Expand +Snapshots and select the Snapshot created earlier in my case I

called this: VMWARE_SRM_SNAP_LUN60

4. Expand +SPA or +SPB and locate the LUN you created earlier 5. Select the LUN in the list - in my case I called this

LUN60_100GB_VIRTUALMACHINES_MIRRORVIEW

6. Scroll-down the Selected LUNs list, and under “Host ID” allocate the LUN number that the ESX hosts will use. In my case, as “Host ID” 60 was available it used it

Note: After clicking OK and confirming the usual NaviSphere dialog boxes, you should see the LUN appear in the LUNs list in the Storage Group. Notice how the description indicates this LUN is that is merely a Secondary Copy

57

Note: The snapshot will be only become “active” when you test your Recovery Plans in SRM

At the Protected Site Clariion (New York)

1. Expand + Storage Groups, and right-click the Storage Group that contains your ESX hosts – in my case this is called New_York_Cluster1

2. Choose Select LUNs in the menu 3. Expand +Snapshots and select the Snapshot created earlier in my case I

called this: VMWARE_SRM_SNAP_LUN60

4. Expand +SPA or +SPB and locate the LUN you created earlier 5. Select the LUN in the list - in my case I called this:

LUN60_100GB_VIRTUALMACHINES

6. Scroll-down the Selected LUNs list, and under “Host ID” allocate the LUN number that the ESX hosts will use. In my case, as “Host ID” 60 was available it used it

58

Note: After clicking OK and confirming the usual NaviSphere dialog boxes, you should see the LUN appear in the LUNs list in the Storage Group. Notice how the description indicates this LUN is mirrored

Note: You should now be able to carry out a rescan of the ESX hosts in the Protected Site and format this LUN. In vSphere4 we can now request a rescan of the all the affected ESX hosts in the VMware HA/DRS cluster by single right-click.

After the rescan – format the LUN with VMFS and create some virtual machines.

Conclusion In this section I have quickly shown you how to setup of EMC Clariion MirrorView, which is suitable for use with VMware SRM. From this point onwards you I would recommend you create virtual machines on the VMFS volume. So you have some test VMs to use with VMware SRM. SRM is designed to only pick-up on LUNs/Volumes that accessible to the ESX host and contain virtual machine files. In previous release if you had a volume which was blank it won’t be displayed in the SRM Array Manager Configuration wizard – in the new release it does warn you if this is the case. This was apparently a popular error people had with SRM 1.0, but one that I rarely saw – mainly because I always ensured my replicated volumes have virtual machines contained on them. I don’t see any point in replicating empty volumes! In SRM 4.0 the Array Manager Configuration wizard now displays this error if you fail to populate the datastore with virtual machines:

Since ESX 3.5 and vCenter 2.5 you have been able to relocate the virtual machine swap file (.vswp) on to different datastores, rather than locating it in the default location. A good tip is to relocate the virtual machines swap file onto shared but not replicated storage. This will reduce the amount of replication bandwidth needed. It does not reduce the amount of disk space used at the Recovery Site, as this will be automatically generated on the storage at the Recovery Site. In my demonstrations I mainly use virtual disks. VMware’s RDM feature is now fully supported by SRM (it wasn’t when version 1.0 was first released). I will be covering RDMs latter in this book because it is an extremely popular VMware feature.

59

Chapter 4: Getting started with HP LeftHand Scheduled Remote Copy

60

HP Lefthand are a company that provide both physical and virtual storage IP based appliances in the iSCSI SAN market and was in 2009 acquired by HP. In particular they have a virtual appliance called the VSA which is downloadable from their website for a 60-day evaluation period. In this respect it’s ideal for any jobbing server guy to download and play with in conjunction with VMware’s SRM. If you follow this chapter to the letter you should end up with a structure that looks like this in the VSA’s management console, with the friendly names adjusted to suit your own conventions.

In this screen grab of the HP Lefthand Management Console above you can see I have two VSAs (vsa1.corp.com and vsa2.corp.com) each in their own management group (NYC_Group and NJ_Group). As you can see I have a volume called “virtualmachines” and it is replicating the data from vsa1 to vsa2 to the volume called “replica_of_virtualmachines”. It is a very simple setup indeed, but is enough to get us started with the SRM product.

Some Frequently Asked Questions about HP LeftHand VSA 1. What are the recommended minimums for memory and CPU?

1GB RAM, 1 vCPU offering 2Ghz of CPU time or larger. The adding of additional vCPUs does not significantly improve performance

2. Should the VSA be stored on a Local VMFS volume OR a Shared VMFS volume? Depends entirely on the quality of the storage. If your local storage is faster and offers more redundancy than any remote storage you have – then you would use local storage. In some environments you might prefer to use shared storage to facilitate backup, deployment, and to allow for high availability with VMware HA. As I don’t have much in the way of local storage I opted to place my VSAs on my fibre-channel SAN

3. VSA is licensed by MAC address – should you use a static MAC address? It is recommended to use a static MAC address if you decide to purchase the VSA. If you are just evaluating VSA or simply using it to evaluate SRM it is not required, just recommended

61

4. Can you use vCenter cloning to assist in creating multiple VSAs? Yes. But the VSA must not be configured or in a management group. If you have procured a licensed version of VSA be aware that the template deploy process generates a new MAC address for the new VM, and as such will need licensing or relicensing after being deployed

5. Setting up two VSAs in a management group with all the appropriate settings takes some time. Can you use the clone feature in vCenter to reset lab environments? Yes – build up the two VSAs to the state you need, then simply right-click the Management Group – and choose the option to Shutdown Management Group. You can then clone, delete and re-clone again. You should be careful as the cloning process changes the MAC address as does the template process. An alternative to this approach is to learn the HP Lefthand CLI which allows you to script this procedure. This is not covered in this book

6. Can you capture the configuration of the VSAs and restore it? No. You can capture the configuration for support purposes but not for configuration.

Download and Upload the VSA The VSA is available to download in an OVF “Open Virtual Machine Format” format, however at the time of writing it has yet to be listed on VMware’s Virtual Appliance Market Place. You can always download the “ESX” version of the VSA from HP Lefthand website in the format of a zip file or the OVF format in a zip file. You will see there is an ESX version and a version that will run on a laptop with VMware Workstation or Server. I’m using the ESX version in this book. http://www.hp.com/go/tryvsa It up to you how to upload the VSA to your ESX host as much depends on what flavour of ESX you are using. If you are using ESX 4.0 with the Service Console, it is probably easier and quicker to upload the ZIP file un-extracted, and unzip it at the Service Console with the tar command. If on the other hand you are using the ESX4i version, you might find it better to extract it in Windows first, and then upload to the ESX host using the datastore browser utility or the import feature if you have decided to use the OVF method.

Once suitably extracted it can then be added to the vCenter Inventory by right clicking the VSA’s VMX file like so:

62

Personally, I’m trying my best to use the new OVF format wherever possible. Especially, as now more than half of my servers now run ESX4i. That isn’t especially difficult as I currently only have 4 ESX hosts! But the main reason for me adopting the .OVF format is to encourage its use amongst virtual appliance vendors - when it works it is a joy to use.

Importing the HP Lefthand VSA 1. Once you have downloaded the VSA extract the .ZIP file. 2. Open the vSphere Client to your vCenter 3. In the File menu, select Deploy OVF Template 4. Select the Deploy from file radio button and browse to the .OVF file in the

extracted zip

5. Accept the OVF Template details 6. Accept the EULA agreement 7. Type in a friendly VM name for the appliance, and select a VM Folder location

to hold it. In my case I called it vsa1, and placed into the “Infrastructure” folder

8. Select an ESX host or cluster upon which the VSA will run

63

9. If you have one select a Resource Pool

10. Next select a datastore to hold the VSA’s vmx and VMDK files

11. Finally, select a vSwitch Port Group for the VSA.

Note: Remember this network must be accessible to the ESX hosts to allow the Software iSCSI Stack that exist in ESX to speak to the VSA and the iSCSI volumes it presents. Incidentally, the port groups (vlan11, vlan12, vlan13) in this screen grab are port groups on VMware’s new Distributed vSwitch. SRM does work with the older Standard vSwitches but I thought it would be interesting to use the new networking features of SRM in this book.

Modifying the VSA’s Settings

Add a Virtual Disk for Storage:

1. Next add virtual disk. This disk will be a volume presented to your ESX hosts and used to store virtual machines protected by SRM. As such you will want to make it as big as possible as you will create VMs here. Additionally it must be located on SCSI 1:0

64

Note: Later when we create volumes in the VSA, you will see it does support “thin provisioning” to present this volume of any size you like despite not having the actual disk space to hand. Despite doing this – the larger this third disk is the more space you will have for your virtual machines. Currently the VSA can only contain three virtual disks (including) the two included in the zip file. The VSA can now support 5 virtual disks on SCSI1: with each of the virtual disks being a maximum of 2TB, it means one single VSA can present up to 10TB of storage

Licensed by Virtual MAC Address Before you power on the VSA for the first time, you might want to consider how the product is licensed should you wish to use VSA beyond the 60-day evaluation period. VSA is licensed by the virtual MAC address of the VM generated by VMware at power on. Whilst this auto-generated MAC address shouldn’t change, it can do in some cases where you manually register and unregister a VM from one ESX host to another. Additionally, if you fail to back up the VMX you could lose this information forever. Lastly, if for whatever reason you clone the VSA with vCenter clone/clone to template facility, a brand-new MAC address is generated at that point. To be 100% sure you might prefer to set and record a static MAC address to your VSA. It depends on your circumstances and requirements. If you wish you can set a static MAC address in the range provided by VMware. Since Vi3.5 you have been able to set a static MAC address in the GUI, and we no longer need to edit the VMX file directly.

Whatever you chose, static or dynamic – be sure to make a record of the MAC address so your license key (if you have purchased one) will be valid if you need to completely rebuild the VSA from scratch. HP Lefthand recommends a static MAC address.

65

Primary Configuration of VSA Host Before you consider the first power on and carry out the primary configuration – you might want to consider your options for creating your second VSA. Although it doesn’t take long to add in the VSA, we currently have a VSA, which is in a clean and un-configured state. To rapidly create a second VSA – you could run a vCenter “clone” operation to duplicate the current VSA VM configuration. You can do this even if the VM is located on local storage as is the case with my VSA1. HP Lefthand does not support cloning the VSA once it is in a management group setup with the client console used to manage the system. The primary configuration involves setting the hostname and IP settings for the VSA from the VMware Remote Console window. You can navigate this utility by a combination of keystrokes such as the tab key, spacebar and enter/return keys. It is very simple to use, only keep away from the cursor keys for navigation – they don’t work.

1. Power on both VSA VMs 2. Open a VMware Remote Console 3. At the Login prompt – type start and press [Enter]

Note: The image above has been inverted here to ease printing. The VSA presents a black background/white text environment. You can navigate around the VSA’s console using the tab key and [ENTER]

4. Press [Enter] at the Login prompt

5. In the menu select “Network TCP/IP Settings” and press [Enter]

6. Cursor up, and select < eth0 > and press [Enter]

66

7. Change the Hostname and Set a static IP address

Note: When I repeated this process for my second VSA, I set the name to be vsa2.corp.com with an IP address of 172.168.4.99/24 with a default gateway of 172.168.4.1. Although all my equipment is in the same rack, I’ve tried to use different IP ranges with routers to give the impression that NYC and NJ represent two distinct sites with different network identities.

8. Press [Enter] to confirm the warning about the restart of networking 9. Use the Back options to return to the main green/blue login page

Note: Repeat this process for the other VSA, in my case I used the IP address of 172.168.4.99 for the second VSA TIP: You might wish to update your DNS configuration to reflect these hostnames and IP addresses so you can use an FQDN in various HP Lefthand management tools

Install Management Client Advanced configuration is done via HP Lefthand Centralized Management Console. This is a simple application used to configure the VSA – there is also a Linux version. Your PC must have a valid or routable IP address to communicate to the two VSAs. You will find the HP Lefthand Centralized Management Console free to download from the HP Lefthand download page I will be using the Windows version CMC Windows (.exe)

67

The install of the Management Console is very simple, and isn’t worth documenting here, and a typical installation should be sufficient for the purposes of this book.

Configure the VSA (Management Groups, Clusters & Volumes)

Adding the VSAs into the Management Console

Note: Before you begin you might as well test your management PC can actually ping the VSAs. You’re not going to get very far in the next step if you can’t.

1. Load the CMC, and the Welcome to Find Modules Wizard will start 2. Choose to search By IP Address or Hostname

3. Click the Add button and type in the IP Address or Hostname of the VSAs

4. Click Finish 5. Click Close

Note:

68

I repeated this add process so I have one management console showing two VSAs (vsa1/2) in two different locations

Adding the VSAs to Management Groups

Each VSA will be in its own management group. During this process you will be able to set friendly names for the groups and volume names. It clearly makes sense to use names that reflect the purpose of the unit in question such as:

• NYC_Group and NJ_Group • NYC_Cluster and NJ_Cluster • Virtual_Machines Volume • Replica_Of_Virtual_Machines Volume

• Alternatively, you might prefer to indicate that two VSAs are in two different

locations such as London and Reading – or Chicago and New York Of course it is entirely up to you what naming process you adopt. But these names are not allowed to contain a space as a character

1. In the Getting Started Node, click 2. Management Groups, Clusters and Volumes and Click Next to Welcome page

2. Choose New Management Group 3. For the management group name type something meaningful like NYC_Group and

Select the VSA you wish to add, in my case vsa1.corp.com

Note: In a production set-up theoretically you could have 5 VSAs which replicate to each other asynchronously in the Protected Site, and another 5 VSAs in the Recovery Site that replicate to each other and with the Protection location in a asynchronous manner. Spaces are NOT allowed in the Management Group Name. You can use CamelCase or the under_score character to improve readability.

4. Next set a username and password.

69

Note: This username and password is stored in a separate database internally to the VSA. The DB is in a proprietary binary format and is copied to all VSAs in the same management group. If you are the forgetful kind you might want to make some record of these values. It is in no way connected to the logins to your vCenter or Active Directory environment

5. Choose Manually set time Note: As the VSA is a virtual appliance it should receive time updates from the ESX host, which is in turn configured for NTP. To enable this I edited the VMX file of my two VSAs and enabled the tools.syncTime = "TRUE" option

Create a Cluster

The next stage of the wizard is to create a cluster. In our case we will have one VSA in one management group within one cluster – and a separate VSA in a different management group within a cluster. The cluster is intended for multiple VSA within one management group, however we cannot setup replication or snapshots between two VSA in different sites without one

1. Choose Standard Cluster 2. Type in a cluster name such as NYC_Cluster 3. Next set a virtual IP, this is mainly used by cluster when you have two VSAs

within the same management group and strictly speaking isn’t required in our case – but it’s a best practise to set this now for possible future use. In my case I used my next available IP of 172.168.3.98

70

Note: When I set the virtual IP on my other VSA, I used the IP address of 172.168.4.98

Create a Volume The next step is creating a volume. Volume is another word for a LUN. Whatever word you are familiar with we are creating a block of storage which is unformatted which could be addressed by another system (in our case ESX) once formatted files could be created on it. Some storage vendors refer to this process as creating a “file system”. This can be a little confusing as many people associate this with using EXT3, VMFS or NTFS. A volume or file system is another layer of abstraction between the physical storage and its access by server. It allows for advanced features such as thin provisioning or virtual storage. A volume can either be full or thinly provisioned. With thinly provisioned volumes disk space presented to a server or operating system can be greater than the actual physical storage available. So volume can be 1TB in size even though you only have 512GB of actual disk space. You might know this concept as virtual storage where you procure disk space as you need it rather than up front. The downside is you must really track and trace your actual storage utilization very carefully. You cannot save files in thin air. You can switch from Full to Thin provisioning and back again after you have created the volume.

1. Type in a volume name such as: virtualmachines 2. Set the volume size 3. Choose Full or Thin for Provisioning

71

Note: In this case I created a volume called virtualmachines, which is used to store VMs. The size of the “physical” disk is 37.1GB, but with thin-provisioning I could present this storage as if it was 1TB volume/LUN. Replication level option would be used if I was replicating within a management group. In the case of this configuration it is irrelevant because we are replicating between management groups. When I repeated this process for the VSA2 I selected the option to “Skip Volume Creation”, as when I setup replication between VSA1 and VSA2 – the replication wizard will create for me a “remote volume” to accept the updates from VSA1 At the end of some quite lengthy status bars the management group, cluster and volume will have been created. Note: Now we must repeat this process for VSA2 but using unique names and IP addresses Management Group Name: NJ_Group Cluster Name: NJ_Cluster Volume Name: Select “Skip Volume Creation” Note: Different vendors use different terms for copying a piece of data to another such as remote copy, replication and snapshot. All of these terms in the world of storage mean roughly the same thing; a duplication of data from one location. Note: At the end of this process you should have a view which looks similar to this:

72

Licensing the VSA Although the VSA is free to evaluate for 60-days without any license key at all certain advanced features will need a license applied to them to be activated. You can find the Feature Registration tab when you select a VSA from the Storage Node category. License keys are plain text values than can be cut and pasted from the license fulfilment webpage or by using email/fax to submit your HP order number:

On this window of the interface there is a “Feature Registration Task” with the option to edit the license key

73

Configuring VSA for Replication It is very easy to setup replication or a snapshot between two VSAs in two different management groups. With HP Lefthand VSA we use a “Schedule Remote Snapshot”. This allows for asynchronous replication between two VSA at an interval of your choosing – at an interval of 30 minutes or higher. A much smaller cycle of replication is supported between two VSA’s in the same management group but these do not work with SRM and were never intended for use across two sites. In the HP Lefthand VSA the snapshot process begins with a local snapshot at the protected location, once completed this snapshot is then copied to the recovery location. After the first copy the only data transferred is the changes – or deltas. We have a setting to control the retention of this data. We can control how long to retain the snapshot data both at the Protected and Recovery management groups

1. In the NYC_Group, Protected Cluster, Volumes 2. Right-click your volume and choose New Schedule to Remote Snapshot a

volume

3. Set the Recur Every to be every 30 minutes 4. Under “Primary Snapshot Setup” enable the option to be Retained for a

maximum of 3 snapshots. Note: It’s really up to you how long you keep your snapshots. In this configuration I would have 3 snapshot in 180 minutes, when the fourth snapshot was taken the oldest one would be purged. The longer you retain your snapshot and the more frequently you take them the more options exist for data recovery. In the test environment we are configuring you probably won’t want to hang on to this data for too long. The more frequently you take snapshots and the longer you retain them the more storage space you will require. For testing purposes you might find much less frequent intervals will be appropriate – as you need less space to retain the snapshots

5. Click Retain maximum of: and set the value to be 3 snapshots 6. Under “Remote Snapshot Setup”, make sure the NJ_Group is selected

74

7. Click the New Remote Volume button. This will start a separate wizard which will create a remote volume on the VSA in the NJ_Group. It will be the recipient of block updates from the other VSA in the NYC_Group.

8. Select the NJ_Group in the Management Groups, Clusters, and Volumes Wizard

9. Click Next, and select the radio button for Existing Cluster and Add a volume to

an existing cluster

75

10. In the next dialog box select the NJ_Cluster (I know it feels odd that you have

selected this once already!) 11. In the Create Volume dialog box, type in a friendly name for the volume such

as “replica_of_virtualmachines”. Notice how the type of this volume is not “Primary” but “Remote”

Note: Click Finish and Close. At the end of this process the New Schedule to Remote Snapshot a volume dialog box will have updated to reflect the creation of the remote volume

76

IMPORTANT: You will notice that despite setting all these parameters the OK button has not been enabled. This is because we have yet to set ‘start date’ or ‘time’ for the first snapshot CAUTION: The frequency of the snapshot and retention values is important. If you create too shallow a replication cycle as I have done here, you could be mid-way through a test of your Recovery Plan only to find the snapshot you were currently working on is purged from the system. In the end because of lack of storage I adjusted my frequency to be an hour – as about midway through writing this book I ran out of storage, and that was with a system that wasn’t generating much in the way in new files or deleting old files. So my schedule is not an indication of what you should set in the real world if you were using HP Lefthand storage – merely a way of getting the replication working sufficiently for use to get started with SRM.

12. Next to Select ‘Start At’ time, click Edit button – and using the date and time interface set when you would like the replication/snapshot process to begin.

13. Click OK IMPORTANT: If you have not licensed the VSA this feature will work but only for another 60-days. You may receive warnings about this if you are working with an evaluation version of the VSA.

Monitoring your replication/snapshot

Within the VSA

Of course, you will be wondering if it’s working. There are a couple of ways to tell. Expanding the volumes within each Management Group will expose the snapshots. You might see the actual replication in process with animated icons like in the screen grab below

77

After selecting the remote snapshot, you will see a tab on the right-hand side labelled “Remote Snapshots”. This will tell you how much data was transferred and how long it took to complete.

Replication Frequency Generally

As you have seen my cycle of replication is not especially frequent, and because of the retention of the snapshot – you could regard the HP Lefthand VSA method of replication offers as it were a series of “undo” levels. Now to some degree this is true – if we have three snapshots (SS1, SS2, SS3) each separated by an hour, we have the ability to go back to the last snapshot and the one created an hour before it. However, firstly most SRAs default to using the most recent snapshot created or to creating a snapshot on-the-fly – so if you wanted to utilize these “levels of undo” you would need to know your storage management tools well enough to replicate an old snapshot to the top of the stack. In other words SS1 would become SS4. Lastly, it’s worth saying that where possible many organizations will want to use synchronous replication where bandwidth and technology allow. This synchronous replication offers the highest level of integrity because it is constantly trying in real-time to keep the disk state of the Protection Site and Recovery Site together. Often with this form of replication you are less restricted in the time you can rollback your data. You should know, however, that this functionality is not automated or exposed to the VMware SRM product and was never part of the design. As such it’s a functionality that could only be achieved by manually managing the storage layer. A good example of a storage vendor who offers this level of granular control would be EMC’s “Recovery Point” technology which allows you to rollback a second by second level of the replication cycles. Also remember

78

this synchronous replication is frequently restricted in distance such that it may be unfeasible given your requirements for a DR location.

Adding ESX hosts & Allocating Volumes to Them Clearly, there would be little security if you could just give your ESX hosts an IP address and “point” them at the storage. To allow your ESX hosts access to the storage one must be allocated an IQN (iSCSI Qualified Name). Each ESX host will be allocated a IQN – the IQN is used within the authentication group to identify an ESX host. In case you have forgotten, the IQN is a convention rather than an hard-coded unique name (unlike the WWNs found on Fibre-Channel Devices) and takes the format of iqn-date-reverse-fqdn:alias. As a domain name can only be registered once on a particular date (albeit it can be transferred or sold to another organization) it does impose a level of uniqueness fit for its purpose. An example IQN would be: iqn.2009-10.com.corp:esx1 In this simple set-up my ESX hosts are in the NYC Site, and they are imaginatively called esx1.corp.com and esx2.corp.com. My other two ESX hosts (yes, you guessed it - esx3 and esx4) are at the NJ Site and do not need access to the replicated volume in the NJ_Group management group. Before I invoke DR/BC with SRM the HP Lefthand SRA will grant them access to the latest snapshot of replica_of_virtual_machines. For the moment ESX3 and ESX4 need no access to the VSAs at all.

Adding an ESX Host

1. The NYC_Group 2. Right-Click the Servers icon, and select New Server

3. Type in the FQDN of the ESX host as a friendly identifier, and in the edit box under “CHAP not required” enter the IQN of the ESX Host

79

Note: The VSA does now support CHAP when used with SRM. But for simplicity I’ve chosen not to enable that support here.

Allocating Volumes to ESX Hosts

Now we have the ESX hosts listed in the VSA we can consider giving them access to the “virtualmachines” volume I created earlier. There are two ways of carrying out this task. You can right-click a host, and use the “Assign and Unassign Volumes and Snapshots” menu option. This is useful if you have just one volume you specifically want a host to access. Alternatively, the same menu option can be found on the right-click of a volume – this is better for ESX hosts – because in VMware all ESX hosts need access to the same volumes formatted with VMFS for features such as VMotion, DRS, HA and FT.

1. Right-click the volume, in my case virtualmachines 2. Select in the menu the option to Assign and Unassign Volumes and Snapshots

3. In the Assign and Unassign Servers dialog box enable the “Assigned”

option for all ESX host the datacenter/cluster that require access like so:

80

Note: When you do this you will receive a warning stating that this configuration is only intended for clustered systems or clustered files system. VMFS is a clustering file system where more than one ESX host can access the volume at the same time without corruption occurring. So it is safe to continue

Conclusion

For now this completes the configuration of the VSA – all that we need to do is to configure the ESX host connection to the VSA. Currently our ESX hosts in recovery location have no access to VSA, and will not need it until we test or invoke our DR/BC plan.

Configuring the ESX Software iSCSI If you have a dedicated iSCSI hardware adapter you can configure your IP settings and IQN directly on the card. One advantage of this is that if you wipe your ESX host, your iSCSI settings remain on the card – however they are quite pricy. Many VMware customers prefer to use the ESX host’s Software Initiator. The iSCSI stack in ESX4 has been recently overhauled and it is now easier to setup and offers better performance. The following instructions explain how to set it up to speak to the VSA. Before you enable the Software Initiator/Adapter in the ESX host you will need to create a VMkernel Portgroup with the correct IP data to communicate to the VSA. In the past you needed to also have a second “Service Console” connection. This is no longer the case in ESX4. The diagram below shows my configuration for esx1 and esx2, notice the vSwitch has two NICs for fault-tolerance.

Before proceeding with the configuration of the VMware software initiator/adapter you might wish to confirm you can communicate with the VSA by using a simple test using ping and vmkping.

81

Enabling the iSCSI Initiator

Note: Depending on the version of ESX 4.x.x you are using – you may or may not need to manually open the iSCSI Software TCP Port on the ESX firewall. I’ve always done this manually – to be 100% certain there is no communication barrier between the ESX hosts and the iSCSI Target


18. Next click the Storage Adapters link and select the iSCSI Software Adapter 19. Choose Properties... 20. In the dialog box click the Configure button 21. Enable the option under status, as shown below

Note: This can take some time. Be patient. You will not be able set a custom IQN until

82

you click OK. VMware will try to help you out by setting a default IQN.


Note: After clicking OK, this time – a dialog box will appear indicating you must reboot the ESX host

This is true, but we will defer the reboot until we are finished completely

23. Next select the Dynamic Discovery Tab, and Click the Add button 24. Type in the IP address of the VSA (in my case the VSA in the NYC_Group)

172.168.3.99

83

Note: Static discovery is only supported with hardware initiators. Notice here the IP address of the VSA is the same as its management port. This means that I/O or read/writes to the volume will go through the same virtual NICs and vmnics. In a production environment you would expect to have many NIC ports which you could segment into different usages such as management, regular I/O and replication traffic

25. Click OK Note: This can take some time as well

26. After clicking Close to the main dialog box you will be asked if you want to rescan the Software iSCSI virtual HBA (in my case vmhba34). Click Yes Note: Occasionally, I’ve noticed that some changes to the Software iSCSI Initiator after this initial configuration require a reboot of the ESX host. So try to limit your changes where possible, and think through what settings you require up front to avoid this where possible

Monitoring your iSCSI Connections

There are many places where you can confirm you have a valid iSCSI connection. This is important because networks can and do fail. In the first instance you should be able to see the volume/LUN when you select the virtual iSCSI HBA in the Storage Adapters in the Vi client:

84

Note: On the properties of the “virtual” hba, in this case vmhba32 or vmhba40 (different hosts return a different “virtual” hba number) and the “Manage Paths” dialog box will display a green diamond to indicate a valid connection, together with other meaningful information which will help you identify the volumes returned.

Note: In the dialog box above you can see the target details show that I’m connected to VSA in the “NYC-Group” and the volume is called “Virtual machines”. Additionally, when you use the Add Storage wizard you should see the volume/LUN as can be seen in this screen grab below:

85

However, more specifically you can see the status of you iSCSI connections from the VSAs management console

1. Expand the NYC_Group 2. Select the NYC_Cluster, 3. Select the Volumes and Snapshots node 4. Select the volume in the list and then click the iSCSI Sessions Tab

Note: In this case you can see there were 2 sessions but one of them failed. This was caused by placing esx2.corp.com into maintenance mode (to remove all the VMs on it), and then doing a reboot whilst it was connected to the VSA

HP Lefthand - Create a Test Volume at the Recovery Site I want to confirm my ESX hosts at the Recovery Site can communicate to my second VSA. To do this I’m going to create and give them access to a blank LUN. so I am satisfied they all see this “test” volume. For Recovery Plans to work, the ESX hosts in the Recovery Site (New Jersey) do need to be listed in the management system. However, they do not need to manually assigned access to the replicated volumes – that’s something the HP Lefthand SRA will do automatically whenever we carry out a test or a run of the Recovery Plans. Essentially, the steps below repeat the configuration carried out the Protected Site (New York) but for a different VSA and different ESX hosts.

1. Open the HP Lefthand Centralized Management Console 2. Select the NJ_Group and Logon 3. Expand the +NJ_Cluster, +Volumes 4. Right-click +Volumes and Choose New Volume

86

5. In the New Volume dialog box, type in a Volume name such as TestVolume 6. Type in a Volume Size making sure it is more than 2GB in size

Note: Although we will not be formatting this LUN, ESX itself cannot format a volume which is less than 2GB in size

7. Click OK

Adding an ESX Host

1. The NJ_Group 2. Right-Click the Servers icon, and select New Server

87

3. Type in the FQDN of the ESX host as a friendly identifier, and in the edit box under “CHAP not required” enter the IQN of the ESX Host

Allocating Volumes to ESX Hosts

1. Right-click the volume, in my case TestVolume 2. Select in the menu the option to Assign and Unassign Volumes and Snapshots 3. In the Assign and Unassign Servers dialog box enable the “Assigned”

option for all ESX host the datacenter/cluster that require access like so:

88

Enabling the iSCSI Initiator


5. Next click the Storage Adapter link and select the iSCSI Software Adapter 6. Choose Properties... 7. In the dialog box click the Configure button 8. Enable the option under status, as shown below

Note: This can take some time. Be patient. You will not be able set a custom IQN until you click OK. VMware will try to help you out by setting a default IQN.


89

Note: After clicking OK, this time – a dialog box may appear indicating you must reboot the ESX host

This is true, but we will defer the reboot until we are finished completely

10. Next select the Dynamic Discovery Tab, and Click the Add button 11. Type in the IP address of the VSA in your NJ_Group in my case,

172.168.4.99

90

12. Click OK Note: At the end of this process all the ESX hosts in the Recovery Site (New Jersey) should be able to see the TestVolume from the HP Leftand VSA.

Shutting Down the VSA It is recommended to use the VSA Management Console to take a VSA offline.

1. Right-click the VSA in the Storage Nodes 2. Select in the menu Power off or reboot

Conclusion In this section I have quickly shown you how to setup a 60-day evaluation copy of the virtual appliance which is suitable for use with VMware SRM. We setup two HP Lefthand VSA’s and then configured them for replication or “Schedule Remote Copy”. Lastly, we connected an ESX host to that storage. From this point onwards I would recommend you format the volume/LUN with VMFS and create virtual machines. You might wish to do this so you have some test VMs to use with VMware SRM. SRM is designed to only pick up on LUNs/Volumes that are formatted with VMFS and contain virtual machine files. In previous releases if you had a VMFS volume which was blank it wouldn’t be displayed in the SRM Array Manager Configuration wizard – in the new release it does warn you if this is the case. This was apparently a frequent error people had with SRM 1.0, but one that I rarely saw – mainly because I always

91

ensured my replicated VMFS volumes have virtual machines contained on them. I don’t see any point in replicating empty VMFS volumes! In SRM 4.0 the Array Manager Configuration wizard now displays this error if you fail to populate the datastore with virtual machines:

Since ESX 3.5 and vCenter 2.5 you have been able to relocate the virtual machine swap file (.vswp) on to different datastores, rather than locating it in the default location. A good tip is to relocate the virtual machines swap file onto shared but not replicated storage. This will reduce the amount of replication bandwidth needed. It does not reduce the amount of disk space used at the Recovery Site, as this will be automatically generated on the storage at the Recovery Site. In my demonstrations I mainly use virtual disks. VMware’s RDM feature is now fully supported by SRM (it wasn’t when version 1.0 was first released). I will be covering RDMs latter in this book because it is an extremely popular VMware feature.

93

Chapter 5: Getting started with NetApp and SnapMirror

94

NetApp is probably best known for providing physical storage arrays that provide data deduplication for the NAS market. Actually, their (physical) storage appliances are multiple storage protocol arrays that support fibre-channel, fibre-channel over Ethernet, and iSCSI SANs as well as NFS & SMB NAS connectivity. Unlike other vendors, NetApp does not have a publicly available virtual appliance version of their storage systems. However, customers and people who attend their training courses can frequently acquire the ONTAP Simulator which runs inside a VMware virtual machine. As I have access to the real deal in my lab environment, and that ONTAP Simulator is not publicly downloadable I’ve chosen not cover the setup of the Simulator. If you do have access to it, Cormac Hogan on the viops.com website has created a quick guide to getting started with it: http://viops.vmware.com/home/docs/DOC-1603 At the time of this writing my sources in NetApp are assuring me that a NetApp simulator should appear with Data ONTAP 8.0. I wonder if this will be a publicly VSA, we can only hope. In 2009 NetApp very kindly allocated to me two NetApp FAS2020 systems. They are racked up in my collocation facility – and they look very much like two 2U servers with vertically mounted disks behind the bezel. From what I understand, once you know one NetApp array, you know them all. As such, what we cover in this chapter should apply to all NetApp deployments, large and small.

In the main I manage them using the really new “NetApp Systems Manager”. While this management console is very friendly it does currently lack the ability to use SnapMirror Replication between multiple arrays. So in this chapter I will use “FilerView” which is a web-based administration tool natively built into every controller.

In this screen grab of the NetApp System Management console above you can see I have two NetApp FSA2020 systems (new-york-filer1.corp.com and new-jersey-filer1.corp.com). As you can see I have a volume called “virtualmachines” and it is replicating the data from one to another – or what NetApp call “SnapMirror”. It is a very simple setup indeed, and is enough to get us started with the SRM product. NetApp System Manager is used for day-to-day management of the NetApp Filer. If you want to configure NetApp SnapMirror which is required for SRM work, then you need to fire up “FilerView” which as I stated a moment ago is a web-page based system available on each FAS system.

95

Provisioning NetApp storage for VMware ESX Every NetApp storage system has the ability to serve storage over multiple protocols, so you can attach storage to ESX servers and clusters over NFS, Fibre Channel, FCoE and iSCSI all from one NetApp box. To make provisioning a lot faster, NetApp have created a vCenter plug-in called the Rapid Cloning Utility, which in addition to cloning virtual machines, lets you create, resize and deduplicate Datastores and storage volumes. However, in this example we’ll show you how to provision storage the old fashioned way. The process will be slightly different, depending on whether you’re provisioning storage over NFS, or provisioning LUNs over Fibre Channel and iSCSI – so we’ll cover those in separate sections.

As the above diagram illustrates, it is also possible to connect storage, such as LUNs over iSCSI and file shares, directly into virtual machines over a virtual network. Storage presented in this manner will be invisible to VMware Site Recovery Manager, and so will not be covered in this book.

Creating NetApp Volumes for NFS NetApp uses a concept called “Aggregates’” to describe a collection or pool of disk drives of similar size and speed. Within the aggregate, disks are automatically arranged into RAID groups. In my case aggr0 is a collection of drives used to store Data ONTAP, which is the operating system that runs on all NetApp storage controllers. Aggr1 is the remainder of my storage, which I can present to the ESX hosts in the New York Site. To create volumes you login to the “NA Admin” web pages to access the “FilerView” interface.

11. Open a web-browser to your Protected Site NetApp Filer (New York): https://nameofyourfiler/na_admin in my case this is https://new-york-filer1.corp.com/na_admin

12. Login with root and your password 13. Click at the FilerView icon

96

14. Open the category called Volumes, and Select Add. This will open the Volume Wizard

15. In the Volume Wizard window, click Next to Add a new volume 16. Select Flexible as the volume type

Note: Flexible volumes (sometimes referred to FlexVol) offer significant management improvements to traditional volumes and cache volumes (Traditional volumes appear here purely for backwards compatibility reasons)

17. Next type in the name of the volume, in my case I called mine vol1_virtualmachines

97

18. Next select the aggregate that will hold the volume, in my case I selected

aggr1

19. The next part of the wizard allows for a lot of options. In my case I wanted to

create volume that the ESX host would see as 100GB with 20% reserved on top for temporary snapshot space.

Note: If I had left the radio button in the total size location, this would create a 100GB volume with 100GB useable by the ESX host, and no space reserved for snapshots.

20. Click Commit create the volume

98

Note: Once the commit process completes the volume should appear in the list if you click the Manage link under volumes

Next login to the filer in the Recovery Site, in my case new-jersey-filer1.corp.com , and repeat this volume creation process. This volume will be used to receive updates from the Protected Site NetApp Filer (new-york-filer1.corp.com). The only difference at the Recovery Site NetApp Filer (new-jersey-filer1.corp.com) was that I decided to call the volume – vol1_replica_of_virtualmachines.

This volume must be the same size or larger than the volume previously created for SnapMirror to work. So watch out with the MB/GB pull-down lists, as it’s quite easy to create a volume that’s 100MB, which then cannot receive updates from a

99

volume that is 100GB. Its sounds a pretty idiotic mistake to make, and easily corrected by resizing the volume, but you’d be surprised how easily it is done. The way I know is that this idiot (me) has done it quite a few times!

Modify the Export Properties By default when you create a volume in NetApp it auto-magically makes that volume available using NFS. However, the permissions required to make the volume accessible do need to be modified. As you might know ESX hosts must be granted access to the NFS export by their IP address and they also need “root” access to the NFS export. We only need to export the volume in the Protected Site, as Site Recovery Manager, with the NetApp SRA will handle exporting the SnapMirror destination volume when necessary.

1. In FilerView, open the NFS option 2. Select the Manage Export option 3. Click at the link that shows the existing NFS Export settings – in this case it

says Read-Write Access (All Hosts) Security (sys)

4. In the NFS Export Wizard, enable the Root Access option, and click Next

5. Accept the default export path, in my case /vol/vol1_virtualmachines 6. In the NFS Export Wizard – Read-Write Access, remove the default that All

Hosts have Read-Write access 7. Click the Add button and type in the IP address of your ESX hosts

100

Note: This is a VMKernel port on your ESX hosts in the Protected Site (in my case New York hosts esx1.corp.com and esx2.corp.com) used to mount the NFS export to the ESX host. Repeat this task for each ESX host in the Protected Site that will need access to this NFS Export.

8. After clicking Next, in the NFS Export Wizard – Root-Access page, repeat the same

process

9. Accept the default security flavour 10. At the end of the wizard click Commit to apply your changes

101

Note: At the end of this process the permissions on the NFS Export will have been updated, you will need to click the Refresh button in FilerView to see this change

This graphical process can become somewhat long winded if you have lots of volumes and ESX hosts to manage. You might prefer the command-line options to handle this at the NetApp Filer itself like so: exportfs -p rw=172.168.3.101,root=172.168.3.102 /vol/vol1_virtualmachines

Granting Access to ESX hosts to NetApp Volumes The next stage is to mount the “virtualmachines” volume we created on the Protected Site NetApp Filer (new-york-filer1). Before you mount an NFS export at the ESX host you will need to create a VMkernel Port Group with the correct IP data to communicate to the NetApp Filer. In the past you also needed to have a second "Service Console" connection. This is no longer the case in ESX4. The diagram below shows my configuration for esx1 and esx2, notice the vSwitch has two NICs for fault-tolerance. The diagram below shows my configuration for esx1.

Before proceeding with the next part of the configuration you might wish to confirm you can communicate with the NetApp Filer by using a simple test using ping and vmkping.

102

1. In vCenter, Select the ESX host and click the Configuration tab 2. In the Hardware Pane, select the Storage link 3. Click the Add Storage link in the far right-hand corner 4. Select Network File System in the dialog box 5. Type in the IP address or name of the Protection Site Filer 6. Type in the name of volume using /vol/ as the prefix 7. Type in a friendly datastore name

Note: For reasons of clarity I’ve used the FQDN of new-york-filer1.corp.com. In reality I prefer to use a raw IP address. I’ve found the mounting and browsing of NFS datastores much quicker if you don’t use an FQDN in the dialog box, but use an IP address. Although quite frankly this might have more to do my crazy DNS configurations within RTFM. Within a single week my lab environment can have up to 4 different identities – from this point onwards I’m sticking with corp.com. That said its an event that has caused some issues!

Creating NetApp Volumes for Fibre Channel and iSCSI Storage presented over Fibre Channel and iSCSI requires block based access – this means that the host will see what appears to be a disk drive. To achieve this, many storage vendors carve out areas from their physical disk drives into logical units (LUNs). NetApp take a slightly different approach. Instead, LUNs are created inside Flexible Volumes, which allows for features like de-duplication to work with LUNs. So, to provision a LUN, to be used over Fibre Channel or iSCSI, we first need to create a FlexVol for the LUN to live in.

1. Open FilerView, and expand the Volumes category, and Select Add. This will open the Volume Wizard

103

2. Follow the same process as described in the “Create NetApp storage for NFS” section to create a new Flexible Volume, however this time be sure to select a Snapshot Reserve of zero.

3. We’ll also want to disable automatic snapshots on the volume, since we’re

going to configure SnapMirror later, which will take it’s own snapshots. Navigate to the Volumes – Snapshots – Configure section in FilerView:

4. In the Configure Snapshots page, select your volume from the drop down at the top, and then uncheck the box for scheduled snapshots.

5. So now we have a FlexVol ready for our LUN. Navigate to the LUNs section in FilerView, and open the LUN Wizard.

104

6. In the LUN wizard, enter the path name for where you want the LUN to be

created. This will be /vol/<yourvolumename>/<yourLUNname>

7. Enter the size for your LUN, and select the LUN type of VMware. Also, uncheck the “Space Reserved” box, as this will allow our LUN to be thin-provisioned.

8. Next, we can configure the initiator group (or igroup) that we’ll map the LUN to. An igroup is a collection of initiators that will be given access to our LUN. Click Add Group to create a new igroup. When you reach the “Specify New Group Parameters” section of the wizard, you can select FCP to use the igroup for Fibre Channel initiators, or iSCSI to use the igroup for iSCSI.

105

9. When you reach the “Add Initiators” stage of the wizard, click Add Initiator. You will then be able to insert the WWPNs (World Wide Port Names( for your ESX server’s HBAs.

106

10. If you’re not sure if your ESX server has HBA WWPNs, you can easily locate them within vCenter and just copy / paste them.

11. Continue through the LUN wizard. When prompted to assign a LUN number, you can leave this blank, and the next available LUN number will be assigned.

107

12. At the final step of the wizard, review the summary, and click Commit to apply the

changes.

Gaining Access to NetApp LUNs to ESX Now that we’ve created a LUN, and presented it to our ESX server HBAs, we can create a Datastore to use the LUN.

1. The first thing we need to do is tell our ESX servers to Rescan their HBAs to detect the new LUN. We can do this from vCenter.

2. For QLogic HBAs, you might need to rescan a second time before the LUN is detected. You’ll see a new LUN listed under the HBA’s devices once the rescan has completed.

108

3. So now our ESX server can see our LUN, and we can create a new VMFS Datastore to use it.

Configure NetApp SnapMirror

A Quick introduction to SnapMirror SnapMirror is the main data replication feature used with NetApp systems. It can perform synchronous, asynchronous, or semi-synchronous replication in either a Fibre Channel or IP network infrastructure. We’re going to configure SnapMirror to replicate asynchronously between the volumes for NFS and Fibre Channel, which we created earlier on the Protected and Recovery site Filers, but there are a couple of things we need to setup first.

Confirm IP Visibility (Mandatory) and Name Resolution (Optional)

Before beginning with the setup of replication between the two NetApp Filers it’s worth confirming that they can see each other through your IP network. I like to enable SSH support on my filers so can use PuTTy with them as I do with my ESX hosts. It means I can carry out interactive commands without resorting to the BMC card. NetApp Filers obliviously support the “ping” command – and using this with both the IP address and host name of the Recovery Site NetApp Filer (New-Jersey) you work out whether they can see each other – and also if name resolution is correctly configured.

If you fail to receive positive responses in these tests, then check out the usual suspects. Begin with confirming the DNS configuration on the filers under Network and Configure Host Name and Resolution.

Enable SnapMirror (Both the Protected & Recovery Filer)

On newly installed NetApp systems, the SnapMirror feature is likely to be disabled. For SnapMirror to function it needs to be licensed and enabled on both systems

1. Login to FilerView 2. Expand the SnapMirror option 3. Click the Enable/Disable link

109

4. Click the Enable SnapMirror button

Note: Repeat this task on the Recovery Site NetApp Filer (New Jersey). If the filer you are working with is new – then you might want to confirm you have the correct licenses in place for SnapMirror

Enable Remote Access (Both the Protected & Recovery Filer)

In order for us to configure NetApp SnapMirror we need to allow the filer from Recovery Site (New Jersey) access to the Protected Site NetApp Filer (New York). When we configure this we can use either an IP address or FQDN. Additionally, we can indicate if the Recovery Site NetApp Filer (New Jersey) is allowed remote access to all volumes or just selected ones.

1. Login to FilerView on the Protected Site NetApp Filer, in my case this is new-york-filer1

2. Select SnapMirror 3. Click Remote Access and Click Add

4. In the left hand window, type in the IP address or name of the Recovery Site NetApp Filer (New Jersey) and then click Add

110

Note: Repeat this task at the Recovery Site NetApp Filer (New Jersey) to allow the Protected Site access to the Recovery Site’s volumes.

Configure SnapMirror on the Recovery Site NetApp Filer (New Jersey)

The next thing is to log in to the NetApp Filer on the Recovery Site (New Jersey), and enable the replication. We’ll need to restrict our destination volume, in the Recovery Site, so that only SnapMirror can make changes to it. Then we can create the SnapMirror relationship.

1. In FilerView, select Volumes and Manage 2. In the corresponding pane, select the volume that will be receiving updates,

in my case vol1_replica_of_virtualmachines 3. Click the Restrict button and click OK to confirmation dialog box

Note: After carrying out this task the web-page should refresh indicating the volume is now restricted

Note: Next we configure a SnapMirror relationship between the Recovery Site NetApp Filer (New Jersey) to receive updates from the Protected Site NetApp Filer (New

111

York)

4. Under SnapMirror in FilerView, click the Add button

5. In the SnapMirror Wizard, select the volume which will receive updates from

the Protected Site. In my case vol1_replica_of_virtualmachines. Later on in the wizard you will be able to select the source for those updates – in my case new-york-filer1’s volume called “virtualmachines”

Note: Initially, it might feel odd that the SnapMirror configuration is controlled at the Recovery Site NetApp Filer (New Jersey), and that in the wizard you specify the destination location before the source location. But if you think about it – in the event of a real DR event it would be the destination where you be managing the storage layer from the DR location.

6. In the next part of the wizard, type in the IP or name of the Protected Site NetApp Filer which will be the source of updates for the volume.

7. The next part of the wizard called “Set Schedule Arguments” controls two main

parameters – Restart Mode and Maximum Transfer Rate

Note: The Restart Mode controls how the NetApp Filer will behave should the transfer of

112

data be interrupted due to a network outage. The pull-down list allows for “Schedule Priority”, “Always” and “Never”. With the schedule priority replication will resume at the point it was interrupted at the next scheduled time. With “always” replication will restart from the beginning rather than resuming at the point of interruption. With “Never” the replication will resume from the point that the previous replication was interrupted. Changing the Maximum Transfer Rate allows you to throttle the network bandwidth allocated to the replication in KB/Sec. This can be useful if you are using existing WAN links (rather than dedicated ones) for replication, and you do not want the replication to adversely affect the bandwidth needed for other applications. Of course, doing this will slow down replication and it will take longer (relative to the changes that are taking place) for replication to complete. As you can see from the wizard leaving this field blank allows for the replication to use ALL the bandwidth available to it.

8. The next part of the wizard called “Set Schedule” allows you to configure the frequency of the replication

Note: Clearly, this allows you control how often replication will happen. For DR purposes you will probably find the daily, weekly, and monthly preset inadequate for your RTO/RPO goals. That really leaves you with just the Every Minute and Every Hour options. If you select the Every Hour option under “Repeat Mirror”, you will be able control if the replication happens on the hour, or 5, 10, 15, 20, 25, 30, 40, 45, 50, 55 minutes past the hour. You might find using a custom schedule will allow for great flexibility – replicating or mirroring specific volumes with differing frequencies again based on your RPO/RTO goals. The Help button next to the “Schedule Format” is particularly helpful as it offers examples on how to configure this. Each part of the entry (separated by space) in the Custom Schedule Format represents a time period – minute-hour-day-week. The values can be a range (say from 11pm-1am) and multiple values for time period can be separated with a comma. The asterisks sign is used as a wildcard to indicate that you accept the default for that period. In fact this is just using the “Cron” format, which you may be familiar with from Linux. For example to configure a schedule that would replicate every 5 minutes past

113

every hour, Monday through Friday you would type: 15 * * 1,2,3,4,5 If you just wanted to configure a schedule that would replicate every 5 mins all the time you would simply type: 5 * * * Additionally, you can configure the relationship to replicate synchronously or semi-synchronously via the command line.

9. Once you have configured the schedule, click Next – and then click Commit 10. If you select SnapMirror and the Manage option you will see the relationships

there. Notice how in the screen grab below the relationship is currently “unknown”

This is because the relationship has yet to be “initialized” or triggered. The initialization will perform a baseline copy from the source volume to the destination. After this first copy, SnapMirror will replicate changes only. To trigger the initialization process. Click the Advanced link in this window, and then click the Initialize link

While the SnapMirror relationship is initializing, its status will appear such in the main view like so. Once the initialization has completed, the status should change to “idle” with a state of “snapmirrored”. Occasionally when you refresh the page the status will change from “idle” to “transferring” – this is because an update is being transferred.

114

Introducing NetApp Rapid Clone Utility 3.0 and Virtual Storage Console If you follow my RTFM blog frequently you will know I’ve been playing around with the new vStorage plug-ins to vCenter for some time. Well, this week I was very fortunate to be granted access to NetApp’s latest and greatest version of the Rapid Clone Utility 3.0. It’s currently in beta but will GA very soon, and I’m very lucky to be able to give you a sneak preview. Of course, you do need to be a NetApp customer for the plug-in to work, but the beauty of the RCU is that it’s totally free. And in my book, and in the current economic climate, free is always good! I decided to begin this chapter with creating volumes manually using the FilerView. However, you might find using the RCU is a more efficient way of creating and mounting volumes to many ESX hosts The RCU is a neat little tool – albeit quite poorly named – because it doesn’t just clone virtual machines. It will help you with your day-to-day administrators. And if you’re more of a NetApp Admin than VMware person, it’s a way of giving your VMware people the ability to manage their own storage (if that doesn’t scare the living daylights out of you!) rather than bugging you all the time, safe in the knowledge that, that if anything goes wrong – it’s their fault not yours! [JOKE!]. If you couple the RCU with NetApp’s Virtual Storage Console (VSC) – you will be giving your vCenter much more flexibility. Anyway, without any further ado, let’s get this puppy installed and see what it can do for you. You can download the utilities from: http://now.netapp.com

Installing and Post-Configuration

The NetApp RCU and Virtual Storage Console are both installed to your vCenter server, and run as services on VMware’s management platform. There’s no client as such to install which actually I like. The UI of the RCU is served up as a series of web pages embedded into vCenter. Normally, I’m a bit sceptical about this – after all presenting a web page in vCenter is no great shake, as the HP Insight Manager plug-in demonstrates. The difference with the RCU is that you get the feel of a real application inside vCenter. For the installation of the RCU and VSC you pretty much get a next-next installation routine. During the installation of the RCU the wizard asks you for the authentication details of the vCenter – such that the RCU can interact with vCenter from a plug-ins/extensions perspective:

115

With the VSC the installer opens a web page where you input the same information. It’s not clear why the plug-ins have different front-ends. I suspect different teams within NetApp wrote them.

It is possible to install the vStorage plug-ins to separate management servers if you wish. But I’ve found the RCU/VSC quite lightly loaded and didn’t appear to affect the performance of the core vCenter services. After the installation its time to crank up the vSphere Client. There will be two security prompts (if you have installed both the RCU and the VSC) to confirm you trust the clients connectivity to these new services. In the Plug-in Manager dialog box you should see two new extensions – the Virtual Storage Console and “Kamino”. Kamino is the internal project name for the RCU. I guess at some stage NetApp will change this to something like “Rapid Clone Utility”. Incidentally, for the uninitiated, Kamino is a planet that appears in “Attack of the Clones”. According to Wikipedia they are

116

a “race of tall, elegant, long-necked creatures…who keep to themselves and are known for their cloning technology.” http://en.wikipedia.org/wiki/Kamino

The VSC requires no post configuration tasks for admin to use it, but the RCU does require some post configuration tasks. For it to work you are going to have to tell the RCU what the name/IP address is of your filer, and how to authenticate it. To do this you need to crank up the RCU Admin page. In the vSphere Client navigate to Home \ Solutions and Applications Tab \ Rapid Clone Utility icon and click the Storage Controllers tab. The add… link over on the far right will allow you to kick things off. Start by typing either the name or IP address of the management interface of the filer, together with the username and password. In this case I’m being a very naughty boy by using root/password. You might want to create an aggregate(s) just for VMware, and then use permissions in the NetApp filer to restrict VMware folks just to their aggregate(s). Effectively, making a little sandpit where they can do stuff without fear of them affecting other subscribers to the storage array.

117

At this stage the RCU interrogates the filer and reports on all network interfaces, existing volumes and aggregates. By default the RCU also adds in aggr0. I personally like to remove this from the list – as aggr0 is normally where the system that make the filer run is stored (called Data OnTap) – and its not a good idea to create volumes there, even if there is free space. Use the arrows to add and remove interfaces, volumes and aggregates out of the configuration to restrict what the RCU can see and do.

Using the NetApp DataStore Provisioning Wizard

That completes the post-configuration stages; we’re now ready to play. Remember how I said earlier that the RCU was so much more use than just for cloning virtual machines. The RCU can facilitate daily management. Here’s how. Consider if you need to create a new volume because your ESX hosts need a new chunk of storage; without the RCU here’s a 1,000-foot view of the steps you would have to take.

1. Load-up the NetApp Management tool (Login etc, etc) 2. Create a new volume 3. Set up access (adding ESX hosts) and set the permissions 4. Go back to the vSphere Client on the first ESX host to complete the Add Storage

Wizard (if your using NFS) 5. In the wizard correctly type three fields (Filer name/IP, volume path, datastore

name) 6. Repeat 1-5 for every single damn ESX host you have. By the way you’ve 32 in a

VMware HA/DRS cluster 7. Call your partner and tell them that because of 1-6 you won’t be home in time for

your evening meal 8. Spend the weekend in the dog house

118

Now, let’s compare that to using the RCU.

1. Select your VMware HA/DRS cluster, and click the NetApp Datastore Provisioning Wizard icon:

2. After clicking next to the Storage Controller select either NFS or VMFS as the volume. For the purposes of this article I selected NFS. The subsequent dialog is pretty straightforward. You specify a size (100GB) and datastore name. I called mine “homeintimefordinner” because that’s my main goal. I was only able to select aggr1 because that’s all allowed in the Storage Controllers configuration. What’s really cute about the RCU is the access to additional parameters such as thin-provisioning and auto-grow, allowing me to be really storage efficient whilst covering-my-ass at the same time. So if my VMware admin puts more files in the volume than expected he won’t hit a brick wall. The volume will grow at a factor of 50GB, until it hits the real wall of 200GB.

3. Click the Finish button – and watch RCU take care of all the rest. And here’s the important part – it does this FOR EVERY SINGLE ESX HOST IN THE CLUSTER!

119

4. Call up your favourite restaurant. As I’m British and it’s a Friday – I decide it’s got to be the timeless classic of “Fish and Chips”.

5. Ring up your partner, tell them you will be home early – and you’re “cooking” 6. Spend the weekend basking in newly earned brownie points, and that is two steps

less than the manual method. How neat is that?

Using the NetApp Virtual Storage Console Like EMC’s “Storage Viewer” the main job of the VSC is to improve the quality of the information provided in the vSphere Client surrounding storage use. Once VSC is installed (I won’t bore you with that because it is too easy!) the VSC will add an additional tab to each ESX host in vCenter. After a brief discovery process the main overview page will open, and from here you can see your NetApp Filers and ESX host connected to it. The main overview page will detail:

• The NetApp Filer Name and IP Address • The version of OnTap in use • How much free capacity there is • Protocols Supported • ESX hosts and IP address connect to the NetApp Filer • ESX hosts status settings

In the screen grab below you can see that one of my ESX hosts has an alert on the NFS settings. VSC can assist in the process of applying settings to an ESX host recommended by NetApp.

120

If I right-click the alert above, and select “Set Recommended Values”

A second dialog box will appear which will allow me to enable the default settings as recommended by NetApp.

Some changes made this way will require a reboot, but the VSC does refresh itself to tell you if this is the case. Under the other links called Storage Details – SAN and Storage Details – NFS you can see more detail about the storage protocols supported by the array. For example under Storage Details – SAN you can see I have just one 10GB LUN which I’m using for RDM Demonstration. Notice how ALUA and DeDuplication have in this case been disabled. Equally interesting is that you can see which aggregate the volume resides in (aggr1) and volume and aggregate usage. The “View Mapped Hosts” link will tell you which other ESX hosts have access to the volume.

121

In the Storage Details – NAS you receive a very similar view except of course it will give data you would only expect to see presented for NFS such as the “export” name and the privileges of the host (R/O, R/W and Root Access).

122

The Data Collection node allows you to pull down to your management PC log files associated with your NetApp array – these are downloaded as tar.gz files ready to be uploaded to NetApp Support. The Data Collection allows you to download support files for the Array, ESX host and the physical switch layer:

And finally, there’s the Tools menu. The NetApp VSC allows you to download a number of interesting utilities that address the thorny issue of disk alignment.

123

They have two utilities which can be downloaded and copied to your ESX hosts. The mbrscan utility checks to see if the disk is aligned, and the mbralign utility fixes the disk alignment problem. For the utilities to work your VM must be powered off so the utilities can gain access to the .VDMK file. If you don’t they simply return a “Device or Resource busy” message. You download the tools to your management PC, and then upload them to your ESX host. Once there you can extract them from the tar ball, and make them executable with the: tar -xzvf mbrtools.tar.gz chmod 555 mbrscan chmod 555 mbralign You can then use mrbscan to check the VMDK files of the VM. So for my VM in the netapp-virtualmachine NFR mount with a VM called cs03 I would run this command: /root/mbrscan /vmfs/volumes/netapp-virtualmachines/cs03/cs03-flat.vmdk Notice how I’m running the command against the –flat file. This is where the data of the virtual disk actually resides; as the first file of the virtual disk (cs03.vmdk) is merely a text file which is a descriptor file. You can view its contents with the Linux cat command. Below you can see the results of that mbrscan – as you can see p(artition)1 is NTFS volume, which is not aligned. In this case, it is caused by the MBR record at the beginning of a boot disk.

To fix this misalignment – you can run the other utility. This does take some time to complete because the utility has to move blocks within the virtual disks (.vmdk) file. Before you run the mbralign utility its worth confirming that no snapshots or linked clones are set on the VM, as this can cause problems – the utility pauses at the warning about

124

this fact – and won’t proceed until you press [Y]. Additionally, you will need some free space – because the first thing mrbalign does is take a backup copy of the target virtual disk:

The command is: /root/mbralign /vmfs/volumes/netapp-virtualmachines/cs03/cs03-flat.vmdk

Once the disk is properly aligned you can then opt to run scripts inside the guest operating system of the VM to carry out further optimization tasks. The scripts are contained inside .ISOs which can be downloaded, uploaded to storage array, and then attached to the target VM. You should be able to download these .ISO files from where you installed the VSC (in my case I installed it to my vCenter). However, I found I couldn’t retrieve them using the URLs listed in the VSC. In the end I copied them directly from: C:\Program Files\NetApp\Virtual Storage Console\webapps\public When you finally have the .ISO file attached to the VM, when you go to open the CD in Windows Explorer, an autorun.inf triggers an import of Microsoft Registry File (.reg)

125

Conclusion In this section I have quickly shown you how to setup NetApp SnapMirror, which is suitable for use with VMware SRM. We configured two NetApp FSA2020’s and then configured them for replication – or SnapMirrror. Lastly, we connected an ESX host to that storage. Additionally, I showed how the new plug-ins from NetApp allow you to create volumes and mount them efficiently to the ESX hosts. From this point onwards you I would recommend you create virtual machines on the NFS mount point, so you have some test VMs to use with VMware SRM. SRM is designed to only pick-up on LUNs/Volumes that accessible to the ESX host and contain virtual machine files. In previous releases of SRM if you had a volume which was blank, it wouldn’t be displayed in the SRM Array Manager Configuration wizard – in the new release it does warn you if this is the case. This was apparently a popular error people had with SRM 1.0, but one that I rarely saw – mainly because I always ensured my replicated volumes have virtual machines contained on them. I don’t see any point in replicating empty volumes! Since ESX 3.5 and vCenter 2.5 you have been able to relocate the virtual machine swap file (.vswp) on to different datastores, rather than locating it in the default location. A good tip is to relocate the virtual machines swap file onto shared but not replicated storage. This will reduce the amount of replication bandwidth needed. It does not reduce the amount of disk space used at the Recovery Site, as this will automatically be generated on the storage at the Recovery Site. In my demonstrations I mainly use virtual disks. VMware’s RDM feature is now fully supported by SRM (it wasn’t when version 1.0 was first released). I will be covering RDMs latter in this book because it is an extremely popular VMware feature.

127

Chapter 6: Installing VMware SRM

128

Architecture of VMware SRM Before you begin the process of setting up and configuring SRM for the first time it’s important to understand the structure of the product and its basic requirements.

One major challenge of the architecture above is that the SRM server is unlikely to reside on one network – but the array management system will quite possibly be patched into a different network. In other words it has four main different communication paths coming to and from it:

• To/From SRM Database backend (SQL or Oracle) • To/From the Storage Array via the VMware Site Recovery Adapter (SRA) written by

your storage vendor • To/From the vCenter and License Server • To/From the vCenter Server at the Recovery Site, this in turn communicates to the

SRM Server at the Recovery Site. If you like vCenter acts as “proxy” to its respective SRM server Note: Of course it is possible to host all these roles in one Windows instance (database, SRM, vCenter, License Server). They have been represented as separate roles for clarity of the diagram and later to show the port numbers used Note: In some arrays you have the option of presenting to the SRM the “management” LUN/Volume using VMware’s SRM feature. This allows you to leverage the storage vendor’s management tools natively in the SRM Server. Personally, I would prefer to use networking to facilitate this communication – although it is possible to do this, for example with EMC Symmetrix Arrays

Below is a list of the port numbers for the relevant communication paths used by the diagram

1. The SRM Server communicates to the backend database on the propriety ports encoded for Microsoft SQL, Oracle or IBM DB2

129

2. With vSphere4, there is no “license” server, which provides license files as in Vi3.5. In vSphere4 licenses have now reverted to text strings stored in the main vCenter database. It is still the case you need a license for vCenter and ESX hosts in the DR location. However, if you run vCenter in a “linked mode” licenses can be transfer between sites. In my case both the New York and New Jersey site are licensed

2/4 SRM communicates to both vCenters at both the Recovery and Protected Sites on port TCP 443. It communicates to its own vCenter and the vCenter server at the Recovery Site. The vCenter acts as a proxy between the two SRM Servers The SRM Service Listens on SOAP based TCP port 8095 Users of the Vi Client download the SRM Plug-in from the SRM Service on a custom HTTP Port of 8096 If you choose to use the API then communication is on TCP Port 9007 and 9008 (SOAP and Custom HTTP respectively) 3. The SRM Server via its vendor specific Site Recovery Adapter communicates on a range of ports dictated by the storage vendor. Please consult vendor specific documentation.

During the configuration of the SRM “Array Manager”, the SRM uses special software written by your storage vendor called an SRA (Storage Array Adapter) to discover the LUNs/Volumes being replicated. This will be network communication either to the UTP uplinks on the fibre-channel array OR directly to the management ports on an iSCSI or NFS target. In a production environment you would need to configure routing or intra-VLAN communication to allow the SRM Adapter to communicate with your array. Additionally, the other network challenge is making sure firewalls allow for the vCenter-to-vCenter communication and the SRM-to-SRM communication that inherently exists. Finally, the last challenge is actually getting the two arrays to communicate to each other for the purposes of replication/snapshots.

Storage Replication Components SRM assumes you have two or more geographically dispersed locations. The first is your “Protected Site”. You might know this better as the primary site – the location where all your business critical functions exist. If you lose this – then the business cannot operate,

130

so an investment is made in a “Recovery Site” which can be used in the event of such failures. You might know this better as your secondary location, or as your DR/BC location. Quite frequently companies hire rack space at commercial rates to generate a Recovery Site location if they lack those resources internally in the business. In my case I am going to begin by using very clear names for the primary and secondary site. I’m going to assume that we have a dedicated location for recovery – perhaps we hire rack space for this – and the failover is unidirectional. That is to say, the primary site always fails over to the secondary site. There is another distinct configuration – which is bidirectional. In this case the secondary site’s DR location is the primary site, and the primary site’s DR location is the secondary site. A bi-directional approach would be used in a large business where New York’s DR location might the New Jersey office, and the New Jersey’s office’s DR location would be New York. I will be looking at the configuration of SRM for a bi-directional DR in Chapter 13. Another way of describing the difference between a unidirectional or bidirectional configuration – is as the more conventional active/standby or active/active terms, although in fact these terms are probably more popular in our industry for describing the relationships we create for redundancy. I will stick with the terms unidirectional or bidirectional because these are also the terms you will find in the official documentation from VMware. At one of these locations there are ESX hosts with virtual machines that need protection. The “Protected Site” VMs are being replicated to the protected location on a frequency which is a balance between your bandwidth and your tolerance of loss of data. Put very crudely the more bandwidth between the Protected and Recovery Site – the more frequently you can replicate between the two locations. Larger companies can and often do have a blend of replication technologies and cycles to facilitate shifting the data away from the Protected Site. Perhaps they have hi-speed fibre-channel links between SiteA to SiteB, but then use slower network pipes from SiteB to SiteC. In this configuration the replication between SiteA and SiteB could be synchronous without latency. So as a write is being committed to a disk in SiteA it is already being written to a disk in SiteB. Such frequency of replication allows for a very low chance of data loss. The replication from SiteB to SiteC will have a larger latency but is frequently selected as the best method of getting the data a significant distance away from the Protected Site in an economical way. Currently, SRM is limited to a one-to-one pairing of sites together. It is not currently possible to create a spoke-and-hub style structure. It’s hoped in future releases that this type of configuration will be possible.

VMware Components

Laying these storage considerations to one side, there are a number of VMware components that need to be configured. You may already have some of these components in place if you have been using VMware technologies for some time. At both the Protected and Recovery Site you need:

• ESX 4.0 “Classic” or ESX4i (SRM 4.0 is backwards compatible with ESX 3.0.3 and higher. There are no plans to make SRM 1.0 forwards compatible with ESX 4.0)

• vCenter 4.0 • A database for both the SRM server at the Protected and Recovery Site.

o VMware supports SQL 2005 Standard or higher (SQL Express is supported as well)

o Oracle 10g Standard Release 1 or higher o IBM DB2 Express C or higher

• A protected SRM and recovery SRM instance which runs Windows 2003 SP1 or higher. SRM is certified to run on the 64-bit version of Windows 2003 and Windows 2008

• SRM Adapter from your storage vendor installed on both SRMs • SRM vSphere Client Management Plug-in

131

• LUN Masking – ESX hosts at Protected Site see “live” LUNs, but the ESX hosts at the Recovery Site only see “replicated” LUNs or snapshots. This allows for tests without interrupting normal operations, and does not interrupt the normal cycle of replication between the two sites

• DNS Name Resolution – As with vSphere4 generally it is recommended you thoroughly test all methods of name resolution – short hostname, long FQDN, and reverse lookups

This is just a general list of requirements – for a more specific and comprehensive list including specific patches read these documents here: http://www.vmware.com/support/pubs/srm_pubs.html and specifically this compatibility matrix PDF guide gives a comprehensive overview of what is supported and required for SRM to function. As this is likely to change – you are much better off consulting online guides rather than this book: http://www.vmware.com/pdf/srm_compat_matrix_4_0.pdf A common question asked is if it is possible to replicate the vCenter database at the Protected Site to the Recovery Site. The answer is, if you intended to use SRM – NO. SRM assumes the two databases of vCenter are running independently of each other. In fact one of the management tasks required during the configuration of SRM is “pairing” the protected SRM to the recovery SRM – and then “mapping” the vCenter objects (folders, resource pools, networks) in the Protected Site to the Recovery Site. Currently, the structure of the vCenter database does not allow for the use of SQL or Oracle replication to duplicate it at the Recovery Site. In years to come I imagine VMware might develop a model for vCenter which resembles Active Directory domain controllers – i.e. a multiple master replication model. This could potentially negate the need for database replication and perhaps reduce the complexity of the “Inventory Mapping” process inside SRM. At the moment if you are worried about the availability of the vCenter service then you could run it in a VM on a DRS/HA and FT enabled environment. Alternatively, if you are running vCenter on a physical machine (the idea of which personally boggles me) then you could use MSCS or the new vCenter Heartbeat Service. For purposes of brevity I am going to assume you know how to setup ESX and vCenter so I can more closely focus on the SRM part of the installation and configuration process. If this is something you are not familiar with – you may wish to read my other guides which cover this in some detail. Alternatively, you might wish to purchase my vSphere book which was published by McGraw-Hill in Feb, 2010. http://www.amazon.com/VMware-vSphere-Implementation-Mike-Laverick/dp/0071664521/ref=sr_1_4?ie=UTF8&s=books&qid=1249125438&sr=8-4

In my case I used the following names for my components

132

The “Protected” Site: dc01nyc.corp.com vc4nyc.corp.com srm4nyc.corp.com sql4nyc.corp.com The “Recovery” Site: dc01nj.corp.com vc4nj.corp.com srm4nj.corp.com sql4nj.corp.com The screen capture below shows the full configuration of my ESX hosts, VMware DRS/HA Clusters and other vCenter objects including Folders and Resource Pools prior to starting the SRM installation. If you’re following this book you don’t necessarily have to adopt my structures and naming conventions – but I will be using them throughout this book. Of course none of these virtual machines are actually running on a live system in a production environment – this is merely a sandbox environment for playing with SRM and demonstrating the functionality of the product. For obvious reasons I would recommend this approach before rolling out SRM in a production environment. As you can see I am running the HP Lefthand VSA and the components required to make SRM work. These VMs will not be replicated/snapshot to the Recovery Site – indeed you could extend this concept to include “local” services that are not required or applicable at the Recovery Site such as DNS, DHCP, WINS and Print Services. They will remain in folders and resources pools separate from my “production” VMs. Generally, in my screen grabs you will find these non-replicated VMs are in an “infrastructure” resource pool and VM Folder. I could have hidden these VMs using permissions – but in an effort to be open-handed I wanted you the reader to see exactly what my configuration is.

As you can see I’ve almost exactly mirrored the resource pool structure of the NYC DataCenter, by creating similar sub-resource pools in the NYC_DR resource pool. This is to allow for a one to one “inventory mapping” with SRM with the Protected Site to the Recovery Site. I’ve carried a similar routine with the VM Folders too:

133

As I’ve said before you can regard all the VMs in the “infrastructure” folders/resource pools as “local” virtual machines which will not be replicated to the recovery location. Additionally, my Test & Dev resource pools/folders represent VMs that are not business critical – and do not form a part of my Recovery Plan. As such I have not created a Test & Dev resource pool or VM folder in the DR location of New Jersey. You might in the first instance dismiss this as just being overly fastidious. But please don’t underestimate the importance of these structures and hierarchy. Get them wrong and you will find that the failback process has unexpected consequences. Without a good folder and resource pool structure you will find yourself dumping all your VMs on a cluster or a folder, and potentially having to manually relocate 1,000s of VMs to the correct location.

More Detailed Information about Hardware and Software Requirements

As you know software requirements and patch levels are endlessly changing – at the very least you will want to know if your storage has been tested with SRM and is supported. It seems very silly for me to list these requirements at length in a book – as this information will go very stale, very quickly. So instead visit this URL http://www.vmware.com/products/srm/ On this page you will find all manner of useful information – PDFs, whitepapers, guides, webcasts and so on. A little bit more direct than that is http://www.vmware.com/support/pubs/srm_pubs.html This is where you will find the official administration guide (which despite you having got hold of this book is always worth a read!) and some other guides including:

• VMware vCenter Site Recovery Manager Release Notes • Getting Started with VMware vCenter Site Recovery Manager • VMware vCenter Site Recovery Manager Administration Guide • Installing, Configuring, and Using Shared Recovery Site Support • Adding a DNS Update Step to a Recovery Plan • VMware Site Recovery Manager API • Site Recovery Manager Storage Partner Compatibility Matrix • VMware vCenter Site Recovery Manager Compatibility Matrixes

The compatibility matrixes tell you everything you need to know about what is supported and not supported such as

• What version of ESX and vCenter is supported and what patches are needed • What Windows Operating Systems and Service Packs are required • SRM Database compatibility • Guest Operating Systems protected by SRM

134

• Guest Operating System Customization (for example Solaris is not currently on the list of support GOS) which allows for the change of the IP address of the virtual machine for example

• Storage Array Compatibility Treat the compatibility matrix like the now legendary VMware HCL for ESX servers, if it’s not on the list it’s unsupported. Your configuration may well work but if it breaks or is unreliable don’t expect VMware Support to help you much. As for hardware requirements (either physical or virtual) VMware currently recommend these minimums as a starting point: Processor - 2.0GHz or higher Intel or AMD x86 processor. Memory - 2GB Disk Storage - 2GB Networking - Gigabit recommended

Scalability of VMware SRM

Another concern you will have about VMware SRM is whether it has any limits in terms of the number of ESX hosts and virtual machines it can protect, and how many Recovery Plans it can hold and run. A moment ago we talked about minimums but it is worth mentioning SRM’s current maximums. SRM is tested up to the maximum of 1000 protected virtual machines per site – that’s about double the number of VMs on the previous release. You can create a maximum of 150 Protection Groups – which in turn are linked to 150 replicated LUNs/Volumes. Each Protection Group can contain a maximum of 500 VMs. You can run up to three Recovery Plans concurrently. The operative term here are the words “per site” – you could have thousands of virtual machines spread over many sites – in this case you would be scaling out the SRM product to reflect that volume of virtual machines. As with all products you should expect these numbers to increase in subsequent releases.

Designed for both Failover and Failback?

As you can see SRM was designed from day one to automate the failover from the Protected Site to the Recovery Site. It might surprise you to know that it was never part of the design strategy to automate failback to the Protected Site. You might regard that as a tremendously important oversite – if you forgive the pun! In theory invoking failback should be as simple as inverting the replication process to make sure we get the most recent set of virtual machines files, and taking your plan for DR/BC and reversing its direction. I’m sorry to say it is no way as simple as that. Invoking failback and failover are huge, huge decisions – not lightly taken or achieved with or without virtualization software. Despite this we are going to try it, because I think it will be fun to do so, and we will learn a large amount about how the new release of SRM works, and what is achievable with the software as it stands. From my discussions with VMware and customers the fact there isn’t a big button in SRM that says “Failback” is sometimes regarded positively – not negatively. Although this doesn’t stop some customers rather glibly and dismissively stating “ah, where is your ‘failback’ button”! Here are some reasons why. In many respects failback is more dangerous than failover. With failover there is really no choice but to hit the big red button and start the process. After all, if a fire, flood or terrorist attack has destroyed or partially destroyed your primary location – you will have no choice but to engage DR. Now, let’s says that DR is successful and you’re now at the Recovery Site. As long as you are operating successfully at the DR location and you can continue to do so for some time – what pressure or urgency is there to return? Firstly, your sales people are creating new orders and accounts are processing the invoices. They are generating income and revenue to the organisation.

135

Secondly, the application owners are happy – because they managed to survive the disaster and get their services online. Due to these circumstances you are more likely to want to gradually and carefully return to the primary location (if you can!). You certainly don’t want failover to be a casual click of a button affair. The very act of carrying out a failback in such a cavalier manner could actually undo all the good of a well executed Recovery Plan. After all you might take the argument that if your primary location was severely destroyed during a disaster – you may never want to return to that locale. In some respects failback is more complicated, because there many reasons for the cause of the failover – which will result in different preparation. For example you may need to procure new servers, networking and storage – and critically that new storage may have to replicate terabytes of data before beginning the failback process itself. What I hope to see is the failback process made simpler and easier to do – with less steps and stages than currently required, especially in the array of clean-up and clean-out which is currently needed, not least so we can concentrate on what really matters. Towards the end of the life of SRM 1.0 Update 1 we began to see the storage vendors, especially EMC and NetApp develop their own failback plug-ins – to some degree they are very like the SRAs that you have with failover. They are by no means perfect, nor do they automate every task – but a beacon has been lit that will show the direction of the product, with storage vendors contributing their own plug-ins to automate as much of the failback process as possible. I wouldn’t be surprised that when SRM 5.0 is released we will have a failback feature in the core product. For many people in the world of corporate compliance, audit trails and insurance – being able to press the “test” button is what they are really buying VMware SRM for. It means they can say to the business, their managers and their auditors – look we have a plan and it’s been tested. However, to really, really test a Recovery Plan – the only real test is one that invokes DR for real. In some corporate environments they hard test their Recovery Plan bi-annually. For these organisations this lack of easy failback option is quite off putting. Now, I’m not saying failback isn’t possible with SRM, it’s just that currently it’s much more a manual process than the simple test button that you see in the main SRM product.

VERY IMPORTANT: A Word About Resignaturing VMFS Volumes This section is for people who have not attended the VMware vSphere authorized courses – or conversely for those who have and promptly forgotten most of what was said to them! Before I begin it’s important to emphasise that this does NOT apply to NFS volumes, it only applies to VMware’s File System (VMFS) accessed by block-level storage such as Fibre-Channel or iSCSI. It’s important you understand what the concept of resignaturing is and why SRM does this automatically. This will help you understand some of the “strangeness” that SRM sometimes displays. In fact this strangeness is not strange at all. It’s by design. It’s the way it is. First let’s begin with a bit of revision about the properties of VMFS volumes. Before and after a format of VMFS volume on a Fibre-Channel or iSCSI volume, the storage can be addressed in many different ways:

• By its Linux device name: /dev/sdk • By its VMkernel “runtime” device name: vmhba1:0:15 • By is Unique Network Address Authority (NAA) value: naa.6000... • By its volume name which has to be unique to the ESX host: myvmfs • By its datastore name which has to be unique to vCenter: myvmfs • By its UUID: 47877284-77d8f66b-fc04-001560ace43f

It’s important to know that as with the NAA, the UUID value must be completely unique and an ESX host cannot have two UUIDs that are exactly the same presented to it at the

136

same time. UUIDs are generated by using three core variables including date, time and LUN number in order to guarantee that the UUID value be absolutely unique. This can cause some interesting, or shall I say unpleasant consequences if you are not consistent in your LUN numbering. That is to say problems can occur if ESX1 believes the LUN/Volume number is 15, and another ESX hosts believes the same block of storage is LUN/Volume 20. It’s also worth saying that currently virtual machines do not find their VMDK and VSWP files using the friendly volume/datastore name. If you examine the contents of a .VMX file you will see references to the UUID value.

Note: Here you can see that the second virtual disk (ctx01_1.vmdk) is being stored on a different volume (..ea5bb) to the VMs Vmkernel swap file (...ea5bc). If the virtual disk is stored in the same location as the VMX file, then the /vmfs/volumes path is not displayed in the VMX file, but it is used. As you can see UUID numbers are very important. The requirement for unique UUIDs does present some interesting challenges to DR. By definition any snapshot or replication process configured in the array is intended to create an exact duplicate of the VMFS volume which would by definition include the UUID value. In normal day-to-day operations an ESX host in the Protected Site should not get to see both the original LUN and replicated LUN/snapshot at the same time. If it did ESX would suppress the second LUN/Volume. If an ESX host was allowed to see both LUNs/Volumes at the same time ESX would be very confused – and not at all happy. It wouldn’t know which LUN/Volume to send its read/writes to. In ESX 3.5 the host would prints a hard console error message suggesting you may need to do a resignature of the VMFS volume.

137

In ESX 4.0 this hard console message has been deprecated, and an ESX 4.0 host no longer prints this hard console message, which I think, is a bit of shame. Note: I’m aware that this image is difficult to read in black and white – and that the contrast of blue on a black background may not reproduce very well when this book is printed. For your information the text states “1c6953435349344:1 May be snapshot: disabling access. See resignaturing section of the SAN Administration Guide”. In previous versions if this was a replicated/snapshot LUN or volume, the way to resolve this would be to modify the advanced settings in ESX to enable a resignature and issue a rescan of the HBA. In the new release of vSphere4 there are now two ways of issuing a resignature to a snapshot volume to make it visible to the ESX host. You can now issue a resignature from the GUI. If a volume that is a snapshot or replicated – if you present it to an ESX host using your storage vendor’s management tools, then from the Add Storage wizard it will appear with an existing VMFS volume label. To demonstrate this manual approach of presenting the storage which is used if you were testing your DR plan, and you didn’t have SRM – I temporarily gave one of my ESX hosts a valid IP address to communicate to the vsa1.corp.com host. Then in HP Lefthand VSA, I added in the ESX host as server to the NJ_Group and assigned one of the snapshots to it. Then I ran the Add Storage wizard on the ESX hosts like so:

As you can see the volume is not blank, as it has a valid VMFS volume label. When selected the ESX hosts Add Storage wizard realises this is a replicated volume and offers me the chance to carry out a resignature manually.

138

Alternatively, if you’re a dab hand at the command-line you should know that the new esxcfg-volumes command supports a –l switch to list all volumes/snapshots that have been detected as snapshots, and a –r switch to issue the instruction to resignature the volume. The example below is the command: esxcfg-volumes –l This lists the snapshots/replicated volumes the ESX host has discovered:

As you can see the command shows that the VMFS is cannot be mounted because the original volume is still online. The volume is available to the resignature process. So if I then followed though that command with: esxcfg-volumes –r lefthand-networks-virtualmachines Note: where lefthand-networks-virtualmachines is the VMFS volume name. Then the ESX Host would resignature the volume, and then mounts it to the ESX hosts. When this happens the volume is given a new UUID and volume name (snap-<hexID>originalvolumename) to allow you to identify it in the vSphere environment

Of course these command-line actions are also reflected in the vSphere Client too:

139

This behaviour is the SAME for all storage vendors; I’m just using HP Lefthand as an example. In fact I first saw this new way of managing resignaturing whilst working with EMC and their Replication Manager software. This kind of behaviour might have some very undesirable consequences if you are carrying out manual DR without the SRM product. The volume/datastore name would be changed, and a new UUID value generated. If virtual machines were registered on that VMFS volume there would be a problem – all the VMX files for those virtual machines would be “pointing” at the old UUID rather than new one. The VM would need removing from the vCenter inventory, and registering to pick up on this new UUID. So when you carry out manual DR the volume must be resignatured first, and then VMs are registered according to the new volume name and UUID. Much the same sequence of events happens when you test Recovery Plans. Are you with me so far? Now, the good news is that SRM automatically resignatures volumes for you – but only in the Recovery Site – and it auto-magically fixes any issues with the VMX files. As the ESX hosts in the Recovery Site could have presented different snapshots taken at different times SRM it defaults to automatically resignaturing. It then corrects the VMX files of the recovery virtual machines to ensure they power on without error. In the early beta release of SRM 1.0 VMware did an automatic rename of the VMFS volume to the original name. However, in the GA release of SRM 1.0 and 4.0 this renaming process was dropped. If you do want SRM to rename the snapshot of the VMFS to have the original volume name this can be enabled by editing the vmware-dr.xml file or by modifying the .xml via the advanced settings dialog box accessed by right-clicking the Site Recovery node in the vSphere Client:

140

This mandatory resignature could be regarded by some as being somewhat “over cautious” by VMware, but it does guarantee less errors created by the ESX host having the potential of being presented the same UUID more than once. If this automatic resignaturing did not occur and ESX host was presented two LUNs/volumes with the same VMFS volume, datastore and UUID values – the administrator would receive a hard error – and it would be up to the SRM administrator to resolve this problem. Some people might take the position that these sorts of replication problems are best avoided altogether, rather than taking unnecessary risks with data or adding a layer of unnecessary manual configuration. It’s perhaps worth mentioning that there are indeed products in the storage arena where an ESX host might see both the original LUN and its snapshot at the same time. I’m thinking of products such as HP’s CrossLink/Continuous Access and EMC TimeFinder. These technologies are designed to protect your system for loss of SAN. With these technologies the ESX host would have connectivity to two arrays which would be constantly replicating to each other. The idea is if an entire storage array failed, it would still be able to access the LUN on another array. It’s probably for this reason that VMware SRM defaults to resignaturing LUNs to stop potential corruption. Individually, you might not accept this position – but for the moment this is how SRM works.

The Big Plan

My big master plan is to be able to crash all the ESX hosts in the New York Protected Site – simulating loss of all infrastructure components, and being able to invoke DR/BC at the New Jersey Recovery Site. Initially, we will begin with tests of this – once we are happy the SRM infrastructure is working – we will plan to work through a failback to the Protected Site using SRM to automate as much of the process as possible. I will be starting out with unidirectional model for SRM, and then progress on to a bi-directional model. After that I will be adding a new chapter to the book covering the new “shared site” feature.

VMware SRM Product Limitations and Gotchas What follows below is a straight cut and paste from the release notes for SRM. Do you read release notes? – no, I thought not. Despite my RTFM credentials, I’ve been caught not reading them for VMware products too. After being burnt twice – I now quickly give them a run through to spot potential problems before spend a week trying to resolve what was mentioned in the release notes. In other words if only I’d RTFM’d! The one’s I’ve highlighted in red are problems I’ve experienced for real, and then discovered that they were in the release notes after all. See Mr RTFM sometimes forgets to RTFM… I think we all know the moral of the story here, don’t we?

Uninstall Removes Stored Array Manager Credentials Uninstalling SRM Server software removes the array manager credentials that are stored on the host. If you reinstall SRM on that host, you must configure the array managers again.

Repair-Mode Installation Fails if SRM Server Log File is Open If you run the installer in repair mode while any SRM server log file is open, the installation fails. Workaround:Close the log file and retry the installation.

Installer Drop-Down Does Not Refresh to show a Newly-Created DSN When you create a new DSN during installation, the newly-created DSN is not visible in the list of available DSNs. Workaround:Cancel and restart the installation.

Length and Character Set Requirements for Passwords. SRM PKCS#12 certificate passwords cannot contain more than 31 characters, and must consist entirely of ASCII characters

Recovery Virtual Machine Administrator Role has Inadequate Privileges to Protect a Template You must manually add the following privileges to this role:

141

VirtualMachine.Provisioning.Mark As Template

VirtualMachine.Provisioning.Mark As Virtual Machine

Recovery Plan Administrator Must Have Read Permission for All Recovery Plans A user who has administrator permission for any Recovery Plan must be granted read permission for all Recovery Plans. Assigning read permission for all Recovery Plans enables the user to access hidden metadata that must be read when an administrator role accesses a specific Recovery Plan

Remote vCenter Login Credentials Must Include Domain Name When a Domain Account is Specified If you specify a Windows Domain account when the SRM client prompts you for vCenter credentials, you must include the domain name in the user name (for example, DOMAIN\user).

Protected Sites Shows "Unable to Connect" After Successful Connection After successful connection between protected and Recovery Sites, the Protected Site reports "Unable to Connect" and eventually reports the error: "Low Resources on Pair..." Workaround:

Restart the SRM Service.

Close the vSphere Client for the Recovery Site.

Break the connection and configure connection from the protected Summary page.

Start the vSphere Client and log in to the Recovery Site.

Select Site Recovery and configure the connection from the remote site.

When Pairing Sites, Use Trusted Certificates When pairing sites and the certificates of the recovery-site vCenter Server and SRM Server are not trusted by the protection-site SRM server, yellow warning triangles, rather than green check boxes, appear to the left of the Certificate Validation steps. The yellow warning triangles warn the user that the given certificates did not pass the validation requirements that the certificates be signed by a trusted Certificate Authority (CA) and have a DNS value matching the address of the server. During the pairing, the user indicated that the certificates should be accepted based on their SHA-1 thumb-prints. It is a serious security violation to accept certificates based on their thumb-prints without verifying that the thumb-prints are correct. Workaround: Ensure that both vCenter Servers and both SRM Servers use trusted certificates.

Problems Customizing Certain Linux Guest Configurations During Recovery Linux guests that are not running an ext2, ext3, or ReiserFS file system may experience customization failures when recovered.

SRM Reports the Error: "Cannot execute scripts" When Customizing Windows Virtual Machines During Recovery During test recovery or recovery, when Windows guests are customized, occasionally the virtual machines attempt to shut down gracefully and SRM reports the error "Cannot execute scripts." This results in a hard shut down after customization is complete and the virtual machine remains powered off regardless of its Recovery Plan priority. Workaround: Manually power on the Windows virtual machines that report this error.

A Stop Button Appears After Starting a Recovery Plan Test Occasionally, after you start recovery test for the first time, a Stop button appears with the message: "Stop Recovery. Are you sure you want to stop this Recovery Plan? This process may take several minutes." Workaround: Click "No." The test proceeds and completes successfully Note: The interesting things about this bug, is that it’s been around since SRM 1.0

Some Arrays Might Require a Second Rescan. Some storage arrays might require a second rescan to discover LUNs. HP arrays have been identified as having this requirement. To enable the additional rescan, use the SAN Provider settings page of the Advanced Settings dialog box. Note: I’ve seen this issue surface on the SRM forums and was a popular forum post in the first release of SRM 1.0. It can be “fixed” by changing a parameter in the advanced settings dialog box:

142

RDM Descriptors Must be Placed on a Replicated Datastore In order to protect virtual machines that use raw disk mapping (RDM) devices, ensure that the RDM descriptor files reside on replicated datastores. Placing the RDM descriptor files in the same datastore as the .vmx file that refers to them is highly recommended.

Licensing VMware SRM SRM is licensed by inputting a valid license string. It’s perhaps worth explaining at this point that VMware SRM has two different licensing models, because it can be configured in two separate ways – unidirectional (active/standby) and bidirectional (active/active). With unidirectional configuration, you only need an SRM License for the virtual machines protected by the by SRM server at the site. We don’t have to have a license at the Recovery Site for SRM. Now, this does not mean that you can run vSphere4 at the Recovery Site for free, gratis. Far from it! If you are running ESX 4.0 and vCenter 4.0 at the Recovery Site you will need licenses to do that. If you trigger your Recovery Plan for real and in doing so failover to the Recovery Site you will need CPU socket licenses to run at that site for any length of time. When you come to use SRM to facilitate failback – you are allowed to temporarily take your SRM license from the Protected Site to the Recovery Site to begin the process. This is legal and well within the terms and conditions of the VMware SRM EULA. It’s perhaps realistic to state that like many ISVs, VMware would have no way of policing or enforcing this trusted model for licensing. If you are configuring a bidirectional set-up you would need an SRM license in BOTH locations. In my case I will be doing that because at a later stage I will be walking you through a bidirectional configuration. Given the current complexities surrounding licensing some customers have proposed possible scenarios where this “per socket” licensing model fails. Here is a good example. Let’s say I have a 32 node DRS/HA cluster where each physical ESX host has 4 sockets totalling 16 cores. That would be 128 socket licenses I would have to purchase at the Protection Site. But what if I only had 5 virtual machines that needed protection? This has led some pundits to suggest that a per-virtual machine model would have been better for SRM – I would only pay for what I protected.

143

Firstly, whilst I see this point of view (it’s an attempt to save money on licenses after all) it is rather unrealistic to suggest that an organisation with this number of ESX hosts would have such a small number of virtual machines that would need protection – it’s a quite unrealistic example. Secondly, it would be hard for VMware to implement this quickly since the main licensing module which persists in counting the number of physical sockets, rather the number of vCPUs in use. It would require a total over-haul of VMware’s licensing system. My final word on this area of debate – is that I guess this problem does illustrate how dated counting CPU sockets or cycles has become as a way of licensing products – not least because the very act of virtualization has made these licensing by CPUs seem increasingly arcane. I think it’s very revealing that one of VMware competitors Citrix XenServer chose a per-physical server licensing model rather than a per-socket approach. The trouble is that high-level management products like vCenter and VMware SRM are still tied to the older, somewhat Jurassic, per-socket model. Anyway, these last couple of paragraphs on licensing might have left you feeling more confused than when we started; in a way that is my point. Many vendors’ licensing systems can be confusing and opaque – they are often like comparing one cell phone tariff to another. So here’s a simple adage for the moment that works for VMware SRM: Where ever you create a Protection Group (to protect virtual machines) you need a license.

Setting up the VMware SRM Database with Microsoft SQL 2005 Community Request: I don’t have license for or any experience of either Oracle or IBM DB2. If anyone here reading this book would like to supply the step-by-step procedures for doing this I would be very grateful for that. I would credit your work in the main text here, and also acknowledge your contribution at the beginning. Note: It is possible to configure SRM to use a Microsoft SQL Express. If you want to learn how, VMware has a PDF guide that takes you through the process. As for me I’m using the commercially available version of Microsoft SQL 2005 server: http://communities.vmware.com/docs/DOC-11547 This document above formed the backbone of the instructions below, and I’m indebted to the authors of it. I only wish that this documentation was included in the main SRM Administration PDF guides from VMware. SRM requires two databases; one database in the Protection Site (in my case New York City), and the other on a separate database at the Recovery Site (in my case New Jersey). Currently, SRM only supports SQL Authentication if you intend to run the SQL Server DB as I do separately from the SRM server. However, both SQL and SRM must be part of the same Domain. In my lab environment I was successful in engineering a Windows Authentication configuration where the SQL server was on separate server to the SRM Server. In fact I’ve been experimenting with this with a number of VMware products all the way back to vCenter 1.0 which didn’t support Windows Authentication on a separate SQL host. As with vCenter 1.0 if you try to use Windows Authentication with SRM 4.0 the core SRM service will fail to start. It’s a relatively trivial task in granting the SRM DB user account the “right to login as service” on the SRM Service from the Microsoft Services MMC. It’s dead easy, works like a charm. But it’s not supported, and doing this

144

configuration will cause issues during upgrades (which I have seen and worked around), which if you are prepared to live with are fine. But personally, until we get official support from VMware, I would give such a configuration a wide berth. As for permissions and rights - for SQL Server, the SRM database user does not need the DB_OWNER permissions as you do in the vCenter database. As long as the database, the schema has the same name as the database username, and is the default schema for that user, and is owned by that user – then you should be fine. After that it’s just a question of setting the permissions so the database user account can read and write to the database correctly. There are three requirements for the database schema:

• It must be owned by the SRM database user (the database user name you supply when configuring the SRM database connection)

• It must have the same name as the SRM database user • It must be the default schema for the SRM database user • The SRM database user must have database administrator privileges • The SRM database user must be granted the following permissions:

o Bulk o Insert o Connect o Create table o Create view

• If you are using Windows authentication, the SRM server and database server must

run on the same host • If the SRM server and database server run on different hosts, you must use mixed

mode authentication. • If SQL Server is installed locally, you might need to disable the Shared Memory

network setting on the database server.

Creating the Database and Setting Permissions 1. Open Microsoft SQL Server Management Studio 2. Login with the SA account details created during the install of Microsoft SQL

2005

3. Click the New Query button, to open a new SQL Transact Window. This will allow you to type commands to carry out the creation of the schema, database, defaults, ownership and permissions

145

4. Type the command:

CREATE SCHEMA srm4nycdb ;

Note: Remember in the case of Microsoft SQL 2005 and SRM the schema, database and user account will all have the SAME name, in my case srm4dbnyc. After executing the command the Messages Tab in Microsoft SQL 2005 should read “Command Completed Successfully” and a green tick should be next to “Query Executed Successfully”

146

5. Next we will create the database simply by editing and executing the the query

CREATE DATABASE srm4nycdb ; Note: If you use the “Refresh” option on the “Databases” node, you should see that a database has indeed been created:

6. Next we need to the select the database to make changes to it – such as

creating the user account and setting the default schema. You do this with the “use” command: USE srm4nycdb ;

7. The next set of commands – create the user account and associate login with the schema: CREATE LOGIN srm4nycdb WITH PASSWORD = '??????????' ; CREATE USER srm4nycdb FOR LOGIN srm4nycdb WITH DEFAULT_SCHEMA = srm4nycdb; Note: Although it seems a trite thing to say – make a note of this password as you will need it. I make my passwords comply with Microsoft Active Directory standards for password complexity. These commands above should create a user in the +

147

Security and +Logins node:

8. The next stage is to configure the ownership, defaults permission and rights for the

database user to the database and the schema itself. First let set the Database Role Membership, to do this we need to navigate to +Databases, +srm4nycdb, + Security, + and Users – and then select the properties of the srm4nydb user account:

Note: Confirm in the dialog box that the username and default schema is correct – it should read srm4nycdb or whatever naming convention you decided on for your organization

9. Under the General Tab, and Database Role Membership grant the privilege of db_owner:

148

Click OK

10. Next we will set the default database for the database user AND set some user rights. Navigate to + Security, +Logins and select the database user created with the CREATE LOGIN command earlier. In the dialog change the default database from being the “master” database, to be the database created with the CREATE DATABASE command:

149

11. In the same dialog box – select Server Roles option, and enable the “Bulk Admin” privilege:

12. The next step is a “sanity check”. Using the “User Mapping” node confirm that the database, user and default schema names ALL match:

150

13. The final step in this process is to login as the SRM database user account and hard

test the configuration and privileges that have been set. You can do this by closing the Microsoft SQL Server Management Studio and opening it with the SRM database user account like so:

Next open a new query window, and type the command: CREATE SCHEMA srm4nycdb AUTHORIZATION srm4nycdb; Next create a dummy table inside the database with: USE srm4nycdb go CREATE TABLE test(data varchar(10)) go Note: This should create a success message in the query window, and in the + Database, +srm4nycdb, and +Tables node you should see a table called srm4nycdb.test

151

If you receive an error – check the syntax of the SQL Transact commands, and double-check your permissions.

Configure a DSN Connection on the SRM Server(s)

1. Login to the Protected Site SRM server, in my case the srm4nyc virtual machine with the SRM database user account, in my case srmdbuser-nyc

2. Open the ODBC Data Source Administrator from Administrative Tools on the Start Menu

3. In the ODBC Data Source Administrator choose the System DSN tab 4. Click the Add button 5. From the end of the list choose SQL Native Client, and select Finish

Warning: Be careful not to select the SQL Server driver from the list

6. In name field of the Create a New Data Source to SQL Server dialog box, type VMware SRM

7. From the drop-down list select your Protected SQL server and click Next

8. Select “With Integrated Windows Authentication…” and click Next 9. Enable “Change the default database to” and select the Protected SRM

Database you created earlier

152

10. Click Next and Finish Note: You should now be able to confirm all the dialog boxes associated with the ODBC setup – and also test that you have connectivity to the database server. This test is nearly always successful. It does not give permissions and rights to the database. Note: Repeat this DSN setup for the Recovery Site SRM server.

Installing VMware SRM Server Caution: This is not desperately important but you can install SRM into a shared-site mode. Where one Recovery Site offers DR resources for many Protected Sites, if this is the configuration you intend you might want to take a look at Chapter 13 on the multi-site configurations. Now, don’t panic. You can take a unidirectional configuration and convert to be a shared-site configuration at anytime. It’s just not a configuration that is terrifically popular at the moment, and I don’t want to make the setup any more complex than it needs to be at this stage.

Installation of SRM Software

The installation of SRM is the same for both the Protected Site SRM server and Recovery Site SRM server. During the install you will need to supply the

• vCenter FQDN • Username and Password valid for authenticating to vCenter • Accept a default certificate or generate one of your own • Site identification values such as name of the site, and email contact details • SQL/Oracle/DB DSN credentials for the correct database

1. Login to the Protected Site SRM server, in my case SRM4NYC 2. Run the SRM installer .exe 3. Click next to the usual suspects of the Welcome Screen and EULA 4. Select a disk location for the SRM software 5. In the SRM to vCenter registration dialog box, enter the name of your

Protected Site vCenter and valid credentials to authenticate to vCenter

153

Note: One anomaly with this dialog box is that although the default port number used to communicate is the TCP port 80, when you look at the details of a completed SRM install the system communicates with vCenter on port 443. You must have port 80 opened for this dialog box to work, and if you try to alter the port in the dialog box to 443, you will receive an error. The TCP 80 port is essentially a redirection port which redirects communication to port 443 anyway.

Note: Additionally, I have cheated by using the built-in administrator account in Windows. It is recommended that you create a dedicated account for this purpose and exclude it from any password reset policies you have on your domain. In short regard it as a service account

6. After a short while a certificate security warning dialog box may appear. Choose Yes Note: As mentioned a moment ago, despite the dialog box uses port 80 by default – there is an exchange of certificate details. This is used to confirm that Protected Site SRM “trusts” the vCenter system. This warning is caused by using the auto-generated certificates of on the vCenter server - which does not match the FQDN of vCenter system. To stop this message appearing you would have to generate trusted certificates for both the Protected and Recovery Site vCenter

154

You can confirm that the certificates thumbprint presented in this dialog box is same as your vCenter servers by open your web-browser on the vCenter server and viewing the certificate via your web-browser controls.

Here you can see the last 3 blocks of the thumbprint match the SRM installation dialog box (B4:E9:F4) It’s by no means a requirement to use trusted certificates either from an internal root certificate authority or even a commercial certificate authority. I’ve found the built-in autogenerated certificates work fine for my purposes. However, if you do want to generate your own certificates and have them trusted and used by SRM, then VMware has written a guide that explains exactly how to do that: http://communities.vmware.com/servlet/JiveServlet/download/2746-201687-1209882-20726/How_to_use_trusted_Certificates_with_VMware_vCenter_Site_Recovery_Manager_v1.0.2.pdf

7. The next dialog box also concerns security. It is possible for the SRM install to generate a certificate to prove the identity of the SRM server itself. Alternatively you can create certificates of your own. Select Automatically generate a certificate and Click Next

155

8. As part of the auto-generation of the SRM certificate you must supply your organisation and organisation unit details

Warning: Alphanumeric, spaces, commas and periods are all valid characters. Invalid characters include such characters as the dash and the underscore

9. Next set the Site information details. In this dialog box I changed the default name of the site from “Site Recovery for vc4nyc.corp.com” to the friendlier “New York Site”, and then I added the email address details. I also removed the IP address that appeared in the “local host” field. I’ve never been a big fan of hard-coded IP address in any product – and I’ve always replace them with an FQDN. The IP address that appears is the IP of the SRM server.

156

Note: SOAP/HTTP Listener API ports (9007/9008) are only used if you choose to use the software development kit (SDK) to create applications or scripts that further automate SRM. The SOAP listener port (8095) is used to send and receive requests/acknowledgements from the SRM service. The HTTP listener port (8096) is used in the process of downloading the SRM plug-in. The email settings can be found in the extension.xml file located on the SRM server after the installation.

10. Next, complete the database connection information dialog box

Note: Remember these credentials have nothing what so ever to do with the username and password used to authenticate to vCenter – they are the credentials to authenticate to the database. The “Database Client” allows you to select the type of database you are using (SQL, Oracle, IBM) and the Data Source Name allows you to select the DSN created earlier. Although you can open the Microsoft ODBC DSN Setup wizard from the installer I personally prefer to have this configured and verified before I begin. It means less work during the installation phase. The “Connection Count” option is used to set the initial pool size you want opened to connect to the database. The "pool" manages open connections to the database. It is a waste of resource to generate a new connection (and then tear it down) for each database operation, so existing connections are reused. The "Connection Count" is the initial size of the pool. If more connections are needed, the system will generate them (expanding the pool), up to the maximum specified by the "Max

157

Connections" field. After that limit is reached, SRM database operations will pause until a connection is available. It is possible that the Database Administrator may restrict how many open connections a database user may have at a given time. If that is the case, then "Max Connections" should be set to not exceed that number. My default values have been “reasonable” for my testing, but they can certainly be increased as needed.

11. After completing this dialog you can proceed to installation – where the main file copy process occurs. Time for lunch or tea break.

Installation of a Site Recovery Adapter – Example HP Lefthand SRA

A Site Recovery Adapter (SRA) is a third party plug-in provided by your storage array vendor. In the early beta versions of SRM some of the SRAs were built-in to the SRM product whilst others had to be downloaded and installed to the SRM. However, from the General Availability (GA) onwards your storage array vendor’s SRA must be downloaded and installed separately. Without the SRA installed you would find options to control how SRM works with the replication engine would be unavailable. You can download your SRA from VMware’s SRA part of the website. In the first release of SRM 1.0 there was some delay in the storage vendors SRA’s appearing on VMware’s website. This is caused by the fact that VMware have their own “internal” QA process which runs independently of the vendor. If you cannot find your storage vendors SRA and you know it is supported – then it’s worth approaching the storage vendor directly. This still seems to be the case in SRM 4.0, that the storage vendor has an SRA, which is newer than the one available on the VMware site. Frequently, the blame is laid at VMware’s door – which is sometime unwarranted, after all they can only QA what the storage vendors submit. Installing a SRA extends the functionality of the vSphere Client to allow for the configuration of the “Array Manager” piece of SRM. Without an SRA installed you would not be able complete the post-configuration part of the SRM setup. The dialog box shown below would have no options on the drop-down list

Once the SRA has been installed, it allows VMware’s SRM to discover LUNs/Volumes on the Protected and Recovery Site – and compute which LUNs/Volumes are replicated. The real idea of this is to allow the Recovery Site administrator to run Recovery Plan tests without having to liaise with or manage the storage layer directly. The SRA will automate the process of presenting the right replicated LUNs or the latest snapshot to the ESX hosts at the Recovery Site when they are needed.

158

The SRA itself is often merely a collection of scripts, which carry out three main tasks

• Communicate and Authenticate to the Array • Discover which LUNs are being replicated – and select/create a snapshot prior to a

test or stop the normal pattern of replication, and promote volumes to be read-writable at the point of run of a SRM test or what you might call site failover

• Work with SRM to initiate tests, clean-up after test, and trigger genuine failovers It’s worth just pointing out that some SRA have other software or licensing requirements for example:

• FalconStor SRA currently requires a license string to be inputted during the installation

• EMC SRDF requires the EMC Solution Enabler software to be installed prior to installing the EMC SRDF SRA

• EMC MirrorView SRA needs .NET 2.0 and the EMC Solutions Enabler to be installed prior to installing the SRA

• EMC RecoverPoint (Santorini) will install Java 2 Standard Edition RunTime 6.0 U5. • 3Par SRA requires the Inform CLI for Windows to be installed before their SRA.

Additionally, it requires .NET 2.0 which it will attempt to download from the internet

• Compellent StorageCenter SRA requires .NET 2.0 and will attempt to download it from the internet. After the install you will be given the option to restart the SRM Service

• IBM N-Series SRA automatically restarts the SRM Service after installation whereas most SRA’s require you to manually restart the SRM service after they have been installed

In my examples I am using Lefthand VSA virtual appliance so I need to download and install the SRA from HP Lefthand. The installation of SRA is very simple – in most cases it is a next-next-finish exercise, together with a restart of the VMware SRM service.

1. Download the HP Lefthand SRA from vmware.com 2. Double-click at the exe 3. After the extraction process you see a Welcome Screen 4. Click Next 5. Accept the License Agreement

159

6. Open the Services console, and restart the VMware Site Recovery Service, alternatively you can restart SRM at the command-prompt with net stop vmware-dr net start vmware-dr Note: Repeat this installation on the Recovery Site SRM Server, in my case the srm4nj.corp.com server. After the install of the HP Lefthand SRA you will find a folder is on your Start Menu and Programs folder called HP Lefthand, which holds a link to documentation in a PDF format.

Installing the vSphere Client SRM Plug-in As with the installation of VMware Update Manager or VMware Convertor, an installation of SRM “extends” the vSphere Client with additional management functionality in the form of a “plug-in”. After the successful installation of SRM you should find there is a Site Recovery Manager plug-in available on the plug-ins menu. This needs to be installed to carry out the first configuration or post-configuration of the SRM service

1. Login to the Protected vCenter with the vSphere Client 2. In the menu choose Plug-ins and Manage Plug-ins 3. Under “Available Plug-ins”, click the Download and Install link next to vcDr plug-

in

Note: There is nothing to the install of a plug-in apart from accepting the EULA and clicking next button. Although occasionally I find I have to close and reload the vSphere client once it has been installed for the SRM icon to appear in the “Solutions and Applications” tab in the Home location

When you click the Site Recovery button for the first time – you will receive a security warning very similar to the warnings you receive when loading the vSphere Client. This warning is caused by the use of the auto-generated certificate for SRM

160

If you do not wish this message to appear again, enable the option to Do not display any security warnings for “w.x.y.z” and click the Ignore button

Failure to Connect to the SRM Server If you lose connectivity to or restart the SRM service on either the Protected or Recovery location – and you have the vSphere Client open, you will receive an error dialog box. If a failure to connect to SRM occurs you will see this when you click the Site Recovery icon. If this happens confirm that the SRM service has started, if the SRM will not start confirm connectivity to the SQL Database – and other dependencies such as IP settings and DNS name resolution.

Additionally, if the Protected Site cannot connect to the Recovery Site’s vCenter (perhaps you have lost connectivity to the Recovery Site) you will see this error message in the Protection Setup part of the Site Recovery Manager window

161

If this happens to you check the usual suspects such as a failure of the vCenter service at the Recovery Site, and then click Configure link and resupply the credentials for the vCenter at the Recovery Site. If you do get an outage of Recovery Sites vCenter or SRM you will find the Protection Setup pane will change the Connection Status to read “Not Connected”. One you have resolved the communication problem you should find it reconnects – if not you may find yourself running through the configure wizard again to force repairing of the sites.

Conclusion In this chapter I’ve tried to jump-start you through the primary stages of getting the SRM service up and running. Essentially, if you can create a database and point Windows to that database you can install SRM – it’s very similar in that respect to installing VMware Update Manager. Remember your biggest challenge with SRM is getting the network communications working between the Protection Site and Recovery Site environments, and that’s not just an IP and DNS issue – there are potential firewall considerations to be taken into account as well. That’s where we’re headed in the next chapter – the post-configuration stages of the SRM product which is initiated at the Protected Site vCenter. In the next chapter we will be pairing the sites together and configuring both “Inventory Mappings” and “Protection Groups”.

163

Chapter 7: Protection Site Configuration

164

Pairing the Protected and Recovery Site SRM together One of the main tasks carried out in the first configuration of SRM is pairing the Protected Site SRM to the Recovery Site SRM. It’s at this point that you configure a relationship between the two – and really this is the first time that we indicate which is the Protected Site and which is the Recovery Site. When doing this first configuration personally I prefer to have two vSphere Client windows open - one on the protected vCenter and the other on the recovery vCenter. This way I get to monitor both parts of the pairing process. This is something I did very often in my early use of SRM, so I could see in real-time the effect of change in the Protection Site on the Recovery Site. Of course you can simplify things greatly by using the new “linked mode” feature in vSphere4. For the moment I’m keeping the two vCenter separate so it’s 100% clear that one is the Protected Site and the other is the Recovery Site.

Protected Site: New York Recovery Site: New Jersey

As you might suspect, the pairing process clearly means that the Protection Site SRM and Recovery Site SRM will need to communicate to each other to share information. It is possible to have the same IP range used at two different geographical locations. It’s a networking concept called “stretched VLANs”. Stretched VLANs, if implemented, can greatly simplify the pairing process as well as greatly simplifying the networking of virtual machines when you run tests or invoke your Recovery Plans. If you have never heard of the concept of a stretched VLAN it’s well worth brushing up on them – and considering their usage to facilitate DR/BC. This type of configuration – stretched VLANs – as we will see later, can actually ease the administrative burden when running test plans or invoking DR for real. Other methods of simplifying communications especially when testing and running Recovery Plans are the use of network address translation (NAT) systems or modifying the routing configuration between the two locations. This can stop the need to re-IP the virtual machines as they boot in the DR location. This will be something we look at closely in subsequent chapters. This pairing process is sometimes referred to as “Establishing Reciprocity”. The first release of SRM the pairing process is one-to-one – and it was not possible to create spoke and wheel configurations where one site is pair to many sites. The structure of the SRM 1.0 product prevented many-to-many SRM pairing relationships. In SRM 4.0 the product has evolved to support a shared-site configuration where one DR location can provide resources for many Protected Sites. However, in these early stages I want to keep with

165

the two site configuration. Later in this book we will reconfigure vCenter to use “linked mode” with many sites. Installing the SRM and vCenter software on the same instance of Windows can save you a Windows license. However, some people might consider this approach as increasing the dependence on the management system of vCenter – if you like there is a worry or anxiety about creating an “eggs-in-one-basket scenario”. If you follow this rationale to its logical extreme – your management server will have many jobs to do, such as being the:

• vCenter Server • Web Access Server • Guided Consolidation Server • Converter Server • Update Manager Server

My main point really is if the pairing process fails – it’s more likely to do with IP communications, DNS name resolution and firewalls than anything else. IP visibility from the Protected to Recovery Site is required to set SRM up. When connecting the sites together – you always login to the Protected Site, and connect it to the Recovery Site. This starting order dictates the relationship between the two SRM servers.

1. Login with the vSphere Client to the vCenter server for the Protected SRM Site (New York)

2. Select the Home icon, and Click the Site Recovery icon 3. In the Summary Tab, in the Protection Setup pane – click the Configure next

to the Connection Option

4. In the dialog box type in the name of the vCenter for the Recovery Site

166

Warning: When you enter the hostname for the vCenter, use lowercase. The vCenter host must be entered exactly the same way during pairing as it was during installation (For example, either fully qualified in all cases or not fully qualified in all cases). Additionally, although you can use either a name or an IP address during the pairing process – be consistent. Don’t use a mix of IP addresses and FQDNs together as this only confuses SRM. Note: As we saw earlier during the installation despite typing port 80 to connect to the vCenter system it does appear to be the case that communication is on port 443

Note: Again if you are using the un-trusted auto-generated certificates that come with a default installation of vCenter you will receive a certificate security warning dialog box

167

5. Then specify the username and password for the vCenter Server at the Recovery Site Note: Again if you are using the untrusted auto-generated certificates that come with a default installation of SRM you will receive a certificate security warning dialog box

Warning: Although these two warning dialog boxes look the same – they are warnings about completely different servers – the vCenter and SRM servers of the Recovery Site. Authentication between sites can be difficult if the Protected and Recovery Site are different domains and there is no trust relationship between them. In my case I opted for a single domain that spanned both the Protected and Recovery site.

6. At this point the SRM wizard will attempt to Complete Connections – and dialog box will show you the progress of this task

Also on the Remote Task bar of the Protected vCenter you will see a progress bar

168

At the end of the process you will be prompted to authenticate the VSphere Client against the remote (i.e. the Recovery) site. If you have two Vi Clients open at the same time on both Protected and Recovery Site – you will receive two dialog box prompts

Once again you may receive a security warning if you have used un-trusted auto-generated certificates for vCenter. Notice how in the above dialog box I’m using the full NT Domain style login of DOMAIN\Username. Note: At the end of this first stage you should check that the two sites are flagged as being connected – together values for both the local site, and paired site. Additionally you will see there is an option to break the pairing between the two SRMs

Note: The break button is the reverse of the pairing process. It’s hard to think of a usage case for this option. But I guess you may at a later stage un-pair two sites and create a different relationship. In an extreme case, if you had a real disaster the original Protected Site might be irretrievably lost. In this case you would have no option but to seek a different site to maintain your DR planning TIP: This window can also give you useful status information about a lack of resource between the pair. This can also mean you need to edit the default parameters that control this alert in the vmware-dr.xml file. As you may remember from the release notes section in Chapter 6 this alert can be a false positive – an alert triggered by accident.

169

Note: From this point onwards whenever you load the VSphere Client for the first time, and click at the Site Recovery Manager icon in the VSphere Client – you will be prompted for a user name and password for the remote vCenter. The same dialog box appears on the Recovery Site SRM. This does not happen if you have two or more vCenter in the new “linked mode” configuration. Although the new vSphere Client has the ability to pass-through your user credentials, this does not happen for SRM – mainly because you might need totally different credentials at the Recovery Site anyway.

Configuring Array Managers – An Introduction The next essential part to the post-configuration of the SRM is enabling the Array Manager’s piece of the product. The Array Manager is often just a graphical front-end to supply variables to the SRA. It’s in this part that you inform SRM what engine you are using to replicate your virtual machines from the Protected to the Recovery Site. In this process SRA interrogates the array to discover which LUNs are being replicated – and enable the Recovery SRM to “mirror” your virtual machines to the recovery array. You must configure each array at the Protected Site that will take part in the replication of virtual machines – if a new array is added at a later stage it must be configured here. The Array Manager will not show every LUN/Volume replicated on the storage array – just the ones used by your ESX hosts. The SRA’s work this out by looking at the files that make up the VM and only reporting LUNs/volumes, which are in use by VMs on ESX hosts. This is why it’s useful once you have setup the replication part of the puzzle to populate LUNs/volumes with VMs Clearly, the configuration of each Array Manager will vary from vendor to vendor. Much though I would like to be vendor neutral at all times - it’s not possible for me to validate every Array Manager configuration because that would be cost and time prohibitive. As you can see from the four following screen grabs the UI for each vendor’s SRA will be slightly different. However, if you look closely at the screen grabs for each of the SRA I’ve included in this guide – you can see that they all share two main things in common. Firstly, you must provide an IP address or URL to communicate with the Storage Array, and secondly you must provide user credentials to authenticate with it. Most SRAs will have two fields for two IP addresses – this is usually for the 1st and 2nd storage controllers which offer redundant connections into the array whether it be fibre-channel or iSCSI based. Different vendors label these storage controllers differently – so if you’re familiar with NetApp perhaps the term “Storage Heads” is what you are used to, or if it’s EMC Clariion you use the term “Storage Processor”. Clearly, for the SRA to work there must be a configured IP address for these storage controllers and it must be accessible to the SRM Server.

170

Anyway, what follows below might get very repetitive. It’s a blow-by-blow storage vendor specific coverage configuring the array manager for the main storage vendors. If I was you I would skip to the section heading that relates to the specific array vendor that you are configuring – because as I’ve said before running through one array manager wizard is very similar to another.

Configuring Array Managers – EMC Celerra

Installing the EMC Celerra Replicator SRA is a relatively simple affair. In this example I will be walking you through the configuration of the EMC Celerra Replicator SRA with VMware SRM. With EMC Celerra systems the SRM server will communicate to the Celerra at the Protected Site (New York) to collect volume information. It’s therefore necessary to configure either a valid IP address for the SRM to allow this to occur OR allow routing/intra-VLAN communication if your SRM and VSA reside on different networks. This is one of the challenges of installing your SRM and vCenter on the same instance of Windows. Another work around is to give your SRM two network cards – one used for general communication and the other used specifically for communication to the Celerra. If you have no communication between the SRA and the Celerra you will receive this error message.

171

Warning: Confirm you can ping both the Protected Site Array and Recovery Site Array by the Celerra Control Station IP from the Protected Site (New York) SRM server before starting this part of the configuration

1. Logon with the vSphere Client to the Protected Site’s vCenter, in my case the vc4nyc.corp.com

2. Click the Site Recovery icon 3. In the Summary Tab, in the Protection Setup pane – click the Configure next

to the Array Managers Option

4. In the Protection Side Array Managers dialog box, click the Add button

172

5. In the Add Array Manager dialog box, type in a friendly name for this manager such as Array Manager for Protected Site

6. Select EMC Celerra Replicator as the Manager Type 7. Type in the IP Address of the Control Station at the Protected Site in the IP

Address field, in my case this is my New York Celerra system with the IP address of 172.168.3.77 Note: If you are unsure of the IP address to use you can find them out using Celerra administration web-pages. If you select the Celerra in the list and click the “Control Station Properties” tab you can find out the Celerra IP address.

8. Supply the Username/Password for the manager 9. Click the Connect button

Note: This should connect the SRM server to the Celerra system – and show the name of the Data Mover in the Array ID field together with its model number like so:

IMPORTANT: In this case I’ve used the “nasadmin” account – this will allow me to see any Celerra ReplicatorV2 protected LUN even if it doesn’t hold a virtual machine or is used by ESX. It’s perhaps better to use the Celerra’s permissions systems to restrict the SRM server to only being able to enumerate volumes used by the ESX hosts.

10. Click OK.

173

Note: SRM will then begin the Discover Array and Recompute Datastore Groups process

Note: In this dialog box above you can see the Array Manager has discovered my Protected Site Celerra (New York). The device count here is 1, which is the one iSCSI LUN configured for ReplicatorV2. The EMC Celerra SRA will ignore LUNs which are not configured for ReplicatorV2. For testing purposes I created a single LUN, and then populated it with virtual machines. Notice how it has discovered its peer in the replication process is the Recovery Site Celerra (New Jersey).

11. Click Next IMPORTANT: In this next stage we’re going to tell the Recovery Site’s SRA server what the Array Manager’s IP address is of the Recovery Site’s Celerra. Again the Recovery Site’s SRA will need a valid IP address to connect to its peer – just like the Protected Sites SRA server needs a valid IP address to connect to its Celerra. The configuration of the Add Array Manager dialog box for the Recovery Site SRA is practically the same

174

Note: Although we are running the Array Manager wizard from the Protected Site, at this stage we are actually configuring the Recovery Site SRM – and making it aware of the storage array at the Recovery Site.

12. Click OK

Note: Notice how the device count is 1. This is the Celerras at the Protected and Recovery Site finding my one iSCSI LUN that I gave the LUN ID of 100. If you create new replicated volumes and present them to a VMware ESX host then this number should increment accordingly. For this to happen you will have to use the Rescan Arrays button which you see at the end of the Array Manager wizard

13. Click Next, and review the datastore information and click Finish

175

Note: You can re-run this configuration to add additional arrays and rescan arrays to force discovery of new LUNs/Volumes at any time you wish by clicking the “Configure” link in SRM. Note: Remember for datastores to appear this way – they must be in use by a virtual machine in some shape or form – either storing a virtual disk on a VMFS volume or as RDM (Raw Device Mapping). After all you may be using your array to replicate other systems, our concern is only for LUNs/Volumes used by VMware ESX and our virtual machines Important: By supplying both the Protected and Recovery Sites IP details and authentication details – this allows SRM to automate processes that would normally require the interaction of the storage management team or interaction with the storage management system. This is used specifically in SRM when a Recovery Plan is tested as ESX host’s HBAs in the recovery location are rescanned, and the SRA from the storage vendor allows them access to the replicated LUNs/Volumes to allow the test to proceed. However, this functionality does vary from one storage array vendor to another. For example these privileges in some arrays would allow for the dynamic creation and destruction of temporary snapshots, as is the case with EMC Celerra or NetApp Filers. With other vendors someone in the “storage team” would have to grant access to the LUN and Snapshot for this to be successful as is the case with the EMC Clariion. You might think allowing this level of access to the storage layer would be deeply politically, indeed it could well be. However, in my discussions with VMware and those people who were amongst the first to try out SRM – this hasn’t always been the case. In fact many storage teams are more than happy to give up this control if it means fewer requests for manual intervention from the server or virtualization teams. You see many storage guys get understandably irritated if people like us are forever ringing them up to ask them to carry out mundane tasks like creating a snapshot and then presenting it to a number of ESX hosts. The fact we as SRM Administrators can do that safely and automatically without their help takes a

176

burden away from the storage team so they can have time for other tasks. Unfortunately, for some companies this still might be a difficult pill for the storage team to swallow without explaining to them fully beforehand the remit of SRA. If there has been any annoyance for the storage team it has often been in the poor and hard to find documentation from the storage vendors. That has left some SRM Administrators and Storage Teams struggling to work out the requirements to make the vendor’s SRA function correctly.

Configuring Array Managers – EMC Clariion

Installing the EMC Clariion SRA is a relatively simple affair. It has two main requirements – that Microsoft .NET 2.0 is installed, and that EMC’s “Solution Enabler” software is installed beforehand. After the installation of the EMC Clariion SRA, you will find you have an additional management tool called the “VMware Insight for MirrorView”. Its well worth taking a look at that utility to confirm the MirrorView configuration is good before then configuring the SRM Array Manager.

The interesting thing about MirrorView Insight for VMware is that it is almost a mini-version of SRM. After running the discovery process the “Next” button will allow you to carry out failover and failback. Leaving that functionality aside the “Replicated LUNs on Array” gives you an interesting report on the array configuration in terms on MirrorView. You might recognise from the screen grab below the configuration that was built in the earlier chapter:

177

The tab called “Replicated LUNS on ESX” gives you a view of your logical vCenter layout – as it relates to the LUNs that have been configured for MirrorView which holds virtual machines:

The DataStore tab shows just the datastores accessible to the ESX hosts, and is very similar to the datastore view in vCenter. The difference here is that LUNs that are correctly configured for MirrorView are flagged up with a green tick and status information that confirms the configuration is correct. The Virtual Machine tab works in similar way except from a virtual machine perspective.

The final view of Data Store Groups brings much of this information together in a single view.

178

The main point here is even if you don’t use this utility to handle failover or failback (because you may well prefer SRM!) – it is still a very handy utility for verifying your configuration and reporting the MirrorView configuration. Anyway – moving back to VMware SRM. In this example I will be walking you through the configuration of the EMC MirrorView SRA with VMware SRM. With EMC Clariion systems the SRM server will communicate to the Clariion at the Protected Site to collect LUN information. It’s therefore necessary to configure either a valid IP address for the SRM to allow this to occur OR allow routing/intra-VLAN communication if your SRM and VSA reside on different networks. This is one of the challenges of installing your SRM and vCenter on the same instance of Windows. Another work around is to give your SRM two network cards – one used for general communication and the other used specifically for communication to the Clariion. If you have no communication between the SRA and the Clariion you will receive this error message.

Warning: Confirm you can ping your Clariion Storage Processor (SPA/B) from the Protected Site SRM server before starting this part of the configuration




179



6. Select EMC MirrorView SRA as the Manager Type 7. Type in the IP Address of the Storage Processors (SP-A/SP-B) at the

Protected Site in the IP Address field, in my case this is my New York Clariion system with the IP address of 172.168.3.79 and the address of 172.168.3.78. Note: If you are unsure of the IP address to use you can find them out using the Navisphere management tool. If you right-click the “Domain” that contains your Clariion, and select Domain Status in the menu, you can find out the Clariion systems IP addresses.

180


Note: This should connect the SRM server to the Clariion system – and show the name of Clariion in the Array ID field together with its model number like so:

IMPORTANT: In this case I’ve used the “nasadmin” account – this will allow me to see any MirrorView protected LUN even if it doesn’t hold a virtual machine or is used by ESX. It’s perhaps better to use the Carrion’s permissions systems to restrict the SRM server to only being able to enumerate volumes used by the ESX hosts.

10. Click OK. Note: SRM will then begin the Discover Array and Recompute Datastore Groups process

181

Note: In this dialog box above you can see the Array Manager has discovered my Protected Site Clariion (New York). The device count here is 1, which is the one volume configured for MirrorView. The EMC Clariion SRA will ignore volumes which are not configured for MirrorView. For testing purposes I created a single volume, and then populated it with virtual machines. Notice how it has discovered its peer in the replication process is the Recovery Site Clariion (New Jersey).

11. Click Next IMPORTANT: In this next stage we’re going to tell the Recovery Site’s SRA server what the Array Manager’s IP address is of the Recovery Site’s Clariion. Again the Recovery Site’s SRA will need a valid IP address to connect to its Filer – just like the Protected Sites SRA server needs a valid IP address to connect to its Clariion. The configuration of the Add Array Manager dialog box for the Recovery Site SRA is practically the same

182


12. Click OK

Note: Notice how the device count is 1. This is the Clariion at the Protected and Recovery Site finding my one LUN called “LUN_60_100GB_VIRTUALMACHINES”. The green tick indicates that the LUN/Volume has been successfully replicated and that they are in sync with the Protected Site array. You might think allowing this level of access to the storage layer would be deeply politically, indeed it could well be. However, in my discussions with VMware and those people who were amongst the first to try out SRM – this hasn’t always been the case. In fact many storage teams are more than happy to give up this control if it means fewer requests for manual intervention from the server or virtualization teams. You see many storage guys get understandably irritated if people like us are forever ringing them up to ask them to carry out mundane tasks like creating a snapshot and then presenting it to a number of ESX hosts. The fact we as SRM Administrators can do that safely and automatically without their help – takes a burden away from the storage team so they can have time for other tasks. Unfortunately, for some companies this still might be a difficult pill for the storage team to swallow without explaining to them fully beforehand the remit of SRA. If there has been any annoyance for the storage team it has often been in the poor and hard to find documentation from the storage vendors. That has left some SRM Administrators and Storage Teams struggling to work out the requirements to make the vendor’s SRA function correctly.

Configuring Array Managers – HP Lefthand Networks

In this example I will be walking you through the configuration of the Lefthand Networks SRA. With iSCSI systems the SRM server will communicate to the iSCSI Target at the Protected Site to retrieve datastore and LUN information.

183

It’s therefore necessary to configure either a valid IP address for the SRM to allow this to occur OR allow routing/intra-VLAN communication if your SRM and VSA reside on different networks. This is one of the challenges of installing your SRM and vCenter on the same instance of Windows. Another work around is to give your SRM two network cards – one used for general communication and the other used specifically for communication to the VSA. If you have no communication between the SRA and VSA you will receive this error message.

Warning: Confirm you can ping your VSA from the Protected Sites SRM before starting this part of the configuration





184


6. Select Lefthand Networks SAN/iQ as the Manager Type 7. Type in the IP Address of the VSA at the Protected Site in the SAN/iQ

Manager IP1 field, in my case this is my vsa1.corp.com system with the IP address of 172.168.3.99

Note: If you only have one manager at the Protected Site (as is the case for myself), type the same FQDN/IP address again. You must complete both the SAN/iQ Manager IP 1 and SAN/iQ Manager IP 2 fields.


Note: This should connect the SRM server to the VSA Manager – and show the name of the Management Group created on the VSA

185

Note: More than two SAN/iQ Managers can be specified by using a comma as separator


Note: In this dialog box above you can see the Array Manager has discovered my single LUN/Volume created in Lefthand Networks VSA hence the “Device Count” is 1. For testing purposes I created a single volume, formatted it with VMFS and then populated it with virtual machines.

186

11. Click Next IMPORTANT: In this next stage we’re going to tell the Recovery Site’s SRA server what the Array Manager’s IP address/FQDN is of the Recovery Site’s iSCSI Target. Again the Recovery Site’s SRA will need a valid IP address to connect to its iSCSI Target – just like the Protected Sites SRA server needs a valid IP address to connect to its iSCSI Target. The configuration of the Add Array Manager dialog box for the Recovery Site SRA is practically the same


12. Click OK

187

Note: Notice how the device count is 1. This is the Lefthand Network VSA finding my one volume called “virtualmachines”. If you create new replicated volumes and use them with VMware ESX this number should increment accordingly. For this to happen you will have to use the Rescan Arrays button which you see at the end of the Array Manager wizard


Note: You can re-run this configuration to add additional arrays and rescan arrays to force discovery of new volumes at any time you wish by clicking the configure option in the management console. Note: Remember for datastores to appear this way – they must be in use by a virtual machine in some shape or form – either storing a virtual disk on a VMFS volume or as RDM (Raw Device Mapping). After all you may be using your array to replicate other systems, our concern is only for volumes used by VMware ESX and our virtual machines Important: By supplying both the Protected and Recovery Sites IP details and authentication details – this allows SRM to automate processes that would normally require the interaction of the storage management team or interaction with the storage management system. This is used specifically in SRM when a Recovery Plan is tested. As ESX host’s HBAs in the recovery location are rescanned, and the SRA from the storage vendor automatically allows them access to the replicated LUNs/Volumes to allow the test to proceed. You might think allowing this level of access to the storage layer would be deeply politically, indeed it could well be. However, in my discussions with VMware and those people who were amongst the first to try out SRM – this hasn’t always been the case. In fact many storage teams are more than happy to give up this control if it means fewer requests for manual intervention from the server or virtualization teams. You see many storage guys get understandably irritated if people like us are forever ringing them up to ask them to carry out mundane tasks like creating a

188

snapshot and then presenting it to a number of ESX hosts. The fact we as SRM Administrators can do that safely and automatically without their help – takes a burden away from the storage team so they can have time for other tasks. Unfortunately, for some companies this still might be a difficult pill for the storage team to swallow without explaining to them fully beforehand the remit of SRA. If there has been any annoyance for the storage team it has often been in the poor and hard to find documentation from the storage vendors. That has left some SRM Administrators and Storage Teams struggling to work out the requirements to make the vendor’s SRA function correctly.

Configuring Array Managers – NetApp FSA In this example I will be walking you through the configuration of the NetApp Filer with NFS. With NetApp systems the SRM server will communicate to the Filer at the Protected Site to volume information. It’s therefore necessary to configure either a valid IP address for the SRM to allow this to occur OR allow routing/intra-VLAN communication if your SRM and VSA reside on different networks. This is one of the challenges of installing your SRM and vCenter on the same instance of Windows. Another work around is to give your SRM two network cards – one used for general communication and the other used specifically for communication to the NetApp Filer. If you have no communication between the SRA and the NetApp Filer you will receive this error message.

Warning: Confirm you can ping your iSCSI Target from the Protected Sites SRA before starting this part of the configuration




189



6. Select NetApp Data ONTAP NAS Storage System as the Manager Type 7. Type in the IP Address of the NetApp Filer at the Protected Site in the IP

Address field, in my case this is my new-york-filer1.corp.com system with the IP address of 172.168.3.89 and the NFS address of 172.168.3.89

Note: Although the connection to the NetApp Filer is expressed as a single IP address it’s

190

likely that this IP represents a NIC Team (or VIF – Virtual Interface). Most filers will come with many NICs that can be bonded together for different purposes. With a NetApp FSA2020 it is a 2U enclosure with just two NICs and BMC card. In my simple configuration the IP address for management and the IP address for NFS exports is the same. In more production like configurations this may not be the case.


Note: This should connect the SRM server to the NetApp Filer– and show the name of NetApp Filer in the Array ID field together with its model number like so:

IMPORTANT: NetApp has two SRAs, one is specifically for NFS Support, and other is for FC or iSCSI SAN support. In the SRA for FC/iSCSI the manager type name is “NetApp Data ONTAP Storage System”. Ensure you install the right SRA and if you install both – that you select the correct SRA from the list. In this case I’ve used the root account – this will allow me to see any SnapMirror Volume even if it doesn’t hold virtual machine or is used by ESX. It’s perhaps better to use NetApp’s permissions systems to restrict the SRM server than only being able to enumerate volumes used by the ESX hosts.


191

Note: In the dialog box above you can see the Array Manager has discovered my new-york-filer1. The device count here is 1, which is the one volume configured for SnapMirror. The NetApp SRA will ignore volumes which are not configured for SnapMirror. For testing purposes I created a single volume, and then populated it with virtual machines. Notice how it has discovered its peer in the replication process is the new-jersey-filer1.

11. Click Next IMPORTANT: In this next stage we’re going to tell the Recovery Site’s SRA server what the Array Manager’s IP address is of the Recovery Site’s NetApp Filer. Again the Recovery Site’s SRA will need a valid IP address to connect to its Filer – just like the Protected Sites SRA server needs a valid IP address to connect to its Filer. The configuration of the Add Array Manager dialog box for the Recovery Site SRA is practically the same

Note:

192

Although we are running the Array Manager wizard from the Protected Site, at this stage we are actually configuring the Recovery Site SRM – and making it aware of the storage array at the Recovery Site.

12. Click OK

Note: Notice how the device count is 1. This is the NetApp Filers finding my one volume called “virtualmachines”. If you create new replicated volumes and use them with VMware ESX this number should increment accordingly. For this to happen you will have to use the Rescan Arrays button which you see at the end of the Array Manager wizard


Note: You can re-run this configuration to add additional arrays and rescan arrays to

193

force discovery of new LUNs/Volumes at any time you wish by clicking the configure option in the management console. Note: Remember for datastores to appear this way – they must be in use by a virtual machine in some shape or form – either storing a virtual disk on a VMFS/NFS volume or as RDM (Raw Device Mapping). After all you may be using your array to replicate other systems, our concern is only for LUNs/Volumes used by VMware ESX and our virtual machines IMPORTANT: By supplying both the Protected and Recovery Sites IP details and authentication details – this allows SRM to automate processes that would normally require the interaction of the storage management team or interaction with the storage management system. This is used specifically in SRM when a Recovery Plan is tested. As ESX host’s HBAs in the recovery location are rescanned, and the SRA from the storage vendor automatically allows them access to the replicated LUNs/Volumes to allow the test to proceed. You might think allowing this level of access to the storage layer would be deeply politically, indeed it could well be. However, in my discussions with VMware and those people who were amongst the first to try out SRM – this hasn’t always been the case. In fact many storage teams are more than happy to give up this control if it means fewer requests for manual intervention from the server or virtualization teams. You see many storage guys get understandably irritated if people like us are forever ringing them up to ask them to carry out mundane tasks like creating a snapshot and then presenting it to a number of ESX hosts. The fact we as SRM Administrators can do that safely and automatically without their help – takes a burden away from the storage team so they can have time for other tasks. Unfortunately, for some companies this still might be a difficult pill for the storage team to swallow without explaining to them fully beforehand the remit of SRA. If there has been any annoyance for the storage team it has often been in the poor and hard to find documentation from the storage vendors. That has left some SRM Administrators and Storage Teams struggling to work out the requirements to make the vendor’s SRA function correctly.

Configure Inventory Mappings The next stage in the configuration is configuring inventory mappings. This involves mapping the resource pools, folders and virtual networks of the Protection Site to the Recovery Site. Ostensibly, this happens because we have two separate vCenter installations which are not linked by common data source. This is true despite the use of “linked mode” in vSphere4. The only things that are “shared” between two or more vCenters in linked mode are licensing, roles and the search functionality. The remainder of vCenter metadata (DataCenters, Clusters, Folders and Resource Pools) is still “locked” inside the vCenter database driven by Microsoft SQL, Oracle or IBM DB2. When your Recovery Plan is invoked for test or real purposes the SRM server at the Recovery Site needs to know your preferences for how your replicated VMs are brought online. Although the recovery location has the virtual machines files by virtue of third-party replication software, the “metadata” that makes up the vCenter inventory is not replicated. It is up to the SRM administrator to decide how this “soft” vCenter data is handled. The SRM administrator needs to be able to indicate what resource pools, networks and folders the replicated VMs will use. This means when VMs are recovered they are brought online in the correct location and function correctly. Specifically, the important issue is network mappings. If you don’t get this right the VMs that are recovered might not be accessible. VMware SRM has the functionality to re-IP VMs as they

194

are brought online in the Recovery Site – but you might want to investigate techniques like NAT, Stretched VLANs and simply modifying the routing tables on your network to re-route traffic normally destined for the Protection Site to the Recovery Site. These techniques will allow you to bring up your VMs without the need to re-IP them – and may in the long-term yield far better results. Although this “global default” mapping process is optional, the reality is you will use it. If you wish you can manually map each individual VM to the appropriate resource pool, folder and network when you create what are called “Protection Groups”. The “Inventory Mappings” wizard merely speeds up this process and allows you to set your default preferences. It is possible to do this virtual machine by virtual machine – but that is very administratively intensive. To have to manually configure each virtual machine to what network, folder and resource pool it should use in the Recovery Site would be very burdensome in a location even with a few hundred virtual machines. Later in this guide we will look at these per-virtual machine inventory mappings as way of dealing with virtual machines that have unique settings. In a nutshell see the “Inventory Mappings” as a method of dealing with virtual machines settings as if they are groups, and the other methods as if you were managing them as if they were individual users. It is perfectly acceptable for the “Inventory Mappings” to have this icon next to some of the inventory objects.

After all there may be resource pools, folders and networks that do not need to be included in your Recovery Plan. For example test and development virtual machines might not be replicated at all, and therefore the inventory objects that are used to manage them are not configured. Similarly, you may have “local” virtual machines that do not need to be configured, a good example might be your vCenter and its SQL instance may be virtualized. By definition these “infrastructure” virtual machines are not replicated at the Recovery Site because we already have duplicates of them there already – that’s part of the architecture of SRM after all. Other “local” or site specific services may include such systems as anti-virus, DNS, DHCP, Proxy, Print and, depending on your directory services structure, Active Directory domain controllers. Lastly, you may have virtual machines such as deployment services – in my case the UDA, that do not need to be replicated at the Recovery Site as they are not business critical, although I would suggest you would need to consider how dependent you are on these ancillary virtual machines for your day-to-day operations. At this stage we are not indicating which VMs will be included in our recovery procedure. This is done at a later stage when we create SRM “protection” groups. Let me remind you (again) of my folder, resource pool, and network structures:

195

My vSwitch Configuration at the Protection & Recovery Site

My Resource Pool Configuration at the Protection & Recovery Site

My VM Folder Configuration:

Note: The arrows represent graphically how I will be “mapping” these resources from the Protected Site to the Recovery Site. SRM uses the term “Compute Resources” to refer to Clusters of ESX hosts and the resource pools within

196

1. Logon with the VSphere Client to the Protected Site’s vCenter 2. Click the Site Recovery icon 3. In the Summary Tab, in the Protection Setup pane – click the Configure next

to the Inventory Mappings Option Note: This will merely take you to the Protection Groups node and the Inventory Mappings tab. The column labelled “Recovery Site Resource” which contains “None Selected” merely means there is no default mapping yet in place

4. Double-click your preferred virtual network (in my case, the portgroup named vlan11 (dvSwitch0). In the subsequent dialog box select the virtual network in the Recovery Site

197

Note: By default, when you run a test “Recovery Plan” to the Recovery Site – SRM will auto-magically put the replicated VMs into a “bubble network” which isolates them from the wider network using an internal vSwitch. This prevents possible IP and NetBIOS conflicts. Try to see this “bubble network” as a safety valve that allows you test plans with a guarantee that you will generate no conflicts between the Protected Site and the Recovery Site. The settings above are only used in the event of triggering your Recovery Plan for real. If I mapped this “production” network to the “internal” switch no users would be able to connect to the recovered VMs. Notice how I am not mapping VLAN10 port group to VLAN50. This is because the VMs that reside on that network deliver “local infrastructure” resources which I do not intend to include in my Recovery Plan. Note: Networking and DR can be more involved than you first think, and much depends on how you have the network set up. When you start powering on VMs at the Recovery Site – they may be on totally different networks requiring different IP addresses and DNS updates to allow for user connectivity. The good news is SRM can control and automate this process. One very easy way to simplify this for SRM is to implement “stretched VLANs” where two geographically different locations appear to be on the same VLAN/subnet. However, you may not have the authority to implement this – and unless it is already in place it is a major change to your physical switch configuration, to say the least. It’s worth making it clear that even if you do implement stretched VLANs you may still have to create inventory mappings because of port group differences. For example there may indeed be a VLAN 101 in New York and a VLAN 101 in New Jersey. But if the administrative team in New York call their port groups on a virtual switch NYC-101 and the guys in Chicago call theirs NJ-101 then you would still need a port group mapping in the Inventory Mappings tab. Lastly, in the top right hand corner of the Inventory Mappings tab there is a Refresh... and Remove... option. The usage of these two options is largely self-explanatory. Note: Once you understand the principle of inventory mappings this then becomes quite a tedious task of manually mapping the correct Protected Site vCenter objects to the Recovery Site vCenter objects like so:

198

As you can see I have not configured any mappings for my test network, test & dev resource pool or test & dev virtual machine folder. Similarly, I’ve created no mapping between my infrastructure network (vlan10), infrastructure resource pool or infrastructure virtual machine folder. So not everything needs to be mapped to the Recovery Site, just like not every LUN/Volume in the Protected Site needs replicating to the Recovery Site. In my early days of using SRM 1.0 I used to take all the VMs from the protection site, and dump them into one folder called “Recovery VMs” on the Recovery Site’s vCenter. I soon discovered how limiting this would be in the failback scenario. I would recommend more or less duplicating the folder and resource pool structure at the Recovery Site, so it exactly matches the Protected Site. It offers more control and flexibility – especially when you begin the failback process. I would avoid the casual and cavalier attitude of dumping virtual machines into a flat-level folder. I’ve found from hard experience that this can lead to situations where you cannot return the virtual machines to their original location in the Protected Site when you carry out a failback. Finally, in my experience it is possible to map between the two new virtual switch types – distributed and standard vSwitches. This does allow you to run a lower-level SKU of the vSphere4 product in the DR location. So you could be use Enterprise Plus in the Protection site and the Advanced version of vSphere4 in the Recovery Site. People might be tempted to do this to save money on licensing. However, personally I think it is fraught with unexpected consequences. For example an 8-way VM licensed for Enterprise Plus in the Protected site, would not start in the Recovery Site. A version of vSphere4 that doesn’t support DRS Clustering and the initial placement feature would mean having to map specific VMs to specific ESX hosts. FAQ: Can you map Distributed vSwitches to Standard vSwitches – and vice versa. A. Yes. To SRM portgroups are just labels and doesn’t care. Remember if VM is mapped from a DvSwitch to a SvSwitch it may well loose functionality that only the DvSwitch can provide.

199

Creating Protection Groups Protection groups are used whenever you run a test of your Recovery Plan, or when DR is invoked for real. Protection groups are pointers to the replicated vSphere datastores that contain collections of virtual machines that will be failed over from the Protected Site to the Recovery Site. The Protection Groups relationships to VMFS volumes can be one-to-one. That is to say, one Protection Group can contain or point to one VMFS volume. Alternatively, it is possible for one Protection Group to contain many VMFS volumes – this can happen when a virtual machine’s files are spread across many VMFS volumes for disk performance optimization reasons or when a virtual machine has a mix of virtual disks and RDM mappings. In a loose way the SRM Protection Group could be compared to the “storage groups” or “consistency groups” you make in your Storage Array. However, what actually dictates the membership of a Protection Group is the way the VMFS volumes are utilized by the virtual machines. An important part of the wizard for creating Protection Groups is selecting a destination “placeholder” for the Recovery Site – this is a VMFS volume at the recovery location. After the wizard has completed, SRM creates the VMX file and the other smaller files that make up the virtual machine from the protection site to the Recovery Site using the “placeholder” selected in the wizard. It then pre-registers these “placeholder” VMX files to the ESX host at the Recovery Site. This registration process also allocates the virtual machine to the default resource pool, network and folder as set in the inventory mappings section. Remember your real virtual machines are really being replicated to a LUN/Volume on the storage array at the Recovery Site. You can treat these “placeholders” as an ancillary VM used just to complete the registration process required to get the virtual machine listed in the Recovery Site’s vCenter inventory. Without the placeholder VMs there would be no object to select when you create Recovery Plans. If you think about it, although we are replicating our virtual machines from the Protection Site to the Recovery Site – the VMX file does contain site specific information especially in terms of networking. The VLAN and IP address used at the recovery location could differ markedly from the protected location. If we just used the VMX as it was in the replicated volume some of its settings would be invalid (portgroup name and VLAN for example), but others would not change (amount of memory and CPUs). The main purpose of placeholder/shadow vmx files is that they help you see visually in the vCenter inventory where your virtual machines will reside prior to executing the Recovery Plan. It allows you to confirm upfront whether your inventory mappings are correct. If a virtual machine does not appear at the Recovery Site, it’s a clear indication that it is not protected. It would have been possible for VMware to create the virtual machine at the Recovery Site – the point of testing the Recovery Plan, but in doing it in this way it gives the operator an opportunity to fix problems before testing a Recovery Plan. These placeholder virtual machines are sometimes referred to as “shadow” VMs. You may occasionally see reference to this term in error messages, for instance if Recovery Plans go wrong, such as “Image for testing or recovery cannot be produced because the shadow group is currently being tested”.

So before you begin creating Protection Groups – I would create a small 5-10GB volume on the storage array of your choice and present that up to all the ESX hosts that will perform DR functions. For example on my Clariion array I created a small 5GB LUN visible to my Recovery Site ESX hosts (esx3/4) called LUN_102__5GB_PLACEHOLDERVMS

200

I then formatted this with VMFS giving it a friendly volume name like so:

The smallest VMFS volume you can create is 1.2GB. If the volume is any smaller than this you will not be able to format it. The placeholder files do not consume much space, so small volume should be sufficient – although you may wish to leverage your storage vendors thin-provisioning features so you don’t unnecessarily waste space – but hey, what’s a couple of GB in the grand scheme of things compared to the storage front print of the VMs themselves? On NFS you may be able to have smaller size for your placeholder datastore, much depends on the array, for example the smallest volume size on my NetApp FSA2020 is 20GB After a proper run of a test Recovery Plan, the failback (returning to the original primary site) process has a manual clean-up phase, which involves the operator deleting the “placeholder” VMX files. So you might find it useful to either remember where they are located, or set up a dedicated place to store them – rather than mixing them up with real virtual machine files. Frequently, people find it difficult to tell the difference between the placeholder files in vCenter and real virtual machines. It is good practice to use folder and resource pool names that reflect that these “placeholder” virtual machines are not “real”. It would be great to see in subsequent releases a special icon to indicate these are SRM virtual machine files. TIP: When you create your first ever Protection Group you might like to have the VSphere Client open on both the Protection Site vCenter and also the Recovery Site vCenter. This will allow you to watch the real-time events that happen on both systems – of course if you are running in “linked mode” you will see this happening if you expand parts of the inventory

201

1. Logon with the VSphere Client to the Protected Site’s vCenter (New York) 2. Click the Site Recovery icon 3. In the Summary Tab, in the Protection Setup pane – click the Create link next

to the Protection Groups option

4. In the Create Protection Group – Name and Description dialog box, type in a

friendly name and description for your Protection Group. In my case I’m creating a Protection Group – called Virtual Machines Protection Group

5. When you click next, the Protection Group wizard will show you the datastores discovered by the Array Manager

202

6. Next select a datastore “placeholder” for your VMs. You can use local storage for the placeholder if you so wish. You can use remote storage, but if you do it should be a stand-alone placeholder which does not take part in any replication process.

Note: It really doesn’t matter what type of datastore you select for the “placeholder” VMX file. You can even use local storage – remember there are only “temporary” files used in the SRM process. However, local storage is perhaps not a very wise choice. If that ESX host goes down, is in maintenance mode or is in disconnected state – then SRM would not be able to access the placeholder files during the execution of a Recovery Plan. It would be much better to use storage that is shared amongst the ESX hosts in the Recovery Site. If one of many of your ESX hosts were unable to access the shared storage location for placeholder files – it would merely be skipped – and no placeholder VMs would be registered on it. The size of the datatore does not have to be large – the placeholder files are the smaller files that make up a virtual machine, they do not contain virtual disks. Note: After clicking the finish button a number of events will take place. Firstly, at the Recovery Site vCenter you will see the task bar indicate the system is busy “protecting” the ALL virtual machines that reside in the datastore included in the Protection Group

203

Whereas in the Recovery Site’s vCenter it will begin the process of registering the VMs in the correct location in the inventory

You will also have noticed these “new” VMs are being placed in the correct resource pool, folder and connected to the correct network. In the screen grab below notice how ctx1 is mapped to VLAN51, as opposed to VLAN11 in the Protected Site.

If you browse the storage location for these “placeholders” you can see they are just “dummy” VMX files. As I mentioned before occasionally VMware SRM refers to these “placeholder” VMs as “shadow” VMs. As you can see in the next screen grab, there is no virtual disk created for these “shadow” VMs.

In the Virtual Machines and Template view, at the Recovery Site’s vCenter the VMs have been allocated to the correct folder.

204

Note: SRM knows which network, folder and resource pool to put the recovery VMs into – because of the default “Inventory Mapping” settings we specified earlier Note: If you create a template and store it on a replicated VMFS volume it will become protected as well.

This means that templates can be recovered and be part of Recovery Plans (covered in the next chapter) just like ordinary VMs

Notice how templates are not powered on when you run a Recovery Plan – because they can’t be powered on anyway without being converted back to being a virtual machine. Warning: Deleting Protection Groups at the Protected Site vCenter reverses this registration

205

process. When you delete a protected group, it unregisters and destroys the placeholder files created at the Recovery Site. This does not affect the replication cycle of the virtual machines which is governed by your array’s replication software. Be very cautious with deleting Protection Groups – the action can have unexpected and unwanted consequences if they are “in use” by a Recovery Plan. This potential problem or danger is covered later in this book. To explain it you need to know more about Recovery Plans which I have yet to cover. But basically, if you delete Protection Groups… the placeholders get deleted too… and all references to those VMs in the Recovery Plan get removed as well…!!!

Failures to Protect a Virtual Machine

Bad Inventory Mappings? Occasionally, you might find that when you create a Protection Group the process then fails to register one or more virtual machines at the Recovery Site. This is normally caused by a user error in the previous “Inventory Mappings” process. The error is flagged up on the Protected Site with a yellow exclamation mark on the Protection Group, and the virtual machines that failed to be registered.

This error is usually caused by the virtual machine settings being outside of the scope of the “Inventory Mappings” settings defined previously – and therefore the Protection Group doesn’t know how to map the virtual machines current folder, resource pool or network membership to the corresponding location at the Recovery Site. A good example is networking – which is how I first stumbled upon this error during the beta programme. In the Inventory Mapping process I did not provided any inventory mappings for vlan10. I regarded this as a local network that contained local virtual machines that did not require protection. Accidentally, the virtual machine named ctx01 was patched into this network, and therefore did not get configured properly in the Recovery Plan. In the real world this could have been an oversight – perhaps I meant to set an inventory mapping for vlan10 but forgot to – in this case the problem wasn’t my virtual machine but my bad configuration of the inventory mapping. Another scenario could be that the inventory mapping is intended to handle default settings where the rule is always X. There could be a number of virtual machines held within the Protection Group that have their own unique settings – after all one size does not fit all. SRM can allow for exceptions to those rules when a virtual machine has it own particular configuration that falls outside of the group, just like with users and groups. If you have this type of inventory mapping mismatch it will be up to you to decide on the correct course of action to fix them. Only you can decide if the virtual machine or the

206

inventory mapping is at fault. Therefore you can resolve this match in a number of different ways

• Update your inventory mappings to include objects originally overlooked • Correct the settings of the virtual machine to fall within the settings of the default

inventory map • Customize the VM with its own unique inventory mappings. This does not mean

that you can have rules (Inventory Mappings) and exceptions to the rule (Custom VM settings). A VM either is covered by the default inventory mappings or not.

If you think the Inventory Mapping is good, and you just have an exception – it is possible to right-click

Broken Network Settings Prevent Protection If one of your VMs network adapter settings has become broken you may see this error message:

This caused by lost or orphaned network settings

The only solution to this problem, is to fix the “invalid backing” error, by configuring the VM(s) that affected to a portgroup on the ESX host, and then trying the protection again.

It’s not an error - it’s a naughty, naughty boy!

http://www.youtube.com/watch?v=af9EHtQMMc4 If can you forgive the Monty Python “Meaning of Life” reference – the confusing yellow exclamation mark on a Protection Group – can be a benign one. It can actually be an indication that a new virtual machine has been created which is covered by the Protection Group. As I may have stated before simply creating a new virtual machine on a replicated LUN/Volume does not automatically mean it is protected and enrolled into your Recovery Plan. I will cover this in more detail as I examine how SRM interacts with a production environment which is constantly changing and evolving.

207

Hopefully with these “errors” you can begin to see the huge benefit that inventory mappings offers. Remember inventory mappings are optional – and if you chose not to configure them in SRM when you created a Protection Group every virtual machine would fail to be registered at the Recovery Site. This would create tens or hundreds of virtual machines with the yellow exclamation mark and each one would have to be mapped by hand to the appropriate network, folder and resource pool.

And Finally

The last type of error looks like this on a Protection Group

You will notice that no virtual machines are listed underneath the virtual machines tab of the Protection Group. This can happen if the ESX hosts at the Protection Site have lost all contact with the VMFS volumes covered by the Protection Group; the VMFS volume that contained them has been destroyed or that all the VMs have been moved to a different datastore not covered by replication or a SRM Protection Group. FAQ: Will Protection Groups protect templates? A. Yes, they will be protected so long as there is an inventory mapping for them. Templates are automatically enrolled into the “No Power On” part of Recovery Plans – so they are recovery but not powered at the point of tests or run the plan

Conclusion As you have seen one of biggest challenges in SRM in the post-configuration stages is network communication. Not only must your vCenter/SRM servers be able to communicate with each other from the Protected Site to the Recovery Site – but the SRM server must be able to communicate with your Array Manager also. In the real world this will be a challenge which may only be addressed by sophisticated routing, NATing, intra-vlan communication – or by giving your SRM server two network cards to speak to both networks. It’s perhaps worth saying that the communication we allow between the SRM and the storage layer via the vendor’s “Storage Replication Adapter” could be very contentious with the “Storage Team”. Literally, via the vSphere Client you are effectively managing the storage array. Historically, this has been a manual task purely in the hands of the “Storage Team” (if you have one), and they may react negatively to the level of rights that the SRM/SRA needs to have to function under a default installation. To some degree we are cutting them out of the loop. This could also impact negatively on the internal change management procedures used to handle storage replication demands in the business or organization within which you work. This shouldn’t be something new to you. In my research I found a huge variance in companies’ attitudes towards this issue – with some seeing it as a major stumbling block. In others some thought it a stumbling block that could be overcome as long as senior management fully back the implementation of SRM, in other words the storage team would be forced to accept this change. At the opposite extreme those people who deal with the day-to-day administration of storage were quite grateful to have the workload reduced, and noted that the less people involved in the decision making process the quicker our precious virtual machines will be online.

208

Virtualization is a very political technology – as virtualizationists we quite frequently make quite big demands on our network and storage teams which can be deemed as very political. I personally don’t see automating your DR procedures as being any less political. We’re talking about one of the most serious decisions a business can take with its IT – invoking its DR plan. The consequences of that plan failing are perhaps even more political than a virtualization project that goes wrong. I think it may be entirely possible if you worked closely with your storage team and your storage vendor to modify the scripts behind the SRA, to include the possibility of manually presenting the storage to the recovery ESX hosts to circumvent the politics this introduces. Of course it is totally impossible for me to configure every single storage vendor’s arrays and then show you how VMware SRM integrates with them – but hopefully I’ve given you at least a feel for what goes on at the storage level with these technologies – together with the contributions from of the storage vendors to this book. What I hope is you have enough knowledge now to both communicate your needs to the storage guys, but also to understand what they are doing at the storage levels to make all this work. In the real world we tend to live in boxes – I’m a server guy, I’m a storage guy and I’m a network guy – quite frequently we live in ignorance of what each guy is doing. Ignorance and DR make for a very heady brew. Lastly, I hope you can see how important inventory mappings and Protection Groups are going to be in the recovery process. Without them a Recovery Plan would not know where to put your virtual machines in vCenter (folder, resource pool and networking) and secondly it would not know on which LUN/Volume to find those virtual machines files. In the next chapter we will be looking at the creating and testing of Recovery Plans. I’m going to take a two-pronged approach to this topic. Chapter 8 gets you up and running and Chapter 9 takes Recovery Plans up to their fully functional level. Don’t worry; you’re getting closer and closer to hitting that button that says, “Test my Recovery Plan.

209

Chapter 8: Recovery Site Configuration

210

We are very close to being able to run our first basic test plan. I’m sure you’re just itching to press a button that tests failover. I want to get to that stage as quickly as possible so you get a feel for the components that make up SRM. If you like, I want to give you the “bigger picture” view before we get lost in the devil that is the detail. So far all our attention has been on a configuration held at the Protection Site’s vCenter. Now we are to change tack to look at the Recovery Site’s vCenter configuration. The critical piece is the creation of a Recovery Plan. It is likely you will have multiple Recovery Plans based on the possibility of different types of failures, which yield different responses. The Recovery Plan if you lost an entire site would be very different from a Recovery Plan invoked because the loss of individual storage array or for that matter a suite of applications.

Creating A Basic Full Site Recovery Plan Our first plan is going to include every VM within the scope of our Protection Group with little or no customization. Again we will return to creating a customized Recovery Plan in the next chapter. This is my attempt to get you to the testing part of the product as soon as possible without overwhelming you with too many customizations. The Recovery Plan contains many settings and you have the ability to configure:

• The Protection Group covered by the plan • The delay between the power on of one VM before another based on the VMware

Tools Heartbeat service or a flat value in seconds • Control network settings during the testing of plans • Suspend “local” VMs at the Recovery Site that are not business critical – to free up

resources for the failed over VMs

1. Logon with the Vi Client to the Recovery Site’s vCenter 2. Click the Site Recovery icon 3. In the Summary Tab, in the Recovery Setup pane – click the Create link next to

the Recovery Plans option

4. In Create Recovery Plan – Recovery Plan Information dialog box type in a meaningful and friendly name and description for the plan such as Complete Loss of Site Plan – Simple Test

211

5. In the Create Recovery Plan - Protection Group dialog box, select the Protection Group which is to be covered by the Recovery Plan

6. Click Next, and in the Create Recovery Plan – Response Times dialog box, select a timeout value you think is appropriate for the power on of your recovery VMs

212

Note: These two timeout values combine together. Initially, the system will wait an initial 600-seconds whilst the network settings taken from the Inventory Mapping are applied. The second 600-second timeout waits for signal from the VMware Tools service running in the VM. If SRM cannot receive a VMware Tools heartbeat – it will mark that virtual machine as a problem in the plan and then move on to the next virtual machine. If you know SRM 1.0 well, you should know that these values were much smaller in the previous release (30 and 300 seconds respectively)

7. Next in the Create Recovery Plan – Configure Test Networks dialog box, set the options to handle networking when you run a test.

Note:

213

The option called “auto” creates an “internal” standard vSwitch called “bubble”. This ensures no IP or NetBIOS conflicts can occur between the Protected Site VMs and the Recovery Site VMs. As the dialog box shows above, you can over-ride this behaviour and map it to a vSwitch that would allow communication between the VMs – but watch out for the possibility of creating conflicts with your production VMs. Important Note about Auto: On the surface the auto feature sounds like a good idea. It will stop conflicts based on IP or NetBIOS name occurring. However, it can stop two virtual machines in the test that should communicate to each other from doing so. Here is an example. Say you have four ESX hosts in a DRS cluster, when the virtual machines are powered on we will have no control over where they will execute. They will “auto”matically be patched to an internal standard vSwitch which means by definition whilst the virtual machines on that vSwitch will be able to communicate to each other –they will be unable to speak to any other virtual machine on any other ESX host in the cluster. The consequences for this are clear – despite our ability to order the power on of virtual machines to meet whatever service dependency issues we have – those network services would fail to meet those dependencies – and therefore fail to start properly. Currently, there is no work around for this – except by using your VLAN structure to isolate the virtual machines from the wider network. The problem may be fixed in future releases by a concept called “Cross-Host Network Fencing”. This allows cross communication between vSwitches from one ESX host to another – but conceptually there is a large barrier (or fence) around them, which prevents IP and NetBIOS conflicts occurring. The concept of network fencing first surfaced some time ago in the VMware Lab Manager product where people faced a similar challenge in their test environment – multiple duplicates of virtual machines running on the same physical network. Essentially, all networking fencing is a very sophisticated and automated deployment of Network Address Translation (NAT) with DHCP. This allows all the virtual machines to preserve the original IP configuration, but still communicate to each other. Even with this development we would still have issues, for example there some protocols such as DCOM that don’t work with network fencing. Another alternative to cross-host network fencing could be the implementation of pVLAN or Private VLANs. The concept would be that the pVLAN would be used to mimic the real network of the Recovery Site, but be isolated using a pVLAN. PVLANs are now supported in vSphere4, but remember to configure them you need access to the Distributed vSwitch, which is currently an Enterprise Plus feature only. For the moment the “auto” feature in the Recovery Plan wizard is best regarded as a “safety valve” that allows you to test a plan without any fear of generating an IP or NetBIOS name conflict in Windows VMs.

8. Finally, you can suspend VMs at the Recovery Site to free up CPU and Memory resources in the Create Recovery Plan – Suspend Local Virtual Machines dialog box. In my case I called for my Test & Dev VM to be suspended

214

9. Click Finish Note: In this case I asked for a test VM called “winxp” to be shutdown in the Recovery Site (New Jersey) as it won’t be needed when carrying out tests. As with Protection Groups, Recovery Plans can be much more sophisticated than the plan I have just created. Once again I will return to Recovery Plans in Chapter 9.

Testing Storage Configuration at the Recovery Site By now you’re probably itching to hit the big green button in SRM that says test.

But before you do, if you want your test to complete correctly – then it is worth confirming the ESX hosts in the Recovery Site will be able to access the storage array at the Recovery Site. Previously when we were setting this up we focused on making sure that the ESX hosts in the Production Site had access to the VMFS volume. The same considerations may have to be taken into account at the Recovery Site as well. It might be good practice to make sure that the ESX hosts in the Recovery Site have visibility to the storage, especially if you’re using iSCSI where post-configuration of the ESX hosts is required to allow access to the storage array. In most cases both the hosts at the Protection and Recovery Site will already have been granted access to the array location where they reside. You may not even have to manually allow the ESX hosts in the Recovery Site during the execution of the Recovery Plan to a volume or LUN. For example in the HP Lefthand SRA it will automatically allocate the ESX hosts access to the very latest snapshot if you have set these up as in the Scheduled Remote Copy format. The HP Lefthand VSA knows how to do this because that is one of its main jobs, and because we provided IP address and user credentials during the Array Manager configuration at the Protected Site. This may not be the case in other storage vendors management systems – you may well need to create

215

management groups in the storage array – and allow your ESX hosts access to them for SRM to then present replicated LUNs/Volumes to the ESX hosts. If you are unsure about this – check back to Chapters 2-4 where I covered array based replication for different storage vendors. This level of automation does vary from one storage array vendor to another. For example with your type of storage array you may well need to use your “LUN masking” feature to grant your Recovery Site ESX host access to the storage group (aka volume group, contingency group, consistency group, recovery group) that contains the replicated or snapshot LUN. It is well worth double-checking the “readme” file information that often ships with a SRA to confirm its functionality. Additionally, many storage vendors have such good I/O performance that they create a snapshot on the fly for the test and then present this snapshot to the ESX hosts in the Recovery Site. At the end of the test, they will normally delete this temporary snapshot – this is the case with NetApp Storage and its FlexClone technology. Below is a diagram that shows what happens at the storage layer when a test of a Recovery Plan happens.

The main thing to mention here is all that needs to be configured on the array is that the ESX hosts in the Recovery Site have access to the storage groups, which include the replicated LUNs. As long as this is done, when the test is executed the storage vendor’s SRA will send an instruction to the Storage Array to create a snapshot on the fly – and it will then instruct the array to present the snapshot (not the R/O replicated LUN) to the ESX hosts (this is indicated by the dashed line in the diagram). This means when tests are executed, your production system is still replicating changes to the Recovery Site. In short running tests is an unobtrusive process, and does not upset the usual pattern of replication that you have configured because the ESX hosts in the Recovery Site are presented a snapshot of the replicated volume which is marked as read-write, whereas the replicated volume is marked as read-only – and is still receiving block updates from the Protected Site storage array. In the case of storage vendors who do not create a snapshot “on the fly” they will mount the latest snapshot in the cycle of replication – this is the case with HP Lefthand VSA.

Overview: First Recovery Plan Test Well, we’re now here! If all goes to plan you should be able to run this basic Recovery Plan we have created and find the VMs in the Recovery Site are powered on. A great many events happen at this point. If you have some software that records the screen such as HyperCam or Camtasia you might even want to record the events so you can play them back. If you wish to watch a video of the test you can view one I captured and uploaded to RTFM http://www.rtfm-ed.co.uk/srm.html

216

What do mean by “Test”?

Before we actually “test” our Recovery Plan I think we should really discuss what constitutes a proper test of your DR plan. In many ways the “test” button in SRM is actually testing that the SRM software works and that your SRM Recovery Plan functions as expected. For many organizations a real test would be a hard test of the Recovery Plan – literally hitting the red button – and actually failing over the Protected Site to the Recovery Site. Think of it this way. If you have a UPS system or a diesel generator at your site you could do all manner of software tests of the power management system, but you won’t really know if the system behaves as hoped until you’ve lost power. With this attitude in mind it’s not unheard of that large companies invoke and hard test their DR plans twice a year. This allows them to indentify flaws in the plan, to update their “run books” accordingly and also keep the team in charge of controlling the DR plan up to date with those procedures and unexpected events that can and do happen. In short, clicking the test button in SRM does not prove or guarantee that the business IT functions will still operate after a disaster. FAQ: Is it possible to trigger a plan automatically, perhaps with a script? A: Currently, no. There exists no PowerCLI cmdlets that would allow us to test or run

a Recovery Plan. Even if there was – many would be worried about false positives - a DR plan could be acted on automatically when no disaster has occurred. Generally, people who ask this question are trying to make SRM an availability technology like VMware HA or FT. With that said some more enterprising members of the community have investigated using the SRM API and .NET objects.

FAQ: Is it possible to run two plans simultaneously? A. Yes and no. It depends very much on your storage array vendors architecture. For

example on my EMC Clariion CX3 I can run two Recovery Plans almost at the same time. However, if click on the test button at the same time I frequently see an error message like so:

Here you can see that the SYMPI database is locked by the first Recovery Plan. Generally, you if wait a short while for the Step 2. Prepare Storage to complete you can start the next Recovery Plan without an error. Of course, if you run with multiple storage vendors, which have no dependency on each other – you can run more than one Recovery Plan simultaneously.

What happens during a Test of the Recovery Plan?

There are a significant number of changes that take place at the Recovery Site location when a test is run. Here’s a 1,000 yard view of the process

• Test is started • ESX storage is rescanned so they can see the replicated storage

217

• Replicated VMs are registered • SRM triggers the suspension of virtual machines marked as not needed • Before recovery virtual machines are powered on, an internal switch is created to

prevent IP/NetBIOS conflicts • Once all the VMs are powered on – the test is paused • The SRM Administrator can review the findings result of the test • Operator clicks continue to default “message” processed in the test of a Recovery

Plan • Power off and clean up of the recovery virtual machines occurs • Resume Suspended virtual machines • ESX host storage is rescanned again to remove references to snapshots presented

during the test

Now in More Detail

• ESX hosts run the “Prepare Storage for Test” process which involves • ESX host’s HBAs (Fibre-Channel, iSCSI, Software ISCSI) are rescanned • The ESX discovers the VMFS/NFS volume that contains the VMs replicated

from the Protected Site to the Recovery Site • ESX host’s refresh to see VMFS volumes • ESX host’s replicated VMFS volume is resignatured and given a volume

name of “snap-nnnnnnn-virtualmachines” where virtualmachines is in my case the original VMFS volume name. In the screen grab below you can see the Recovery Plan has been started, and the storage layer of the ESX hosts is in the progress of being rescanned.

In this screen grab below you can see a VMFS volume has been mounted, and has been resignatured. All resignatured values are given the name snap- in them so they can be clearly identified. In this case the Recovery Plan was executed using the EMC Clariion Array I have in my test lab

Note: In the beta of SRM 1.0, the system would rename the snapshot of the VMFS

218

volume to be same datastore name that is used on the Protected Site. If you wish to re-enable the rename of VMFS volumes, you can do this by editing the configuration XML file VMware SRM on the Recovery Site. Locate in the C:\Program Files\Site Recovery Manager\Config directory the vmware-dr.xml file. Modify the <fixRecoveredDatastoreNames>false</fixRecoveredDatastoreNames> to be <fixRecoveredDatastoreNames>true</fixRecoveredDatastoreNames> This can now be done by using the advanced setting option on the properties of the Site Recovery icon

• Virtual Machines Unregistered & Registration:

• The Placeholder VMs are unregistered from the vCenter inventory • The Replicated VMs are registered to the inventory • They are then reconfigured

• Virtual Machines marked for Suspension – are suspended

219

• Virtual Machines are Powered on

• ESX hosts a create vSwitch called testBubble-1 vswitch • The test vswitch has a portgroup called testBubble-1 group, if you have set

the Recovery Plan to use “auto” in test mode • Virtual machines are reconfigured to use the testBubble-1 group

WARNING: Occasionally, when the test hangs or goes seriously wrong, I’ve seen the clean-up phase fail, and this error subsequently fails to remove this vSwitch/Port group. It is safe to remove it manually once the test has completed. If left behind it can create an error message the next time the

220

Recovery Plan is tested. If your Recovery Plans fails halfway through the process you can have components left over. For example the screen grab below shows two test-bubble vSwitches, the first created by a Recovery Plan – but not cleaned up at the end.

• Follow the Progress of your plan • By selecting the Plan, and selecting Recovery Steps you can see the

progress of the plan • Errors are flagged in red • Successes are marked in green • Active processes are marked in red with a % value for how much has been

completed

221

Practise: First Recovery Plan Test 1. Logon with the Vi Client to the Recovery Site’s vCenter 2. Click the Site Recovery icon 3. Open the + Recovery Plans icon 4. Select your plan, in my case it was called Complete Loss of Site Plan - Simple

Test

5. Click Test Recovery Plan 6. Acknowledge the warning about the time taken to carry out the Recovery

Plan

WARNING: Do not click Run Recovery Plan. This actually invokes DR proper. This is a C-Class management decision. Unless you are in a lab environment you seek higher approval

222

Note: At this point a dialog box will appear

and the Recovery Plan icon will change, indicating a Recovery Plan is in process.

Additionally, in the Task Bar you will see track changes occurring

• Once all the VMs are powered on, the process will be paused around

54%, and the Recovery Plans icon will change to an “Informational icon”

This icon usually indicates a “message” event has taken place. Messages can be viewed in the Recovery Step tab of a Recovery Plan. Messages stop the plan to allow for some kind of manual intervention in the recovery steps. The Recovery Steps tab also allows you to see the process I was describing in detail at the beginning of this section

• At this point the test of your recovery has completed, and the recovery VMs should be powered on. It useful to examine any errors as a way of

223

troubleshooting your configuration. For example I had two errors the first of which appears to be benign, and the second which was a configuration error. The first error indicated a timeout waiting for heartbeat signal from the VM called db02.

I checked the VM at the Recovery Site, and found that VMware Tools was up-to-date and was running. The VM had recovered without an error. I decided this was due to the fact that in my environment I lack memory resources, and the time out was probably caused by a surge of VMs starting simultaneously. I occasionally receive time outs because of this issue. The second error was caused by a mistake in my inventory mappings for a template. I had not correctly allocated a valid port group for my template VM in the inventory. I had to let the Recovery Plan complete before I could review my inventory mappings.

• When you are happy to proceed, Click the Continue option

Controlling & Troubleshooting Recovery Plans

Pause, Resume and Stopping Plans Personally, if I want to pause the progress of a Recovery Plan regularly – I prefer to add a “message” to the plan at the appropriate point. You can manually control the progress of the test using the icons on the button bar

224

If you do choose to Pause or Stop a test the Recovery Plan icon changes accordingly

A Test may be cancelled by SRM if it detects a serious error such as being unable to access replicated LUN/Snapshot or if SRM believes another test is in progress or has hung. The screen capture below shows such a situation

WARNING: Cancelling the test manually is not without consequences if you don’t allow the system to complete. It can leave SRM in a pending state where it thinks a test is still running when it has been cancelled.

Error: Cleaning Up Phase of the Plan does not happen with iSCSI

The clean up and reset phase of the test plan does not always automatically stop access to the replicated iSCSI LUN/Volumes. In my experience of using SRM, it’s not unusual behaviour to see the replicated LUN/Volume is still listed under datastores for the recovery ESX hosts after a test has completed. Of course what can happen is that between one test and another a new snapshot could occur. By default most SRAs always default to using the most recent snapshots. However, some SRAs do not deny access to the snapshot after the test has completed. This can lead to a situation where the VMFS volume remains visible to the ESX host after the test is completed. Whatever happens by default SRM will always prefer to use the most recent snapshot. This can cause an alert should you attempt to run the test plan multiple times, and is caused by a second attempt to resignature and rename the volume. The resignature will be successful, but the rename will fail because a VMFS volume with the same name may already exist from the previous run of the test plan. It should resignature BOTH volumes, and rename the most recent to be the original volume name. Note: This error will only happen if you enabled the datastore rename process in the vmware-dr.xml file as outlined earlier in this chapter.

225

This graphic shows the error message:

and this screen grab of the datastores view on ESX host shows the effect If this happens the recovery VMs will point to the VMFS volume which has the snap-nnnnnnn-virtualmachines name which is the most recent snapshot. The actual cause of this is quite tricky to explain as it depends on the time the test plan was run compared with the cycle of replication adopted at the storage array. The error is caused if SRM fails to manually resignature both volumes. There is an easy work around this issue - which is to rename your "older" snapshot VMFS volume something else such as "test1-virtualmachines". This should allow the additional snapshot to be presented without causing the rename annoyance. You might be interested in knowing why this problem specifically afflicts iSCSI, and does not affect SAN and NFS systems. I’ve seen this a number of times in ESX 3.5 and ESX 4.0 with or without SRM, it does seem to be the case that even you make big changes to the ESX Host (changing its IQN, blocking the iSCSI Port 3260 or denying access to an iSCSI volume) a mounting of iSCSI volumes persists. Whilst at VMworld 2009, it was my very good fortune to finally meet VMware’s Andy Banta. I’ve known Andy for sometime via the forums but never met the guy face to face. Andy was part of the team involved developing the ESX host’s iSCSI stack. So after the event I made point of asking Andy why this happens with iSCSI and ESX. It all hinges on how iSCSI is implemented by your storage vendor and that ESX hosts will keep iSCSI sessions alive – because it is too dangerous to simply tear them down – when they could be in use by other targets. What appears below is Andy’s explanation albeit with a little tidying up by yours truly: “First off, there's a distinction to be made: Clariions serve LUNs, LeftHand and EqualLogic system present targets. The behaviour on a Clariion should be the same for both FC and iSCSI: LUNs that go away don't show up after a rescan. What you're seeing are the paths to targets remaining after the target has been removed. In this case, there are not sessions to the targets and the system no longer has any access to the storage. However, ESX does hang onto the handle for a path to storage even after the storage has gone away. The reason for this is to prevent transient target outages from allowing one piece of storage to take over the target number for another piece of storage. NMP uses the HBA, target and channel number to identify a path. If the paths change while I/O is going on, there's a chance the path won't go to the target NMP expects it to. Because of this, we maintain a persistent mapping of target numbers to paths. We also don't ever get rid of the last path to storage, so in the case, since SRM used the snapshot as storage, the ESX storage system won't get rid of it (at least for a while). In 3.5 we used an aging algorithm to lets us know when we could reuse a target ID. What Dell EqualLogic finally ended up recommending to customers for 3.5 was to rescan after removal ten times (our length of aging before we gave up on the target)”

Error: Lost of the Protection Group Settings

Occasionally, I’ve seen Recovery Plans lose their awareness of the storage setup. An administrator of SRM deleting the Protection Group at the protection site usually causes this. If the administrator does this all the “placeholder” VMs disappear. The Recovery Plan then become “orphaned” from the storage configuration at the other location – and doesn’t know how to contact the storage array to request access to the replicated

226

volumes/LUNS. Effectively, it becomes a plan without any VMs to recover. The ironic thing about this is the recovery steps actually give “success” as result – but the virtual machines at the Recovery Site are never powered on. The dialog box below shows you this – notice how there is no + symbol next to Step 2. Prepare Storage

Whereas a Recovery Plan that does know how to communicate to the storage would normally have + Next to Step 1 as the screen grab below demonstrates.

The way to fix this problem is to reconfigure the Recovery Plan and ensure it can see the Protection Groups Right-click each Recovery Plan Choose Edit Click Next to accept the existing plan name and description Ensure that there a tick for the affected Protection Groups

Click Next, and in the Create Recovery Plan – Response Times dialog box, select a timeout value you think is appropriate for the power on of your recovery VMs Next in the Create Recovery Plan – Configure Test Networks dialog box, set the options to handle networking when you run a test. Finally, you can suspend VMs at the Recovery Site to free up CPU and Memory resources in the Create Recovery Plan – Suspend Local Virtual Machines dialog box. In my case I called for my Test & Dev VM to be suspended

227

Click Finish WARNING: Deleting Protection Groups has a huge, huge impact on your configuration, effectively destroying the hard work carried in creating an effective Recovery Plan. For this reason you should avoid at all costs the Windows Administrator solution to all problems – “Let’s just delete it and add it back in – and see if it works again.” I doubt very much you will pay much heed to this warning – until the day you experience it yourself first hand. Like me, you probably learn as much from your mistakes as your successes.

Error: Getting Extra Help This particular screen grab is a non-reproducible error I had in SRM 1.0. However, I doubt very much if the principles have changed in this respect. If I receive a similar fatal error like this in SRM 4.0, I will be sure to update this book with a more contemporaneous example. I’ve found if you let your mouse hover the error text highlighted in red, frequently you can see verbose error information. Take the example below

Here the problem was caused by a lack of disk-space for a remote snapshot – and as a consequence it cannot find the remote snapshot schedule for the primary volume. Additionally, if you export the results of your plan you can easily cut and paste such an error message from the export and send information in an email to your storage buddies who will fix all your problems and make the pain go away. That’s the theory at least, unless you’re the guy in charge of the storage layer – in that case it’s your problem!

228

Error: Non-fatal error information reported during execution of array integration script: testFailover Output: "C:\Program Files\VMware\VMware Site Recovery Manager\/scripts/SAN/HP Lefthands/jre/bin/java" -cp "C:\Program Files\VMware\VMware Site Recovery Manager\/scripts/SAN/HP Lefthands/UI.jar" com.lefthandnetworks.commandline.Srm.Srm < "C:\Program Files\VMware\VMware Site Recovery Manager\/scripts/SAN/HP Lefthands/XMLinput.xml"" DELETE: The writable space on snapshot named "virtualmachines_RS_Sch_1_Rmt.593" was deleted, continuing... NOTE: Had this been a real failover the remote parent volume named "replica_of_virtualmachines" would have been changed to a primary volume, continuing... ERROR: command to address 172.168.3.98 failed because could not find the matching remote schedule for primary schedule 35188AC48F2BAEBBC018AA4C3C6C6534;ProtectedManagementGroup;514;rdm_ctx1_RS_Sch_1_Pri. Error: . Further analyse of the storage array indicated that I was beginning to run out of space

This little episode illustrated the dangers of improperly monitored virtual storage or thinly provisioned LUNs/Volumes. You cannot store things in mid-air – and currently the only way to see if you’re running out of space is not in your OS or hypervisor – but from your vendor’s storage management software.

Error: Deleted Objects at the Recovery Site, still referenced in the Recovery Plan

Another problem that can occur in the Recovery Plan is when it makes references to vCenter objects such as virtual machines that no longer exist. The screen grab below shows such an error.

As you can see the error message states that “Error: The request refers to an object that no longer exists or has never existed”. I deleted a test virtual machine that was marked to be suspended during my Recovery Plan. The Recovery Plan has an orphaned entry to this object that no longer exists at the Recovery Site. The same lost object creates an error further down the plan when it tries to resume from suspended state the “test” virtual machine that was never suspended in the first place. The fix for this particular error is to edit the Recovery Plan and run through the wizard, you will eventually come to the dialog box where you can remove the orphaned virtual

229

machine(s). Simply by running through the wizard it will “refresh” the Recovery Plan(s) and purge the reference to the lost virtual machines.

Error: Repairing VMs

Occasionally, I’ve seen this error on both my Protection Groups and Recovery Plans:

The main cause of this in my experience – is that the ESX host, which holds the “Placeholder” VMs for the affected VMs, has become “disconnected”. Generally, I have to restart the management services on the ESX host, and then use the “Repair All” option to fix the problem. It’s important to resolve the cause of the problem before clicking the repair all option. If you don’t I’ve seen SRM wait a very long time before it gives up on the process.

EMC Celerra and Testing Plans During the test of a Recovery Plan you should see that new VMFS volume is mounted and resignatured by the Celerra SRA. In my case this was a LUN with the ID of 128 at the ESX Host:

This can be seen from the Celerra’s Manager console:

230

In the same windows under the Targets tab, Properties of the Target and the LUN Mask tab, you can see that LUN 128 has been presented to the ESX hosts in the Recovery Site:

EMC Clariion and Testing Plans During the test of a Recovery Plan you should see the snapshot on the Recovery Array (New Jersey) become active.

You should also see that a LUN is allocated from the Reserve LUN Pool (RLP) to hold the snapshot data.

231

NetApp and Testing Plans With a NetApp FSA you should see when you’re testing plans – that the ESX hosts in the Recovery Site mount the replicated NFS datastore automatically. With NFS there is no resignature process to be concerned with as VMware’s file system (VMFS) is not in use. In the screen grab below you can see esx3.corp.com and esx4.corp.com both mounted in a NAS datastore.

The name of the datastore is exactly the same as the name in the Protected Site location.

The NetApp file should also create a temporary FlexClone of the volume that contains the replicated VMs. This allows the existing schedule of SnapMirror to continue working during the period of the test. This means that SRM tests can be carried out during operational hours without affecting the system. Using FilerView with in NetApp FSA you should see in Volumes >> FlexClone Volumes that a testfailoverClone is created for each volume you have set up for SnapMirror which is configured for SRM like so:

232

This temporary FlexClone is deleted from the NetApp FSA when the test is completed.

HP Lefthand and Testing Plans When testing Recovery Plans with HP Lefthand VSA, the SRA will add entries to the “servers” node within the CMC referred to as SRM_AG_1, SRM_AG_2 and so on. These servers are given access to the latest snapshot in the schedule remote copy cycle. In the screen grab below you can see that esx3.corp.com and esx4.corp.com have been given access to the remote snapshot RMT.7.

Storage Replication Cycle Scenarios

I want us to take some time out to look at different examples or scenarios of array replication/snapshot schedules – to explain some of the “strangeness” you might occasionally see within SRM. I’m going to use HP Lefthand VSA as an example, but this situation could affect any system that doesn’t create snapshots “on-demand” but instead leverages the schedule of snapshots from an asynchronous replication cycle. In this situation replication is happening infrequently – say at 15-minute interval. This situation also affects other iSCSI-based systems for reasons I outlined earlier. Let’s take an example of a cycle of retaining three snapshots, taking a snapshot every hour.

233

This will generate a scenario where you have a volume with three snapshots, Snapshot1, Snapshot2, Snapshot3. Depending on the storage vendor you may at times see four snapshots, as most arrays would not delete Snapshot1 - the oldest snapshot – until the newest has been taken. During this time you would have Snapshot1, Snapshot2, Snapshot3 and Snapshot4. Once Snapshot4 has been taken, you would see Snapshot1 being purged – and the system would leave behind Snapshot2, Snapshot3 and Snapshot4. This leaves us with two different scenarios: Scenario 1: LUN/Volume, S3, S2, S1 Scenario 2: LUN/Volume, S4, S3, S2, S1 How VMware’s SRM and your vendor’s SRA behave depends upon when a Recovery Plan is tested or run – and how it intersects with the cycle of replication. Example 1: If SRM is run while Scenario 1 is the condition, say at 3:30 (S3 + 30 minutes) the SRA will communicate to the array, first finding the volume, then the most recent and complete snapshot (S3) and configuring authentication to allow ESX to access that snapshot. When the test is done most arrays will erase the temporary writable space that was used by SRM on that snapshot. They do this to conserve space on the recovery SAN. The SRA does not un-authenticate the ESX host from the snapshot. SRM does not give SRA the information to do this currently. Example 2: For arguments sake it is now 4:01 (S4+1minute) and you run your test Recovery Plan again. Once again SRA will communicate to the array, first finding the volume, then the most recent and complete snapshot (S3), which still happens to be S3 because S4 is still copying. In this case the SRA will change nothing. S3 is already authenticated and SRM proceeds repeating the last test it ran. Once again when it stops it will not un-authenticate the snapshot from ESX. You should be able to run SRM tests in fast succession as many times as you want, many of those times you'll not actually be changing the state on the SAN if the replication cycle has not moved on yet. Example 3: For argument’s sake it is now 4:10 (S4+1min) and a test is run again. Again the SRA will find the most recent complete snapshot, in this case S4 is this time authenticated since it is completed. The ESX server now has both S4 and S3 mounted. Both were resignatured and should say they are " snap-NNNNN". SRM does not find datastores based on their names, it finds them by their exact device name which the SRA passes back to SRM. So regardless of datastore name SRM will use the datastore in the SRM we mounted for the test, S4, and it will ignore any others no matter how similar they are in name or content, including S3. This screen capture shows how the ESX host can see two snapshots at the same time, but renames the VMFS volume of the most recent one to be called “virtualmachines”

Note: Remember since SRM 1.0 was release this renaming process is no longer the default

234

behaviour, but it can be re-enabled by editing the vmware-dr.xml file or modifying the Advanced Settings dialog box. Here I can tell the VMFS called “virtualmachines” is the most recent as HP Lefthand VSA presents the new snapshot with a higher “target” number – vmhba32:32 rather than vmhba32:31:0:1. At the end of the test during the “clean-up” phase SRM should resignature the volume again. It’s this which I have occasionally see fail. The screen capture below shows a successful resignature

Example 4: For arguments sake two more hours pass and it is exactly the top of the hour. In this case we have a new scenario. Scenario 3: LUN/Volume, S6, S5, S4, S3 S6 is almost done copying and both S4 and S3 are still mounted to the ESX server. But as soon as S6 has completed the snapshot process S3 will be deleted by the schedule. As part of the delete process it will be removed from the authentication lists for the ESX server and iSCSI access to it will not be allowed any more. So this is when the clean-up actually occurs, and this clean-up is done by your Storage Array – not SRM or SRA. After all the array can no longer offer to ESX a snapshot which no longer exists. You could see for a short period of time that authentication has been removed at the Storage Array and ESX no longer sees S3 but this would only be while the snapshot is actually deleting. If monitored this process in the Array Manager software the snapshot in the UI would probably say "deleting" or “removing”. If you follow this to the logical conclusion, if you leave a big enough gap between one test and another there is less and less chance of an error occurring – as clean-up by the Storage Array is more likely to have been completed. The screen grab below shows this process in action between my two HP Lefthand VSAs:

235

The effect of this is not immediately apparent but you will find if you browse a datastore that has been de-listed in this way – that although the volume/datastore label might be present in ESX, when you browse the datastores contents in the Vi Client will be empty – which is a little bit disconcerting when you see it for the first time. There are really three ways to avoid this happening.

• Increase the number of snapshots you retain • Increase the time between one snapshot and the next – so snapshots are retained

for a longer period • Temporarily pause the schedule of replication, and then resume it again when you

have finished the test of the Recovery Plan. Note: In terms of HP Lefthand VSA you can pause the schedule very easily In the Protected Site VSA (New York) Select your volume Select the Schedules Tab

Of course, the longer you pause your schedule – the more changes will be accruing

236

in the storage layer – and you must remember to resume your schedule when you are finished.

Conclusion In this section I tried to get you to the running of your Recovery Plan ASAP – in fact that’s been my intention from the very beginning, believe it or not, as I think seeing a product “in action” is the quickest way to learn it. As you saw clicking that Test or Run button generated a great deal of activity and changes. VMware’s SRM is a very dynamic product in that respect. My hope is that you are following my configuration for real whilst you read. I know that is a big ask – so if you were not in this fortunate position I would really recommend watching the first video, which I linked to in this chapter. No matter how much I screen grab and document what happens, that’s never going to be as rich as doing it for yourself or seeing a video of the event. Secondly, I wanted to try and explain some of the “strangeness” you might see in the SRM product. It’s not strangeness at all – it’s by design. It’s how your storage layer’s cycle of replication interacts with and intersects your activity in the SRM product. Anyway, it’s now time to move away from this “Sesame Street” approach to Recovery Plans now you are up to speed with the principle. In the next chapter you will spend most of your time with the product creating many custom plans which leverage all the features of SRM so you can test your DR plans against each other – and for different scenarios. So far this book has been about getting the SRM product to work – the next chapter is really about why your organization bought the product in the first place.

237

Chapter 9: Custom Recovery Plans

238

So far we have just accepted the default settings from the Recovery Plan. As you might expect and hope, it is possible to heavily customize the Recovery Plan. Customized Recovery Plans allow you to control the flow of the recovery process – together with customized virtual machine mappings it allows you to completely automate the common tasks performed when evoking your DR plan. The creation of multiple Recovery Plans with different settings allows us to deal with different causes and scenarios that trigger the use of our Recovery Site – and additionally allows us to test those plans to measure their effectiveness. With custom Recovery Plans we can control and automate a number of settings such as:

• Power off virtual machines at the Protected Site by priority or order • Power on virtual machines at the Recovery Site by priority or order • Change the IP settings of the virtual machines • Stop the plan and issue an operator message • Stop the plan and issue a command

Additionally, in this chapter I want to delve a little deeper in SRM to discuss the consequences of such issues as

• Creating/Renaming/Moving vCenter Objects at the Protection/Recovery Site • Using VMware’s RDM “Raw Device Mapping” Feature • More complicated storage scenarios where there are virtual machines with multiple

virtual disks, held on multiple VMFS datastores with potential use of VMFS extents • Adding new virtual machines • The impact of “Cold Migration” with file relocation and “Storage VMotion”

It’s worth mentioning that some of these settings in the Recovery Plan will only be effective dependent on whether you are merely testing your Recovery Plan or if you’re actually invoking it. So some of these settings will only apply during tests – and more importantly some will only be applied when you actually evoke your DR plan. For example the possibility of powering down virtual machines at the Protected Site is never included in a test of a Recovery Plan, but it is executed when you invoke your Recovery Plan for real. You can see whether a setting will take effect by looking at the mode column in the Recovery Plans recovery steps window

Here we can see that the power off at the protection site of virtual machines (step 1) only happens when you run a recovery, whereas message, clean, resume and reset processes (steps 8-9) are only carried out when you run a test. When a step has neither “Recovery Only” nor “Test Only” it means it is always carried out regardless of what mode you are using.

239

Configuring Shutdown of Protected Virtual Machines at the Protected Site You might find it somewhat curious that this feature exists. After all if you have decided to invoke your DR plan – isn’t this only done when everything is lost at the Protected Site? In a way you have a point – after all if a major fire or terrorist attack has occurred it may well have knocked out your primary location altogether destroying everything in its wake. To put it bluntly there may be nothing left to power down at the Protected Site, and indeed you may have lost all communications to that location from the Recovery Site anyway. To many this fact would appear to be undeniably true. So let me give you an example of a situation where such a disaster occurred but didn’t result in the loss of the protection site – in fact it was totally untouched by the disaster – but we ended up invoking DR anyway. In this situation the power off option makes logical sense. In 1996 I was working for a business in the UK, which had its corporate HQ in the Arndale Center in Manchester. The Arndale Center is a very large shopping mall come business location in downtown Manchester – and the bomb caused widespread damage valued by insurers in the range of £411m (GBP) with reconstruction costs in the range of £1.2b (GBP). The damage was caused by an illegally parked Ford “truck bomb” detonated by the IRA (Irish Republican Army). The bomb weighed 3300lbs and was the largest bomb exploded by the IRA at that time. You can read archive reports of the bombing on the BBC’s “On This Day” website: http://news.bbc.co.uk/onthisday/hi/dates/stories/june/15/newsid_2527000/2527009.stmhttp

If you prefer wiki (that excellent source of free, nonbiased information!) there is a wiki page as well. http://en.wikipedia.org/wiki/1996_Manchester_City_Centre_bombing The building was evacuated because of earlier warnings of an impending attack which was a very common tactic deployed by the IRA at the time. Unfortunately, when the bomb was detonated people behind the police security cordon were injured.

240

Now in the case of the business I was working for, Corporate HQ was on a floor so high up in the building it was unaffected. In fact the Arndale Center survived the blast, whereas other buildings near the epicentre were eventually demolished and rebuilt. The systems we had running in the Arndale Center were unaffected, although I don’t remember now if we had any communications to that location. I’m not intending to make any macabre claims to fame here by the way – I was just some middle ranking member of staff in this company – and I was working in Birmingham at the time – a long way from any real danger. It’s perhaps sobering to remember that when these events took place – our last concern was the business and our DR plan – but the safety of members of staff who lived and worked in the area. The timing of the bomb was on Saturday morning, which was intended to cause maximum harm to civilians doing their weekly shopping. However, to be honest another concern of a minority of the staff was whether the business could survive the situation and if we were going to get paid at the end of the month. The payroll systems were housed at the Arndale Center as well. Nonetheless, as you might expect the whole location was a crime scene pretty quickly after the emergency services had done their sterling working to rescue survivors. Even if our systems were working in Manchester we could not access them – and the office facilities where the scheduling office and training rooms were housed were locked down whilst the appropriate agencies did a forensic research around the immediate blast site. In this context the feature makes sense – we would not want to bring up recovery systems that might have the same NETBIOS name and IP address – at the same time as those in the protection site. This would or could create enough conflicts to stop us pursuing our recovery or stop our end-users from getting the services required. Additionally, when we evoked DR it would need to handle the client’s systems – so they were redirected to systems at the Recovery Site, which were within the scope of our management. Our other concern was to avoid conflicts if the affected site was brought back online again. You could apply this example to any “planned” invocation of the DR plan – such as a major power outage or flood which may not cause direct damage to the Protected Site but is intrusive enough to normal business operations to be regarded as warranting a DR plan. In terms of SRM – the product will first power down virtual machines in the following order Low, Normal, High. This is directly opposite to the way recovery virtual machines are powered on as the screen grab shows

By default shutdown of protected virtual machines only happens when you trigger your Recovery Plan for real – all virtual machines are by default given “normal” as their priority for both Protected and Recovery virtual machines. For the purposes of this chapter I am going to create a new Recovery Plan

241

1. In the SRM Manager window on the Recovery Site vCenter, select Recovery

Plans and click the Create Recovery Plan button

2. Type in a name and description for the plan such as Complete Loss of Site –

Custom Plan and click Next 3. Select your Protection Group(s), and click Next 4. Set the default Virtual Machine Response Time, and click Next 5. Select your options for the Test Networks, and click Next 6. Select any virtual machines you wish to suspend during running of the

Recovery Plan, and click Finish Note: This should create a second plan in the list as seen below

7. Select your new plan, and click the Recovery Steps tab and expand the +

next to 1. Shutdown Protected Virtual Machine at Protected Site

As you can see by default VMs are added to the Recovery Plan in no particular order (not even alphabetically) and it is left to the administrator to re-order the VMs to reflect the relationships between them

8. Select a virtual machine and using the Step Up/Down Icons on the toolbar

242

relocate and reorder the virtual machines to be in the correct location as you see fit. The screen grab below shows my new order. There is no multiple selection option here, and you cannot drag-and-drop the VMs.

Note: This feels like a very laborious process – and there is not a bulk method by which you can at least put them in the correct priority group. Of course the judgement to stop one VM before another will always have to be a per-virtual machine setting. Personally, I would like to see the product improved to allow us to select multiple VMs into as many categories as we liked with as many priorities as we wished. I think it would also be very good if we could copy an existing Recovery Plan, so we could then modify the start-up orders in the copy. This would allow for easier experimentation of Recovery Plans – and reduce the time spent re-ordering the VMs. Note: You might notice how I put my less critical virtual machines into the Low Priority grouping, which is executed first. The idea of this may be that I want to engage the Recovery Plan on a less important virtual machine – and validate that SRM is working properly before engaging the plan for my other virtual machines. This is consistent with a more “planned” invocation of the DR plan. Perhaps you know some major work is being done in the local area and you know you are going to be without power for some days, which is longer than your own power generation system can cope with – assuming you have some kind of diesel generator that allows for 1-3 days worth of power supply. This might trigger the use of the Recovery Plan – despite the fact you haven’t “lost” the Protected Site.

Configuring Priority/Order for Recovery Virtual Machines Of course it is much easier to explain and justify this aspect of the Recovery Plan. Our virtual machines in the Recovery Site must be brought on-line in the correct order to allow for multi-tier applications to work. Core infrastructure systems such as domain controllers and DNS will need to come on stream first, followed perhaps by database systems. Those

243

database systems services will no doubt be using domain accounts during the start-up and without the directory service running, those services are likely to fail to start. Additionally, there’s little point in bringing up front-end services such as web-servers if the back-end services with which they have dependencies are not yet functional. In fact that’s the short hand we normally use for this concept – “service dependencies”. It is often the case that for VM3 to function – VM1 and VM2 must be running, and for VM2 to work VM1 must be started first and so on. Of course, the exact order you need for your plan to be successful is beyond the scope of this guide – that’s something highly specific to your organisation. Without this feature configured, as you might have seen, the virtual machines are more or less powered on randomly, albeit they are all contained under the “normal” priority. If you follow this to the logical conclusion – you could have a Recovery Plan just for a particular business critical application. In this case you haven’t lost the Protected Site, just a critical piece of the business infrastructure. A judgement call must be made by senior management who must decide that the loss of this system and the time to recover it is just too significant and too long to be left to normal recovery steps, such as restoring from backup. This would require you to manage your LUNs/Volumes carefully. In this case the business critical application would want dedicated LUNs/Volumes just for it, and there would be a one-to-one mapping between the LUNs that contained say your e-commerce website, and your SRM “Protection Groups”. This would allow for a “Protection Groups” for Web, Citrix, File Servers and so on. The UI for configuring the priority order and start-up order for recovery virtual machines works in precisely the same way as shutdown order of the protected virtual machines.

1. Select your new plan, and click the Recovery Steps tab and expand the + next to 5. Recover Normal Priority Virtual Machines

2. Again select a virtual machine and using the Step Up/Down Icons on the toolbar

From here you can relocate and reorder the virtual machines to be in the correct location as you see fit. The screen grab below shows my new order

Note: Here I have expanded the + on specific virtual machine to highlight a useful feature of SRM’s Recovery Plans. It’s possible to drill-down to specific events

244

whilst a test is running. This will allow you to follow every action line by line. As you may already have seen the Recovery Step tab also gives you percentage% value indicating how much time is spent on each step.

Parallel Host Start-Up Order and Normal/Low It’s important you know that the start-up order in Normal/Low works very differently from the start-up order in High. With high priority the virtual machines are started in series. So vm3 will not start before vm2, and vm2 will not start before vm1. With Normal and Low priority the virtual machines are started in order, but if you have more than one ESX host (as is the case in most environments) it can start more one at the same time. So if I had six virtual machines (vm1, vm2, vm3, vm4, vm5, vm6) and three ESX hosts, these would start up in this order but if there is enough ESX hosts and enough resources, then it would start vm1, vm2 and vm3 first, followed by vm4, vm5, and vm6. SRM would not power on vm4 until it was happy that vm1, vm2 and vm3 had started properly. As a consequence there is a less strict order of start-up with Normal/Low but it does allows for a quicker start-up of the virtual machines. After all if every virtual machine was started up serially, rather than simultaneously – those people with very large numbers of virtual machines would have to wait a very long time to get their virtual machines up and running.

Adding Message Steps It is possible to interrupt the flow of a Recovery Plan to send a message to the operator. If you have been following this guide as a tutorial you will have seen this already. It is a default that when all Recovery Plans are created and tested there is a built-in message which stops the test merely proceeding to carry out the “clean-up” phase of the test. In this case the message is intended to give the operator an opportunity to review the results of the test and confirm/diagnose the configuration.

It’s possible to add our own messages to our custom Recovery Plan. In my case I would like a message to occur after all my primary virtual machines have been powered on. These virtual machines are all the ones labelled with a number 1 such as dc-1, ctx-1, and so on. In my case I want a message to appear between High and Normal to the effect that I am asked to confirm that the primaries are up and functioning before allowing the other VMs to be powered on. Note: Messages are always added above the selected step in the Recovery Plan

1. In the Recovery Plan select + 5. Recovery Normal Priority Virtual Machines 2. Next click the Add Message Step Icon

245

Note: You can also right-click and choose Add Message

3. In the Add Message Step dialog box, type in your message and then click OK

Note: This message should be added to the list of steps and should cause a renumbering of all the steps in the Recovery Plan.

Note: It is possible to insert messages and commands on the properties of each virtual machine. In the virtual machine tab of the Recovery Plan, each virtual machine can be edited, and per-virtual machine messages can be added.

246

Adding Command Steps As with messages, it’s possible to add commands to the Recovery Plan. These commands could call scripts in a .bat, .cmd, .vbs, .wmi, Powershell or Perl to automate other tasks. When you call these scripts you must provide the full path to the script engine and the script file in question. For example to run a Microsoft .BAT or CMD file you would supply

C:\Windows\System32\cmd.exe /c c:\alarmscript.bat

These scripts are executed at the Recovery Site SRM Server, and as a consequence they must be stored there – you should know they are executed under the security context of the SRM’s local administrator account. As a test I used the Microsoft Net Send command to send a message to another system. This requires the messenger service to be enabled on the destination system – and this is quite frequently disabled or not installed by Microsoft as it is now regarded as insecure.

@echo off net send 192.168.2.198 Please contact [email protected] to inform him that the first recovery has completed

1. In the Recovery Plan select in my case + 5. Message: WARNING: Please

confirm that all the High Priority VMs have started and their services are functioning correctly

2. Next click the Add Command Step Icon

3. In the Add Command Step dialog box, type in the path to your command

interpreter and script file and then click OK

Note: In my case this script runs just before my message

247

Adding Command Steps with PowerCLI Although using classic style batch files to carry out tasks is nice, it’s not an especially powerful API for manipulating and modifying the vSphere platform. If you want to make more subtle scripts we really need a more robust scripting engine. Fortunately, VMware has for some time embraced the Microsoft PowerShell environment supplementing it with “cmdlets” that are specific to managing the VMware virtual environment. Start by downloading and installing Microsoft PowerShell to the SRM server in the Recovery Site, then download and install the VMware PowerCLI tools. http://support.microsoft.com/kb/968929 http://www.vmware.com/support/developer/windowstoolkit/ Once you have the PowerCLI environment setup you can then start to create .PS scripts. One of the common questions asked on the SRM Forums is how to reduce the amount of RAM used by the VMs during the recovery process. This is because people sometimes have less powerful ESX hosts in the Recovery Site. For example they have less physical memory than the production ESX hosts in the Protected Site. Using PowerCLI we can automate the process of reducing VMs’ RAM allocation by running .PS scripts before the power on event. There are a couple of ways of doing this by using PowerCLI. You could have a .PS script for each and every VM and reduce its memory. Below is a sample PS script that would do just that for my VM called ctx01. This script uses the set-vm cmdlet to reduce the recovery VM’s memory allocation to 1024MB. The –confirm:$false commands prevents the script from waiting for a human operator to confirm the change: EXAMPLE 1: connect-viserver vc4nj.corp.com --user corp\administrator --password vmware Set-VM ctx01 -MemoryMB "1024" -Confirm:$FALSE Disconnect-VIServer -Server vc4nj.corp.com -Confirm:$FALSE Of course a .PS script for each and every VM would be very administratively intensive. So you might prefer to search for VMs based on their name, and make changes that affect many VMs simultaneously. For example in the .PS script below – the get-vm cmdlet is used to find every VM which starts with the text “ctx” and then “pipelines” this to the set-vm command. This would modify the memory of my VMs ctx01, ctx02 and so on: EXAMPLE 2: connect-viserver vc4nj.corp.com --user corp\administrator --password vmware get-vm ctx* | Set-VM -MemoryMB "1024" -Confirm:$FALSE Disconnect-VIServer -Server vc4nj.corp.com -Confirm:$FALSE

248

Perhaps a more sophisticated script would not set a flat amount of memory – but instead check the amount of memory assigned to the VM and then reduce it by a certain factor. For example perhaps I wanted to reduce the amount of memory assigned to all the recovered VMs by a factor of a half. The script below finds the current amount of memory assigned to the VM, and then reduces it by 50%. For each VM found with the ctx* string in its name it finds the amount of memory assigned and then uses the set-vm cmdlet to set it correctly. EXAMPLE 3: connect-viserver vc4nj.corp.com --user administrator --password vmware Foreach ($VM in Get-VM ctx*){ $NewMemAmount = $VM.MemoryMB / 2 Set-VM $VM -MemoryMB $NewMemAmount -Confirm:$FALSE } Disconnect-VIServer -Server vc4nj.corp.com -Confirm:$FALSE In my case I decided to use this final method as the way of controlling the amount of memory assigned to the CTX VMs. I would like to thank Al Renouf from the UK, as he helped write this last example. In case you don’t know Al is very handy with PowerShell generally and his Virtu-al blog is well worth a read: http://www.virtu-al.net/ The next part is getting these PS files to be called by SRM. I prefer not to call the .PS script directly but instead I create a .cmd/.bat file which will call the script at the appropriate time. This helps reduce the amount of text held within the Command Script step. By using variables in the .cmd/.bat file we can re-use again to call any number of .PS files held on the SRM Server.

Step 1: Create a redirect.bat file

I first came across the redirect.bat whilst reading Carter Shaklin’s PowerCLI blog which discussed using .PS scripts with vCenter Alarms. http://blogs.vmware.com/vipowershell/2009/09/how-to-run-powercli-scripts-from-vcenter-alarms.html And with help from Virtu-AL’s website – I was able to come up with a .bat file that would call my .PS1 scripts. The script loads up the Microsoft Powershell environment – together with the PowerShell Console file (.psc1) which allows VMware’s PowerCLI to function. The variable at the end (%1) allows for any .PS1 file to be called with a single redirect.bat file. @echo off C:\WINDOWS\system32\windowspowershell\v1.0\powershell.exe -psc "C:\Program Files\VMware\Infrastructure\vSphere PowerCLI\vim.psc1" "& '%1'"

Step 2: Copy the redirect.bat and powercli.ps script to Recovery SRM Server The next stage is to copy your redirect.bat and .PS file(s) to a location on the Recovery SRM server. It doesn’t really matter where you place them – so long as you type correctly the path to the script when you add a command to the Recovery Plan, it should execute without an error.

249

In this case ctx01-ram.ps1, ctx-bulk-ram.ps1 and ctx-ram-half.ps1 represent each of the different examples discussed previously.

Step 3: Add a command to the Recovery Plan

1. In the Recovery Plan, Select the Recovery High Priority Virtual Machines 2. Next click the Add Command Step button

3. Next type the full path to the command interpreter (cmd.exe) the

redirect.bat file and the .PS file you would like to execute

Note: In this case because the dialog box is small and the path to all the files is long the text has “wrapped” and should read: c:\windows\system32\cmd.exe /c c:\redirect.bat c:\ctx-ram-half.ps1 This will appear in the plan like so:

250

Note: The location of the .PS script is important. It must be called before the Recovery of the high, normal or low VMs. If not the .PS script will make changes to the placeholder .vmx files rather than the genuine VM .vmx files. Remember during the “Prepare Storage” stage the placeholder .vmx files are unregistered from vCenter, and the genuine VM .vmx files take their place. Therefore any changes made to a placeholder.vmx file will simply be ignored and replaced. You might feel uncomfortable with SRM running these scripts automatically. Remember you could just put message steps in the Recovery Plan – and run these commands manually if you so wish. Additionally, you may wish to review how to authenticate your PowerShell .PS files to your vCenter. In my case to keep things simple I have the username and password as plain text in the .PS file. There are methods of authentication within PowerShell where this is not necessary. For example Carter Shanklin’s blog discusses using the ability to store encrypted credentials for the local system account such that a username or password need not appear in the .PS file. Warning: Finally, think about the consequences of using PowerCLI to modify VMs for failback. These changes would be replicated back to the Protected Site should you decide to failback to the Protected Site. Remember you are making changes to the VMX file in the case of the memory allocation VMs. As part for the failback process you may “flip” the replication direction – to make changes accrued in the Recovery Site replicate back to the Protected Site. To stop this you would need a PS script that undid the changes made by the Recovery Plan.

Adding Command Steps to Call Scripts with the VM Starting with PowerCLI 4.0 Update 1 a new cmdlet is supported to allow you to call scripts within the Guest Operating System. The new cmdlet is called Invoke-VMScript. For Invoke-VMscript to work Microsoft PowerShell must be already installed to the VM. The Invoke-VMscript cmdlet works natively with cmd, bin and bash shell scripts in Linux. By default in Windows it assumes you will want to run a PowerShell script inside the VM – it is possible using the ScriptType parameter to specify the type of script you want to run. Below is a syntax example where I used Invoke-VMscript to list a VMs IP address settings directly from the Windows running inside the VM. Invoke-VMscript -VM ctx01 -ScriptText "ipconfig" -HostUser root -HostPassword password -GuestUser corp\administrator -GuestPassword vmware There are two important points here. Firstly, despite authenticating the PowerCLI against vCenter – the Invoke-VMscript still requires credentials on the ESX hosts itself to function. Secondly, you clearly need to authenticate against the Guest Operating System running inside the VM for the script to work.

251

Used in this simple way the Invoke-VMscript could be used to restart a service within one of the recovery VMs. First start by creating some kind of script within the VM: @echo off echo Stopping the Netlogon Service… net stop netlogon echo Starting the Netlogon Service net start netlogon echo Success!!! And then calling this script with the Invoke-VMscript cmdlet Invoke-VMScript -VM ctx01 -ScriptText "c:\restartnetlogon.bat" -HostUser root -HostPassword password -GuestUser corp\administrator -GuestPassword vmware You could if you wished develop your own scripted method of changing the IP settings of virtual machines. Before you consider this you should check out the methods that VMware have developed. But if you would like an example here is one below. Microsoft’s netsh command contains the ability to export your IP settings to and from a text file – and it also supports multiple LAN interfaces. So for example the netsh command: netsh interface dump > c:\netcfg.txt would create a text file called netcfg.txt which would contain your IP configuration. This text file can now be modified to hold your desired IP configuration. Within the VM a .bat file could be created to handle the re-import of the new IP settings. @echo off echo Restoring your current IP settings… netsh exec c:\netcfg.txt ipconfig /registerdns echo IP Settings now updated… This script can then be called from using the Invoke-VMscript cmdlet: Invoke-VMScript -VM ctx01 -ScriptText "c:\configureLANsettings.bat" -HostUser root -HostPassword password -GuestUser corp\administrator -GuestPassword vmware

Configure IP Address changes for Recovery Virtual Machines One task you may wish to automate is the change of an IP address within the virtual machine. Currently VMware’s method of achieving this is by calling Microsoft SysPrep from the Guest Customization Settings part of vCenter. The important thing to note is the usual role of this component is normally to deploy new virtual machines – in this case the settings of the guest customization are ignored. The only settings, which are applied, are the IP settings. The downside of this approach is that every virtual machine will require its own which is very administratively intensive. It’s well worth considering other approaches, which do not require a change in each and every virtual machine’s IP configuration. Such approaches could include

• Retaining the existing IP address – and redirecting clients by IP address • Using stretched VLANs so virtual machines remain on the same network • Allocating IP address by DHCP and Client Reservations

252

If you wish to use the VMware method, begin by configuring a Guest Custom Configuration for the affected virtual machines on the Recovery Site’s vCenter. VMware have improved this method by introducing a comma-separated file edited in Microsoft Excel, which would allow for a bulk administrative way of creating guest customization settings. However, it still depends on sysprep to change the IP address of the VM which does constitute a bottleneck in the recovery process – as the mini-installation Wizard is not quick and introduces a number of reboots to take full effect. Warning: Remember for this method to work you do need to copy the sysprep files to C:\Documents and Settings\All Users\Application Data\VMware\VMware vCenter\sysprep location. If you fail to do this or vCenter cannot find the right version of sysprep you will receive this timeout error message as shown in the screen grab below:

Creating a manual IP Guest Customization

1. On the Recovery Site’s vCenter 2. Click the Home location 3. Under the Management tab, select the Customization Specification Manager

4. In the Customization Specification Manager, Click New button

5. Type a friendly name such as SRM: CTX-1 IP Settings and Click Next

253

6. Fill the dialog boxes with “dummy” information until you arrive at the “Network Interface Settings” part of the Wizard and select Custom Settings and Next

7. Select the NIC in the list and the ellipsis button …

8. Set your IP Settings as required and click OK

254

9. Click Next and Finish in the dialog box Note: Once you have created one guest customization it is possible to duplicate it using the Guest Customization Manager. Once copied you can use Edit to modify the IP address applied

Set the Virtual Machines Custom Settings

The next step is to configure each virtual machine with its guest customization settings

255

1. At the Recovery Site vCenter, select the Recovery Plan and Click the Virtual Machines Tab

2. Select the virtual machine in the list, in my case ctx01 and click the Edit... button

3. In the Configure Virtual Machine dialog box, click the Browse button, and select the Guest Customization/Specification Settings you created earlier

4. In the Wizard click Next 5. Additionally, we can have a command/message execute before/after a

virtual machine has been powered on

256

Note: If you configure these options you will see them in the Recovery Plan under the Pre-Power On and Post-Power On options like so:

Configure Bulk IP Address changes for Recovery Virtual Machine DR-IP-Exporter) In SRM 1.0 Update 1, VMware introduced a new utility called dr-customizer.exe utility, which allows for the ability to bulk generate the Guest Customization Settings from the .CSV file. This is better than having to manually run through the Guest Customization Wizard for each and every VM.

1. Open a command-prompt Recovery Site SRM server 2. Then cd C:\Program Files\VMware\VMware vCenter Site Recovery

Manager\bin 3. Run this command:

dr-ip-customizer -cfg ..\config\vmware-dr.xml -csv c:\nyc.csv -cmd generate

4. After the first connection – if necessary choose [Y] to trust the SRM server 5. Then provide the login details for the Recovery Site vCenter Server

The dr-ip-customizer should generate an output like so which enumerates the number of Protection Groups (aka Shadow Groups) and the number of placeholder VMs (aka Shadow VMs) within each Protection Group

Additionally, it will create a .CSV file at the C:\ which when opened in Microsoft Excel will look something like this:

257

As you can see its a very simple file which maps the name of the virtual machine as it is known at the Protection Site, with the name of the “Shadow” VMs generated by the registration of the placeholder .vmx files in the Recovery Site. The VM Name column is merely there to assist you in mapping these “Shadow” VMs to the actual VMs contained in the Recovery Plan. The column called Adapter ID is used to control how settings are applied. If Adapter ID is set to 0, then this acts as a global setting which is applied to ALL LANs within the VM. This setting cannot be used to change the IP address of a LAN – but instead can be used to globally set values that would be the same for all network interfaces such as the DNS configuration. If a VM has multiple NICs its possible to specify each NIC by its number (1,2,3) and apply different settings to it. For example suppose you wished to give two different NIC interfaces two different IP settings – together with more than one default gateway and DNS settings – you would simply add an additional row for each VM. VM ID

VM Name

Adapter ID

IP Address

Subnet Mask Gateway DNS Servers

shadow-vm-70468 ctx01 1 192.168.4.31 255.255.255.0 192.168.4.1 192.168.4.130 shadow-vm-70468 ctx01 1 192.168.4.2 192.168.4.131 shadow-vm-70468 ctx01 2 172.168.4.31 255.255.255.0 172.168.4.1 172.168.4.130 shadow-vm-70468 ctx01 2 172.168.4.2 172.168.4.131

If all your VMs have a single NIC and you just need to re-IP them then you could use a much simpler configuration like so:

258

In this case I set the Adapter ID column to be 1 for every VM. I then used the complete series feature of Microsoft Excel to generate a unique IP address for each VM – and so on.

6. To process the .CSV file you would use the following command: dr-ip-customizer -cfg ..\config\vmware-dr.xml -csv c:\nyc.csv -cmd create

7. Provide the username and password to the vCenter, and accept the certificate of both the vCenter and SRM hosts if necessary Note: The create command generates the Guest Customizations for you based on the parameters of the .CSV file. The dr-ip-customizer supports a -cmd drop parameter which removes guest customizations from vCenter, and a -cmd recreate which applies any changes to the .CSV file after the use of create – and can be used to reconfigure an existing settings. After running this command – you should see it complete by creating the “shadow” guest customization settings:

Additionally, you should see in the Guest Customization Manager window that a whole new list of customizations have been added like so:

259

TIP: You can edit these settings to double-check that output is as you expect – but you should not make any changes. I did this a few times during the time I was writing this to confirm my .CSV was formatted correctly.

Important: Do not manually delete these Guest Customization entries in vCenter. Instead if you need to remove them, use the drop command as part of the dr-ip-customizer. If you don’t do this using the utility you would have to manually re-assign each and everyone by hand. Using the dr-ip-customizer removes the association of the Guest

260

Customization to the “shadow” VM and then deletes the Guest Customization in vCenter: dr-ip-customizer -cfg ..\config\vmware-dr.xml -csv c:\nyc.csv -cmd drop Warning: Remember as with scripts which change the configuration of the VM – when you come to flip the replication paths prior to failback – you would need a similar re-IP configuration to reset the VMs back to their original IP configuration.

Customized VM Mappings As you might remember “Inventory Mappings” are optional, but incredibly useful because without them you would have to do mappings of network, resource pool and folder on a per-virtual machine basis. Occasionally, a virtual machine will fail to be added to the Recovery Site because SRM cannot map the virtual machine to a valid network, folder or resource pool. Alternatively, because you haven’t configured an inventory map – you will have to decide which customized virtual machine mappings are for you. VMs like this are flagged up with the status message of “Mapping Missing”. This is a very common error and it’s usually caused by the VMs network settings being changed, and then those network settings are not included in the inventory mapping error. You should really resolve these errors from the Inventory Mappings location first, unless you have a VM like the one below which is unique and needs an individual per-VM mapping configured for it.

1. In SRM select the Protection Group and click the Virtual Machines Tab

2. Select the affected virtual machine and click the Configuration Protection button

3. In the Edit Virtual Machines wizard, select a folder location for the VM

261

Note: Notice how you can over-ride the default settings from inventory mappings because this virtual machine falls outside of the scope of the inventory mapping settings. Its perhaps worth making it 100% clear, that if the virtual machines settings are covered by the “Inventory Mappings” this dialog box will in the main be disabled. Remember it is not currently the case that “Inventory Mappings” represent a “global rule” and these settings allow for “exceptions” to that rule – as you would find in VMware HA or DRS. This functionality is purely reserved for VMs that SRM cannot find a suitable inventory mapping for.

4. In the Edit Virtual Machines wizard, select an ESX host or Cluster for this VM

Note: In this interface if you have a “Fully Automated” DRS cluster as I have, you will be unable to specify a specific ESX host. Instead you will only be able to select which cluster the virtual machine will reside on, and DRS will decide which ESX host is used at power on – as you might know, this is a feature called “initial placement” in DRS.

5. In the Edit Virtual Machines wizard, select a Resource Pool for this virtual machine

6. In the Edit Virtual Machines wizard, select a Network for this virtual machine

262

Note: In this case the Recovery Network column for this virtual machine was blank. This was a good initial indication of the source of my problem, that the Protection Group didn’t know how to map the virtual machines primary network (vlan10) to the correct network on the Recovery Site – because it was not included in the “Inventory Mappings” Note: The recovery priority option controls where the virtual machine will be placed in the Recovery Plan. If you choose Normal for instance it will put the virtual machine in the normal category for powering off virtual machines in the Protected Site, and place it into the normal category for the power on of virtual machines in the Recovery Site.

Managing Changes at the Protection Site As you might be beginning to see SRM is going to need continual management and maintenance. As your Protected (production) Site is constantly changing day-to-day, maintenance is required to keep the Protected and Recovery Sites properly synchronized. One of the primary maintenance tasks is making sure that newly created virtual machines that require protection are properly covered by one or more of your Recovery Plans. Simply creating a virtual machine and storing it on a replicated VMFS volume does not automatically enrol it in your Recovery Plan. After all not all VMs may need protection. If you follow this fact to its logical conclusion you could ask the question – why create a new virtual machine on a replicated VMFS volume if you don’t require it? However, in VSphere4 it was impossible to guide or restrict a user to only being able to select the certain VMFS volumes when they create a new virtual machine. There was a risk that a user could unintentionally place a VM on a replicated VMFS volume when they shouldn’t have. Equally there is a distinct possibility that they could store their new VM on an unprotected volume. In vSphere4 it is now possible to set permissions on datastores within folders – so it is possible to guide the user to creating VMs in the right locations.

263

Creating and Protecting New Virtual Machines

You might wrongly assume that as you create a new virtual machine – so long as it is created on the right replicated storage, in the right resource pool, in the right folder and in the right network – it would automatically be “picked” up by SRM protected by default. However, this is not the case. Whilst creating a new virtual machine on a replicated VMFS volume should ensure the files of the virtual machine are at least duplicated at the Recovery Site – a new virtual machine is not automatically enrolled to the virtual machine Protection Group defined on the protection site. You can see this if you create a new virtual machine as I have done within the same locations covered by the inventory map.

This behaviour is not unlike an error we saw earlier – where a virtual machine or inventory mapping fails to map the virtual machines – network, resource pool or folder correctly. This is very easy to fix.

1. At the Protection Site, select the Virtual Machine Protection Group, and select the virtual machine which is currently not protected – in my case this is the web-3 virtual machine

2. Next Click the Configure Protection button

264

Note: So long as the VMs settings are covered by the inventory mappings the protection should complete automatically without further interaction. If however, the VMs settings fall outside of the inventory mapping, you will be confronted with a number of dialog boxes to manually set the location of the VM from cluster, folder or resource pool perspective. The Configure All button allows you to protect multiple new VMs – and both methods will add the virtual machine and add it to the Recovery Sites inventory – the VM will be enrolled to every Recovery Plan that is configured for the Protection Group where the VM resides. As you can see below ctx03 is a new virtual machine and now has a placeholder file held in the Citrix resource pool, it is also listed in the Normal Priority category of the Recovery Plans that use the Protection Group within which it is associated.

265

So remember simply “protecting” new VMs is not the end of the task, the next stage would be ensuring that the VMs were correctly ordered in the Recovery Plan – and any additional settings such as command scripts and message set correctly.

Renaming and Moving vCenter Inventory Objects

Despite the use of vSphere4’s new “linked clone” feature - you can see the SRM product is very much dependent on the operator correctly pairing and then mapping two separate vCenter inventories. These two vCenters do not share a common data source or database. So you might be legitimately concerned about what happens if vCenter inventory objects in either Protection or Recovery Site are renamed or relocated. This has been a problem in some other management add-ons from VMware in the past – a notable example is VMware View: http://www.rtfm-ed.co.uk/?p=1463

266

There are some rules and regulations regarding renaming various objects in vCenter. In the main renaming or creating new objects will not necessarily “break” the inventory mappings configured earlier, this is because the mappings actually point to Managed Object Reference Numbers. Every object in the vCenter inventory is stamped with a MOREF value. You can consider these like SIDs in Active Directory – renaming an object in vCenter does not change the objects MOREF value. The only exception to this are port groups which are not allocated a vCenter MOREF, in fact their configuration and identifiers are not held by vCenter, but by the ESX host. If we examine the scenarios below we can see the effect of renaming objects in vCenter:

• Renaming Virtual machines Not a serious problem. Protection Groups are updated to the new VM name, as are Recovery Plans. However, placeholder/shadow virtual machines references do not automatically update. Simply waiting for the next cycle of replication or re-running the Recovery Plan does not update the placeholder/shadow virtual machine. Fortunately, this does not stop your Recovery Plan from working. I found the only way to fix this issue was to un-protect and re-protect the virtual machine. This is NOT a desirable method as it means you lose your Recovery Plan customizations.

• Renaming DataCenters, Clusters, Folders in the Protected Site Not a problem. Inventory Mappings window automatically refreshes

• Renaming Resource Pools in the Protected in the Protected Site Not a problem. Inventory Mappings window automatically refreshes Renaming Virtual Switch Port groups in the Protected Site Much depends if you are an Enterprize Plus customer with access to Distributed vSwitches. If you are – then there are no problems to report. All the VMs are automatically updated for the new Port Group name – and the inventory mappings remain in place. It’s a very different story if you’re using Standard vSwitches. This will cause all the VMs affected to “lose” the inventory mapping. The VMs do remain “protected’ – and no unpleasant yellow warning message will appear on the Protection Group. As you can see in the screen grab below there is no yellow exclamation mark next to the VMs.

267

This is a bad outcome because someone could rename Port Groups on ESX hosts without understanding the consequences for the SRM implementation. Without a correct inventory mapping, the Recovery Plans would execute but they would fail with every VM that lacked a network mapping. This would create an error message when the Recovery Plan was tested or run which states: “Error: Network device needed by recovered virtual machine couldn't be found at recovery or test time.” So put very simply renaming Port Groups on Standard vSwitch should be avoided at all costs! If you have renamed Port Groups after configuring the Inventory Mapping, two main corrective actions are needed to take place to resolve the problem. Firstly, a refresh of inventory mappings is required. This a relatively simple task – of revisiting the inventory mapping at the Protected Site, and the looking for the renamed Port Group(s) and establishing a new relationship. In screen grab below the renamed Port Group (vlan_11) has no association with a Port Group in the Recovery Site.

Secondly, as you might know already if you rename Port Groups on Standard vSwitches, the virtual machines configured at the Protection Ssite become “orphaned” from the port group. This “orphaning” can be seen from the screen grab below. To create this example I simply renamed the Port Group called “vlan11” to be called “vlan_11”:

268

Note: This “orphaning” of the virtual machine from the virtual switch port group has been a “feature” of VMware for some time, and is not specifically a SRM issue. However, it does have a significant effect on SRM. It can cause the Protection Group process to fail – the process which creates the placeholder/shadow virtual machines on the Recovery Site. Correcting this for each and every virtual machine using the vSphere Client is very laborious. You can automate this process with a little bit of scripting, with the PowerCLI from VMware. Example: get-vm | get-networkadapter | sort-object -property "NetworkName" | where {'vlan11' -contains $_.NetworkName} | Set-NetworkAdapter -NetworkName vlan_11

Managing Changes at the Recovery Site • Renaming DataCenters, Clusters, Folders in the Recovery Site

Not a problem. Inventory mappings window automatically refreshes.

• Renaming Resource Pools in the Protected in the Recovery Site Not a problem. Inventory mappings window automatically refreshes.

• Renaming Virtual Switch Port groups in the Recovery Site Once again I found there were no issues with renaming Port Groups at the Recovery Site if you are using Distributed vSwitches. Any rename of a Distributed vSwitch Port Group at the Recovery Site results in the inventory mapping at the Protected Site. However, the same situation presents itself with Standard vSwitches in the Recovery Site as they did in the Production Site. I renamed my port groups from vlan50-54 to vlan60-64. However, no manner of refreshes or restarts updated the inventory mappings window at the Production Site. The mapping window switched to “None Selected”. The only resolution was to manually remap the port groups

Other Changes in the vSphere and SRM Environment In my experience there are other changes that can take place in vSphere and SRM which cause the relationships we configure in SRM to break. For example in my experience I’ve found renaming VMFS volumes at the Protection Site which are covered by replication

269

cycles can cause issues. What can happen is that you rename a VMFS at the protection site before it has been covered by the next replication cycle – then a test is run. The test fails because the test expects to see the old name, not the new name – and it is still being presented with the old name at the Recovery Site. Symptoms include “File not found” error messages when the test plan is executed – rather worryingly you can find your replicated VMFS volume is empty! The solution I found was simply to wait until your VMFS rename has reached the Recovery Site array – in other words wait for the next cycle of replication – and the problem will go away. This issue does not affect synchronous replication cycles where any changes to the VMFS volume is synched immediately to the Recovery Site. Setting the renaming of VMFS volumes to one side, it’s worth saying the stages of configuring SRM do happen in a specific order for a reason – and each stage has a dependency on the previous stage. The order of configuring SRM is the following:

1. Pair the Sites 2. Array Manager 3. Inventory Mappings 4. Protection Groups 5. Create Recovery Plan

Say for example you remove your Protection Group - what happens is the Recovery Plan(s) have references to Protection Groups that don’t exist. If you create a new protection plan, then you have to go manually to the Recovery Plan – and configure them to use the correct Protection Group. As deleting and recreating configurations is a very popular (if unsophisticated) way of “fixing” problems in IT generally – you must be very careful. You must understand the implications of deleting and recreating components. For example say you delete and recreate a Protection Group – and then tell your Recovery Plan to use it – what you will discover is that all priority/order settings in the plan are lost – and reset to the default. You will find all the virtual machines are re-housed into the Normal category for both the power down and power on of virtual machines. This is deeply annoying if you have spent some time getting all your virtual machines to power on at the right time in the right order. As you can tell by my tone – I’ve found this out through my own bitter experience! Lastly, a word of caution – as we have seen most changes that take place can be accommodated by SRM. However, this is currently a significant attribute of a Protected/Production virtual machine that is not propagated to the Recovery Site. If you increase or decrease the amount of memory allocated to a virtual machine after it has been covered by a Protection Group, the only way (currently) to fix this issue is to remove the protection of the affected virtual machine, and re-protect it – this causes the destruction of the virtual machine “placeholder” VMX at the Recovery Site, and its re-creation. The mismatch between the real VMX file and the placeholder is not technically significant – and is largely a cosmetic irritation. When the plan is tested the amount of memory allocated to the VM from the Protected Site will be used. As we saw earlier – if you do want the Recovery VMs to have different settings – you are better off using PowerCLI to make those modifications at the point of recovery. Moral of the story: View the casual removal of inventory mappings and Protection Groups with extreme caution. What I hope is that future versions of SRM will have an export and import feature, which will allow you to backup your Recovery Plans, separately from the SQL database within which they are stored. This development would allow us to create Recovery Plans during the failback process as well – rather than creating them manually.

270

Creating New VMs on New Networks and on New Storage As your organisation grows and changes at the Protected Site, again SRM will need updating to be aware of these changes. If for example a new network or new VLAN is created at the protection site, and populated with new virtual machines. In this case the SRM at the Protected Site will need reconfiguring for these changes. Where this shows itself particularly acutely is in the Inventory Mappings part of SRM. Additionally, as the Protection Site grows – so will its storage requirements – creating new LUNs/Volumes and replicating them to the Recovery Site will be likely. As a consequence the Array Manager configuration will need to be refreshed to ensure that SRM is aware of these new storage units. In the example below, I created a new VLAN, called VLAN15 – and a whole new series of virtual machines called db03, file03 and mail03 were created. These were then patched into the new VLAN. The SRM Administrators at the Protected and Recovery Site were assigned the task of making sure they were protected. Additionally, it was identified that the existing VMFS volume was reaching saturation point both in terms of I/O and free space. So a new LUN/Volume was provisioned, and the storage team (that’s me by the way!) ensured it was replicated to the Recovery Site. In this case I created a new EMC MirrorView LUN on my Clariion:

As you might imagine the current configuration I have for SRM does absolutely nothing for these virtual machines. These newly created VMs are not included in the existing Array Manager, Inventory Mapping or Protection Group configuration – and as a consequence they are not covered by a Recovery Plan.

271

Update the Array Manager

In my experience simply creating a new LUN/Volume on a storage array that is replicated to another location – is not enough for the Array Manager part of SRM to update. It is not the case that SRM or the SRA rescans the array at an interval to discover new replicated LUNs or Volumes. This makes sense, as most SRAs are just script files. Also in my experience sometimes just doing a simple rescan isn’t always enough either. This may be a “feature” of either SRM or the vendor’s SRA at a later date.

1. At the Protection Site SRM 2. Click the Configure link next to the Array Manager

3. Select the entry for the Protection Site Array Manager and Choose Edit

4. Type in the username and password used to authenticate to the Storage Array and Click Connect and then after the “wheel-of-hell” has completed the connection click Next

272

Note: After clicking OK to this dialog box, the “Device Count” field will increment by the number of new LUNs/Volumes that have been created.

Continue with the Array Managers wizard, and repeat this task for the Recovery Array Manager

5. In the last dialog box click the Refresh Array button – this should refresh the storage system and show the new LUN/Volume. Below you can see the datastore group for EMC Clariion now shows two VMFS volumes are being replicated – one called emc-Clariion-virtualmachines and the other called emc-clarriion-newvirtualmachines.

273

Note: In most cases your Protection Group will refresh accordingly – much depends on whether or how you use what are sometimes called “consistency” groups. Consistency groups are used on arrays to make multiple LUNs/Volumes replicate at the same state (synchronous or asynchronous) and at the state rate. This ensures the data is held in a consistent state between volumes. There are four scenarios that can happen once you are dealing with multiple volumes: Scenario 1: If the array you’re working on doesn’t support this feature – then you will need to create a Protection Group for each LUN or volume. Scenario2: If your array supports consistency groups – but you don’t use them – you will need to create a Protection Group for each LUN or volume. You should confirm that this is supported by configuration for the storage vendors SRA. Scenario 3: If you create a consistency group – and then place more than one on the LUNs/Volumes used by SRM – then the Protection Group will refer to multiple VMFS volumes – and enforce the policy set in the storage array in the Protection Group like so:

Update Inventory Mappings

If you are creating brand new VMs on a brand new network – you might find that they are missing an inventory mapping. Clearly, there must be a network defined at the Recovery Site. Without the correct inventory mapping you will see the “Map Missing” warning on the VMs within the Protection Group.

274

Updating your inventory mappings depends on the settings you originally had – and how they have altered. For example in my case – only the network settings changed. I did not create any new resource pools or folders.

Updating Recovery Plans

Now that we have finished updating the Protection Site configuration, it’s time to turn our attention to the Recovery Plans. These new VMs will simply be dumped into the “normal” category for both shutdown and recovery events.

275

Note: This configuration is getting closer and closer to something resembling the real world. In reality you are likely to have many virtual machines stored across many VMFS volumes. After all one of the VMware recommendations is to distribute the virtual disks across many VMFS volumes/LUNs – in order to distribute disk I/O to avoid saturating one LUN/Volume with excessive read/writes. Of course very careful planning is going to have to take place in setting up the replication and Protection Groups – to ensure that ALL the files that make up a virtual machine are being replicated and included in the Recovery Plan. After all a half complete virtual machine is not going to be of much use to anyone in the event of a disaster.

Storage VMotion and Protection Groups VMware Virtual Infrastructure 3.5 released a new feature called “Storage VMotion”. This allows you to relocate the files of a virtual machine from one datastore to another, whilst the virtual machine is running, regardless of storage type (NFS, iSCSI, SAN) and vendor. Back then, Storage VMotion was carried out by using a script in the Remote CLI tools which is downloadable from VMware’s website – with vSphere4 you can simply drag-and-drop and run through a wizard to complete the process. Storage VMotion can and does have implications for VMware Site Recovery Manager. Additionally, I found I had to use the “Rescan Arrays” option from within the Array Manager wizard to force an update. Frequently, I found that after a VM had been removed or added from a Protection Group, the Storage VMotion process did not automatically force a “Recompute Datastore Groups” event:

276

Basically, there are three scenarios:

• Scenario 1: Virtual machine is moved from non-replicated storage to replicated storage, effectively joining a Protection Group

• Scenario 2: Virtual machine is moved from replicated storage, to non-replicated storage, effectively leaving a Protection Group – and as such is no longer covered by SRM

• Scenario 3: Virtual Machine is moved from replicated storage to another replicated storage location, effectively the virtual machine is moved out of the scope of one Protection Group to another into the scope of another Protection Group

Let me explain and show you what happens in each of the scenarios. With Scenario 1 this is very straightforward – it’s as if a brand new virtual machine has just been created. The Protection Group would have a yellow exclamation mark indicating the virtual machine had not been configured for protection. With Scenario 2, the outcome is less than neat. Removing a virtual machine from replicated LUN/Volume to non-replicated storage can result in the error message in the events tab and the virtual machine being listed in the Protection Group as being “invalid”

277

The “solution” to this issue is to select the VM, and choose the “Remove Protection” option. With Scenario 3 I have seen similiar invalid error messages. In this case the virtual machine moves from one Protection Group to another, it should be “cleaned out” from the old Protection Group. In the case I moved with Storage VMotion, the VM called ss02 into the Celerra Protection Group. There is a yellow exclaimation mark on the NetApp Protection Group because ss02 is still listed there, and I would be unable to protect the ss02 VM with the Celerra Protection Group.

If I did try to protect the SS02 VM, then I would receive an error message like so:

Generally, if the Protection Group does not re-compute the storage properly, selecting an “invalid” virtual machine and clicking the “Remove Protection” option will fix it. In the case of Scenario 3, I found I had to “Remove Protection” from the “invalid” virtual machine first in the NetApp Protection Group before I could “Configure Protection” at the Celerra Protection Group. Although on the surface these issues with Storage VMotion seem a trivial irritation – its important to consider that every time you remove protection from the afflicted VM this removes the placeholder.vmx file from the Recovery Site, which means the VM is removed from all of your Recovery Plans – when it is re-protected it is regard as “new” and will unceremoniously be dumped back into your Recovery Plan at the default location.

Virtual Machines Stored on Multiple VMFS Datastores Of course it is entirely possible to store virtual machines files on more than one VMFS datastore. In fact if you know VMware well, you will know this is actually a recommendation from VMware. By storing our boot VMDK, log VMDK and data VDMK file on different LUNs we can improve disk I/O substantially by reducing the disk contention that could take place. Even the most disk intensive virtual machine could be stored on a LUN of its own, and as such it would not face any competition for I/O at the spindle level. You will be pleased to know that SRM does support a multiple disk configuration – so long as all the datastores, which the virtual machine is using, are replicated to the Recovery Site. The VMFS volume location of a virtual disk is controlled when it is added into the virtual machine using the “Add” wizard on the virtual machine

278

These virtual disks appear seamlessly to the guest operating system, so from Windows or another support guest operating system it is impossible to see where these virtual disks are located physically. In the situation above I added another volume to the EMC Clariion called “emc-clariion-datavirtualmachines” – and added it to the volume list; rescanned my ESX hosts and started to put virtual machines data disk on it as shown below:

All I did after that was make sure that the “datavirtualmachines” VMFS volume was scheduled to replicate at exactly the same interval as my volume called virtual machines – I did this by using ‘Consistency Groups’ on the EMC Clariion. If you have existing Protection Groups these will automatically be updated to reflect the fact the virtual machines are utilizing multiple datastores. You will see this if you create new Protection Groups to cover new VMFS volumes. In the screen grab below you can see my “Clariion Protection Group” has been updated to reflect that there are virtual machines on [emc-clariion-virtualmachines] VMFS volume, that also have VDMK files stored on [emc-clariion-datavirtualmachines]. If you are creating a brand new Protection Group, you will see both VMFS volumes included, if virtual machines are using more than one VMFS datastore – like so:

279

This looks very similar to the situation where both VMFS volumes have been placed into the same consistency group.

Virtual machines with Raw Device/Disk Mappings In this guide I began with one single VMFS volume and LUN in the Protected Site. Clearly this is a very simplistic configuration – which was deliberately chosen to keep our focus on the SRM product. I want to delve in more detail into more advanced configurations such as VMware’s RDM features and multiple disk configurations that more closely reflect the real world usage of a range of VMware technologies. At this point I faced a little bit of a dilemma. My dilemma was as follows – should I repeat the storage section all over again – to show you the process of creating a LUN/Volume in the storage array – and then configuring replication and how to present that to ESX hosts? Following on from this – should I also document the process of adding a RDM to a virtual machine? In the end I figured that if you as the reader had got this far in the guide, you should be able to double back to the storage section of this guide – and do that on your own. For example I added an RDM to my Lefthand Networks VSA:

280

This RDM then was then added to the mail01 virtual machine. The thing to notice in this screen grab of the manage paths button on the virtual machine is the vmhba syntax of the RDM on the protected VM – it says the path is vmhba34:C0:T2:L0.

I want to concentrate on the specific issues of how SRM would handle this addition of new storage to the system – and how it handles the RDM feature. After creating the new volume/LUN, configuring replication – and adding in the RDM to the virtual machine, the next stage is to make sure the Array Manager has correctly discovered the new RDM. It is worth saying two critical facts about RDMs and SRM. Firstly, the RDM mapping file itself must be stored on a VMFS volume covered by a replicated LUN. If you don’t there simply won’t be an RDM mapping file available in the Recovery Site for the Recovery VM to use. Secondly, SRM resolves hardware issues with RDM - RDM’s mapping files have three main hardware values – a Channel ID (only used by iSCSI Arrays), Target ID and LUN ID. These values held within the mapping file itself are likely to be totally different at the Recovery Site array. SRM fixes these references so the virtual machine is still bootable – and you can still get your data. If you were not using SRM and carrying out your Recovery Plan manually, you would have to remove the RDM mapping file re-added to the recovery virtual machine. Without this when the replicated virtual machine was powered on it would point to the wrong vmhba path.

281

Notice in this screen grab from a “recovery” virtual machine the mapping has been “corrected” by SRM to point to the right vmhba syntax of vmhba32:C0.T5.L0. If a new volume is created, whether it is an RDM volume or VMFS volume, it is important to rescan the Array Manager configuration on the Protected Site to make sure it is discovered by SRM and SRA.

1. Logon with the Vi Client to the Protected Site’s vCenter 2. Click the Site Recovery icon 3. In the Summary Tab, in the Protection Setup pane – click the Configure next

to the Array Managers Option 4. Click Next 5. Click Next 6. In the Review Replicated Datastores dialog box click the Rescan Arrays button

Note: In the SRM’s Array Manager wizard you should see the replicated RDM appear in the list like so:

Note: You might want to know what happens if you create a new virtual machine which contains an RDM mapping to a LUN which is not replicated or VMDK that is not a replicated volume. If you try to protect that virtual machine SRM will realise that you’re trying to protect a virtual machine which has access to a LUN/Volume which

282

is inaccessible at the Recovery Site. When you try to add such a virtual machine to the Protection Group it will fail with this error message:

When you try to protect the VM a wizard will run to allow you to deal with this portion of the VM that cannot be protected. Ideally, you should resolve the reason why the RDM or VMDK is not replicated – but the wizard does allow you to work around the problem by detaching the VMDK during the execution of the Recovery Plan:

283

Multiple Protections Groups and Multiple Recovery Plans This section is quite short but it may be the most important one to you. Now you have a very good idea of all the components of SRM, it’s time for me to show you what a very popular configuration might look like in the real world. It is perfectly possible, in fact I would say highly desirable to have many Protection Groups and Recovery Plans. If you recall, a Protection Group is intimately related to the LUNs/Volumes you are replicating. One model for this, suggested earlier in the book, is grouping your LUNs/Volumes by application usage so they can in turn easily be selected by a SRM Protection Group. I’ve set up such a situation in my lab environment to give you an insight into how such a configuration would look and feel. I’m not intending that you reproduce this configuration if you have been following this book in a step-by-step manner. It’s just there to give you a feel for how a “production” SRM configuration might look and feel.

Multiple DataStores

In the real world you are likely to put your virtual machines in different datastores to reflect that those LUNs/Volumes represent different numbers of disk spindles and RAID levels. To reflect this type of configuration I very simply created five volumes called citrix, db, file servers, mail, view and virtual desktops in the storage array – in the example below – the EMC Clariion.

284

Volumes Formatted with VMFS

In vCenter I then rescanned each of my ESX hosts, and then proceeded to format these volumes with VMFS, using volume and datastore names that reflected their functionality

Additionally, I re-ran the “Array Managers” wizard on the Protected Site to ensure that SRM was aware that these LUNs/Volumes were replicated and did contain virtual machines.

Multiple Protection Groups

The storage changes outlined in this section were then reflected in the Protection Groups I created. I would now have six Protection Groups reflecting the six types of virtual

285

machines. When I create the Citrix Protection Group, I selected the VMFS volume I created for that application

If we follow this to its logical conclusion I end up creating another six Protection Groups for each of my replicated VMFS volumes.

Multiple Recovery Plans

These multiple Protection Groups now allow for multiple Recovery Plans – a Recovery Plan just for my mail environment. Also in the in the case of complete site loss I could create a Recovery Plan that included all my Protection Groups like so:

At the end of this process I would have a series of Recovery Plans that I could use to test each application set – and also to test a complete Recovery Plan.

286

Summary

As you can see the most powerful and sensible way to use SRM is to make sure that various virtual machines that reflect big infrastructure components in the business are separated out at storage level. From a SRM perspective it means we can separate them into logically distinct Protection Groups, and then use those Protection Groups in our Recovery Plans. It is infinitely more functional than one flat VMFS volume and just one or two Recovery Plans – and trying to use such options in the Recovery Plan as “Recover No Power On Virtual Machines” to control what is powered on or not during a test of a Recovery Plan. The intention of this section was not to try and change my configuration – but just to illustrate what a “real world” SRM configuration might look and feel like. I was able to make all of these changes without resorting to powering off the virtual machines – by using Storage VMotion to relocate my virtual machines on the new LUNs/Volumes.

The Repair Array Manager’s Button If you select the Recovery Plans node on the Recovery Site SRM you will see you have a “Repair Array Managers” button like so:

Like me you might find this button a little bit curious given that the Array Managers configuration is set at the Protected Site, and not the Recovery Site. Like me you may wonder in what circumstances the array might be in such a state that it needs “repairing” as such. It took me some time to find out the usage case for this feature because it isn’t especially well flagged up in the VMware documentation, although this is very likely to change. This button doesn’t repair the storage array so much as it allows you to repair your configuration of the Recovery Sites communication to that array. Suppose the Protected Site has gone down due to the disaster. You then move across to the Recovery Site to invoke your Recovery Plan, only to discover that there is an error in the configuration of the SRM/SRA that communicates to the Storage Array at the Recovery Site. Examples of this include:

• The first IP used to communicate to the array is good, but the 1st controller is unavailable, when the SRA goes to use the 2nd Controller it fails because the SRM Administrator typed in the wrong IP address – or indeed failed to specify it

• An individual at the Recovery Site changed the IP address used to communicate to the Recovery Site storage array without informing the SRM Administrator

287

• An individual at the Recovery Site changed either the username or password used to authenticate to the array

When you click the Repair Array Managers button the standard “Array Managers” dialog box opens on the Recovery Site SRM, which then allows you to correct these problems. You would not need to use this interface if the Protected Site array was available as is the case with a Planned DR situation.

Conclusion For me this is one of my biggest chapters because it really shows what SRM is capable of and perhaps where its limitations lie as well. One thing I found a little annoying is the way there’s no drag-and-drop option available to reorder virtual machines in a priority list – and clicking those up and down arrows for each and every virtual machine is going to get pretty darn annoying – for me it was irritating with just ten virtual machines, never mind hundreds. Hopefully you got a good idea of the long-term management of SRM. After all virtual machines do not automatically become protected simply by virtue of being stored on replicated VMFS volumes. Additionally, you saw how other changes in the Protection Site impact on the SRM server such as renaming datacenter, clusters, folders, networks and datastores – and for the most part SRM makes a good job of keeping that metadata linked to the Recovery Site. It’s perhaps worth highlighting the dependencies within the SRM product especially between Protection Groups and Recovery Plans. Additionally, I find the fact we cannot yet backup our Recovery Plans to file does introduce a worry, that major changes at the Protection Site such as un-protecting a VM or even deleting a Protection Group – can lead to a damaged Recovery Plan with no quick and easy way of restoring the Recovery Plan. As you might have seen deleting Protection Groups is a somewhat dangerous thing to do – despite the relative ease with which they can be re-created. It un-protects all the virtual machines affected by that Protection Group and removes them from your Recovery Plans. Recreating all those Protection Groups does not put the virtual machines back in their original location, thus forcing you to recreate all the settings associated with your Recovery Plans. What we could really do with in the Recovery Plans area is a way of exporting and importing them – so that those settings are not lost. Indeed it would be nice to have a “copy Recovery Plan feature” so you could create any number of plans from a base, to work out all the possible approaches to building a DR plan. Finally, I think it’s a shame that events such as Cold Migration and Storage VMotion still do not fully integrate into the SRM product. Hopefully, you saw there is a range of different events that can occur which SRM reacts to with various degrees of automation. However, as you will see in the next chapter it’s possible to configure alarms to tell you if a new virtual machine is in need of protection.

289

Chapter 10: Alarms, Exporting History and Access Control

290

You will be very pleased to hear that SRM has a large number of configurable alarms and a useful reporting feature as well. Alarms are especially well defined with lots of conditions that we can check on. This is a welcome enhancement to VMware products which have in the past had quite limited alarm and reporting functionality. This said the action we can take in the event of an alarm being triggered is still pretty much limited to sending an email, sending an SMNP trap or executing a script. It’s perhaps worth stating something very obvious here – that SMTP and SNMP are both networked services. These services may not be available during the time of a real disaster – as such you may not wish to rely on them too heavily. Additionally, you will find that SRM does not have its own specific “events” tab – instead SRM events are included alongside your day-to-day events. I think as long as you have allocated used roles and permissions for SRM – then you should be able to filter by these accounts, which should improve your traceability. After I have covered this “Access Control”, I will include some filtering/searching screen grabs to illustrate what I mean.

vCenter “Linked Mode” and Site Recovery Manager Before I jump right into looking at alarms, I want to take a moment to discuss the importance of vCenter “Linked Modes” to this particular chapter. When I started to write this version of the book I was tempted to introduce this feature from the very beginning, but I was worried that the distinction between Protected Site and Recovery Site might be obscured by the use of this feature. I think by now if you are new to VMware SRM this distinction is more than clear. I would like at this point to introduce “linked mode” into the picture for one main reason. When you have your vCenter set-up with “linked mode” amongst licensing data, the vCenter also shares the “roles” created in vCenter – this will have a direct impact on the “Access Control” part of this chapter – I personally feel if you use “linked mode” it will dramatically cut down the number of logins and windows you will need open on your desktop – and significantly ease the burden of setting permissions and rights with the Site Recovery Manager Product itself. My only worry about linked mode in the context of SRM is that you are creating a relationship or dependency between the Protected and Recovery Sites. SRM doesn’t require the “linked mode” feature to function. However, when balanced against the advantages of “linked mode” I think this anxiety is unfounded. Anything that eases administration and reduces the complexity of permissions and rights has to be embraced. Normally, you enable “linked mode” during the installation of vCenter. If you haven’t there exists a wizard on vCenter start menu to re-run that portion of the vCenter installation. When you run the linked mode wizard you must use a domain account that has administrative rights on both the Protection Site and Recovery Site vCenters

1. Login into the Recovery Site vCenter 2. From the Start, Programs, VMware menu run the vCenter Linked Mode

Configuration Properties option 3. After clicking Next, select the radio button to Modify linked mode

configuration

291

4. Ensure that in the next dialog box the option to “Join vCenter server instance to

an existing linked mode group or instance” is selected and click Next 5. In the Connect to a vCenter instance dialog box, type in the FQDN of the

Protected Site vCenter like so:

Note: After the installation you will have a view that looks something like this:

292

When you click at the Site Recovery Manager icon after the first login, you will still be challenged for a username and password to communicate to the Site Recovery Manager server. Switching between the Protected or Recovery Site – is simply a question of selecting them from the navigation bar like so:

Note: When you first do this – you will be still challenged for the credentials of the SRM host. Although linked mode cuts down the number of logins to the vCenter(s), you still must authenticate to the SRM host. This is additional layer of security and worth noting your vCenter credentials might not be the same as your SRM credentials.

Alarms Overview Alarms cover a huge array of possible events including but not restricted to such conditions as

• Low Available Resources

293

o Disk o CPU o Memory

• Status of the Recovery Site o Recovery Site SRM is up/down o Not pingable o Created/Deleted

• Creation of o Protection Groups o “Shadow” Placeholder Virtual Machines

• Status of Recovery Plans o Created o Destroyed o Modified o Pending Messages

• License Status • Permission Status • SAN Connectivity

Note: In SRM 1.0, the thresholds for Disk, CPU and Memory alarms were set not within the GUI but in the vmware-dr.xml file. It is now possible to set the parameters for these from the Advanced Settings dialog box with the vSphere Client. You will see the option for advanced settings on the right-click of Site Recovery node like so:

294

As you would expect some alarms are more useful than others and they can in some respects facilitate the correct utilization or configuration of the SRM product. There are some notable cases. Additionally, you will notice that both the Recovery and Protected Site hold the same alarms – configuring both sites would be appropriate in a bi-directional configuration. Here are some examples Example1: Simply creating a new VM on a VMFS volume, which is replicated, does not automatically add the virtual machine to the Protection Group and Recovery Plan. An email to the SRM administrator might be helpful in prompting him/her to carry out the appropriate actions. Example 2: Although Recovery Plans have a notification or messages feature – you will only see the message if you have the vSphere Client open with the Site Recovery Manager plug-in. It might be desirable to send an email out to the appropriate person as well. Example 3: Failure to receive a ping or response from the Recovery Site could be indicative of mis-configuration of the SRM product – or some kind of network outage Example 4: SRM requires SAN connectivity and reliable replication cycles. Failure of storage may trigger the use of the DR Plan or indicate a mis-configuration. There is no point in having SRM working, if the underlying storage array has failed.

Creating a New Virtual Machine To Be Protected Alarm (Script)

Note:

295

Unlike the scripts executed at the Recovery Plan, the scripts are executed by either the Protected vCenter or the Recovery vCenter. As such the scripts must be created and stored on the vCenter responsible for the event. This can be identified by the use of the word “Protected” or “Recovery” in the event name.

1. At the Protected Site, click the SRM button 2. Select the Alarm Tab and double-click the alarm called VM Added

3. In the Edit Alarm dialog box, select the Actions tab 4. Click the Add button 5. From the pull-down list select Run a Script, and type:

C:\Windows\System32\cmd.exe /c c:\newvmscript.bat

Note: One condition can have many actions – so it’s possible to create a condition that would send an email, smnp trap and also run a script

6. On the Protected Site SRM create a script called newvmscript.bat with this content @echo off net send 192.168.3.198 A new VM has been created at the protection site. The Protection Groups will need updating to include this new VM in your Recovery Plans Note: This script is only intended as an example. I would not and do not recommend the production use of the Messenger Service

Creating a Message Alarm (SNMP)

1. At the Recovery Site, click the SRM button 2. Select the Alarm Tab and double-click the alarm called Recovery Profile

Prompt Display

Note:

296

The Recovery Profile Prompt Display – means that the Recovery Plan has paused with a Message Step – and is waiting for an operator to respond to it.

3. In the Edit Alarm dialog box, select the Actions tab 4. Click the Add button 5. From the pull-down list select Send a notification trap

Note: Unlike Send an notification email the destination/recipient is not defined here, but in vCenter in the Administration menu, vCenter Server Settings under the SMNP section of the dialog box. By default if you run a SNMP management tool on the vCenter in the “public” community you will receive notifications. To test this functionality I used the free utility called TrapReceiver – and VMware also use this on their training courses to test/demonstrate SMNP functionality without the need for something like HP Overview. I installed TrapReciever to the Recovery Site vCenter server to test the SNMP functionality. http://www.trapreceiver.com/ Over the page is the result of such an alarm sent to Trap Receiver

Creating a SRM Service Alarm (SMTP)

Warning: Unfortunately, at the moment there is no way to control the sensitivity of the SRM alarms. So repeated alarms will fill your inbox until the event has passed at a rate of about one email every five minutes.

1. At the Recovery Site, click the SRM button

297

2. Select the Alarm Tab and double-click the alarm called Remote Site Down and Remote Site Ping Failed

3. In the Edit Alarm dialog box, select the Actions tab 4. Click the Add button 5. From the pull-down list select Send a notification email, and type in the

destination/recipient email

Note: In the edit box type in an email address of an individual or a group who should receive the email. Again configuration of the SMTP service is set in vCenter in the Administration menu, vCenter Management Server Configuration under the SMNP section of the dialog box.

Note: Emails will be trigged when the “Not Responding” message appears in the SRM Summary Page

298

Note: The actual emails produced with this alarm can be a little cryptic especially in the part that reads “Old Status” and “New Status”, but they do the job required as can be seen below

Exporting & History It is possible to export a Recovery Plan out of Site Recovery Manager and also export the results of a Recovery Plan out of Site Recovery Manager. The export process can include the following formats:

• Word • Excel • Web Page • CSV • XML

Although Recovery Plans can be “exported” out of SRM, they cannot conversely be imported into SRM. The intention of the export process is to give you a “hard copy” of the Recovery Plan, which you can share and distribute – without necessarily needing access to SRM. Warning: Currently, the SRM default is trying to open the exported file at the location where your vSphere Client is running. If the system where you are running the vSphere Client does

299

not have Microsoft Word/Excel then this will fail. The plan is still exported, but your system will fail to open the file. In my experiments Microsoft Word Viewer 2007 worked, but Microsoft Excel Viewer 2007 did not. Additionally, Microsoft Excel Viewer could not open the CSV format either. I found I needed the full version of Excel to open these files successfully – the XLS file comes with formatting, but as you would expect the CSV comes with no formats whatsoever.

Exporting Recovery Plans 1. At the Recovery Site SRM select your Recovery Plan 2. Click the Export Recovery Plan icon

3. From the Save As dialog box select the format type

Note: The output of the plan looks like so – this would be taken from a Microsoft Word View 2007

300

Recovery Plan History SRM has a history tab – which will show success, failure, and error summaries – and allows you to view previous runs of the Recovery Plan in a html format, or export them in the other formats as indicated earlier.

1. At the Recovery Site SRM, select a Recovery Plan 2. Click the History Tab, select a previously run Recovery Plan – and click View or

Export

Note: In the screen grab below I checked out the history of one of my error results as viewed in a html format. The error “Failed to recover datastore” was caused by an administrative mistake on my part. I forgot to give my ESX hosts in the Recovery Site access to the snapshot on my EMC Clariion. Without access to the snapshot the test of my Recovery Plan kept on failing.

301

Access Control Call it what you will but permissions, access control and change management are part and parcel of most corporate environments. So far we have been managing SRM using a default “administrator” account for every task. This is not only unrealistic, it is also very dangerous especially in the realm of DR. DR is such a dangerous undertaking it should not be triggered lightly or accidentally. Correctly setting permissions should allow the product to be configured and tested separately from the process of invoking DR for real. Although this is a high-level “C-Class” executive decision, the management of the process should be in the hands of highly competent, trained and well-paid IT staff. SRM introduces a whole raft of new roles to vCenter – and as with day-to-day vCenter rights and privileges the SRM product displays the same “hierarchical” nature as vCenter. An additional layer of complexity is added by having two vCenter systems (Protected and Recovery Site vCenters) that are delegated separately. It’s worth saying that in a bi-directional configuration these permissions would have to be mutually reciprocal to allow for the right people to carry out their designated tasks properly.

As with the alert actions – access control is driven by authentication services – for many this will mean Microsoft Active Directory and Microsoft DNS. If these services fail or are unavailable you may not even be able to login to vCenter to trigger your Recovery Plan. Proper planning and preparation needs to be taken to prevent this from happening – and you may wish to develop a Plan B where a Recovery Plan could be triggered without even the need for Microsoft’s Active Directory. Depending on your corporate policies this could include the use of physical domain controllers - or even the use of local user accounts on your vCenter and SRM system. From a security perspective local user accounts are frowned upon to say the least in most corporate environments – so the first step is to review the default vCenter permissions which does allow full access to vCenter using the local administrator account on the vCenter server itself.

The Site Recovery Manager Roles include:

• Protection Groups Administrator • Protection SRM Administrator • Protection Virtual Machine Administrator

• Recovery DataCenter Administrator • Recovery Host Administrator • Recovery Inventory Administrator • Recovery Plans Administrator • Recovery SRM Administrator • Recovery Virtual Machines Administrator

302

You can see these roles in the vSphere Client at Home >> Administration >> Roles. If you copy these roles to create new ones, it can take time if you are in “linked mode” for them to be replicated to other vCenters in the environment.

At the time of writing the SRM 1.0 book there was little information about the privileges assigned to these roles – and sadly this situation has not changed that much in SRM 4.0. I could easily find this out by clicking each role, and checking the privileges by hand, but I think doing this and then transposing them into this guide would be quite tedious. Instead it might be more valuable for us to think about the changes that occur in the SRM environment to help us think about the privileges needed. If a new datastore were created then potentially a new Protection Group would be needed. Similarly, as new virtual machines are created, they must be correctly configured for protection. We would also want to allow someone to create, modify and test Recovery Plans as our needs change. In the following scenario I’m going to create a couple of users – Adam, Alex, Chad, Cormac, Lee and Vaughn – and allocate them to a group in Active Directory called SRM Administrators. I will then login as each of these users to test the configuration – and validate they can carry out the day-to-day tasks they need to do. The plan is to allow these guys to ONLY carry out SRM tasks – with the minimum rights needed to for day-to-day maintenance of the SRM environment.

The configuration will allow these four individuals to manage a unidirectional or active/passive SRM configuration. In other words they will be limited to merely creating and executing Recovery Plans at the Recovery Site. In part what I am reproducing in this guide is an example of permissions and rights outlined in the official administration guide to SRM 1.0 from VMware – rather curiously this example no longer appears in the SRM 4.0 administration guide. Below is a table, which summarizes the permissions, required to achieve this configuration. At the Protection Site Role Location in vCenter Propagate?

303

Read‐only vCenter Hosts & Clusters NO Read-only Datacenters NO Protection Virtual Machine Administrator

vCenter host level1 YES

Protection SRM Administrator Site Recovery Root NO Protection Groups Administrator SRM Protection Groups Yes At the Recovery Site Role Location in vCenter Propagate? Recovery Inventory Administrator

vCenter Hosts & Clusters NO

Recovery Datacenter Administrator

Datacenters NO

Recovery Host Administrator vCenter host level NO Recovery Virtual Machine Administrator

Resource pools and vCenter folders2

YES

Recovery SRM Administrator Site Recovery Root NO Recovery Plans Administrator SRM Recovery Plans level YES 1. Any object containing ESX hosts such as a cluster or folder. Use this method rather than setting the permission on per ESX host basis 2. I think much depends on how you structure your resource pools and folders. Do you create resource pools within resource pools; do you have a top-level folder from within which all other folders are created; are you using resource pools with DRS, as such perhaps you could set this privilege on the cluster (aka the “root resource pool”) As you can see there are a significant number of roles we have to use (7 in total) in many different locations (7 total) some requiring inheritance or “propagation” (4) but most do not (7). As you might gather one of my main feature request for SRM is a “delegation wizard” that would set these privileges for us! Warning: As you can see SRM user rights by themselves are NOT enough. If you only have rights to the SRM part of vCenter you will not even be able to logon via the vSphere Client. You will need to grant your users and groups at least “Read Only” rights to some part of the vCenter Inventory to allow the login process to be successful.

Configuring a SRM Administrator Group (Protection Site)

1. Switch to the Protected Site’s vCenter 2. Select the vCenter node, and click the Permissions Tab

3. Right-click underneath Administrators, and choose Add Permissions 4. Next click the Add button to add users or groups 5. Next select the role Read Only 6. IMPORTANT: Remove the tick next to Propagate to child objects

304

7. Next select your datacenter(s), allocate the role of Read Only 8. IMPORTANT: Remove the tick next to Propagate to child

Note: If you had many datacenters you might want to put them into folders, so you could control the permissions more affectively

9. Next select your DRS/HA cluster(s), allocate the role of Protection Virtual

Machine Administrator 10. CAUTION: In this case leave the option to Propagate to child objects

selected

Note: In the absence of DRS/HA clusters you can use folders to group ESX hosts, to avoid

305

setting this permission on a per-ESX host basis

11. Next select the SRM view, select the Site Recovery node and select the role Protection SRM Administrator

12. IMPORTANT: Remove the tick next to Propagate to child

13. And finally, within the Protected Site. Select the Protection Groups node and

allocate the role of Protection Groups Administrator 14. CAUTION: Leave Propagate to child objects selected

VERY IMPORTANT Phew, that was a lot of work! I hope you did the right roles, in the right location with the option of inheritance set! Unfortunately, you’re not done yet – remember the people who work at the Protection Site, also need rights at the Recovery Site to create and test their Recovery Plans.

Configuring a SRM Administrator Group (Recovery Site)

1. Switch to the Recovery Sites vCenter Server

2. Select the vCenter node allocate the role of Recovery Inventory Administrator

306

3. IMPORTANT: Remove the tick next to Propagate to child objects

4. Select the datacenter(s) and allocate the role of Recovery DataCenter

Administrator 5. IMPORTANT: Remove the tick next to Propagate to child objects

6. Select the cluster(s) and allocate the role of Recovery Host Administrator 7. IMPORTANT: Remove the tick next to Propagate to child

8. Select the resource pool(s) and folders and allocate the role of Recovery

Virtual Machine Administrator 9. CAUTION: Leave Propagate to child objects selected

15. Next select the SRM view, select the Site Recovery node and select the role

Recovery SRM Administrator 10. IMPORTANT: Remove the tick next to Propagate to child

307

11. And finally, Select the Recovery Plans node and allocate the role of Recovery Plans Administrator

12. CAUTION: Leave Propagate to child objects selected

Note: That’s it – you’re good to go! By now you’re probably wishing for some kind of delegation wizard! I would agree with you!!!

Testing your Permissions It’s quite one thing to set permissions, and another to see them in action. Personally since I started in IT in the early 90’s I’ve always created test accounts to login with to test my permissions. Just to be 100% sure I’ve got them right – and to ensure there aren’t any nasty surprises. In this case I want to flag up what you would not be able to do. If you were setting the permissions as set out above you would find the following at the Protected Site:

• No ability to create virtual machines

• No ability to create Recovery Plans

At the Recovery Site:

308

• Views restricted just to recovery virtual machines

• Unable to create Protection Groups

Some Permission Limitations – Test & Run Plans In SRM 4.0 introduced the ability to separate the ability to test Recovery Plans, from the ability to run Recovery Plans. In fact the default privileges of the Recovery Plans Administrator role – only allows for the Test of Recovery Plans. This affectively means that until you modify the roles configuration – only a full vCenter administrator has the rights to both test and run a plan. You can see this if you edit the properties of the Recovery Plan Administrator Role and also try to run a Recovery Plan with the default rights.

309

I think you have a number of options here:

• You could change the default settings on the Recovery Plan Administrator Role • You could create two new group in Active Directory – one called SRM Admins –

Test Only, and other called SRM Admins – Run & Test. Then assign the users to groups. Next copy the existing Recovery Plan Administrator role to a new role, which includes the “run” privilege. Finally, add both groups into the Recovery Plans node on the Recovery Site

310

• Create Groups for each Recovery Plan – based on their function (Citrix, DB, File Servers, Mail and View). This could also be applied to Recovery Plan Administrator role – to allow different teams to test the recovery of their systems.

VMware SRM Log Files As with all software VMware’s SRM has internal log files of its own. These are located in this directory path: C:\Documents and Settings\All Users\Application Data\VMware\VMware Site Recovery Manager\Logs The intention of these logs is not for day-to-day usage but for VMware Product Support. Should you have a serious problem with SRM that you cannot resolve – sometimes looking at these logs can be useful. The following log file sample below shows what happens when two VMs are to be protected but fail because of invalid inventory mappings. The log file does not show friendly vCenter names, but rather unfriendly MOREF (Managed Object Reference Numbers) which are expressed in this format – vm-275, network-288 and resgroup-895. [2008-09-30 17:36:04.464 'DrInventoryMapper: site-28' 2820 verbose] Recommendation for VM 'vm-725': (dr.primary.MappingRecommendation) { [#3] dynamicType = <unset>, [#3] vm = 'vim.VirtualMachine:vm-725', [#3] folder = <unset>, [#3] networkRecommendations = (dr.primary.MappingRecommendation.NetworkRecommendation) [ [#3] (dr.primary.MappingRecommendation.NetworkRecommendation) { [#3] dynamicType = <unset>, [#3] primaryNetwork = 'vim.Network:network-288', [#3] secondaryNetwork = 'vim.Network:network-215', [#3] } [#3] ], [#3] resourcePool = 'vim.ResourcePool:resgroup-895', [#3] conflict = false, [#3] } [2008-09-30 17:36:04.464 'DrInventoryMapper: site-28' 2820 verbose] Recommendation for VM 'vm-727': (dr.primary.MappingRecommendation) { [#3] dynamicType = <unset>, [#3] vm = 'vim.VirtualMachine:vm-727', [#3] folder = <unset>,

311

[#3] networkRecommendations = (dr.primary.MappingRecommendation.NetworkRecommendation) [ [#3] (dr.primary.MappingRecommendation.NetworkRecommendation) { [#3] dynamicType = <unset>, [#3] primaryNetwork = 'vim.Network:network-289', [#3] secondaryNetwork = 'vim.Network:network-214', [#3] } [#3] ], [#3] resourcePool = 'vim.ResourcePool:resgroup-895', [#3] conflict = false, [#3] } [2008-09-30 17:36:04.464 'DrInventoryMapper: site-28' 2820 verbose] Made recommendations for 2 VMs in 0 seconds The specific error here was that the two VMs were in a folder that had not been mapped appropriately, and it resulted in a yellow exclamation mark on the Protection Group, and a failure to create the placeholder files in the Recovery Site. You should also know there are specific log files for the storage vendors SRAs. The paths for these do vary – so I would consult the vendor documentation if you are looking for them. Additionally, the SRA from the storage vendor will often have their own log files. These are well worth analysing if you get serious and cryptic errors in the main SRM product. Like with all technologies, knowing how they work and using them regularly will help improve your troubleshooting skills. I hope as you have been reading this book, which documents the errors I have as much as the success.

Conclusions As you can see SRM significantly extends and adds to vCenter’s Alarms, Report and Access Control features. And whilst alarms may not have configurable options you might see in the main vCenter product such as Triggers and Reporting tabs – the sheer number of alarms or conditions we can trap on is a very welcome addition to what was once an under-developed aspect to the core vCenter product until vSpher4 was released. Again, simply the very ability to run reports in SRM is a great addition – as once again it’s a feature we don’t usually see in the core vCenter product. In one respect the investment in vCenter by VMware is paying dividends in allowing the ability for its own developers to extend its functionality with plug-ins. Similarly the recent editions to the VMware stable of applications such as VMware View, need to join in the party too. In this respect I think VMware SRM Engineers has lit a torch for others to follow. This more or less concludes this particular configuration type. So far this book has adapted a scenario where your organization has a dedicated site purely for recovery purposes, and I now want to change this scenario to where two datacenters have spare CPU, Memory and Disk capacity – that they can reciprocate recovery for each other. A situation where New Jersey is the Recovery Site for New York, and New York is the Recovery Site for New Jersey or where Reading is the Recovery Site for London, and London is the Recovery Site for Reading. For the large corporate this offers the chance to save money – especially on those all important and precious VMware licenses. In the next chapter we will also take a look at the multi-site features which are new to SRM 4.0 which allows for so-called “spoke-and-hub” configurations where one Recovery Site offers protection to many Protection Sites.

313

Chapter 11: Bi-Directional and Multi-Site Configurations

314

So far this book has focused on a situation where the Recovery Site is dedicated to the purpose of recovery, and this could easily be hired rack space provisioned by a third party company. This is very popular in smaller organizations that perhaps only have one datacenter – or their datacenters are small and do not have the resources to be both a Production Site and Recovery Site at the same time. As with conventional redundancy this “dedicated” Recovery Site model is not especially efficient as you are “wasting” valuable financial resources to protect yourself from an event that might never happen. Like all insurance policies – your home insurance and your car insurance – this is a waste of money. Until, that is, you are unlucky enough to find that one day your house is broken into, and a thief steals your car and burns it out. Due to the licensing and other associated costs it is much more efficient for two or more datacenters to be paired together to offer DR resources to each other. Such a configuration is referred to as a bi-directional configuration in the official VMware SRM documentation. I’ve left this type of configuration to the end of the book – not because I thought most people wouldn’t be interested – but for three main reasons. Firstly, I wanted to make it 100% crystal clear what tasks are carried out at the Protected Site (Site Pairing, Array Managers, and Inventory Mappings) and which tasks are carried out at the Recovery Site (Recovery Plans). Secondly, permissions are simpler to explain and test in a conventional Protected Site, and dedicated Recovery Site configuration. Lastly, at this stage my hope is that you should now have a very good understanding of how SRM works – and so a bi-directional configuration shouldn’t be so difficult to add to an existing unidirectional configuration. At this stage I did make some major storage changes. Previously, the Recovery Site (New Jersey) just had access to replicated volumes from the Protected Site (New York). For a bi-directional configuration to work you clearly need replication happening in the opposite direction. When this happens the clear distinction between the Protected and Recovery Site breaks down – as they are both recovery and protection site for each other. If it helps, what I’m changing from is an Active/Passive DR model to an Active/Active model, where both sites reciprocate each other – both running a production load, whilst at the same time using their spare capacity to offer DR resources to each other. Additionally, I created a new resource pool and folder structure on both the New Jersey and New York sites. This was to allow for inventory mappings created for the first time between New Jersey and New York. Personally, I feel that for simplicity the various sites should almost be a complete “mirror” of each other. In other words the resource pools and folder structures are identical wherever possible. In reality this “mirroring” may not be practical or realistic – as no two sites are always identical in terms of their infrastructure or operational capabilities. So at the New Jersey site I created a series of resource pools which are the mirror of the resource pools in New York – assuming that both sites are very similar and run the same applications and services. As with New York, I created a resource pools called NJ_DR and child-resource pools – which would serve as the containers for the placeholders .vmx files of New Jersey. Similarly, I created a VM folder structure, which is an exact mirror of the New York Site, in the New Jersey location. You can see me mirroring in the two screen grabs below.

315

Configuring the Array Manager Note: Unlike the very first time we set-up SRM - the two locations are paired already – there is no need to pair the two sites together. What we must do is configure the Array Manager so the SRM and SRA at New Jersey are aware of the volumes available and which are replicated.

1. Login as Administrator for the New Jersey Site (nee the Recovery Site) 2. Click the SRM Icon 3. Next to Array Managers, click the Configure button

Note: Notice how there is no need to pair the sites, as that has already been done earlier in this book. Additionally, you can see that the Array Manager has a setting of “Not configured” indicating this site has never been set-up to have its VMs protected.

316


5. In the Add Array Manager dialog box, type in a friendly name for this manager

such as Array Manager for New Jersey Site 6. Select from Manager Type from the pull-down list – the SRA appropriate for your

Storage Array provider Note: For more detailed information on this step consult Chapter4 – Protected Site Configuration.

7. Click the Connect button 8. Click OK and Click Next 9. Then click the Add button to add in the connection Recovery Site, in my case this

is the Storage Array at the New York location. 10. Click Next and Finish – the Finish dialog box should show the SRM/SRA has

discovered the replicated volume

Configuring the Inventory Mappings As with the first pairing we did in Chapter 4 – the next stage is to configure your inventory maps. I’m not going to repeat myself here – as it could be quite tedious and also what you choose to map where is going to vary from implementation to implementation. Below is a screen capture of me mapping New Jersey’s networks, resource pools, and folders to the vCenter objects in New York.

317

Creating the Protection Group Again creating a Protection Group does not differ substantially in a bi-directional configuration.

Creating the Recovery Plan Again Recovery Plans do not differ substantially in a bi-directional configuration – in this case I must switch to the New York vCenter server to create a Recovery Plan for the New Jersey virtual machines.

318

Shared Site Configurations Warning: The phrase “shared site” – is not to be said with badly fitting false teeth or after large amounts of alcohol. A shared site configuration exists in SRM when more than one production location is set to receive its DR resources from Recovery Site services by SRM. It allows for a so-called “spoke and hub” configuration to be created where one Recovery Site offers DR resources for many production or protection sites. Such a configuration would be used by a large company or could be used by a Service Provider who offers SRM as an out-sourced solution to companies who do not have the resources to manage their own DR location. Clearly, in this commercial scenario permissions MUST be correctly assigned so a customer does not start managing the Recovery Plans of a totally different business! It’s also necessary to prevent duplicate name errors – which could happen if two different companies had VMs with the same VM name. Although this separation sounds complete – even with the correct permissions – one customer can see the events and tasks of a different customer. Another usage case for the Shared Site feature is to circumvent some of the scalability limitations that currently exist in the product – so if you have 1,000s and 1,000s to protect – the shared site feature is for you. At the moment running the SRM installer with command-line switches triggers the Shared Site configuration. In time this may become integrated into the install wizard or better still as part of the wizard used to pair sites together. The command-line switches allow you to generate what VMware call an SRM “extension” ID. In terms of the product the configuration of SRM remains the same (Array Managers, Protection Groups, Inventory Mappings and Recovery Plans). It’s still the case that you need a vCenter and SRM server at the Protection Site, but at the Recovery Site you need just one vCenter – and then a SRM Host for each site or customer you need to protect. From a DR provider’s perspective this means a customer has their own SRM dedicated SRM host – but they can share the compute resources of the entire DR location. Some customers might not be happy with this – and may want their own dedicated ESX hosts that are used to bring VMs online. Of course, they will pay a premium for this level and quality of service. This additional VMware SRM host does not incur a licensing penalty from a VMware perspective – but it is another Windows instance.

319

When you login to the SRM you login to a particular “extension” which relates to the Recovery Site that you are managing. Rather than running the installation normally, you would run the SRM installer with the switch: /v”CUSTOM_SETUP=1” This adds two additional steps to the installation wizard – first a mandatory step is enabling the custom plug-in identifier, followed by the settings for the Custom SRM Extension. The extension has three parameters:

• The SRM ID A piece of text of no more than 29 characters – although you can use characters such as underscore, hyphen and the period. I would recommend you stick to purely alphanumeric as these characters can cause problem if they are used at the beginning or end of the SRM ID. The SRM ID is should be the same on both SRM hosts in the Protection and Recovery sites.

• Organization A friendly 50 character name – no restrictions apply to this field with respect to special characters

• Description A friendly 50 character description – no restrictions apply to this field with respect to special characters

Once the install has been completed you continue to pair up the sites as you would normally. Once the pairing process has completed – during the logon process you will be able to see ALL the Custom Extension IDs together with their descriptions. Whilst this information is not commercial sensitive – it is not possible to hide these references – they need to be visible so folks can select which site contains their recovery configuration. In my current configuration my Protected Site and Recovery Site are already paired together. This configuration cannot be changed without uninstalling the SRM product. So it’s perhaps worth thinking about how important the shared site feature is to you – before you embark on a rollout. That might sound like bad news, but it is NOT the end of the world. There is nothing stopping me creating a new site representing say, Washington DC, and then add a new SRM Server in the Recovery Site to be paired with the new site that needs protection. As you can see from the image below – I created a new vCenter instance for the Washington DC DataCenter – together with SRM server for that location called srm4wdc. To allow for the shared site configuration I created another SRM server at the Recovery Site called srm4wdc-rs.

320

Installing VMware SRM with Custom Options to the New Site (Washington DC)

1. Login to the new Protected Site SRM Server, in my case srm4wdc 2. Open a command-prompt and run the SRM installer with the switch:

/v”CUSTOM_SETUP=1”

3. Complete the Setup routine as normal. After the VMware vCenter Site Recovery Manager Extension part of the wizard:

321

You should then be confronted by the Plug-in identifier window

4. Select the option called Custom SRM Plug-in Identifier

5. The next part of the wizard allow you to set the SRM ID, Organization and

Description

Installing VMware SRM Server with Custom Options to the Recovery Site Now the Protection Site is installed, its time to run the same routine at the new SRM server at the Recovery Site (that I called srm4wdc). During the install it’s important to remember that when prompted for the vCenter details – it’s the vCenter at the Recovery Site you need to specify (in my case srm4nj). The Site Name must be unique, as you cannot pair two sites together that have identical locations. At the Protected Site I used: Washington DC Site At the Recovery Site I used: Washing DC Recovery Site

322

Warning: A common mistake is specifying the existing or some other SRM Host in the local host field. You could leave local host set as the default “localhost”. But personally I just don’t like doing that – it feels too much like hard-coding an IP address. This must be set correctly, as it cannot be changed – even if you “repair” the installation, as I found out to my chagrin whilst writing this section! With that said, the Custom SRM Plug-in Identifier or SRM ID must be the same string as specified during the install of SRM at the Protected Site. It’s perhaps worth saying that the “Organization” names in the installation could reflect the business, which offer SRM as a service.

Additionally, during the configuration of the Custom SRM Plug-in you MUST type in the same SRM ID as you did during the install of the Protected Site SRM. Once this install has completed reload your vSphere Client using the vCenter at the new Protection Site’s FQDN. After the login you should find you have the plug-in menu and new custom SRM Plug-in for the new locale:

323

Note: Even after installing this plug-in they still appear on the list. This is a “feature” of the plug-in manager when used in Shared Site mode. If you already have the SRM plug-in installed you will be asked if it is OK to upgrade the existing plug-in. Choose Yes! Ironically the install is an odd one, as it will stall because the vSphere Client (yes, the very thing that triggered the install!) is still open. I found if I closed the Plug-in Dialog box I could then close the vSphere Client – and let the install complete. Once the install has completed you should be able to reload the vSphere Client – and click the SRM icon in Solutions and Applications tab. Notice how Local Site details show the SRM Site Details such as Site ID and Organization.

After this process you then use the Connection and Configuration to pair the two sites together. This is exactly the same as the standard installation – and you can proceed to configure the Array Manager, Protection Groups, and Recovery Plans. The screen grab below shows my Washington DC Site and Washington DC Recovery Site Pair together – although they have different site names, they are linked by virtue of the SRM ID.

324

When you come to switch from the Protection Site to the Recovery Site, you will be confronted with a dialog box to select your SRM ID together with friendly descriptions. The reference in the screen grab to <default> is the initial site pairing between my New York and New Jersey Site, which I created in Chapter4:

Finally, if you need to switch between one Recovery Site and another you can click the logout option at the Recovery Site, and then reconnect to the SRM Service:

The rest of the configuration remains unchanged. So next in the new site (Washington DC) I would configure the Array Manger and create Protection Groups. Then after that create the required Recovery Plans.

325

Decommissioning a Site Let assume for whatever reason that the Washington DC Site was to be decommissioned. The removal process from an SRM perspective would be a reversal of the normal work flow – in fact SRM was being used to move from Washington DC to another location. You could set-up a new Recovery Site (which would be the new location) and failover to it as if there had been a disaster. Once the virtual machines in the old site (Washington DC) were up and running the old site could be removed. In this case you would

1. Remove Recovery Plans from the old Recovery Site (New Jersey, in my case) 2. Remove Protection Groups from the old Protected Site (Washington DC, in my

case) 3. Remove the Array Manager configuration from the old Protected Site (Washington

DC in my case) 4. You would then “break” the pairing of the Protected Site to the Recovery Site:

Note: Strictly speaking there is no requirement to “break” the site relationship. I just like to do it. The real reason for break is to undo the pairing process in cases where the administrator makes an error such as pairing to sites together accidentally.

5. From this point you could re-run the vCenter Linked Mode wizard, to isolate the old site into a stand-alone vCenter instance.

6. Uninstall the SRM product from the Protected Site SRM and from its previously paired Recovery Site SRM

7. After this you may wish to revisit your licensing and remove any stale sites that may be listed – there is a right-click and “remove asset” option to complete this process.

Conclusions Once you understand the principles and concepts behind SRM, a bi-directional or shared site configuration is really an extension of those very same principles covered in the earlier chapters. The only complexity is getting your mind round the relationships. Perhaps occasionally you stopped in the chapter to clarify to yourself what the relationships were between the two locations, both in SRM and in the storage array. You are not alone – I did the very same thing. I got so wrapped up in the Protected Site/Recovery Site view of the world it took me some time to adjust my thinking to accept that each location can have a dual functionality. Of course I always knew it could – but adjusting to that switch once you have the concept that SiteA is Protected, and SiteB is Recovery just takes a little time. What I would really love to see is a special icon just for placeholder/shadow virtual machine – at the moment they use exactly the same icon so it’s not immediately apparent which is which. In a bi-directional configuration, if your Protection Site (New York) and the

326

Recovery Site (New Jersey) are configured very similarly it’s sometimes tricky to keep the relationships clear in your head – and that’s with just two sites! Sometimes it is very clear – after all the real virtual machines are the ones powered on, and your placeholder virtual machines are the ones powered off. That distinction becomes less clear once you’ve triggering your DR plan for real, especially as some storage vendors require you to power off virtual machines during the failback process. You will find this becomes more so when we deal with failover and failback. Let me make that clearer – especially with failback. I really had to concentrate when I was doing my first failback and writing it up – mainly because failback is a very manual process, which requires you to interact with the storage layer in an even more direct way than we have done already. In the next release of SRM we’re going to see a much more automated failback process. Anyway, I digress, as that is the subject of our next chapter – failover and failback – running our Recovery Plans for real – what some folks call “Hitting the Big Red Button”.

327

Chapter 12: Failover and Failback

328

The one thing that we have yet to discuss or cover is what SRM is all about. A disaster occurs and you must trigger your Recovery Plan for real. It’s sometimes called “hitting the big red button”. The reason I have left this so late is that it’s a major decision – which permanently changes the configuration of your SRM environment and your virtual machines and so it is not to be undertaken lightly. My second reason for covering this issue now – is that before this chapter was started I wanted to completely change the viewpoint of the book to cover bidirectional configuration. The previous chapter was a precursor to preparing for covering a failover and failback situation. I didn’t want to trigger a real run of the DR plan before making sure a bi-directional configuration was in place and understood. My last reason for leaving this topic so late in the book is that in this release there is no automated “failback” after triggering the Recovery Plan. So we are rapidly reaching the point of no return. In this release of SRM – failback is a manual process both at the storage layer and at the vCenter layer. I think it is highly likely that in the next release of SRM there will be some kind of automated failback. The useful thing is that you will by reading this chapter understand to some degree the manual process, that will one day automated. I’m great believer in understanding automated processes, it helps you troubleshoot them when they go wrong – and gives you the plan B of being able to do things by hand to boot! A real execution of your Recovery Plan is just like a test except in this case the Step 1 phase of the plan is actually executed. In other words if possible SRM will power off VMs at the Protection Site (New York) if it’s available. But it won’t execute the final part of the plan, which resets all the recovery VMs. It leaves them up and running. In the real world clicking the big red button is going to require senior management approval usually at the C-Class level (CEO, CTO, CIO). Unless these guys are in the building that was demolished by the disaster itself – and someone further down the management structure is delegated to make the decision. You could regard this issue as part and parcel of the DR/BC plan. If we have lost senior decision makers either temporarily or permanently, someone else will have to take on their roles and responsibilities. Additionally, there will be subtle and important changes at the storage array. The storage vendors’ SRA will automatically stop the normal cycle of replication between Protected Site and the Recovery Site – and will usually change the status of the LUNs/Volumes from being Secondary/ Slave/ Remote/ Replica (whatever the storage vendor’s terminology) to being a Primary or Master LUN/Volume. All this is done without you having to bug those storage guys in the storage team. For example, if you were using Lefthand Networks VSA you would see the volumes that are normally marked as being “Remote” switched to being “Primary”

329

Triggering the plan from SRM is very easy to do – some might say too easy. You hit the “run” button; read a warning; shift a radio button to confirm you understand the consequences – and click OK. Then you watch the sparks fly or the shit-hit-the-fan depending on its success and whether you sought higher approval! The only real change in the SRM 4.0 product is that by default no one except the full administrator has the rights to run a Recovery Plan even if they are a member of all the default SRM roles. Of course a run of disaster Recovery Plan need not be executed for merely an out-and-out loss of a site. If you have planned both the datastore and Protection Groups properly, there should be no reason why you can’t failover and failback on an application level.

Considerations before Failover and Failback There are some larger issues to consider before hitting the big red button. Indeed we could and should argue these issues need addressing before we even think about implementing SRM. Firstly, dependent on how you have licensed SRM, you may need to arrange the transfer of the SRM license between the Protected and Recovery Site to remain covered by the EULA agreed with VMware. VMware issues SRM licenses on the basis of trust. You need a license assigned to the site that has protected VMs located in the vCenter. As a failback is essentially a reversal of the day-to-day configuration the Recovery Site will need licensing albeit temporarily to facilitate the failback process. VMware assumes you don’t abuse this license – using single site SRM license to protect two sites would be regarded as breach of the license agreement. They do allow this to happen allowing failback to occur without additional licenses needing to be purchased. Of course, if you have bi-directional configuration – then this license concern does not apply, as both sites are licensed – as both site possess Protection Groups. Secondly, if you are changing the IP addresses of the virtual machines then your DNS systems will need to have the correct corresponding IP address and hostnames in DNS. Ideally, this will be achieved in the main by using your DNS Server Dynamic DNS Name Registrations feature – but watch out for any static records in DNS and the caching of those DNS records on other systems.

Planned Failover – Protected Site is available The main obvious difference when running the Recovery Plan when the Protected Site is available – is that the virtual machines in the Protected Site get powered off based on the order specified in the plan. However, a more subtle change is effected as well – and that is the suspension of the replication or snapshot between the Protected Site and Recovery Site. The diagram below illustrates the suspension of the normal replication cycle – this must happen to prevent replication conflicts and data loss, after all it’s the virtual machines at the Recovery Site which users will be connecting to and making data changes to. For all intents and purposes they are the primary virtual machines after a failover has occurred.

330

As you can see the X indicates replication of data has been suspended, and the LUNs that were once marked as R/O in our tests, are being marked as R/W in our execution of the Recovery Plan. In manual DR, this is normally a task triggered at the storage array by a human operator using the vendors’ failover/failback options – but as the SRA has administrative rights to the storage array this can be automated by SRM. Once the plan has successfully completed you should be able to see this change in the status pages of your given storage system. For example in Lefthand Networks VSA you will see that our volume which was once a remote scheduled copy is now a primary volume again.

Here you can see that the volume called replica_of_virtualmachine is now marked as “Primary” – where it used to say “Remote”. Additionally, you can see the snapshot numbers are “out of synch”. Some time has passed since I ran the plan for real, and whilst the ProtectedManagementGroup has carried on the schedule of local snapshots – these snapshots have not been transmitted across the wire to the RecoveryManagementGroup. The SRM/SRA stopped this scheduled remote copy automatically. This is the default behaviour whenever a Recovery Plan is run. In this example I’m going to assume I have lost access to the New York Site for some hours or days. STOP! Before running the plan, you might want to carry out a test first – to make sure all VMs

331

recovery properly. VMs that display errors can be left, merely as placeholders in the Recovery site.

1. In this case I need to login at the Recovery Site vCenter – in my case New Jersey

2. Select the Recovery Plan, and Click the Run button

3. Read the Confirmation text, and select it – and click Run Recovery Plan

Note: I have a live recording of me running this very plan on the RTFM website. If you want to see what happened when this plan was executed you can watch it here: http://www.rtfm-ed.co.uk/srm.html If everything goes to plan (forgive the pun) you won’t see much difference with the true run of the plan from a test of the plan. What you will see is power-off events at the Protected Site

332

Additionally, you have multiple Protection Groups with multiple datastores associated with them, you should see in recent tasks references to “Prepare Storage for Recovery”

EMC Celerra and Running Plans During the test of a Recovery Plan you should see that new VMFS volume is mounted and resignatured by the Celerra SRA. In my case this was a LUN with the ID of 128 at the ESX Host:

This can be seen from the Celerra’s Manager console:

333

In the same windows under the Targets tab, Properties of the Target and the LUN Mask tab, you can see that LUN 128 has been presented to the ESX hosts in the Recovery Site:

EMC Clariion and Running Plans During the execution of the Recovery Plan, the affected VMs are powered down – and then the LUN in the Protected Site is marked as read-only. The SRA does not change any LUN masking settings – so the LUNs still appear in the Storage Adapters view. But as the LUN is marked as read-only this will stop the VMFS volume from being mounted. VMware SRM should force a refresh of storage at the Protected Site – but if it fails to complete then a manual rescan of datastore will result in the VMs being marked as (inaccessible).

334

This happens as the relationship between the Protected Site (New York) and the Recovery Site (New Jersey) has become inverted by the EMC Clariion SRA. So the Protected Site is now seeing the “Secondary Image” which is read-only, and the Recovery Site is now seeing a “Primary Image” which is read-write. You can see in NaviSphere this inversion of the relationships with in the management window. For example, in the screen grab below you can see the Remote Mirror for “Replica of LUN61 – DB VMs” primary image is now being presented by the array with the identity of ending with …478 which is the Clariion Array for New Jersey, whereas the secondary image is now being presented by the array with the identity with …479 which is the Clariion Array for New York.

You can also see that a couple of the Remote Mirrors are marked with red F. indicating have become “fractured” – they are also indicating that they are [Waiting On Admin; Out of Sync]. You can correct this fracture, and re-establish the synchronisation. You can do this by right-clicking the fractured secondary image – and selecting the synchronize option.

335

This fracturing in MirrorView can happen because if you don’t use “Consistency Groups”. For this reason I would recommend that you use “Consistency Groups”.

HP Lefthand and Running Plans When you run a Recovery Plan with HP Lefthand storage changes take place to your existing configuration. The volumes at the Recovery Site cease to be “Remote” which is the format used for merely receiving updates, and become “Primary” volumes. It is impossible for a “Primary” volume to receive updates – so this affectively stops the current scheduled remote copy process.

You might see an benign error if the schedule remote copy is not paused during this time – as the Protected Site VSA tries to send updates to the other primary volume at the Recovery Site.

NetApp and Running Plans When you run a Recovery Plan with NetApp based storage a number of changes take place to your existing configuration. The main changes you will see is on the Recovery Site Array. Firstly, the SnapMirror relationship with the Protected Site array will be marked as “broken off”.

336

Additionally, you should see the affected volumes are no longer marked as being “restricted”.

At the Protected Site location you will see that because the SnapMirror relationship has been “broken off”, then there is no management that can be done there. All you can do is view the SnapMirror. As with all relationships if it has been untimely broken-off there isn’t much you can do about that. I don’t think the current state of my NetApp Filers says much about the relationships that New Jersey enjoys with New York. [That’s another one of my jokes by the way]

337

Planned Failback – Protected Site is available As a reminder let me restate that SRM was never designed to automate failback to the primary or Protected Site. This said, SRM can easily be configured to assist in that process. In its current state the Recovery Site, or New Jersey in this case is effectively the owner of New York’s virtual machines. They are running and being connected to by end-users. As such it’s possible to create a temporary Protection Group at the Recovery Site, and Recovery Plan at the Protected Site – and thus invert the process I triggered earlier. Of course care must be taken to make sure any changes generated in the short time we were running at the Recovery Site (for me a about day) are replicated back to the Protected Site to prevent data loss. That process will vary from one storage array vendor to another. Logically this procedure sounds very simple – just turn around or reverse the configuration. You might think it’s a bit like finding you have taken the wrong direction in your car, and all you need to do is perform a three-point U-turn and head off in the opposite direction. If you want to extend that analogy of the car further – it’s like you took that wrong turning some hours ago, and now the only way to get back on track is to make a massive detour and retrace some of the journey you had already done. Oh, and by the way your kids need the bathroom, and you’re running low on gas. Coupled to that, the littlest one has just asked “Daddy, are we there yet?” With that said, many of the storage vendors have developed their own failback utilities to assist in this process, and in subsequent releases of SRM I feel it is almost inevitable that there will be some kind of failback feature in the product. VMware has taken a lot of heat on this issue, with some customers using it (wrongly I believe) as way of blackballing a SRM project. The process is actually quite easy once you have gone through it a couple of times. It’s just the lack of automation currently makes it’s a convoluted process. From a storage perspective it means inverting your normal path of replication from the Protected Site to the Recovery Site. This is a manual task carried out with great care following the storage array vendor’s documentation on how to complete the reconfiguration. If the array at the Protected Site has not been destroyed in the disaster the data held on it will be out of synch with the Recovery Site. How much depends entirely on how long you were at the Recovery Site for, and the rate of data change that has taken place. If this is small, then there is the possibility of merely replicating the differences (the deltas) that have accrued over this time. Alternatively, if the storage array was destroyed or is massively out synch, you might find yourself actually bringing the new array to the Recovery Site and doing the replication locally.

338

Setting these storage considerations to one side, the failback process is further complicated by the fact that we have to manually “clean up” the original SRM configuration we had before the failover occurred. Additionally, once the failback has completed we have to clean up the very same configuration that facilitated the failback in the first place – and repair our original recovery process. Let’s deal with this clean-up process first before we worry about inverting the storage replication/snapshot process.

Step 1: Clean-out all the old Placeholder Files at the Recovery Site (New Jersey)

During the configuration of SRM placeholder/shadow VMX files were placed on storage on an ESX host at the Recovery Site, in my case New Jersey. These need to be removed manually. This is because we do not need them before. They were created when the Protection Group was created at the Protected Site, now that failover has happened the production VMs now live at the Recovery Site. During the writing of this book I’ve made many changes such as creating and destroying virtual machines (ctx03, ctx04 and so on). This created quite a lot of garbage in the location I chose to store my placeholder VMX files. This is making me wish glad that have a dedicated location for storing these – to keep my placeholder files, totally separate from any “real” VMX files. It would be rather foolish not to do this as it becomes very difficult to separate placeholders VMs from real VMs. If you are managing a shared site model – then you might like to have separate placeholder LUNs for each site supported

339

Using [shift]+click in the right-hand pane I can delete the placeholder VMX very quickly. Notice how I’m not including the .dvsdata folder, this internal system folder that is used to store your Distributed vSwitch configuration if you are using them.

Note: 99% of what I’ve learned in life has been from mistakes; this doesn’t mean however that I make mistakes 99% of the time!

Step 2: Delete the Protection Group(s) at Protection Site (New York) The next stage will be to delete the Protection Group at the Protection Site, in my case New York – which was used to trigger the failover

1. Login to the Protected Site vCenter (New York) 2. Click the Site Recovery icon 3. Expand + Protection Groups and select your Protection Group(s)

340

4. Click the Remove Protection Group button

Step 3: Remove from the Inventory all the “old” Protected Virtual Machines (New York) The next step is to remove the old and out of date virtual machines that were once running at the protection site (New York). This is where a good folder and resource pool structure is handy – in keeping the protected virtual machine from the local or unprotected virtual machines. This clean up is happening so we don’t get conflicts over the names of virtual machines. If we tried to register say “mail01” during the failback process we would receive an error because “mail01” is already registered with vCenter.

1. Login to the Protected Site vCenter (New York) 2. Select all the out of date virtual machines 3. Right-click and choose Remove from Inventory

WARNING: You need to be very careful here. There other reasons – not associated with SRM that a VM might be marked as inaccessible or orphaned. You must ensure you don’t destroy the wrong type of VM. This removal process includes any templates that were included in your Recovery Plan

341

Step 4: Power off of protected virtual machines at the Recovery Site (New Jersey) The next step will require a maintenance window. We need to gracefully shutdown all the virtual machines at the Recovery Site that were failed over by the execution of the Recovery Plan. This will ensure any disk I/O is quiesced before inverting the replication path.

1. Login to the Recovery Site vCenter (New Jersey) 2. Use the “Shutdown Guest Operating System” to bring down the virtual

machines. You may need to do this in a graceful manner in order for services to terminate correctly.

Step 5: Reverse Replication Path (Storage Array)

Many vendors use different terms for this – some vendors call it a “personality swap” which makes it sound like some weird psychotropic event – other vendors call it failover and failback. What it means is where the replication path used to be Protected Site >> Recovery Site, we need to send the data back to the Protected Site using Recovery Site >> Protected Site. If we didn’t do this we would get data loss. In some case, you will need to stop or pause any local snapshot process currently taking place on the original volume – before then triggering a personality swap or failback. Once the inversion has been completed we may have to resume any replication process we used to have in place.

…with EMC Celerra In EMC Celerra there is literally a “reverse” option, which is used to invert the existing path of replication. When you run a Recovery Plan, the replication between the source (New York) and destination (New Jersey) is stopped by the SRA. Its status is changed on the Recovery Site Celerra to be “failed over”

342

1. Login to the Recovery Site Celerra (New Jersey) 2. Select the Replications node 3. In the Replications Tab, select each replication 4. Click the Start button

Note: In the next page you can see the replication is to be reversed (sending changes from the Recovery Site (New Jersey) to Protected Site (New York)

After clicking OK, you should receive a second page warning you that the replication process will be reversed:

343

…with EMC Clariion

EMC use the term “promotion” to indicate process by which a Secondary Image is made into Primary Image in the MirrorView technology. In planned DR scenario this inversion process happens automatically. As we saw earlier in this chapter in a section called “EMC Clariion and Running Plans” you may see the (F) icon appear which indicate that the Remote Mirror has been come “fractured” – this does not happen if you “Consistency Groups” within Navisphere. You may need to right-click the Remote Mirror and manual force synchronization to occur. Once this has been done you can progress to step 6.

…with HP Lefthand

In terms of the HP Lefthand VSA this would be as follows

1. Login to the Lefthand Networks CMC as administrator 2. Select the original volume and click the Schedules Tab 3. Right-click the snapshot and choose Pause Snapshot Schedule

Note: Repeat this for all your affected VMFS volumes and also any RDM volume as well. Now that our usual cycle of replication has been paused, we can replicate back just once (not regularly) the changes created whilst we were operating from the Recovery Site. If you fail to pause the schedule this will cause errors when you come to test your failback to the Protected Site like so:

344

4. In the NJ_Group, right-click the replicated volume, and choose New Schedule Remote to Remote Snapshot Note: Although we only want to do this replication/snapshot once – we must still use the Schedule Remote Snapshot for SRM to recognise this replicated LUN/volume

5. Change the radio button to set Never Recur, to make this a one-off replication, and select under Remote Snapshot Setup – change the management group to be the Protect Site location (in my case NYC_Group) and enable the option to Include primary volumes. Finally, select the destination as the original volume we have at the protection site

WARNING:GET THIS RIGHT. I’m one of those guys who has an overactive brain, and can’t locate the off switch for my frontal lobe. I was once up at 3.22am in the morning playing with this, didn’t engage brain and got this process wrong. It was part of the development for this book so it wasn’t the end of the world. My RDM had three 1K files called newfile.txt, afterfailover.txt and beforefailback.txt created at various points in the process. Anyway, I didn’t “engage frontal lobes”, and lost my before failback files. This is also a warning about the dangers of working alone without rest late into the night or morning on anything that involves data manipulation.

345

6. Click the Edit button next to the Start At: and Click OK. This will set the time to start time of the snapshot to be now Note: This is effectively going to over-write the original volume. During this replication/snapshot only the differences since we ran the Recovery Plan will be copied back to the Protected Site, in my case New York. The time this takes depends heavily on the amount of changes that have taken place since the Recovery Plan was executed. You are likely to see a warning appear as this process is effectively modifying our source LUN/volume. This message affectively informs the administrator that the volume at the Protected Site is going to demoted from being a “Primary” volume to a “Remote” Volume, the correct format receive updates.

Once you click OK you should see the replication happen immediately. With Lefthand Networks VSA you see animated graphics which show the replication is now happening in the opposite direction

346

Note: Repeat this process for other effective volumes including any RDMs

…with NetApp

In our original replication the source of New York was being SnapMirrored to the destination of New Jersey. With failback a new SnapMirror relationship needs to be configured to replicate any changes in the Recovery Site (New Jersey) to the Protected Site (New York). So New York now becomes the destination of replication updates – so the first step is connecting to the Protection Site and configuring a new SnapMirror relationship.

1. Login to the Protected Site (New York) FilerView 2. In FilerView, under Volumes and Manage. Select the volume, and click the

Restrict button.

347

3. Under SnapMirror, click the Add option

4. In the Destination Location page, select from the pull-down list the volume that

requires updates

348

5. Next in the Source Location page, type the name of the source NetApp Filer,

and the volume that contains changes you need to replicate back to the Protected Site

Note: Next your way through the dialog boxes adjusting your settings as your bandwidth and needs allow. This should create an uninitialized SnapMirror relationship. Before we initialize the SnapMirror, we need to “restrict” the volume that will receive updates.

6. Finally, select your new SnapMirror relationship under SnapMirror and Manage. Click the Advanced link, and select the Initialize link.

349

Note: The synchronization process should not take too long and it depends on the volume of the changes since failover process. As both the source and destination have been initialized before (albeit in the opposite direction), all that needs to be synchronized are the changes to the volume since the failover event. Refresh the status page until you see the status change from “transferring” change to be “snapmirrored”

Step 6: Refresh Storage at the Recovery Site (New Jersey)

As we configured our system in the previous chapter to allow for bi-directional DR the Recovery Site (New Jersey) is already configured to communicate to the two arrays that make up our configuration. All that we need to do is a refresh and rescan to make sure that the system can see the new LUNs/Volumes that have been replicated. If you remember from earlier in the book in Chapter 6, I went through a very similar process when I showed what it was it like to update your Protected Site for a scenario where new virtual machines had been created on a new VMFS volume. I’ve decided to repeat these instructions again – modifying the graphics to reflect our new configuration. I’m assuming that you have unidirectional configuration, so I will repeat the instructions on how to enable the Array Manager for the failback scenario.

1. At the Recovery Site SRM (New Jersey) 2. Click the Configure link next to the Array Manager

3. Select the entry for the Protection Site Array Manager and Choose Add 4. Type in the username and password used to authenticate to the Storage

Array and Click Connect and then after the wheel-of-hell has completed the connection click Next Note: Repeat this for the Recovery Array Manager

5. In the last dialog box click the Refresh Array button – this should refresh the storage system and show the new LUN/Volume

350

If you were carrying this out on NetApp Filer you would expect to see something like this:

Notice how the replication path is from the Recovery Site’s “vol1_replica_of_virtualmachines, to the Protected Site “vol1_virtualmachines” If you were carrying this out with Lefthand Networks VSA you would expect to see something like this:

Note: Once in my tests I noticed that despite the successful replication of my LUN back to the Protected Site (New York), that is not appearing in this list. This was somewhat worrying, and at the time I had a strong suspicion that I would have to manually resolve problems with LUN when I carried out the failback, just as I did when I had to failover. As I progressed this did indeed become the case. After writing this chapter I went through the procedure all over again – this time simulating a hard failure at the Protected Site (New York). On my second attempt this dialog box gave me information that was much more reassuring. So the moral of the story is – if when you are doing the Array Manager procedure you have missing LUN/volumes in this list – then worry. Don’t proceed until you have resolved the issue, because when you engage failback, if the storage simply isn’t there or are incomplete – then your virtual machines will be incomplete too. Either certain virtual machines will be missing their files, or they will partially complete.

351

Step 7: Configure the Inventory Mappings at the Recovery Site (New Jersey)

As with all Recovery Plans – I need to configure inventory mappings and Protection Groups. So from the Recovery Site (New Jersey), the administrator will need to tell SRM how to handle New York’s network, resource pools and folder structure. If you have configured a bi-directional setup then you may have very little work to do – as these settings may already be populated.

1. Login to the Recovery Site vCenter (New Jersey) 2. Click the SRM Icon 3. Select the + Protection Groups node, and click the Inventory Mappings tab 4. Configure your mappings appropriately

Warning: If you have not configured bi-directional DR you will need to map your network resources as well

Step 8: Create Failback Protection Group(s) (New Jersey) Now we have the inventory mappings in place to tell SRM where to put our protected (New York) virtual machines we can configure a Protection Group for them on the Recovery Site (New Jersey) SRM server. This will create “placeholder” files on the Protected Site (New York). If you have not done so already you may like to create a placeholder LUN at the original Protection Site (New York)

1. Login to the Recovery Site vCenter (New Jersey) 2. Select the +Protection Group node, and click the Create Protection Group

button 3. In the Protection Group dialog box, type a name such as Citrix Protection

Group

352

4. Select the datastore which holds your protected virtual machines

Note: Notice how the VMFS volumes still show the volume name generated by the SRA’s triggering of the resignature process. I haven’t bothered to rename them, because I like the way the name reminds me that these VMFS volumes are coming from the Recovery Site and the failover I did earlier.

5. Select a temporary location for holding the “placeholder” or “shadow” VMX files on the Protected Site

Warning: It’s at this stage that if replication has not been correctly configured you will receive warnings. In earlier tests, as I suspected I had problems with one of my LUNs, and indeed I did receive errors on the virtual machine that had the RDM attached to it. The message that appeared was “Protect Virtual Machine – ctx01 – A virtual machine has one or more devices which don’t have file backings on replicated LUNs”. This caused ”ctx1” placeholder not to be created and it was not listed at all at the New York location. To work around this problem at the time I decided to right-click the london-ctx-1 virtual machine and remove its bad LUN reference, and clicked the “Configure Protection” button. The real solution was to fix the underlying storage replication problem. It later transpired I had actually run out of space for the snapshots I was creating. The problem was not caused by SRM

353

or the Storage Array – but by my failure to monitor my actual storage utilization.

6. Repeat this process for the NFS/VMFS Datastore which contains replicated VMs, that you need to failback:

Step 9: Create a Failback Recovery Plan (New York) We are now in a position to create a Recovery Plan at the Protected Site (New York) to failback New York’s virtual machines. Clearly, it is wise here to run a test to see if the failback stands a chance of succeeding – and also the Recovery Plan will need to be as sophisticated as the custom Recovery Plan we covered in Chapter 6. I don’t intend to repeat that here – but just take it as read that you may need to use Low, Normal and High Priorities, Start-Up orders, Scripts and Messages – to automate the process in the desired way.

1. Login at the Protected Site vCenter (New York) 2. Click the SRM icon 3. Select +Recovery Plans and click the Create Recovery Plan button 4. In the Recovery Plan dialog box type in a friendly name such as Failback to

New York Recovery Plan and click Next 5. Select the Protection Group that contains the virtual machines you wish to

failback to the protection site

354

Note: Complete the Recovery Plan as we have done previously, remembering to suspend virtual machines that are not required at the protected location

Step 10: Test Recovery Plan

1. Finally, test your Recovery Plan to see if failback will be successful. Use the Recovery Steps to monitor the progress of the Recovery Plan:

IMPORTANT: Do not proceed with the next step unless you’re 100% happy that the Recovery Plan will be successful.

Step 11: Execute the Recovery Plan for Real (New York)

Once I had resolved my LUN issues I was able to proceed and execute the failback plan as I would a failover plan. I have nothing to add here beyond what I have said already about this process.

355

Clean-Up of the Planned Failback Wait! We’re not done yet! Now we have the virtual machine back where we started we now have to clean up this process. Critically, we need to make sure our return virtual machines are being replicated back to the Recovery Site by the storage array – and also ensure they are properly protected by a Recovery Plan

Step 1: Shutdown VMs at Protected Site (New York)

As with failover, a procedure is recommended by most storage vendors recommend to power off the virtual machines to ensure they are quiesced before re-establishing the regular cycle of replication between the Protected Site (New York) and the Recovery Site (New Jersey). Please consult your arrays vendor specific documentation before begin this process. Before doing my power down, I made an obvious change to all my virtual machines – this is so when I have re-established the normal cycle of replication between the Protected Site to the Recovery Site – I can confirm when I run a test of my Recovery Plan that the locations have the same virtual machines. In my case I changed my desktop colour from red to black.

Step 2: Clean-up the Placeholder files created during the Failback (New York)

During the failback process a Protection Group was temporarily made at the Recovery Site (New Jersey) to facilitate the failback process. This created a whole series of placeholder files at the Protected Site (New York), which we no longer need.

Here I am being careful not to delete the placeholder files from the Recovery Site (New Jersey) which are still valid for my bi-directional configuration.

Step 3: Delete the Failback Protection Group (New Jersey) Next at the Recovery Site (New Jersey) where we created a Protection Group to facilitate failback to the Protected Site (New York) we delete this Protection Group

356

Step 4: Remove from Recovery Site the References to the Old Recovered Virtual Machines From the Recovery Site (New Jersey) we must remove the old references to the Protected Sites (New York) virtual machines. We will be re-establishing protection for these virtual machines now they are back at the Protection Site (New York), and if we left them conflicts would occur between them and the newly created placeholder files

Step 5: Delete the “Failback” Recovery Plan from the Protected Site (New York)

Now that failback has been achieved we don’t need this Failback Recovery Plan.

Step 6: Re-Establish the Normal Pattern of Replication

Now that the Protected Site (New York) owns the virtual machines, we must make sure they are being replicated/snapshot to the Recovery Site (New Jersey). This process will vary from storage vendor to storage vendor.

…with EMC Celerra


With the EMC Clariion was we saw when failover occurred, the Primary and Secondary Images inverted their roles – the same thing occurs with failback. So with a planned failover there should be no work to do. However, it is good practise to check the Remote Mirror has not become “fractured” during the failback process – and confirm that no administrative manual trigger of synchronization is required. I like to run through the Array Managers configuration to confirm that LUNs are being replicated correctly. In the screen grab below you can see the resignatured LUNs are

357

being displayed. Occasionally, I’ve run through Array Managers wizard and found that the Protected Site and Recovery Site not yet mirrored. It does seem to take some time for the inversion of replication roles to complete. It does complete and it is very reliable – but it does take sometime.

…with HP Lefthand Networks

In Lefthand Networks VSA we must first change the volume that the Recovery Site (New Jersey) was using from being a primary volume back to being a remote volume. After doing that we can clean-up the old remote snapshot that was taken of this volume that was used to synchronize the data just before the failback. Lastly we can resume the remote copy schedule so the Protection Site replicates its changes to the Recovery Site. In my case I have one volume that need marking as “remote”. It’s very easy to see if a volume is Primary or Remote by the colour of the icon next the volume on the Management interface Primary Volumes

Remote Volumes

358

1. Login to the Lefthand Networks SRM as Administrator 2. Expand +NJ_Group (New Jersey), + NJ_Cluster and + Volumes 3. Right-click the replica_of_virtualmachines and from the menu choose Edit

Volume 4. Then click the Advanced Tab and change the Type to be Remote

Click OK to the two warning dialog boxes

359

Note: This should re-trigger the regular pattern of replication/snapshot between the Protected (New York) and Recovery (New Jersey) sites. After clicking OK the snapshot process should start immediately. Once this initial “sync” has taken place we can then re-enable the regular schedule for replication at the New York Group

5. Select the Primary Volumes held at the NYC_Group, in my case these are the volumes called virtualmachines

6. Select the Schedules Tab

Note: Notice how the schedule for my virtualmachines LUN/Volume is paused. Also notice how the rdm_ctx1 volume is still greyed out – this is because it is regarded as a remote volume still. It needs to be marked as a primary volume before its schedule is resumed

7. Right-click the Paused schedule, and choose Resume Snapshot Schedule

…With HP NetApp At this point the state of the NetApp Filers will be very similar to state when carried out our planned failover. The SnapMirror that replicated changes from the Recovery Site (New Jersey) back to the Protected Site (New York) will now be broken-off.

360

When you’re looking at this view be very careful. If you have been following this book to the letter – both NetApp Filers will each have a broken-off relationship each. One for the failback and the other failback process, the screen grab above shows the broken-off relationship caused by the failback to the New York Filer, and the screen grab below is the broken-off relationship caused by the failover.

These stale “broken-off” relationships can be left in place for future failback/failover requirements. However, if they offend your eye they can be deleted – but if you ever try a failover and failback again – you will only find yourself re-inputting SnapMirror settings all over again. The choice is yours. With that said if you choose login to the NetApp Filer directly you would receive repeated warnings about these broken-off relationships. I imagine these will be logged within the system, and they end up being regarded as problem, when they actually just part of the process. These will be reported at the NetApp Filer console – for example one below comes from me creating a SnapMirror before the failback process. [replication.dst.err:error]: SnapMirror: destination transfer from new-jersey-filer1.corp.com:vol1_replica_of_virtualmachines to vol1_virtualmachines : destination is not in snapmirrored state. We can re-establish the original SnapMirror relationship by returning to the Recovery Site (New Jersey) broken-off relationship and re-syncing it again. See it a bit like to taking your partner out to a flash restaurant to make up for the all time you spend at work.

1. Login to the Recovery Site (New Jersey) FilerView 2. Select in the menu the SnapMirror and Manage node 3. In the SnapMirror Manage page, locate the broken-off relationship, and click the

Advanced link

361

4. In the SnapMirror Properties page, click the Resync link

Note: The observant amongst you will notice I did not “restrict” the destination volume. This is because the “restrict’ option is only required during the first initialization of a SnapMirror. After that point it is not a hard requirement – so when you come to resync a broken-relationship, there is no need to restrict the volume. The restriction option is used to guarantee to the SnapMirror process exclusive access to the volume.

Step 7: Recreate Protection Group at the Protection Site (New York)

During the failback process we deleted the Protection Group that covered our virtual machines in New York. Now that we have cleaned up the system – we are now in the position to re-instate the protection.

1. Logon with the vSphere Client to the Protected Site’s vCenter (New York) 2. Click the Site Recovery icon 3. In the Summary Tab, in the Protection Setup pane – click the Create link next

to the Protection Groups option 4. In the Create Protection Group – Name and Description dialog box, type in a

friendly name and description for your Protection Group. In my case I’m creating a Protection Group – called File Servers Protection Group

5. When you click next, the Protection Group wizard will show you the datastores discovered by the Array Manager

362

6. Next select a datastore “placeholder” for your VMs.

Step 8: Re-Enable the Protection Group in the Recovery Plan (New Jersey)

As we deleted the Protection Group in the earlier stage, simply re-creating the Protection Group doesn’t automagically reconnect it to our old Recovery Plans. We will need to edit each of them in turn, and enable the Protection Group for them. Unfortunately, one of the big side effects of deleting the Protection Group and recreating it (which is part and parcel of the current failback procedure) is that when the virtual machines are added back into the Recovery Plan (by virtue of their membership of the new Protection Group) all your priority settings are lost, and you will once again have to manually order your virtual machine start-up. This is not nice. Not at all nice.

1. Logon with the vSphere Client to the Recovery Site’s vCenter (New Jersey) 2. Click the Site Recovery icon 3. Select the Recovery Plan(s) in my case I have many 4. Click Edit Recovery Plan and Next to the dialog box

363

5. Next re-enable the Protection Group for the Recovery Plan

6. Click Next, and in the set your – VM Response Times dialog box, select a

timeout value you think is appropriate for the power on of your recovery VMs

7. Next in the Edit Recovery Plan – Configure Test Networks dialog box, set the options to handle networking when you run a test.

8. Finally, you can suspend VMs at the Recovery Site to free up CPU and Memory resources in the Create Recovery Plan – Suspend Local Virtual Machines dialog box. In my case I called for my Test & Dev VM to be suspended

9. Click Finish 10. IMPORTANT:

REVIEW AND RESET ALL YOUR VIRTUAL MACHINE PRIORITY AND ORDER CONFIGURATION SETTINGS CONTAINED WITHIN THE RECOVERY PLAN. THE NEW PROTECTION GROUPS HAVE CREATED NEW PLACEHOLDER VMs, THIS MEANS ALL YOUR VMs WILL BE LOCATED IN THE NORMAL PRIORITY FOR SHUTDOWN AND POWER ONS. Note: Well, we are now back where we started before failover and failback. You might wish to test your plan(s) again to guarantee they work correctly

Step 9: Manage volumes and Restructure Datastore Folders

Once the failback process has been successfully completed you may find your VMFS volumes still show snap-NNNNNNN-originalname. Additionally, if you use datastore folders these “new” LUNs/Volumes may not be in the correct folder location. It is simply a matter of relabeling them, and then drag-and-dropping them to the correct location:

364

Alternatively, if you are working with NetApp and NFS, you might find that the NFS mounting points used during the failover process are still mounted after the failback process. These aren’t dangerous as the volumes will be now marked as read-only and will be receiving updates from the NetApp Filer in the Protection Site. However, they can cause unexpected behavior if they are left mounted. For example, if you try to carry out a test of your Recovery Plan you may see serialization of the datastore names occurring like so:

The volume called netapp-virtualmachines is the stale NFS mount used during the failback, and the netapp-virtualmachines(1) is the snapshot mounted during a test of a Recovery Plan. I prefer to unmount the failover volumes once the failback has been verified.

Unplanned Failover - Protected Site is DEAD Since my planned test I’ve put my configuration back in place, and I even went as far as testing my Recovery Plan for my protected virtual machines in New York – to make sure they work properly. Now what I want to do is document the same process of failover and failback – based on a total loss of the Protection Site (New York). To emulate this I did a hard power off of all my ESX hosts in the Protected Site using my ILO cards. If you recall from the beginning I’m running everything on two poor old ESX hosts This includes the Protection Site’s SQL, vCenter, SRM and Lefthand Networks VSA. This emulates a total catastrophic failure – there is nothing running at the Protection Site (New York) any more. I did a hard power off to emulate a totally unexpected and dirty loss of the system. My main reason for doing this is so I can document how it feels to manage SRM when this situation happens. You might never ever get to try this until the fateful day arrives. Additionally, on my fibre-channel, iSCSI and NAS system – I either took those arrays out of the zone configuration, or used IP conflicts (created by VMs) to emulate total loss of communications between the storage arrays. The first thing you’ll notice when you arrive at the Recovery Site – apart from a lot of worried looking faces that is – is that your storage management tools will not be able to communicate to the Protected Site. For example in the HP Lefthand VSA you will find that your VSA’s at the Protected Site are just removed from the management windows

365

With something like EMC’s Clariion Navisphere once you have run the Recovery Plan, you will see that references to the Primary/Secondary images are removed – and you are left with just primary images at the Recovery Site – additionally, until the network is restored there will be no access to the storage array at the Protected Site.

The main difference when the Protected Site (New York) is unavailable is that as you login to the Recovery Site (New Jersey) – the vSphere Client will ask you to login to the vCenter in the Protected Site (New York) – and this will fail – because remember, the vCenter at the Protected Site is dead! If you have the vSphere Client open at the time of the site outage – you will see that the vCenter server becomes unavailable and so will the SRM Server.

366

Additionally, you will see in the bottom right-hand corner of the vSphere Client that connectivity to the Protected Site vCenter has been lost

If you close and re-open the vSphere Client, and then switch to SRM view you will see that vCenter and SRM host are unavailable. The exact messages do vary dependent on whether you have enabled “linked mode” between the respective vCenter servers.

Of course in this scenario you would want to run your Recovery Plan. It is still possible in this state to run a test of the Recovery Plan if you so wish. When you do run the Recovery Plan, the SRM service will attempt to power off VMs in the Protected Site. Despite the fact that the Protected Site is unavailable, you will see “success” message for a task that logically cannot complete. You cannot power of VMs at the Protected Site if it’s a smoking crater. The screen grab below shows this erroneous “success” message.

367

Planned Failback – Protected Site is BACK Of course the failback process can only proceed by definition if the Protected Site (New York) is available again. In this respect it shouldn’t be that different from the failback process covered earlier in this chapter. Nonetheless, for completeness I wanted to cover this. Now, I don’t intend to cut and paste the entire previous section. For brevity I only want to cover what made this failback look and feel different. To emulate this I powered back on my ESX – I allowed this system to come up in any damn order – to generate a whole series of errors and failures. I wanted to make this as difficult as possible - and so made sure my storage array, SQL, vCenter and SRM server were all back online again – but all started in the wrong order. I thought I would repeat this process again to see if there any unexpected gotchas I could warn you about. Firstly when I brought back up the Protected Site I had some rather minor service errors caused by the Protected SRM coming up sooner than either my SQL or vCenter system. A quick “net start vmware-dr” on the SRM server fixed that. I had a couple of connectivity issues to resolve and had to re-pair the sites together again. This will be similar to the stuff I covered in Chapter 5: Failure to Connect to the SRM Server.

368

From this point onwards things very much went to plan. I was able to remove from the inventory the “old” protected virtual machines at the Protected Site (New York) and I was able to shutdown the virtual machines at the Recovery Site (New Jersey) prior to pausing the replication cycles and inverting the direction for replication temporarily from the Recovery Site (New Jersey) back to the Protected Site (New York). In the unplanned scenario it will be the case that some work will need to be done to manage the replication paths. With a planned failover the SRA is designed to manage the replication thread in a graceful manner. Of course, with an unplanned outage of the Protected Site this cannot happen. In general the steps required to resolve these issue are pretty easy to handle, but they do vary from vendor to vendor. Additionally, it’s perhaps worth thinking about the realities of true disaster. If the disaster destroyed your storage layer at the Protected Site, and new storage has to be provisioned it will need configuring for replication. That volume of data may be beyond what is reasonable to synchronize across your site-to-site links. It’s highly likely in this case that the new array for the Protected Site will need to brought the DR location so it can by synchronized locally first, before being shipped to the Protected Site locale.

…with EMC Celerra

With EMC Celerra the steps to carry out a controlled failback do not differ in an unplanned failover scenario. I would refer to the section about the planned failback for further details.


With the EMC Clariion we saw when planned failover occurred, the Primary and Secondary Images inverted their roles. In an unplanned failover the primary at the Protected Site will remain a primary, and the MirrorView relationship with be “fractured”. What needs to happen is that the MirrorView relationships need cleaned-up and then re-establishing.

369

1. In Navisphere, navigate to the Remote Mirrors on the Protected Site (New

York) 2. Open the Remote Mirror object 3. Right-click the Secondary Image, and choose Fracture

Note: Up until this stage the Remote Mirror is “SysFractured”, but doing this task the Secondary Mirror’s status will change to “AdminFractured”. By carrying out this task you are formerly telling the system that the previous MirrorView relationship is now invalid, and the system will stop trying to synchronize data.

4. The next step is to destroy the Remote Mirror configuration on the Protected Site,

and then rebuild it. Right-click the Remote Mirror, and select Properties. Then click the Primary Image tab, and click the Force Destroy button

370

Note: Confirm the warning box. Remember all we are destroying here is the MirrorView relationship at the Protected Site. The LUNs that are still present, and we have not destroyed the Remote Mirror at the Recovery Site (New Jersey). Important: Were now in the position to rebuild the MirrorView relationships. The Remote Mirror at the Recovery Site, is missing a Primary Image. However, we cannot simply add the Secondary Image back in, as it is in use by the Storage Group in the Protected Site. So before we begin we must first temporarily remove it from the Storage Group at the Protected Site. This enforced by Navisphere to prevent you accidentally replicating the contents of the wrong LUN, which may in fact be in use by the hosts.

5. In the Protected Site (New York), right-click your Storage Group 6. Choose Select LUNs. 7. In the Storage Group dialog box – locate the LUN you wish to replicate

changes to – and select and remove. In my case, as I am fixing the Remote Mirror – that controls the replication of Citrix XenApp environment – I’m removing from the Storage Group – LUN60

Note: Before you click remove you may want to take a note of the Host ID value. In my simple environment LUN60 is present to the ESX hosts with the Host ID of 60. This may not necessarily be the case in your environment – where Clariion might know this LUN with the ID of 120 but Host ID is LUN 10. Next we will switch to the Recovery Site Array (New Jersey), and add back a Secondary Image to the existing Remote Mirror

371

8. Right-click the Remote Mirror, and select Add Secondary Image

9. In the Add Secondary Image dialog box, locate the LUN on either SPA or

SPB, and select it as the target. Optionally, you can disable the “Initial Sync Required” and the set the Synchronization Rate to be High. As these LUNs have replicated to each other before – an initial sync may not be required. Unless the array at the Protected Site was damaged beyond repair, and this is new replacement array.

10. Once the Primary and Secondary LUNs are synchronised, the LUN we removed from the Storage Group at the Protected Site can be added back to allow the ESX hosts access. Additionally, you could lower the synchronization rate down to a more acceptable level appropriate for your normal operations. The above process would be repeated for each volume affected by the hard failover process. IMPORTANT: It is very easy to forget to add the LUNs you removed from the Storage Group

372

back into the system. Remember to confirm that the LUNs are back into the right storage groups once you have re-established the MirrorView relationship.

…with HP Lefthand

With the HP Lefthand arrays the SRA does not change the status of the Primary volume(s) at the Protected Site for either a planned or unplanned failover. The steps required to failback from the Recovery Site to the Protected Site are the same as in the planned failback scenario. I won’t repeat those steps here because they do not differ.

…With NetApp

Off the stages involved in the failback scenario is cleaning out volumes and virtual machines. For this happen the storage will need to be available at the Protected Site location. This clean-up process is only required if your storage array survived the disaster. In the case of NFS storage is possible that the ESX hosts may not come-up before your storage array was available. This will result in those NFS shares not being mounted at boot time. As such they will be marked as “inactive” in the datastores view within the vSphere Client like so:

Once these volumes are available again, ESX will default to retrying the mounting process. This can take sometime, so to avoid reboots you might wish to use the console utilities and the esxcfg-nas command for force re-mount: esxcfg-nas –r As with the planned failover and failback, would need to replicate any changes that occurred at the Recovery Site back to the Protected Site. This would either require a new SnapMirror relationship to be enabled or previous relationship to be re-initialized. Much depends where you hard-tested your failover and failback before hand, and whether you cleaned up the SnapMirror relationships during this hard-test or “DR Rehearsal”. If you left them in place – it’s simply case of initializing the SnapMirror from the Recovery Site to the Protected Site, and then resyncing the volumes again to re-establish the normal pattern of replication.

Conclusions As you have seen the actual process of running a plan does not differ from running a test. The implications of executing a Recovery Plan are so immense I can hardly find the words to describe them. Clearly, a planned failover and failback is much easier to manage than one caused by a true failure. I’ve spent some time on this topic – because this is chiefly why you buy the product, and perhaps if you are lucky you will never have to do this for real – as with all insurance against disaster SRM is a waste of money until you have to claim on that policy.

373

Despite this, if you look back through the chapter most of the volume of what I have written is about failback, not failover. This is with good reason – because despite being able to use the features of SRM to speed up failback – it is essentially a manual process. Of course, the get out clause is that SRM was never intended, never designed to automate failback. But this could be a limitation in the adoption of this first release of SRM. I know of certain banks, financial institutions and big pharmaceuticals – who very rigorously test their DR strategies. Some to the degree of invoking them for real once a quarter – despite not actually experiencing a real disaster. The idea behind this is two-fold. Firstly, the only way to know if your DR plan will work is if you use it – see it like a UPS system – there’s nothing like pulling the power supply to see if a UPS actually works. Secondly, it means the IT Staff are constantly preparing and testing the strategy – and improving and updating it as the Protected Site changes. For large organizations the lack of an automated failback process may be significant pain point in the SRM product, and something I expect and hope will come in future releases as the product matures. The building blocks in terms of SRM and SRA are already in place to allow this to happen. Perhaps this will be a good opportunity to move on to another chapter. I’m a firm believer in having a Plan B in case Plan A doesn’t work out. At the very least you could abandon SRM and do everything we have done so far manually. Perhaps this next chapter will finally give you the perspective to understand the benefits of the SRM product.

375


Chapter 13: Scripted Site Recovery

376


A Very Special Acknowledgement I would like to give a special acknowledgement to four individuals who directly helped with this section – with specific reference to the PowerShell aspect. I would like in particular to thank Carter Shanklin of VMware, who was happy to answer my direct emails and is the Product Manager for the VMware PowerShell. Additionally, I would like to thank Hal Rottenberg whom I first met via the VMware Community Forums. Hal is the author of a new book called “Managing VMware Infrastructure with PowerShell”. If you wish to learn more about the power of PowerShell then I would certainly recommend watching and joining the VMware VMTN community forum – and purchasing Hal’s book. I would like to thank Luc Dekens from the PowerShell forum who was especially helpful in explaining how to create a virtual switch with PowerShell. Finally, I would like to thank Al Renouf of virt-al.com for his assistance with this chapter. Finally, I would like to thank Dave Medvitz – who supplied the “Automating SRM” section of this chapter which he wrote completely of his own accord. I first met Dave via the VMware Communities forum where we were having a discussion about automating SRM. It was my mistaken assumption that the process within SRM couldn’t be scripted or automated – fortunately Dave set me right on that one! http://communities.vmware.com/thread/253736 It’s precisely this kind of “VMware Communities” contributions I would like see more of in the future. I personally believe the days of one guy (like me) having the skills and abilities to do all this kind of stuff on his own is drawing to a gradual close. So this chapter is split into two parts. This first is Dave’s work on how to automate tasks in SRM. Going forward I’m sure that PowerCLI will have this kind of functionality. But right now, this is where it is at. The second part concerns carrying out SRM-like tasks without SRM – and covers command-line tools such as the esxcfg- commands and the vCLI – together with examples from the PowerCLI.

Part 1: Introduction - Automating VMware SRM Written by Dave Medvitz; Bold and Italics by Mike Laverick! VMware SRM includes a set of APIs that allow interaction with existing Recovery Plans. With this API you can:

• Identify Recovery Plans. • Run Recovery Plans in Test or Recovery modes. • Pause, Resume, or Cancel active Recovery Plans • Retrieve log of Recovery Plan

How the SRM API works and Requirements The SRM API is presented as a set of web services. In PowerShell, we need to produce a .NET assembly to consume these web services. For PowerShell v2, that is performed with the New-WebServicesProxy cmdlet. New-WebServicesProxy -URL http://srm4nj.corp.com:9008/srm.wsdl -Namespace SRM

377


This command needs to be run prior to accessing the API, and is only valid for the current PowerShell session. Once the proxy is set up, create the necessary objects and make your connection to the recovery server. $mof = New-Object SRM.ManagedObjectFramework $mof.type = “SrmServiceInstance” $mof.value =”SrmServiceInstance” $srm = New-Object SRM.SrmBinding $srm.url = “https://srm4nj.corp.com:9007” $srm.CookieContainer = new-object System.Net.CookieContainer $context = $srm.GetContent($mof).srmApi All of the API methods are methods of the SrmBinding instance, $srm. The first parameter will always be $context.

API Overview The SRM.SrmBinding object is our interface to the SRM API. All of the API functions are methods of this object. These methods are summarized below:

Authentication to Site Recovery Manager

SrmLogin Parameters Context: Managed Object, API context userName: string password: string

SrmLogout Parameters Context: Managed Object

Gathering Information

GetApiVersion Parameters Context: Managed Object

Returns Version: String

Notes 1.0 is current version, and valid for SRM 1.0, 4.0, and 4.0U1

GetFinalStatus Parameters Context: Managed Object RecoveryPlan: string;recovery plan to get status on firstLine:integer:first line to return lineCount:integer,number of lines to return

Returns partialStatus:string[]: lines from the XML log from the last run of the recovery plan.

Note This may need to be called multiple times to get the full response.

RecoveryPlanStatus

Parameters Context: Managed Object RecoveryPlan:string

Returns Completed, Error, Paused, Prompting, Running, and Uninitialized

Notes The status object contains the following properties

Name Description

378


Status Where status is one of: uninitialized, running, prompting, paused,

Recovery Plan Actions RecoveryPlanStart Parameters

Context: Managed Object RecoveryPlan:string PlanType:string: Valid values are ‘Test’ and ‘Recovery’

RecoveryPlanPause Parameters Context: Managed Object RecoveryPlan:string

RecoveryPlanResume Parameters Context: Managed Object RecoveryPlan:string

RecoveryPlanCancel Parameters Context: Managed Object RecoveryPlan:string

RecoveryPlanAnswer Prompt

Parameters Context: Managed Object RecoveryPlan:string

Notes When a running recovery plan gets to a Message step, the status goes to ‘prompting’. The recovery plan will wait there until there is an acknowledgement of the message either via the plug-in or through this API call. There is no way within the API to get the message text.

Example 1: Automate the Test of a Recovery Plan

Now that we have all of the pieces, let’s put it together to do something useful. The goal here is to log in into our recovery server and run a recovery plan test. To do this, we’ll create the script Run-SrmTest.ps1. We’ll need to pass three pieces of information to the script, the recovery server, the recovery plan name, and a set of credentials.

(param [string] $server, [string]$recoveryPlan, [Credential] $credential )

1. Setting up and Logging in

The first thing we will need to do is make sure we have access to the SRM web service New-WebServicesProxy -URL http://$srm4nj.corp.com:9008/srm.wsdl -Namespace SRM

Once we have our web service proxy, we need to set it up and log in. First, we need to create a Managed Object Framework object for SRM. $mof = New-Object SRM.ManagedObjectFramework $mof.type = “SrmServiceInstance” $mof.value = “SrmServiceInstance”

And then bind to the SRM instance on the recovery server. $srm = New-Object SRM.SrmBinding $srm.url = “ https://$srm4nj.corp.com:9007”

379


$srm.CookieContainer = new-object System.Net.CookieContainer $context = $srm.GetContent($mof).srmApi $srm.SrmLogin($context, $credential.UserName, credential.GetNetworkCredential().Password)

2. Running the Plan Now that we’re logged in, let’s start the recovery plan in test mode $srm.RecoveryPlanStart($context,$recoveryPlan, “Test”)

3. Monitoring The Plans Progress

By default, the Recovery Plan will have a message separating the test from the cleanup, there may be other prompts added into the plan as well. To handle this, we’ll check the status every few minutes, and answer any prompts that come up. Unfortunately, the actual message is not obtainable through the API, so we can’t do anything with it. $go = $true While ($go) { Start-Wait 120 $status = $srm.RecoveryPlanStatus($context, $recoveryPlan).state

If ($status –ieq “prompting”) {$status.RecoveryPlanAnswer($context,$recoveryPlan) } Else If ($status –ieq “running”) { $go = $true } Else { $go = $true } } And we’re done. The script will terminate after the Recovery Plan Test has completed (including clean up). The log for this run can be retrieved through the GetFinalStatus API call or the SRM plug-in. Example 2: Automate Recovery The biggest hurdle to automating recovery is getting approval to embed credentials in the script. Powershell and the .NET framework allows for the use of encrypted strings through the SecureString class, as well as the ConvertTo-SecureString and ConvertFrom-Secure string cmdlets. In most cases, secure strings can only be decrypted by the user who encrypted it. In PowerShell, a secure string can be generated by $str = Read-Host –AsSecure The object obtained by this command now needs to be in a form that we can put into a script. To do this, we run ConvertFrom-SecureString $str Now we have a string that we can embed into a script. Assuming that we have a user name in variable $user, and the exported password as $pass, we create a credential object. $cred = new-object System.Management.Automation.PSCredential $user, (ConvertTo-SecureString $pass). The SecureString password can now be obtained from $cred.Password. The clear text password can be obtained from $cred.GetNetworkCredential().Password Validation Scripts Running validation scripts, especially when doing a test recovery can be tricky. There is no place in the API to get a list of the machines included in the currently running recovery

380


plan (you can get the systems from the last run via GetFinalStatus). If run frequently (daily) this may not be an issue, but if run weekly or less often, this may mean that it is weeks before an issue can be discovered. If test scripts are desired, all recovered VMs should exist within some well known structures from which VMs can be determined (folder, Datastore, etc). Once VMs are determined, some test scripts can be run utilizing the Invoke-VMScript cmdlet. This cmdlet requires credentials for the ESX hosts, credentials for the Guests, and script text to be run. The sample below will get all services for the given vm set to Auto which are not running. Invoke-VMScript –VM vmname –HostCredentials hostCreds –GuestCredentials guestCreds –ScriptText “gwmi win32_service –Filter \”State!=’Running’ and StartType=’Auto’\””

Compiling a WebProxy DLL

It may be desirable to compile a DLL with the web proxy code instead of generating it dynamically; in fact for PowerShell v1 it is necessary. To do this, you download and install the .NET 3.5 framework and the .NET 3.5 SDK. Locate the path to wsdl.exe, and csc.exe. The following will create the DLL necessary for connecting to the API Path-to-wsdl.exe /n:SRM /o SrmBinding.cs http://srm4nj.corp.com:9008/srm.wsdl Path-to-csc.exe /t:library SrmBinding.cs /r:SrmBinding.dll

These will create a dll SrmBinding.dll. To use the DLL, it needs to be imported [Reflection.Assembly]::LoadFrom(fullpathtoDLL) Loading the DLL as described above is equivalent to creating it on the fly via the New-WebServicesProxy cmdlet.

Part Two: Introduction – Recovery with Site Recovery Manager One of the interesting ironies or paradoxes that came to mind when writing this book was – what if at the point of requiring my DR Plan, VMware’s Site Recovery Manager failed or was unavailable? Put another way, what’s our Recovery Plan for SRM! Joking apart it’s a serious issue worth considering, there’s little point in using any technology without a Plan B if Plan A doesn’t pan out as we expected. Given that the key to any Recovery Plan is replication of data to a Recovery Site – at the heart of the issue the most important element is taken care of by your storage array, not by VMware’s SRM. Remember all that VMware SRM is doing is automating a very manual process. So there are really two main agendas behind this chapter – the first how to do everything VMware SRM does manually in case our Plan A doesn’t work out, the second is to show how incredibly useful SRM is in automating this process. Hopefully you will see in this chapter how hard life is without VMware’s Site Recovery Manager. Like any automation or scripted process, you don’t really see the advantages until you know what the manual process is like. With that in mind I could have started with this chapter as chapter 1 or 2, but I figured you would want to get stuck into SRM which is the topic of this book, and save this content for the end. It will also give you an idea of what SRM is doing in the background – which is making your life much easier. The big advantage of SRM to me is that it grows and reacts to changes in your Protected Site – something a manual process would need much higher levels of maintenance to achieve. As part of the preparation for this chapter – I decided to delete the Protection Groups and Recovery Plans associated with our bi-directional configuration from my Recovery Site (New Jersey)

381


The manual recovery of virtual machines will require some storage management – stopping the current cycle of replication, and making the “remote volume” a primary volume and read-writable. Whilst SRM does this automatically for you using the SRA from your storage vendor – in a manual recovery you will have to do this yourself. This is assuming you still have access to the array at the Protection Site, as with a planned execution of your DR plan. Additionally, once the replication has been stopped we will have to grant the ESX hosts in the Recovery Site access the last good snapshot that was taken. On the ESX hosts side – once granted access to the volumes they will have to be manually rescanned – to make sure the VMFS volume has been displayed. Based on our requirements and the visibility of the LUN we will have the option either not to resignature or to force a resignature of the VMFS volume. After we have handled the storage side of things we will have to edit the VMX file of each virtual machine and map it to the correct network. After doing that we would then be able to start adding each individual virtual machine into the Recovery Site – each time having to tell the vSphere Client which cluster, folder and resource pool to use. In the ideal world some of this virtual machine management could be scripted using VMware’s various Software Development Kits such as the Perl Scripting ToolKit, the PowerCLI or the vCenter SDK with some language of your choice – VB, C# and so on. I intend to use the PowerCLI for VMware as an example. As you will see with scripting the process is laborious and deeply tedious. Critically, it’s quite a slow process as well and this would impact on the rapidity of your recovery process. Think of all those RTOs and RPOs.

For a Unplanned Recovery For unplanned recovery we must shut down the virtual machines that are currently running in production before we manage the storage. If you remember the first step of any planned recovery is to power down the virtual machines in the Protected Site

If you are doing a manual failover for testing purposes this is not necessary.

Manage the Storage Without an SRA, we will have to engage more with the vendor’s tools for controlling snapshots and replication. This area is very vendor specific so I would refer you to their documentation. In terms of Lefthand Network’s VSA it would be the following steps. Currently, RecoveryManagementGroup (New Jersey) is holding the primary copy and the ProtectedManagement Group (New York) is holding a remote copy maintained by a scheduled remote snapshot. I will need to give the ESX hosts access to the latest replicated or snapshot volume, by adding it to an existing volume list

1. Login to the Lefthand Networks VSA as administrator 2. In the NJ_Group and then right-click a Snapshot from the list 3. From the menu select Assign and Unassign Servers

382


4. Select the ESX hosts, ensuring the permissions are Read/Write

Note: In the example above I’m giving my ESX hosts in New York access to the last snapshot of my virtualmachines volume (804)

VMware PowerCLI The following sections discuss the manual process of getting virtual machines added into vCenter and ready for power on. I’ve also decided to show you how to do the same tasks with PowerShell. What follows is almost a getting started guide for getting setup with PowerShell. First of all download and install the Windows PowerShell. Currently, Windows PowerShell is on version 2. http://support.microsoft.com/kb/968929 Next you will need to download and install the VMware PowerCLI, which adds hundreds of additional cmdlets to the Microsoft PowerShell library specifically designed for use with vSphere4. http://communities.vmware.com/community/vmtn/vsphere/automationtools/powercli To connect and login into the vCenter you use the connect-viserver command like so: connect-viserver vc4nyc.corp.com –user corp\administrator –password vmware TIP:

383


I do appreciate how difficult it is to type all this kind of code by hand. So, I have taken all these PowerShell examples and put them into a text file. You can download this text file from the RTFM website so you can cut and paste, and just change the variables such as your virtual machines names and resource pool names. Remember this book also exists as a PDF so you should also be able to cut and paste directly from there. http://www.rtfm-ed.co.uk/srm.html

Rescan the HBAs of each ESX host

You should be more than aware of how do a rescan of an ESX host either from the GUI or CLI, but what you are looking is for a new VMFS volume to appear. This rescan has to be done once per affected ESX host, and it would be quite laborious to do this via the vSphere Client (and again, if you made a mistake with the storage layer). Of course you could login with PuTTy and use esxcfg-rescan. Personally, I would be more tempted to use the vCLI which could be installed to the same management system as the PowerCLI for VMware. With thevRCLI for Windows we could use either the esxcfg-rescan.pl script esxcfg-rescan.pl --server esx1 --username root --password vmware vmhba32 Whilst the vCLI is a nifty tool which will come into its own – once ESX4i (the skinny latte version of VMware’s hypervisor) takes a grip and the Service Console (the full-fat version of the hypervisor) is depreciated or removed altogether – nonetheless it’s not really geared up for bulk administration. This following piece of PowerCLI rescans all the ESX hosts in vCenter which is much more efficient from a scripting perspective. get-vmhost | get-vmhoststorage –rescanallhba

Note: The syntax of the above piece of PowerShell is relatively easily to explain. Get-vmhost retrieves all the names of the ESX hosts in the vCenter that you authenticated to and this is piped to the command get-vmhoststorage, to then rescan all the ESX hosts. Get-vmhoststorage supports a –rescanALLhba switch, which does exactly what you think it does. The next thing we need to do is to mount the VMFS volume to the ESX hosts. As we saw in earlier chapters we can use the “Add Storage” wizard to resignature VMFS volumes and both Service Console and vCLI contain a esxcfg-volume and vicfg-volume commands which allow you to list LUNs/Volumes that contain snapshots or replicated data.

384


From the vCLI you can use the following command to list LUNs/Volumes that are snapshots or replicas like so: vicfg-volume.pl -l --server=vc4nj.corp.com --username=corp\administrator --password=vmware --vihost=esx3.corp.com This will produce an output like so:

The same command at the Service Console produces a very similar output:

With the escfg-volume and vicfg-volume commands you have the choice of either mounting the VMFS volume under its original name, or to resignature it. To mount the volume without a resignature you would use either:

vicfg-volume.pl –m 4b6d8fc2-339977f3-d2cb-001560aa6f7c lefthand-virtualmachines --server=vc4nj.corp.com --username=corp\administrator --password=vmware --vihost=esx3.corp.com

OR:

esxcfg-volume -m 4b6d8fc2-339977f3-d2cb-001560aa6f7c lefthand-virtualmachines

If you would prefer to carry out a resignature of the VMFS volume, using the vCLI or Service Console you would use:

vicfg-volume.pl –r 4b6d8fc2-339977f3-d2cb-001560aa6f7c lefthand-virtualmachines --server=vc4nj.corp.com --username=corp\administrator --password=vmware --vihost=esx3.corp.com

OR:

385


esxcfg-volume -r 4b6d8fc2-339977f3-d2cb-001560aa6f7c lefthand-virtualmachines

From a PowerCLI perspective there current exist no cmdlets to carry out a resignature of a VMFS volume. However, it is possible to dive into the SDK. This piece of PowerCLI returns a list of LUNs/Volumes that are replicated connect-viserver vc4nj.corp.com --username corp\administrator --password vmware $_this = Get-View -Id 'HostStorageSystem-storageSystem-25' $_this.QueryUnresolvedVmfsVolume()

If you want to carry out the resignature of the VMFS volume – you need the NNA identity of the LUN/volume in question: connect-viserver vc4nj.corp.com --username corp\administrator --password vmware $resolutionSpec = New-Object VMware.Vim.HostUnresolvedVmfsResignatureSpec $resolutionSpec.extentDevicePath = New-Object System.String[] (1) $resolutionSpec.extentDevicePath[0] = "/vmfs/devices/disks/naa.6000eb3420a6130d0000000000003497:1" $_this = Get-View -Id 'HostDatastoreSystem-datastoreSystem-25' $_this.ResignatureUnresolvedVmfsVolume_Task($resolutionSpec) If you want to rename the VMFS volume as SRM does after a resignature you can use this piece of PowerShell. This case the PowerCLI searches for my resignatured volume called “snap-22e9da7a-lefthand-virtualmachines”, and renames it to be called “lefthand-virtualmachines-copy”. set-datastore -datastore (get-datastore *lefthand-virtualmachines) - name lefthand-virtualmachines-copy

Create an Internal Network for Test

It’s part of my standard configuration on all my ESX hosts that I create a port group called “internal” which is on a dedicated switch with no physical NIC uplinked to it

However, you might wish to more closely emulate the way SRM does its tests of Recovery Plans which create a “testbubble” network.

386


Creating virtual switches in the beta release of VMware’s PowerShell has become much easier since Vi3.5, there are new cmdlets in vSphere4 PowerCLI that facilitate this. connect-viserver vc4nj.corp.com --username corp\administrator --password vmware Foreach ($vmhost in (get-vmhost)) { $vs = New-VirtualSwitch -VMHost $vmHost -Name "testBubble-1 vswitch" $internal = New-VirtualPortGroup -VirtualSwitch $vs -Name "testBubble-1 group" }

Add Virtual Machines to the Inventory 1. On one of the ESX hosts 2. Browse the data store that contains the relevant virtual machines 3. Right-click the VMX file and Choose Add to Inventory

4. In the subsequent dialog box select a DataCenter and Folder for holding your

virtual machine 5. Select the ESX Host or Cluster 6. Next select a resource pool for the virtual machine

Note: You should now be able to power on the virtual machine. You will have to manually change its IP address in the guest operating system. Its possible to automate adding a virtual machine to a ESX host (not a cluster) using the command-line ESX host tool called vmware-cmd. Unfortunately, it cannot handle the metadata of vCenter – such as folder location and resource pool. Remember you would have to repeat these steps for each and every virtual machine needed to be recovered. Perhaps a better way is to use some PowerCLI.

387


Once we have the path we can think about trying to register a VM, there is a New-VM cmdlet which we can use to handle the full registration process including esx host, folder, and resource pool location in the vCenter inventory like so: new-vm -vmhost esx3.corp.com -vmfilepath "[lefthand-virtualmachines-copy] ss03/ss03.vmx" -ResourcePool NYC_DR -Location NYC_DR

Remember this registration process would have to be repeated for each and every VMFS volume, for each and every VM needing recovery. I’m assuming that every VM in a VMFS should be registered with the ESX host.

Fix VMX Files for Network

On of the critical tasks SRM automates is the mapping of the VM to the correct network port group to vSwitch. With introduction of Distributed vSwitches in vSphere4, the script process now differs depending on whether you are using Standard vSwitches or the new Distributed vSwitch. If you are using Standard vSwitches you can use nano or Vi at the Service Console to edit the vmx file of each virtual machine to fix the port group used for communication. ethernet0.networkName = "vlan17" to be ethernet0.networkName = " testBubble-1 group "

This is, of course, is administratively burdensome. If you add your virtual machine into the vCenter first (our next task) you can then automate the property change (as I mentioned previously) with PowerShell for VMware. get-vm | get-networkadapter | sort-object -property "NetworkName" | where {'vlan21' -contains $_.NetworkName} | Set-NetworkAdapter -NetworkName “testBubble-1 group” -confirm:$false

388


Note: Here the graphic shows the script running without the -confirm:$false if it there are no confirmation questions, and the scripts just executes without stalling. If you are working with the Distributed vSwitch it is a little bit trickier. Currently, there is a gap between the quality of the cmdlets for Standard vSwitches and DvSwitches – the above piece of PowerCLI simply won’t work currently for a manually recovered VM which was configured for a Distributed DvSwitch from a different vCenter – here’s why. A VM that is configured for Distributed vSwitch holds unique identifiers for the DvSwitch at the Protected Site like so:

When this VM is manually recovered without SRM because no inventory mapping process is in place, the VM will lose its connection to the DvSwitch as it now resides in a new vSwitch. Essentially, the VM becomes orphaned from its network configuration. This shows itself as an “invalid backing” for the network adapter.

Using PowerCLI it is possible to reconfigure VMs configured for DvSwitches like so: $spec = New-Object VMware.Vim.VirtualMachineConfigSpec $spec.changeVersion = "2010-02-14T20:49:27.603958Z" $spec.deviceChange = New-Object VMware.Vim.VirtualDeviceConfigSpec[] (2) $spec.deviceChange[0] = New-Object VMware.Vim.VirtualDeviceConfigSpec $spec.deviceChange[0].operation = "edit"

389


$spec.deviceChange[0].device = New-Object VMware.Vim.VirtualVmxnet3 $spec.deviceChange[0].device.key = 4000 $spec.deviceChange[0].device.deviceInfo = New-Object VMware.Vim.Description $spec.deviceChange[0].device.deviceInfo.label = "Network adapter 1" $spec.deviceChange[0].device.deviceInfo.summary = "vm.device.VirtualVmxnet3.DistributedVirtualPortBackingInfo.summary" $spec.deviceChange[0].device.backing = New-Object VMware.Vim.VirtualEthernetCardDistributedVirtualPortBackingInfo $spec.deviceChange[0].device.backing.port = New-Object VMware.Vim.DistributedVirtualSwitchPortConnection $spec.deviceChange[0].device.backing.port.switchUuid = "9e 4d 0f 50 eb 3a 81 af-5f e5 0b 6b 7e 14 64 54" $spec.deviceChange[0].device.backing.port.portgroupKey = "dvportgroup-291" $spec.deviceChange[0].device.connectable = New-Object VMware.Vim.VirtualDeviceConnectInfo $spec.deviceChange[0].device.connectable.startConnected = $true $spec.deviceChange[0].device.connectable.allowGuestControl = $true $spec.deviceChange[0].device.connectable.connected = $false $spec.deviceChange[0].device.connectable.status = "untried" $spec.deviceChange[0].device.controllerKey = 100 $spec.deviceChange[0].device.unitNumber = 7 $spec.deviceChange[0].device.addressType = "assigned" $spec.deviceChange[0].device.macAddress = "00:50:56:97:6c:37" $spec.deviceChange[0].device.wakeOnLanEnabled = $true $spec.deviceChange[1] = New-Object VMware.Vim.VirtualDeviceConfigSpec $spec.deviceChange[1].operation = "remove" $spec.deviceChange[1].device = New-Object VMware.Vim.VirtualLsiLogicController $spec.deviceChange[1].device.key = 1000 $spec.deviceChange[1].device.deviceInfo = New-Object VMware.Vim.Description $spec.deviceChange[1].device.deviceInfo.label = "SCSI controller 0" $spec.deviceChange[1].device.deviceInfo.summary = "LSI Logic" $spec.deviceChange[1].device.controllerKey = 100 $spec.deviceChange[1].device.unitNumber = 3 $spec.deviceChange[1].device.busNumber = 0 $spec.deviceChange[1].device.hotAddRemove = $true $spec.deviceChange[1].device.sharedBus = "noSharing" $spec.deviceChange[1].device.scsiCtlrUnitNumber = 7 $_this = Get-View -Id 'VirtualMachine-vm-3662' $_this.ReconfigVM_Task($spec) However, this is hardly a friendly way of approaching the problem. You might be interested to know that Luc Dekens of http://lucd.info/ fame has a whole series of articles on handling Distributed Switches using PowerCLI. He’s even gone so far as to write his functions (which behave just like regular cmdlets) to address this functionality gap in the PowerCLI. I wouldn’t be surprised if there are new cmdlets in the next release of the PowerCLI to address this limitation. For the moment, unfortunately, it’s perhaps simpler to have one vCenter for both sites. In this case the ESX hosts in the DR site would share the same switch configuration as the ESX hosts in the primary site. However, a word of warning - such a configuration runs entirely counter to the structure of VMware SRM which demands two different vCenters for each and every Protected and Recovery Site. So if after cooking up your manual scripted solution, you later decided to adopt SRM, you would currently have a great deal of pruning and grafting to do to meet the SRM requirements. I’ve given Luc remote access to my lab environment (because he lacks the setup required to do this kind of work), and it’s my hope that Luc will work on his own functions in the

390


interim to help us handle this kind of scenario. So keep your eyes peeled on my website and Luc’s just in case we (really he) come up the goods. Specifically, Luc has been working on DvSwitch equivalents of the SvSwitch cmdlets called get-networkadapter and Set-NetworkAdapter. Luc his Get-dvSwNetworkAdapter and Set-dvSwNetworkAdapter functions are much easier to use. To use his functions create or open you preferred PowerShell profile. If you don’t know what PowerShell profiles are or how to create them – this Microsoft webpage is good starting point: http://msdn.microsoft.com/en-us/library/bb613488%28VS.85%29.aspx Next visit Luc’s website at this location below, and then copy and paste his functions into the profile. http://lucd.info/?p=1871

Using these functions you can run commands like the example below to set every VM that begins with ss0* to use VLAN55. get-dvswnetworkadapter (get-vm ss0*) | set-dvswnetworkadapter -networkname "vlan55" -StartConnected:$true

I’m sure Luc will carry on improving and extending the features of his functions – and I would heartily recommend his series of articles on PowerCLI and DvSwitches.

Conclusions As you can see the manual process is very labour intensive which is to be expected by the use of the word manual. You might have got the impression that this issue can be fixed by some wizz-bang PowerShell scripts. You might even have thought – sucks, why do I need SRM if I have these PowerShell scripts? However, it’s not as simple as that for two main

391


reasons – there’s no support really for this home-brewed DR – and secondly, you can work all you like testing your scripts, but then your environment will change and those scripts will go out of date and will need endless re-engineering and re-testing. In fact the real reason I wanted to write this chapter is to show how painful the manual process is – to give you a real feel of the true benefits of SRM. I know there are some big corporates who have decided to go down this route – primarily because of a number of factors. Firstly, they have the time, the people and the resources to manage it – and do it well. Secondly, they were probably doing this manually even before SRM came on the scene. At first glance their manual process probably looked more sophisticated than the SRM 1.0 product – they might feel that way about SRM 4.0. Personally, I think as SRM evolves and improves it will become increasingly harder to justify a home-brewed configuration. I think that tipping point will probably come with SRM 5.0.

392


The End – Final Conclusions Well this is the end of the book – and I would like to use this last part to make some final conclusions and observations about VMware Site Recovery Manager and VMware generally. I first start working with VMware products in late 2003. In fact it wasn’t until 2004 that I seriously became involved with VMware ESX and VirtualCenter. So I see that we are all on a huge learning curve – because even our so-called experts, gurus and evangelists – are relatively new to virtualization. But as ever in our industry there are some extremely sharp people who work out in the field who reacted brilliantly to the seismic shift that I saw coming when I saw my first VMotion demonstration. There’s been a lot of talk about how hypervisors will or are becoming a commodity. I still think we are a little way from that, as VMware licensing shows – there is still a premium to be charged on the virtualization layer. But things are changing and VMware’s competitors will catch-up albeit not as quickly as people sometimes think. The other virtualization vendors have too – otherwise there will be only one player in the market. That would be bad for all concerned – including VMware. Companies thrive when they have a market to either create or defend. Since writing this the ESX3i product has become free so you could now argue the hypervisor is a commodity. However, that now means management is now where the money has moved to – and SRM is firmly in that camp. But I see another shift equally seismic and that is a revolution in our management tools because quite simply the old management tools simply don’t cut the mustard. They aren’t VM aware. That’s a bit of pun I’ve been making in my training classes for sometime – it’s often a criticism – you cannot use this tool because it isn’t virtual machine aware, as you can see it’s a bit of pun on the company name. It’s a shame vmaware.com is already registered; it might have been the new name for RTFM! VMware is creating these vm-aware products (Lab Manager, Stage Manager, Site Recovery Manager, LifeCycle Manager, and View) now, not in some far flung and distant roadmap that could change 180 days out from a product release. So if you are VMware shop – don’t wait around – get on and play with these technologies now as I have done, because they are “the next big thing” that you have always been looking out for in your career. As the years roll by expect to see the R in SRM to be removed – and with the advent of cloud computing, then VMware Site Manager, will be as much a cloud management tool, as it is a disaster recovery tool. In future I can imagine using the technology that sits behind SRM currently be used to move virtual machines from an internal private cloud, to an external public cloud – and perhaps even from one cloud provider to another.

[root@rtfm]# THE END

393


INDEX

Architecture, 128 Array Manager, 45, 58, 90, 125, 129, 140, 157, 169, 170,

171, 172, 173, 174, 176, 178, 179, 181, 182, 183, 184, 185, 186, 187, 188, 189, 191, 192, 201, 207, 214, 234, 269, 270, 271, 272, 275, 280, 281, 284, 286, 287, 314, 315, 316, 318, 323, 325, 349, 350, 356, 361

Caution, 76, 152 Celerra, 10, 23, 24, 25, 26, 27, 28, 30, 31, 33, 34, 35, 36,

39, 40, 41, 42, 44, 45, 48, 170, 171, 172, 173, 175, 229, 277, 332, 341, 342, 356, 368

Clariion, 10, 17, 27, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 169, 175, 176, 178, 179, 180, 181, 182, 199, 216, 217, 225, 230, 270, 272, 278, 283, 300, 333, 334, 343, 356, 365, 368, 370

Configuring Array Managers, 169 Lefthand Networks VSA, 73 Priority Orders, 242 Shutdown of VMs, 239 Software iSCSI, 80 SRM Administrators, 303

Creating Alarm - Script, 294 Alarms - Email, 296 Alarms - SNMP, 295 Basic Recovery Plan, 210 New Networks, 270 New Virtual Machines, 270 Protection Groups, 199

Diagram, 19, 20, 22, 128, 215, 330, 338 Distributed vSwitches, 198, 266, 268, 387 Error, 141, 224, 225, 227, 228, 229, 267 Failback, 134

Clean-up, 355 Failover

Planned Failover, 329 Unplanned, 364

Failures, 160, 294, 367 File Level Consistency, 16 Gotchas, 140 Important, 76, 175, 187, 193, 213, 259, 370 Installing, 16, 114, 127, 133, 152, 157, 159, 165, 170, 176,

320, 321 Inventory Mappings, 161, 193, 194, 196, 197, 205, 206,

260, 261, 262, 266, 269, 270, 273, 314, 316, 318, 351 Licensing, 72, 142 NetApp, 1, 8, 10, 14, 16, 19, 22, 24, 93, 94, 95, 98, 99,

101, 102, 103, 107, 108, 109, 110, 111, 114, 115, 116, 117, 118, 119, 120, 122, 124, 125, 135, 169, 175, 188, 189, 190, 191, 192, 200, 215, 231, 232, 277, 335, 336, 346, 348, 350, 359, 360, 364, 372

Parallel Host Start-Up Order, 244 Placeholder, 218, 229, 256, 293, 338, 355 PowerCLI, 216, 247, 248, 250, 268, 269, 381, 382, 383,

385, 386, 388 Powershell, 246, 248

Add Virtual Machines, 386 Fix VM VLan Configuration, 387 Rescan HBAs, 383 Virtual Switches, 385

RDMs, 45, 58, 91, 120, 125, 142, 175, 187, 193, 199, 238, 279, 280, 281, 343, 344, 352

Recovery Plans, 12, 56, 85, 141, 204, 210, 214, 221, 222, 223, 229, 236, 237, 241, 243, 266, 269, 274, 277, 283, 285, 286, 287, 293, 294, 299, 301, 303, 307, 308, 309, 314, 318, 323, 325, 353

Renaming DataCenters (Protection Site), 266 Datacenters (Recovery), 268 Resource Pools (Protection), 266 Resources Pools (Recovery), 238, 268 Virtual Machines, 266 Virtual Switches (Protection), 266 Virtual Switches (Recovery), 268 VirtualCenter Objects, 265

Repair Array Manager Button, 286, 287 Shared Site, 318, 323 SRM

Adding Commands, 246 Alarm - Script, 294 Alarms - Email, 296 Bidirectional Configuration, 313 Changes at the Protection Site, 262, 268 Changes at the Recovery Site, 268 Creating Protection Groups, 199 Custom Messages, 244 Customized VM Mappings, 260 Database, 143 Failback Clean-up, 355 Failback, after unplanned failover, 367 Failure to Protect VM, 205 Hardware Requirements, 134 Installation, 152 IP Address Reconfiguration, 251 Licensing, 142 Log Files, 310 Multiple Protection Groups and Recovery Plans, 283 Pairing, 164 Permissions and Access Control, 301 Permissions Limitations, 308 Planned Failover, 329 Plug-in, 159 RDMs, 279 Recovery Plan Events, 216 Recovery Plan History, 300 Release Notes, 140 Repair Array Managers, 286 Reports, 298 Service Failure, 160, 367 Site Recovery Adapter, 157 SNMP, 295 Software Requirements, 130 Unplanned Failover, 364

Storage Multiple VMFS Volumes, 277 Principles & Caveats, 17 Replication Scenarios, 232 Storage VMotion, 275 Vendor Guides, 22

URL, 22, 23, 239

394


URLs, 10, 13, 22, 23, 24, 26, 61, 94, 114, 116, 131, 133, 206, 215, 239, 247, 248, 265, 296, 331, 382, 383

vmware-dr.xml, 139, 168, 218, 224, 234, 256, 258, 260, 293

Warning, 151, 155, 166, 167, 171, 178, 183, 188, 204, 219, 224, 250, 252, 260, 296, 298, 303, 318, 322, 351, 352