EMC / CLARiiON Troubleshooting Guidequickreference.weebly.com/uploads/1/5/4/2/15423822/clariion... · EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential Section

EMC / CLARiiON Troubleshooting Guide

2nd Edition

EMC Global Services - Problem Resolution & Escalation Management - CLARiiON

EMC / CLARiiON Troubleshooting 2nd Edition Strictly Confidential

EMC / CLARiiON Troubleshooting 2nd Edition Description This is a 2nd edition version of the CLARiiON Troubleshooting Manual that was first introduced in January/2004. The original manual had an accompanying training course that still has relevant material. This document introduces new information and also updated information on topics related to the CLARiiON disk storage product. Please note that not all information will be accessible or available to all readers of this document. Authors Wayne Brain - Consulting Engineer [email protected] David Davis - Technical Support Engineer [email protected] Joseph Primm - Consulting Engineer [email protected] Roy Carter - Corporate Systems Engineer [email protected] Other various engineering sources - Our thanks to everyones input in putting this document together. Intended Audience

EMC and Vendor Technical Support Professionals CLARiiON trained CS Specialists [ie: RxS or LRS] Other field personnel with management approval Objectives Build a solid understanding of specific topics related to CLARiiON. Prerequisites Good knowledge of fibre channel and an understanding of basic CLARiiON operations and functionality. The following are recommended to have been taken prior to use of this manual. CLARiiON Core Curriculum (e-Learning) CLARiiON Core Curriculum (workshop) Field experience providing knowledge of the theory of operation of the CLARiiON CX Series hardware, and Implementation of a CLARiiON using Navisphere 6.x Content The course will cover the topic areas noted below. Section 1 Layered Applications Section 2 NDU Basic Operations and Troubleshooting Section 3 Backend Architecture Section 4 Troubleshooting & Tools Section 5 General FLARE Date Approved By Rev Description 01/15/04 Joseph Primm A02 CL_Troubleshooting_1stEdition (original document) 02/06/07 Joseph Primm B00 CL_Troubleshooting_2ndEdition (initial draft) 02/07/07 Joseph Primm B01 Formatting and statement corrections 02/28/07 Joseph Primm B02 Corrections, added bookmarks, major changes to section 5 08/30/07 Joseph Primm B03 Added CX3 Port numbering, page 218

Copyright 2007 EMC Corporation. All rights reserved. Revision B03 EMC Confidential - Internal Use Only 1

mailto:[email protected]:[email protected]:[email protected]:[email protected]


Section 1 Layered Applications Page 5 General Terms Page 5

SnapView Snapshot Terms Page 5 SnapView Clone Terms Page 6 MirrorView/S and MirrorView/A Terms Page 6 SAN Copy Terms Page 7

SnapView Snapshots Page 7 Source LUN Page 7 Snapshot LUN Page 8 Reserved LU Page 9 Step-by-step snapshots overview - all platforms Page 10

SnapView Clones Page 18 Source LUN Page 18

Clone LUN (Fractured) Page 20 Clone LUN (Unfractured) Page 21

CPL Page 22 Step-by-step clone overview - all platforms Page 23 Reverse synchronization - all platforms Page 27

MirrorView/S Page 28 Primary LUN Page 28 Secondary LUN Page 30 WIL Page 31 How MirrorView/S handles failures Page 33

Access to the SP fails Page 33 Primary Image Fails Page 33 Promoting a secondary image to a primary image Page 34 Running MirrorView/S on a VMware ESX Server Page 35 Recovering by promoting a secondary image Page 35 Restoring the original mirror configuration after recovery of a failed primary image Page 36 Recovering without promoting a secondary image Page 37 Failure of the secondary image Page 37 Promoting a secondary image when there is no failure Page 38

Summary of MirrorView/S failures Page 39 Recovering from serious errors Page 40

How consistency groups handle failures Page 40 Access to the SP fails Page 40 Primary storage system fails Page 40 Recovering by promoting a secondary consistency group Page 41 Normal promotion Page 41 Force promote Page 41 Local only promote Page 41 Recovery policy after promoting Page 42

MirrorView/A Page 43 Primary LUN Page 43 Secondary LUN Page 45 Reserved LU (Primary) Page 46 Reserved LU Secondary Page 47 How MirrorView/A handles failures Page 49

Access to the primary SP fails Page 49 Primary image fails Page 49 Promoting a secondary image to a primary image Page 50 Running MirrorView/A on a VMware ESX Server Page 51 Recovering by promoting a secondary image Page 52 Restoring the original mirror configuration after recovery of a failed primary image Page 52 Recovering without promoting a secondary image Page 53 Failure of the secondary image Page 54 Promoting a secondary image when there is no failure Page 54

Summary of MirrorView/A failures Page 55 Recovering from serious errors Page 56

How consistency groups handle failures Page 56 Access to the SP fails Page 56 Primary storage system fails Page 56

Recovering by promoting a secondary consistency group Page 57 Normal promotion Page 57 Force promote Page 57 Local only promote Page 57

Failure of the secondary consistency group Page 58 SAN Copy Page 59

Destination LUN (Full Copy) Page 59 Source LUN (Incremental Copy) Page 60



Destination LUN (Incremental Copy) Page 61 Reserved LU Page 61

SAN Copy (ISC) Page 63 Creating an Incremental SAN Copy Session Page 63 Marking/Unmarking the Incremental SAN Copy Session Page 64 Starting the Incremental SAN Copy Session Page 64 Viewing/Modifying an Incremental SAN Copy Session Page 64 Destroying an Incremental SAN Copy Session Page 65 Error Cases Page 65

Incremental SAN Copy Session Failure Page 65 Incremental SAN Copy Session Destination Failure Page 65 Out of SnapCache for Incremental SAN Copy Session Page 65 SnapCache failure Page 66

Restrictions Page 66 Issues Page 66

Case Studies Page 70 MirrorView/A - Target array upgraded from CX500 to CX700, MV/A has stopped working Page 70 MirrorView/S - SPB mirrorview initiator is missing after switch cable change Page 71 SanCopy - SanCopy failure Page 71 San Copy - Host I/O failed with MV and host I/O running with 200ms and SPA rebooted Page 74 San Copy - LUN 23 corrupted Page 74 SnapView - Snapsession failure during a trespass Page 76 SnapView - Unable to delete LUNs that were part of a mirror. Page 77 SnapView - SP Bugcheck 0x000000d1, 0x00000002, 0x00000000, 0x00000000 Page 78 SnapView - Bugcheck 0xe111805f (0x81ff6c48, 0x00000000, 0x00000000, 0x000003cd) Page 79

Section 2 NDU Basic Operations and Troubleshooting Page 82 General Theory Page 82

NDU Process Page 82 Sample Cases Page 85

Dependency Check Failed Page 85 PSM Access Failed Page 85 Cache Disable Failed Page 86 Check Script Failed Page 86 Setup Script failed Page 87 Quiesce Failed Page 87 Deactivate Hang Page 87 Panic During Activate Page 88 Reboot Failed Page 88 Registry Flush Failed Page 88 Commit Failed Page 88 Post Conversion Bundle Inconsistency in Release 14 Page 88 R12/R13 to R16/R17 stack size problem Page 88 Initial Cleanup Failed Page 88 iSCSIPortx IP Configuration Restoration and Device Discovery Page 90 QLogic r4/r3 issue Page 90 One or both SPs in reboot cycle Page 90

Tips and Tricks Page 92 SPCollects Page 92 Event Logs Page 92 Ktrace Page 92 NDU Output Files Page 92 Force degraded mode Page 92

Section 3 Backend Architecture Page 92

General Theory Page 92 CLARiiON Backend Arbitrated Loop Page 93

Backend data flow Page 94 How does this relate to the backend of a CLARiiON Storage System? Page 94

Data flow through each enclosure type Page 95 FC-series data flow Page 95 CX data flow Page 95

CX-series data flow with DAE2 Page 96 ATA (Advanced Technology Attachment) Disk Enclosures Page 97

ATA Disk Ownership Page 98 Ultrapoint (Stiletto) Disk Array Enclosure DAE2P/DAE3P Page 101

Fibre Channel Data Path Page 102 How to troubleshoot an Ultrapoint backend bus using the counters. Page 102



Descriptions of the registers returned in the lccgetstats output. Page 104 How to interpret the output of the Ultrapoint counters Page 105

Other options for backend isolation Page 109 SP Event Logs Page 109 RLS Monitor Logs Page 110

Section 4 Troubleshooting & Tools Page 112

CAP Page 112 DRU Page 133 TRiiAGE Page 137

FLARE Centric Log Error Reporting Information Page 147 SP State Page 156 Advanced LUstat Page 156 Ktcons lustat Page 157 Ktcons Vpstat Page 159 FCOScan Page 159 Displaying Coherency Error Count Page 160 RAID Group Error Summary Information Page 160 DISKS SENSE DATA from SP*_System.evt files Page 161 FBI Error Information Page 162 YUKON Log Analysis Page 163

SPCollect Information Page 164 SPQ Page 164

Section 5 General Troubleshooting and Information Page 166

Private Space Reference Page 166 SP Will Not Boot Page 167 First Steps To Try Page 168 CX Boot Failure Modes / Establishing PPP Connection to SP Page 169 LAN Service Port CX3 / EMCRemote Password R24 / SP Fault LED Blink Rates Page 170 Summary of Boot Process Page 171 CX200/400/600 Powerup Page 172 CX300/500/700 Powerup Page 174 CX3-20/CX3-40/CX3-80 Powerup Page 177 Data Sector Protection Page 180

How do these bytes work? Page 181 What can cause uncorrectable sectors? Page 182 Power Loss Scenario Page 183 Pro Active Data Integrity Page 184 Dual Active Storage Processors Page 186 Stripe Access Management Page 187 How do we check the integrity of the (4) 2-byte sectors? Page 188 How to approach & resolve uncorrectable sector issues Page 192

CLARiiON stand alone storage environment Page 192 New tool BRT Page 195 CELERRA storage environment Page 196 CDL storage environment Page 197

General Array and Host Attach Related Information Page 198 Binding / Assignment / Initial Assignment / Auto-Assignment Page 198 Failover Feature (relative to Auto-Assign and not trespassing) / Trespass / Auto-Trespass Page 200 Storage Groups / Setting Up Storage Groups (SGs) / Special (predefined) Storage Groups Page 201 Default Storage Group / Defining Initiators / Heterogeneous Hosts Page 202

Initiatortype Page 203 Arraycommpath / Failovermode Page 205 Logical Unit Serial Number Reporting Page 207

emc99467 - Parameter settings Page 208

APPENDIX Page 209 Flare Revision Decoder Page 209

CX/CX3 Bus numbering charts Page 210 CX3-Series Array Port Numbering Page 218



Section 1 Layered Applications General Terms Source LUN This LUN is often considered the production LUN. Replicas are taken off of source LUNs. SP Most CLARiiON storage systems have two Storage Processors for high availability. LUNs are owned by one SP. I/O is performed by the SP owner of a LUN (including replication I/O). Trespass The owning SP of a LUN can be changed to the peer SP via a trespass event. Trespass events are initiated via server path failover software, or through Navisphere administrative command. PSM On the first five drives of every CLARiiON lives a storage system database maintained by the persistent storage manager (PSM) component. This database contains the replication features configuration information (among other things). The PSM database is maintained on a triple mirror; therefore, this document does not need to cover failure cases where the PSM is totally inaccessible as the storage system will not be able to function in that case (catastrophic failure). It is possible that transient I/O failures from PSM can occur; however, due to the infrequent nature of these failures and the complexity of the error handling, these PSM failures are left outside the scope of this document. LCC/BCC There are two link controller cards for each DAE (disk array enclosure). LCCs are used for FC enclosures and BCCs are used for ATA enclosures. LCCs and BCCs can fail due to pulling the cards, the cables connecting the cards, or due to failures of the HW or SW running on the cards. Disk Failure Failure of two disks in a RAID 1, 1/0, 3, or 5 or one disk in a RAID 0 RAID group will cause any LUs on the RAID group to be inaccessible. A RAID group can fail due to manually pulling disks or due to physical disk failures. Cache Dirty A LU can be marked as cache dirty if modified data that was maintained in both SPs memory could not be flushed out the to the physical drives on which the LU lives. Cache dirty LUs are inaccessible until a procedure is invoked to clear the cache dirty state. Bad block Every CLARiiON storage system maintains block level checksums on disk. When a block is read, the checksum is recalculated and compared with the saved checksum. If the checksum does not compare, a read failure occurs. Overwriting the block will repair the bad block. SnapView Snapshot Terms Snap Session A point in time virtual representation of a source LUN. A source LUN can have up to 8 snap sessions associated with it. Snap sessions incur copy on first write processing in order to maintain the data point in time of the source LUN at the time the snap session was started. When a snap session is stopped, the data point in time is lost and the resources associated with the snap session are freed back to the system for use by new sessions as needed. Snap Source LUN A source LUN that has one or more snap sessions started on it. Snapshot LUN One of up to 8 virtual LUNs associated with a snap source LUN that can have a snap session activated upon it. The snap snapshot LUN immediately appears to contain the point in time data of a snap session the instant a snap session is activated upon it via Navisphere or admsnap command. Reserved LU A private LU used to store the copy on first write data and associated map pointers in order to preserve up to 8 points in time for up to 8 snap sessions on a snap source LUN. A reserved LU is assigned to a source LUN the first time a session is started on the LUN. More reserved LUs will be associated with the snap source LUN as needed by the storage system. In addition to maintaining point in time data, reserved LUs also maintain tracking and transfer information for incremental SAN Copy and MirrorView/A.



SnapView Clone Terms Clone Group A clone group is a construct for associating clones with a source LUN. Clone Source LUN A source LUN that has a clone group associated with it. It can have zero or more SnapView clones associated with it. Clone LUN One of up to 8 LUNs associated with a clone source LUN. Each clone LUN is the exact size of the associated clone source LUN. Clone LUNs are added and removed from a clone group. Clone image condition Condition of a clone LUN provides information about the status of updates for the clone. Clone image state Clone image states reflect contents of data contained in clone with respect to clone source LUN. Fractured and Unfractured Clone LUN A clone can be either fractured or unfractured. It can be available for I/O or unavailable for I/O. A clone LUN that is unfractured is never available for I/O. A fractured clone LUN is only available for I/O if the clone was not in the synchronizing state or the reverse synchronizing state when the administrative fracture occurred. A clone can be fractured via Navisphere administrative command or under certain failure scenarios. Protected and Unprotected Clone reverse sync A clone can optionally be protected or unprotected during a reverse synchronization. If protected, the clone will remain fractured to allow for subsequent reverse synchronizations. If unprotected, the clone chosen for a reverse synchronization will remain mirrored with the Clone source LUN. When the clone reverse synchronization is complete, the unprotected clone will be consistent with the clone source LUN. CPL Clone private LU (CPL) contains bit maps which describe changed regions to provide incremental synchronizations for clones. There is one CPL for each SP in the storage system. MirrorView/S and MirrorView/A Terms Primary LUN Source LUN whose data contents are replicated on a remote storage system for the purpose of disaster recovery. Each primary LUN can have one or more secondary LUNs (MirrorView/S supports two secondary LUNs per primary and MirrorView/A supports one) associated with it. Secondary LUN - A LUN that contains a data mirror (replica) of the Primary LUN. This LUN must reside on a different CLARiiON storage system than the Primary LUN. WIL Write Intent Log (WIL) contains bit maps which describe changed regions to provide incremental synchronizations for MirrorView/S. There is one WIL for each SP in the storage system. Secondary image condition - The condition of a secondary LUN provides additional information about the status of mirror updates to the secondary. Secondary image state The secondary image states reflect the contents of the data contained in the secondary LUN with respect to the primary LUN. Consistency group - A set of mirrors that are managed as a single entity and whose secondary images remain in a write order consistent and recoverable state (except when synchronizing) with respect to their primary image and each other. Fracture - A condition in which I/O is not mirrored to the secondary image (also will not mirror when the secondary image condition is in the waiting on administrative action state) and can be caused via administrative command or under certain failure scenarios (administratively fractured) or when the system determines that the secondary image is unreachable (system fractured). Auto recovery Property of a mirror which will cause the storage system to automatically start a synchronization operation as soon as a system-fractured secondary image is determined to be reachable.



Manual recovery Property of a mirror which will cause the storage system to wait for a synchronization request from an administrator when a system fractured secondary image is determined to be reachable (opposite of auto recovery). Promote - The operation by which the administrator changes an images or groups role from secondary to primary. As part of this operation, the previous primary image becomes a secondary image. SAN Copy Terms SAN Copy Session A SAN Copy session describes the copy operation. The session contains information about the source LUN and all destination LUNs (SAN Copy can copy a source LUN to multiple destination LUNs in one session). A session can be for a full copy or for an incremental copy. Incremental SAN Copy sessions can be in the marked or unmarked state. Marked sessions protect the point in time of the data for copying when the mark Navisphere command was issued. Incremental SAN Copy sessions require reserved LUs. SAN Copy Storage System This is the storage system where the SAN Copy session resides. The SAN Copy processing occurs on the SAN Copy storage system. The SAN Copy storage system can contain a source LUN and/or one or more destination LUN(s) for any given SAN Copy session. Target Storage System This is the storage system where the SAN Copy session does not reside and can contain a source LUN or one or more destination LUN(s) for any given SAN Copy session. The SAN Copy processing does not occur on the target storage system. Destination LUN A destination LUN is the recipient LUN of a data transfer. Source LUNs are copied to destination LUNs. All destination LUNs must be the exact same size or larger than the source LUN. SAN Copy can copy a source LUN to multiple destination LUNs. SnapView Snapshots There are three user visible storage system objects that are used by the SnapView snapshot capability: Snap source LUN(s), snapshot LUN(s), and reserved LU(s). There is a table for each object and a number of events that pertain to each object. The result column describes the outcome as a result of the event of the object while the action is occurring. For the purposes of this document, only persistent snap session behavior is described (non-persistent sessions will terminate in all events described). Source LUN Snap Source LUN Action Event Result Server write to a snap source LUN. Storage system will need to perform a copy on first write to preserve the point in time data of an existing snap session on the snap source LUN. This entails a read from the snap source LUN and writes to reserved LU(s) before the server write to source LUN can proceed.

The storage system generated read from the snap source LUN fails due to a bad block, LCC/BCC failure, cache dirty LUN, etc.

The server write request succeeds. All snap sessions that a copy on first write was required in order to maintain the point in time data for that write will stop. If the last session associated with the snap source LUN is stopped, the associated reserved LUs will be freed back to the pool.



Snap Source LUN Action Event Result Server write to a snap source LUN. The storage system will need to perform a copy on first write to preserve the point in time data of an existing snap session on the snap source LUN. This entails a read from the snap source LUN and writes to reserved LU(s) before the server write to source LUN can proceed.

After the copy on first write processing is completed, the write to the snap source LUN fails due to a LCC/BCC failure or some storage system software problem that happens after the copy on first write processing (if required), and while processing the write to the snap source LUN.

The server write request fails which may trigger server-based path failover software to trespass the snap source LUN (see the description of the trespass action below). All snap sessions associated with the snap source LUN are maintained.

Server read from a snap source LUN.

The read from the snap source LUN fails due to a bad block.

The server read request fails. All snap sessions associated with the snap source LUN are maintained.

SP that owns the snap source LUN is shutdown

Active I/O to the snap source LUN. SP can be shutdown due to a Navisphere command to reboot (includes NDU), the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

All snap sessions remain intact. If the snap source LUN is trespassed, I/O can resume on the peer SP.

Snap source LUN is trespassed.

Active I/O to the snap source LUN. Trespass of the snap source LUN can be triggered as a result of an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from server to snap source LUN is determined to be bad.

All snap sessions remain intact. I/O can resume on the peer SP.

Snapshot LUN Snapshot LUN Action Event Result Server I/O to a snapshot LUN. Snapshot LUNs are virtual LUNs. A single snap session may be activated on a snapshot LUN at any point in time. The point in time data represented by an activated snap session on a snapshot LUN is made up using data from the snap source LUN and the reserved LU(s) associated with snap source LUN.

The read from the snap source LUN fails due to a bad block , LCC/BCC failure, cache dirty LUN, etc. or a write to a reserved LU fails (all writes to snapshot LUNs always be written to associated reserved LUs).

The server I/O to the snapshot LUN fails. All snap sessions that require reading from the snap source LUN in order to maintain the point in time data will stop. If the last session associated with the snap source LUN is stopped, the associated reserved LUs will be freed back to the pool.

SP that owns the snapshot LUN is shutdown (note the snapshot LUN SP owner will always be the same as the snap source LUN owner)

Active I/O to snapshot LUN. SP can be shutdown due to Navisphere command to reboot (includes NDU), SP panics due to SW or HW malfunction, or the SP is physically pulled while active.

All snap sessions remain intact. If the snap source LUN is trespassed (which will trespass all associated snapshot LUNs), I/O can resume on the peer SP.

Snapshot LUN is trespassed.

Active I/O to the snap source LUN and snapshot LUN.

All snap sessions remain intact. I/O can resume on the peer SP.



Snapshot LUN Action Event Result

Trespass of the snapshot LUN can happen due to failover software explicit or auto trespass when a path from the server to the snapshot LUN is determined to be bad. Snapshot LUNs cannot be explicitly trespassed via Navisphere.

The snap source LUN associated with the snapshot LUN (and all other snapshot LUNs associated with the snap source LUN) will be trespassed. It is possible for a trespass storm (or trespass ping pong) to occur if a path to the snapshot LUN is bad to one SP and the path for the associated snap source LUN or another snapshot LUN also associated with the same snap source LUN is bad to the peer SP. Server path failover software on one or more servers may try to trespass the LUN only to have another servers path failover software try to trespass the LUN back to where it was before causing the LUN ownership to go back and forth resulting in really bad performance.

Reserved LU Reserved LU Action Event Result Server write request to a snap source LUN. The storage system may need to perform a copy on first write to preserve the point in time data of an existing snap session on the snap source LUN. This entails I/Os to reserved LU(s) before the server write to a snap source LUN or snapshot LUN can proceed.

An I/O to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes a read failure from the reserved LU due to a bad block. This also includes running out of reserved LU space (no space left in any assigned reserved LUs and no more free reserved LUs in the SP pool).

Server write request succeeds. All snap sessions associated with the snap source LUN are stopped. Allocated reserved LUs are freed back to the reserved LU pool.

Server I/O to a snapshot LUN. The array, in processing a server read or write to a snapshot LUN, will entail I/Os to associated reserved LU(s). A read from a snapshot LUN will never cause a write to any associated reserved LU, but will cause one or more reads.

A read or write to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes read failure from reserved LU due to a bad block. This also includes running out of reserved LU space (no space left in any assigned reserved LUs and no more free reserved LUs in the SP pool).

Server I/O request fails. All snap sessions that require I/O to the reserved LU that failed (includes running out of space) will stop. If the last session associated with the snap source LUN is stopped, the associated reserved LUs will be freed back to the pool.



Reserved LU Action Event Result Rollback operation has started. The rollback process entails reads from associated reserved LU(s) and writes to the snap source LUN. A server I/O will cause a region of the snap source to be rolled back on demand in order to complete the server request.

An I/O to a reserved LU fails due to a LCC/BCC failure, cache dirty LUN, etc. This includes a read failure from a reserved LU due to a bad block. Server I/O may be happening while the rollback is processing.

Any server I/O to the snap source LUN request proceeds. If the server request was a read which required the data to be returned from the reserved LU (not the source LUN) and the region to be read failed due to a bad block, the server read request fails. The rollback process continues. Blocks that were bad in any associated reserved LU(s) will have the appropriate blocks marked bad on the snap source LUN (even though the disk region on the snap source LUN is good) to insure the integrity of the rolled back data.

SP that owns the snap source LUN or snapshot LUN is shutdown (note all the reserved LUs SP owners will always be the same as the snap source LUN owner)

Active I/O to a snap source LUN or a snapshot LUN which generates I/Os to reserved LUs associated with the source LUN. SP can be shutdown due to a Navisphere command to reboot, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

All snap sessions remain intact. If the snap source LUN and all associated snapshot LUNs are trespassed, any active server I/Os to the snap source LUN or associated snapshot LUN(s) and any associated I/Os to the reserved LUs can resume on the peer SP. Any rollback operations that were in progress are automatically continued on the peer SP.

Snap source LUN or snapshot LUN is trespassed (which will cause associated reserved LUs to trespass).

Active I/O to a snap source LUN or a snapshot LUN which generates I/Os to reserved LUs associated with source LUN. Trespass of the snap source LUN or any associated snapshot LUN can happen due to an NDU, Navisphere trespass command, or failover software explicit or auto trespass when a path from the server to the snap source LUN is determined bad.

All snap sessions remain intact. Active server I/O and associated reserved LU I/O can resume on the peer SP. The snap source LUN associated with the snapshot LUN (and all other snapshot LUNs also associated with the snap source LUN) will be trespassed along with any associated reserved LUs. Any rollback operations that were in progress are automatically continued on the peer SP.

Step-by-step snapshots overview - all platforms This contains examples, from setting up snapshots (with Navisphere CLI) to using them (with admsnap and Navisphere CLI). Some examples show the main steps outlined in the examples; other examples are specific to a particular platform. In the following procedures, you will use the SnapView snapshot CLI commands in addition to the admsnap snapshot commands to set up (from the production server) and use snapshots (from the secondary server). 1. Choose the LUNs for which you want a snapshot. The size of these LUNs will help you determine an approximate reserved LUN pool size. The LUN(s) in the reserved LUN pool store the original data when that data is first modified on the source LUN(s). To manually estimate a suitable LUN pool size, refer to Managing Storage Systems > Configuring and Monitoring the Reserved LUN Pool in the Table of Contents for the Navisphere Manager online help and select the Estimating the Reserved LUN Pool Size topic or the chapter on the reserved LUN pool in the latest revision of the EMC Navisphere Manager Administrator's Guide. 2. Configure the reserved LUN pool. You must configure the reserved LUN pool before you start a SnapView session. Use Navisphere Manager to configure the reserved LUN pool (refer to the online help topic Managing Storage Systems >



Configuring and Monitoring the Reserved LUN Pool or the chapter on the reserved LUN pool in the latest revision of the EMC Navisphere Manager Administrator's Guide. 3. Stop I/O and make sure all data cached on the production server is flushed to the source LUN(s) before issuing the admsnap start command. For a Windows server, you can use the admsnap flush command to flush the data. For Solaris, HP-UX, AIX, and Linux servers, unmount the file system by issuing the umount

command. If unable to unmount the file system, you can issue the admsnap flush command. For an IRIX server, unmount the file system by issuing the umount command. If you cannot unmount

the file system, you can use the sync command to flush cached data. The sync fsck command on the secondary servers file system. Refer to your system's man pages for sync command usage.

For a Novell NetWare server, use the dismount command on the volume to dismount the file system. Neither the flush command nor the sync command is a substitute for unmounting the file system. Both commands only complement unmounting the file system.

4. On the production server, log in as admin or root and, issue an admsnap start command for the desired data object (drive letter, device name, or file system) and session name. The admsnap start command starts the session. You must start a session for each snapshot of a specific LUN(s) you want to access simultaneously. You start a session from the production server based on the source LUN(s). You will mount the snapshot on a different server (the secondary server). You can also mount additional snapshots on other servers. You can start up to eight sessions per source LUN. This limit includes any reserved sessions that are used for another application such as SAN Copy and MirrorView/Asynchronous. However, only one SnapView session can be active on a secondary server at a time. If you want to access more than one snapshot simultaneously on a secondary server (for example, 2:00 p.m. and 3:00 p.m. snapshots of the same LUN(s), to use for rolling backups), you can create multiple snapshots, activate each one on a different SnapView session and add the snapshots to different storage groups. Or you can activate and deactivate snapshots on a single server. For an IRIX fabric connection only, the device name includes the worldwide port name. It has the form: /dev/rdsk/ZZZ/lunVsW/cXpYYY where: ZZZ - worldwide node name V - LUN number W - slice/partition number X - controller number YYY - port number The SnapView driver will use this moment as the beginning of the session and will make a snapshot of this data available. Sample start commands follow. IBM AIX Server (UNIX) admsnap start -s session1 -o /dev/hdisk21 (for a device name) admsnap start -s session1 -o /database (for a file system) HP-UX Server (UNIX) admsnap start -s session1 -o /dev/rdsk/c0t0d0 (for a device name) admsnap start -s session1 -o /database (for a file system) Veritas Volume examples: Example of a Veritas volume name: scratch Example of a fully qualified pathname to a Veritas volume: admsnap start -s session1 -o /dev/vx/dsk/scratchdg/scratch Example of a fully qualified pathname to a raw Veritas device name: admsnap start -s session1 -o /dev/vx/rdmp/c1t0d0



IRIX Server (UNIX) admsnap start -s session1 -o /dev/rdsk/dks1d0l9 (for a device name) admsnap start -s session1 -o /database (for a file system) Linux Server (UNIX) admsnap start -s session1 -o /dev/sdc (for a device name) admsnap start -s session1 -o /database (for a file system) Veritas Volume examples: Example of a Veritas volume name: scratch Example of a fully qualified pathname to a Veritas volume: admsnap start -s session1 -o /dev/vx/dsk/scratchdg/scratch Example of a fully qualified pathname to a raw Veritas device name: admsnap start -s session1 -o /dev/vx/rdmp/sdc6 NetWare Server load sys:\emc\admsnap\admsnap start -s session1 -o V596-A2-D0:2 (for a device name) (V596 is the vendor number.) Sun Solaris Server (UNIX) admsnap start -s session1 -o /dev/rdsk/c0t0d0s7 (for a device name) admsnap start -s session1 -o /database (for a file system) Veritas Volume examples: Example of a Solaris Veritas volume name: scratch Example of a fully qualified pathname to a Veritas volume: admsnap start -s session1 -o /dev/vx/dsk/scratchdg/scratch Example of a fully qualified pathname to a raw Veritas device name: admsnap start -s session1 -o /dev/vx/rdmp/c1t0d0s2 Windows Server admsnap start -s session1 \.\\PhysicalDrive1 (for a physical drive name) admsnap start -s session1 -o H: (for a drive letter) 5. Using Navisphere CLI, create a snapshot of the source LUN(s) for the storage system that holds the source LUN(s), as follows. You must create a snapshot for each session you want to access simultaneously. Use the naviseccli or navicli snapview command with -createsnapshot to create each snapshot. naviseccli -h hostname snapview -createsnapshot 6. If you do not have a VMware ESX Server - Use the storagegroup command to assign each snapshot to a storage group on the secondary server. If you have a VMware ESX Server - skip to step 7 to activate the snapshot. 7. On the secondary server, use an admsnap activate command to make the new session available for use. A sample admsnap activate command is admsnap activate -s session1

On a Windows server, the admsnap activate command finishes rescanning the system and assigns drive letters to newly discovered snapshot devices. You can use this drive immediately.

On an AIX server, you need to import the snap volume (LUN) by issuing the chdev and importvg commands as follows: o chdev -l hdiskn -a pv=yes (This command is needed only once for any LUN.) o importvg -y volume-group-name hdiskn where n is the number of the hdisk that contains a LUN in the volume

group and volume-group-name is the volume group name. On a UNIX server, after a delay, the admsnap activate command returns the snapshot device name. You will need to run fsck

on this device only if it contains a file system and you did not unmount the source LUN(s). Then, if the source LUN(s) contains a file system, mount the file system on the secondary server using the snapshot device name to make the file system available for use. If you failed to flush the file system buffers before starting the session, the snapshot may not be usable. Depending on your operating system platform, you may need to perform an additional step before admsnap activate to rescan the I/O bus. For more information, see the product release notes.

For UNIX, run fsck on the device name returned by the admsnap command, but when you mount that device using the mount command, use device name beginning with /dev/dsk instead of device name /dev/rdsk as returned by admsnap command.

On a NetWare server, issue a list devices or Scan All LUNs command from the server console. After a delay, the system returns the snapshot device name. You can then mount the volume associated with this device name to make a file system available for use. You may need to perform an additional step to rescan the I/O bus. For more information, see the product release notes.



8. If you have a VMware ESX Server, do the following:

a. Use storagegroup command to add snapshot to SG connected to ESX Server that will access the snapshot. b. Rescan the bus at the ESX Server level. c. If a Virtual Machine (VM) is already running, power off the VM and use Service Console of ESX Server to assign snapshot to

the VM. If a VM is not running, create a VM on the ESX Server and assign the snapshot to the VM. d. Power on VM and scan bus at VM level. For VMs running Windows, use the admsnap activate command to rescan the bus.

9. On the secondary server, you can access data on the snapshot(s) for backup, data analysis, modeling, or other use. 10. On secondary server, when you finish with snapshot data, release each active snapshot from operating system:

On a Windows server, release each snapshot device you activated, using the admsnap deactivate command. On an AIX server, export the snap volume (LUN) by issuing the varyoff and export commands as follows:

o varyoffvg volume-group-name o exportvg volume-group-name o Then release each snapshot device you activated, using the admsnap deactivate command.

On a UNIX server, you need to unmount any file systems that were mounted from the snapshot device by issuing the umount command. Then release each snapshot device activated, using the admsnap deactivate command

On a NetWare server, use dismount command on the volume to dismount the file system. A deactivate command is required for each active snapshot. If you do not deactivate a snapshot, the secondary server cannot activate another session using the pertinent source LUN(s). When you issue the admsnap deactivate command, any writes made to the snapshot are destroyed.

11. On the production server, stop the session using the admsnap stop command. This frees the reserved LUN and SP memory used by the session, making them available for use by other sessions. Sample admsnap stop commands are identical to the start commands shown in step 4. Substitute stop for start. 12. If you will not need the snapshot of the source LUN(s) again soon, use CLI snapview-rmsnapshot command to remove it. If you remove the snapshot, then for a future snapshot you must execute all previous steps. If you do not remove the snapshot, then for a future snapshot you can skip steps 5 and 3. HP-UX - admsnap snapshot script example Example showing how to use admsnap with scripts for copying and accessing data on an HP-UX secondary server. 1. From the production server, create the following script: Script 1

a. Quiesce I/O on the source server. b. Unmount the file system by issuing the umount command. If you are unable to unmount the file system, issue the admsnap

flush command. The flush command flushes all cached data. The flush command is not a substitute for unmounting the file system; the command only complements the unmount operation.

c. Start the session by issuing the following command: /usr/admsnap/admsnap start -s snapsession_name -o device_name or filesystem_name

d. Invoke Script 2 on the secondary server using the remsh command. e. Stop the session by issuing the following command:

a. /usr/admsnap/admsnap stop -s snapsession_name -o device_name or filesystem_name 2. From the secondary server, create the following script: Script 2

a. Perform any necessary application tasks in preparation for the snap activation (for example, shut down database). b. Activate the snapshot by issuing the following command:

/usr/admsnap/admsnap activate -s snapsession_name c. Create a new volume group directory, by using the following form:

mkdir/dev/volumegroup_name mknod/dev/volumegroup_name/group c 64 0x X0000

d. Issue the vgimport command, using the following form: vgimport volumegroup_name/dev/dsk/cNtNdN

e. Activate the volume group for this LUN by issuing the following command: vgchange -a y volumegroup_name



f. Run fsck on the volume group, by doing the following: fsck -F filesystem_type /dev/volumegroup_name/logicalvolume_name

This step is not necessary if secondary server has different HP-UX o/s revision than the production server. g. Mount the file system using the following command:

mount/dev/volumegroup_name/logicalvolume_name/filesystem_name h. Perform desired tasks with the mounted data (ie; copy contents of mounted f/s to another location on secondary server). i. Unmount the file system mounted in step g using the following command:

umount /dev/volumegroup_name/logicalvolume_name j. Deactivate and export the volume group for this LUN, by issuing the following commands:

vgchange -a n volumegroup_name vgexport volumegroup_name

k. Unmount the file system by issuing the umount command. If you are unable to unmount the file system, issue the admsnap flush command. The flush command flushes all cached data. If this is not done, the next admsnap session may yield stale data.

l. Deactivate the snapshot by using the following command: /usr/admsnap/admsnap deactivate -s snapsession_name

m. Perform any necessary application tasks in preparation for using the data captured in step 6 (ie; start up the database). n. Exit this script, and return to Script 1.

UNIX - admsnap single session example The following commands start, activate, and stop a SnapView session. This example shows UNIX device names. On the production server, make sure all cached data is flushed to the source LUN, by unmounting the file system. umount /dev/dsk/c12d0s4 If unable to unmount the file system on a Solaris, HP-UX, AIX, or Linux server, issue the admsnap flush command. admsnap flush -o/dev/rdsk/c12d0s4 On an IRIX server, the admsnap flush command is not supported. Use the sync command to flush all cached data. The sync command reduces the number of times you need to issue the fsck command on the secondary servers file system. Refer to your system's man pages for sync command usage. A typical example would be: sync /dev/dsk/c12d0s4 Neither the flush command nor the sync command is a substitute for unmounting the file system. Both commands only complement unmounting the file system.

1. Start the session: admsnap start -s friday -o /dev/rdsk/c1t2d0s4 Attempting to start session friday on device /dev/rdsk/c1t2d0s4 Attempting to start the session on the entire LUN. Started session friday.

The start command starts a session named friday with the source named /dev/rdsk/c1t2d0s4.

2. On the secondary server, activate the session: admsnap activate -s friday Session friday activated on /dev/rdsk/c1t2d0s4.

On the secondary server, the activate command makes the snapshot image accessible.

3. On a UNIX secondary server, if the source LUN has a file system, mount the snapshot: mount /dev/dsk/c5t3d2s1 /mnt

4. On the secondary server, the backup or other software accesses the snapshot as if it were a standard LUN.



5. When the desired operations are complete, from the secondary server, unmount the snapshot. With UNIX, you can use

admsnap deactivate to do this. admsnap deactivate -s friday -o /dev/dsk/c5t3d2s1

6. And from the production server, stop the session: admsnap stop -s friday -o /dev/dsk/c1t2d0s4 Stopped session friday on object /dev/rdsk/c1t2d0s4.

The stop command terminates session friday, freeing reserved LUN used by session and making snapshot inaccessible. Windows - admsnap multiple session example The following example shows three SnapView sessions, started and activated sequentially, using Windows device names. The example shows how each snapshot maintains the data at the time the snapshot was startedhere, the data is a listing of files in a directory. The activity shown here is the only activity on this LUN during the sessions. Procedural overview

1. Make sure the directory that holds admsnap is on your path. 2. Start sessions snap1, snap2, and snap3 on the production server in sequence and activate each session in turn on the

secondary server. All sessions run on the same LUN. 3. When session snap1 starts, four files exist on the LUN. Before starting snap2, create four more files in the same directory. On

the secondary server, deactivate snap1. Deactivate is needed because only one session can be active per server at one time. 4. On the production server start snap2, and on the secondary server activate snap2. After activating snap2, list files, displaying

the files created between session starts. 5. Create three more files on the source LUN and start session snap3. After deactivating snap2 and activating snap3, verify that

you see the files created between the start of sessions snap2 and snap3. The filenames are self-explanatory. Detailed procedures with output examples Session Snap1 1. On the production server, list files in the test directory. F:\> cd test F:\Test> dir Directory of F:\Test

01/21/2002 09:21a 0 FilesBeforeSession1-a.txt 01/21/2002 09:21a 0 FilesBeforeSession1-b.txt 01/21/2002 09:21a 0 FilesBeforeSession1-c.txt 01/21/2002 09:21a 0 FilesBeforeSession1-d.txt

2. On the production server, flush data on source LUN, and then start the first session, snap1. F:\Test> admsnap flush -o f:

F:\Test> admsnap start -s snap1 -o f: Attempting to start session snap1 on device \\.\PhysicalDrive1. Attempting to start session on the entire LUN. Started session snap1 F:\Test>

3. On the secondary server, activate the first session, snap1. C:\> prompt $t $p 14:57:10.79 C:\> admsnap activate -s snap1

Scanning for new devices. Activated session snap1 on device F:

4. On the secondary server, list files to show production files that existed at session 1 start. 14:57:13.09 C:\ dir f:\test

Directory of F:\Test 01/21/02 09:21a 0 FilesBeforeSession1-a.txt 01/21/02 09:21a 0 FilesBeforeSession1-b.txt 01/21/02 09:21a 0 FilesBeforeSession1-c.txt 01/21/02 09:21a 0 FilesBeforeSession1-d.txt



Session Snap2 1. On prod server, list files in test directory. Listing shows files created before session 1 started. Notice we created four additional files. F:\Test> dir

Directory of F:\Test 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-a.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-b.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-c.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-d.txt 01/21/2002 09:21a 0 FilesBeforeSession1-a.txt 01/21/2002 09:21a 0 FilesBeforeSession1-b.txt 01/21/2002 09:21a 0 FilesBeforeSession1-c.txt 01/21/2002 09:21a 0 FilesBeforeSession1-d.txt

2. On the production server, start the second session, snap2. F:\Test> admsnap flush -o f:

F:\Test> admsnap start -s snap2 -o f: Attempting to start session snap2 on device \\.\PhysicalDrive1. Attempting to start the session on the entire LUN. Started session snap2. F:\

3. On the secondary server, deactivate the session snap1, and activate the second session, snap2. 15:10:10.52 C:\> admsnap deactivate -s snap1

Deactivated session snap1 on device F:. 15:10:23.89 C:\> admsnap activate -s snap2 Activated session snap2 on device F:

4. On the secondary server, list files to show source LUN files that existed at session 2 start. 15:10:48.04 C:\> dir f:\test

Directory of F:\Test 01/21/02 09:21a 0 FilesAfterS1BeforeS2-a.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-b.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-c.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-d.txt 01/21/02 09:21a 0 FilesBeforeSession1-a.txt 01/21/02 09:21a 0 FilesBeforeSession1-b.txt 01/21/02 09:21a 0 FilesBeforeSession1-c.txt 01/21/02 09:21a 0 FilesBeforeSession1-d.txt

Session Snap3 1. On production server, list files in test directory. The listing shows files created between the start of sessions 2 and 3. F:\Test> dir

Directory of F:\Test 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-a.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-b.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-c.txt 01/21/2002 09:21a 0 FilesAfterS1BeforeS2-d.txt 01/21/2002 09:21a 0 FilesAfterS2BeforeS3-a.txt 01/21/2002 09:21a 0 FilesAfterS2BeforeS3-b.txt 01/21/2002 09:21a 0 FilesAfterS2BeforeS3-c.txt 01/21/2002 09:21a 0 FilesBeforeSession1-a.txt 01/21/2002 09:21a 0 FilesBeforeSession1-b.txt 01/21/2002 09:21a 0 FilesBeforeSession1-c.txt 01/21/2002 09:21a 0 FilesBeforeSession1-d.txt



2. On the production server, flush buffers and start the third session, snap3. F:\Test> admsnap flush -o f:

F:\Test> admsnap start -s snap3 - o f: Attempting to start session snap3 on device PhysicalDrive1. Attempting to start the session on the entire LUN. Started session snap3. F:\Test>

3. On secondary server, flush buffers, deactivate session snap2, and activate third session, snap3. 15:28:06.96 C:\> admsnap flush -o f:

Flushed f:. 15:28:13.32 C:\> admsnap deactivate -s snap2 Deactivated session snap2 on device F:. 15:28:20.26 C:\> admsnap activate -s snap3 Scanning for new devices. Activated session snap3 on device F:.

4. On secondary server, list files to show production server files that existed at session 3 start. 15:28:39.96 C:\> dir f:\test

Directory of F:\Test 01/21/02 09:21a 0 FilesAfterS1BeforeS2-a.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-b.txt 01/21/02 09:21a 0 FilesafterS1BeforeS2-c.txt 01/21/02 09:21a 0 FilesAfterS1BeforeS2-d.txt 01/21/02 09:21a 0 FilesAfterS2BeforeS3-a.txt 01/21/02 09:21a 0 FilesAfterS2BeforeS3-b.txt 01/21/02 09:21a 0 FilesAfterS2BeforeS3-c.txt 01/21/02 09:21a 0 FilesBeforeSession1-a.txt 01/21/02 09:21a 0 FilesBeforeSession1-b.txt 01/21/02 09:21a 0 FilesBeforeSession1-c.txt 01/21/02 09:21a 0 FilesBeforeSession1-d.txt

5. On the secondary server, deactivate the last session. 15:28:45.04 C:\> admsnap deactivate -s snap3 6. On the production server, stop all sessions. F:\Test> admsnap stop -s snap1 -o f:

F:\Test> admsnap stop -s snap2 -o f: F:\Test> admsnap stop -s snap3 -o f:



SnapView Clones There are three user visible storage system objects that are used by the SnapView clone capability: clone source LUN(s), clone LUN(s), and CPL(s). Since the fractured and unfractured state of a clone affects the actions greatly, a fractured clone LUN and an unfractured clone LUN are treated as two different objects in order to reduce the complexity. There is a table for each object and a number of events that pertain to each object. The result column describes the outcome as a result of the event of the object while the action is occurring. Source LUN Clone Source LUN Action Event Result Server I/O to a clone source LUN.

I/O to the clone source LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

The server I/O request fails. Server based path failover software may trespass the clone source LUN. If the I/O error condition was due to a problem related to the owner SP of the clone source LUN, the I/O will be able to continue on the peer SP. See the trespass action for a clone source LUN below. The administrator may trespass the clone source LUN or repair access to the LUN in an attempt to restore availability to the clone LUN (an unfractured clone LUN cannot be trespassed directly). When repaired, the clone LUN will require a manual restart of the synchronization. If trespassed, see the clone source LUN trespass action below. All fractured clones associated with the clone source LUN are unaffected. All unfractured clones image condition will be set to administratively fractured with a clone property that indicates a media failure and the image state will be changed to consistent if the clone was not synchronizing. If the clone was synchronizing, the image state will be set to out of sync (or reverse out of sync) until the clone source LUN is repaired. The repair may happen due to a trespass of the clone source LUN if the peer SP has access to the clone source LUN (see trespass action below).

Server read from a clone source LUN.

The read from the clone source LUN fails due to a bad block.

The server read request fails. No effect to any clones associated with the clone source LUN.



Clone Source LUN Action Event Result A storage system generated read from the clone source LUN as part of a clone synchronization.

The read from the clone source LUN fails due to a bad block.

All unfractured clones will be marked with bad block(s) at same corresponding logical offset(s) that were bad in clone source LUN. If more than 32KB consecutive bad blocks from the clone source LUN are encountered as part of the synchronization operation, the clone image will be set to administratively fractured with a clone property that indicates a media failure and the image state will be changed to out of sync until the clone source LUN is repaired. All other fractured clones associated with the same clone source LUN are unaffected.

A storage system generated read from the clone source LUN as part of a clone synchronization.

The read from the clone source LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

The clone synchronization is aborted. The clone image will be set to administratively fractured with a clone property that indicates a media failure and the image state will be changed to out of sync until the clone source LUN is repaired. All other fractured clones associated with the same clone source LUN are unaffected.

Storage system generated write to a clone source LUN. The storage system does this write as part of a reverse synchronization (read from the clone LUN and a write to the clone source LUN).

Write to the clone source LUN fails due to a LCC/BCC failure, cache dirty LUN, etc. Protected or unprotected clone reverse synchronization does not matter for this scenario.

Until access to the clone source LUN is restored, the clone source LUN and the unfractured clone will be unusable and the image condition will be set to administratively fractured with a clone property that indicates a media failure and the image state will be set to reverse out of sync. All other fractured clones associated with the same clone source LUN are unaffected.

SP that owns the clone source LUN is shutdown.

Active I/O to the clone source LUN. SP is shutdown due to a Navisphere command to reboot, NDU, SP panics due to a SW or HW malfunction, or the SP is physically pulled.

The clone source LUN is trespassed by the storage system, Active server I/O to the clone source LUN can resume on the peer SP. See clone source LUN trespass below.



Clone Source LUN Action Event Result Clone source LUN is trespassed.

Active server I/O to the clone source LUN. Trespass of the clone source LUN can happen due to a Navisphere trespass request or via failover software when a path from the server to the clone source LUN has failed.

Active server I/O to the clone source LUN can resume on the peer SP. All fractured clones associated with the clone source LUN are unaffected and will not trespass with the clone source LUN. All unfractured clones whose image condition is normal will trespass with the clone source LUN. Any clone synchronizations (including reverse) that were in progress when the clone source LUN is trespassed will be queued to be started on the peer SP and will automatically start.

Clone LUN (Fractured) Fractured Clone LUN Action Event Result Server write to a fractured clone LUN.

I/O to the clone LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

Server write request fails. Server based path failover software may trespass the clone LUN. If the I/O error condition was due to a problem with the owner SP of the clone LUN, the I/O will be able to continue on the peer SP. The clone source LUN is unaffected. All other clones associated with the same clone source LUN are unaffected.

A server read from a fractured clone LUN.

Read from clone LUN fails due to a bad block.

The server read request fails. All other fractured and unfractured clones associated with the same clone source LUN are unaffected.

SP that owns the clone LUN is shutdown

Active server I/O to the clone source LUN. SP can be shutdown due to a Navisphere command to reboot, NDU, the SP panics due to a SW or HW malfunction, or the SP is physically pulled.

Server based path failover software may trespass the clone LUN, active server I/O can resume on the peer SP. See clone LUN trespass action below. The clone source LUN is unaffected (provided it is owned by the peer SP). All other clones associated with the same clone source LUN are unaffected.

Clone LUN is trespassed. Active server I/O to the clone LUN. Trespass of the clone LUN can happen due to a trespass command, or failover software explicit or

Active server I/O to the clone LUN can resume on the peer SP. The clone source LUN associated with this fractured clone LUN and all other clones associated with the same clone source LUN are



Fractured Clone LUN Action Event Result

auto trespass when a path from the server to the clone LUN has failed.

unaffected.

Clone LUN (Unfractured) Unfractured Clone LUN Action Event Result Storage system generated I/O to an unfractured clone LUN. The storage system does a write in order to replicate data to a clone source LUN or as a result of a synchronization (not reverse synchronization). The storage system does a read of the unfractured clone LUN as part of a reverse synchronization operation.

I/O to the clone LUN fails due to a LCC/BCC failure, cache dirty LUN, etc.

The clone LUN will be fractured with media failure property set indicating the inability to access the clone LUN and the image condition will be set to administratively fractured with a clone property that indicates a media failure. The clone will be unusable and the image state will be set to out of sync. The administrator may trespass the clone source LUN or repair access to the LUN in an attempt to restore availability to the clone LUN (an unfractured clone LUN cannot be trespassed directly). When repaired, the clone LUN will require a manual restart of the synchronization. If trespassed, see the clone source LUN trespass action above. All other fractured clones associated with the same clone source LUN are unaffected.

Storage system generated read from a clone LUN. The storage system does this read as part of a reverse synchronization (read from the clone LUN and a write to the clone source LUN).

The read from the clone LUN fails due to a bad block. Protected or unprotected clone reverse synchronization does not matter for this scenario.

The clone source LUN will be marked with bad block(s) at the same corresponding logical offset(s) that were bad in the clone LUN. If more than 32KB consecutive bad blocks from the clone LUN are encountered as part of the reverse synchronization operation, the clone image will be set to administratively fractured with a clone property that indicates a media failure and the image state will be changed to reverse out of sync until the clone LUN is repaired. All other fractured clones associated with the same clone source LUN are unaffected.

Storage system generated read from a clone LUN. The storage system does this read as part of a reverse synchronization (read from the clone LUN and a write

The read from the clone LUN fails due to a LCC/BCC failure, cache dirty LUN, etc. Protected or unprotected clone reverse synchronization does not

The clone reverse synchronization operation is aborted. The clone source LUN is made inaccessible. The clone image will be set to administratively fractured with a clone property that indicates a



Unfractured Clone LUN Action Event Result to the clone source LUN). matter for this scenario.

media failure and the image state will be changed to out of sync until the clone LUN is repaired. All other fractured clones associated with the same clone source LUN are unaffected.

SP that owns the clone LUN is shutdown (this is the same SP that owns the clone source LUN).

Active I/O to the clone source LUN. SP can be shutdown due to a Navisphere command to reboot or during an NDU.

All unfractured clone LUNs will follow the SP owner of the clone source LUN. See the event description under clone source LUN SP shutdown action above. Any clone synchronizations that were in progress will be queued to be started on the peer SP if the recovery policy is set to automatic; otherwise the clone image condition will be set to administratively fractured with an image state of out of sync or reverse out of sync until the SP that failed reboots.

SP that owns the clone LUN fails (this is the same SP that owns the clone source LUN).

Active I/O to the clone source LUN. SP can fail due a SW or HW malfunction, or the SP is physically pulled.

All unfractured clone LUNs will follow the SP owner of the clone source LUN. See the event description under clone source LUN SP shutdown action above. Any clone synchronizations that were in progress will be queued to be started on the peer SP.

CPL CPL Action Event Result Storage system generated write to the CPL. The storage system does this write in order to mark regions that were changed due to a write to a clone source LUN or a write to an fractured clone LUN to provide incremental synchronizations. Writes are also done by the storage system to clear marked regions as part of a synchronization operation.

Write to the CPL fails due to a LCC/BCC failure, cache dirty LUN, etc. Protected or unprotected clone reverse synchronization does not matter for this scenario.

If the CPL write was to mark a region for the clone source LUN or a fractured clone LUN, the operation proceeds without error as the marked regions will be maintained in SP memory. The CPL can be reassigned to a newly bound LUN while the system is running to repair it. If the CPL owning SP reboots before the reassignment can take place, all clone LUNs will require a full synchronization. If the CPL write was due to mirroring data or a synchronization (includes reverse synchronization), all unfractured clone LUN image states are set to out of sync or reverse out of sync until the CPL is repaired. All clone LUNs that were synchronizing, will have their image condition set to administratively fractured with



CPL Action Event Result

a clone property that indicates a media failure. CPL can be repaired while the system is running (as described above). After CPL is repaired, all clone synchronizations must be manually restarted via administrative command.

Storage system generated read from a CPL. The storage system does this read as part of a synchronization (including reverse synchronization) to determine which blocks need to be copied from the clone LUN to the clone source LUN.

The read from the CPL fails due to a bad block, a LCC/BCC failure, cache dirty LUN, etc.

Since the regions represented by the block(s) that cannot be read from the CPL are unknown, the storage system will cause full synchronizations (includes reverse synchronization). If the operation was a protected reverse synchronization, the writes that were performed to the clone source LUN during the reverse synchronization operation will not be retained (the clone source LUN will have the data of the protected clone).

Step-by-step clone overview - all platforms Clones use an asynchronous write until they are in sync. Once they are in sync they are a synchronous write. When the clone is using the synchronous write you may see a performance impact. Clones spend most of their existence fractured. This contains examples, from setting up clones (with Navisphere CLI) to using them (with admsnap and Navisphere CLI). Some examples show the main steps outlined in the examples; other examples are specific to a particular platform. In the following example, you will use the SnapView clone CLI commands in addition to the admsnap clone commands to set up (from the production server) and use a clone (from the secondary server). 1. On the storage system, bind a LUN for each SP to serve as a clone private LUN. The clone private LUNs (one for each SP) are shared by all clone groups on a storage system. The clone private LUNs store temporary system information used to speed up synchronization of the source LUN and its clone. These structures are called fracture logs. The clone private LUN can be any public LUN that is not part of any storage group. The minimum and standard size for each clone private LUN is 250000 blocks. There is no benefit in performance or otherwise, to use clone private LUNs larger than 250000 blocks. 2. On the storage system, bind a LUN to serve as the clone. Each clone should be of the same size as the source LUN. The source and clone LUNs can be on the same SP or different SPs. 3. If the source LUN does not exist (for example, because you are creating a new database), you can bind it at the same time as the clone. Then you can add the new source LUN to a storage group. 4. Assign the LUN you plan on using as your clone to a storage group. You must assign the clone LUN to a storage group other than the storage group that holds the source LUN. Use the Navisphere CLI command storagegroup as described in the EMC Navisphere Command Line Interface (CLI) Reference. 5. On the storage system, allocate the clone private LUNs. Use the CLI command function -allocatecpl for this. 6. On the storage system, create the clone group. Use the CLI command function -createclonegroup for this. 7. If the LUN you choose as your clone is mounted on a secondary server, deactivate the LUN from the server it is mounted on by issuing the appropriate command for your operating system:

On a Windows server, use the following admsnap command: admsnap clone_deactivate -o clone drive_letter On a UNIX server, unmount the file system on the LUN you want to use as a clone by issuing the umount command. On a Novell NetWare server, use the dismount command on the volume to dismount the file system.



8. On the storage system, add the LUN you bounded as your clone in step 2, to the clone group. Use the CLI command -addclone for this. By default, when you use the -addclone command, the software starts synchronizing the clone (copying source LUN data to the clone). If the source LUN has meaningful data on it, then synchronization is necessary. Depending on the size of the source LUN, a synchronization may take several hours. If you do not want the default synchronization to occur when you add the clone to the clone group, then you can tell the CLI that synchronization is not required. To do this use the -issyncrequired option in the -addclone command. An initial synchronization is not required if your source LUN does not contain any data. If you specify an initial sync with an empty source LUN, resources are used to synchronize the source LUN to the clone LUN. 9. After the clone is synchronized, you can use it independently, by performing the following steps before fracturing it:

a. Quiesce I/O to the source LUN. b. Flush all cached data to the source LUN by issuing the appropriate command for your operating system.

For a Windows server, use the admsnap flush command to flush all server buffers. admsnap flush -0 E: For Solaris, HP-UX, AIX, and Linux servers, unmount the file system by issuing the umount command. If you are unable to unmount the file system, you can issue the admsnap flush command. admsnap flush -o /dev/rdsk/c1t0d2s2 For an IRIX server, the admsnap flush command is not supported. Unmount the file system by issuing the umount command. If you cannot unmount the file system, use the sync command to flush cached data. The sync command reduces the number of times you need to issue the fsck command on the secondary servers file system. Refer to your system's man pages for sync command usage. For a Novell NetWare server, use the dismount command on the volume to dismount the file system. Neither the flush command nor the sync command is a substitute for unmounting the file system. Both commands only complement unmounting the file system. With some operating systems, additional steps may be required from the secondary server in order to flush all data and clear all buffers on the secondary server. For more information, see the product release notes.

c. Wait for the clone to transition to the synchronized state. d. Fracture the clone using the CLI fracture command. e. Resume I/O to the source LUN. f. For Windows, use the admsnap clone_activate command to make newly fractured clone available to the operating system.

After a delay, the admsnap clone_activate command finishes rescanning the system and assigns drive letters to newly discovered clone devices.

g. Important: If the secondary server is running Windows NT and the clone was already mounted on a secondary server, a reboot is required after you activate the fractured clone. If the secondary server is running Windows 2000, a reboot is recommended but not required. For UNIX servers, for all platforms except Linux, clone_activate tells the operating system to scan for new LUNs. For Linux, you must either reboot the server or unload and load the HBA driver. On a NetWare server, run the command list devices or use the command scan all LUNs on the console.

10. If you have a VMware ESX Server, do the following: a. Rescan the bus at the ESX Server level. b. If a Virtual Machine (VM) is already running, power off the VM and use the Service Console of the ESX Server to assign the

clone to the VM. If a VM is not running, create a VM on the ESX Server and assign the clone to the VM. c. Power on the VM and scan the bus at the VM level. For VMs running Windows, you can use the admsnap activate command

to rescan the bus. 11. Use the fractured clone as you wishfor backup, reverse synchronization, or other use. 12. If you want to synchronize the clone LUN, perform the following steps to deactivate the clone:

a. For Windows, use the admsnap clone_deactivate command, which flushes all server buffers, dismounts, and removes the drive letter assigned by clone _activate. For multi-partitioned clone devices, those having more than one drive letter mounted on it, all other drive letters associated with this physical clone device will also be flushed, dismounted, and removed. admsnap clone_deactivate E:

a. For UNIX, unmount the file system by issuing the umount command. If you cannot unmount the file system, you can use the sync command to flush buffers. The sync command is not considered a substitute for unmounting the file system, but you can use it to reduce the number of incidents of having to fsck the file system on your backup server. Refer to your system's man pages for sync command usage.

b. For NetWare, use the dismount command on the clone volume to dismount the file system. c. Start synchronizing the clone. Use the CLI command -syncclone for this.

13. If you have finished with this clone, you can remove the clone from its clone group. You can also do the following: Destroy the clone group by using the CLI command -destroyclonegroup. Remove the clone LUN by using the CLI command -removeclone. Deallocate the clone private LUNs by using the CLI command -deallocatecpl.



For future clone operations, if you have not removed any required clone components as in step 13, then synchronize if needed, and return to step 9. Windows - clone example The following example shows all the naviseccli or navicli and admsnap commands needed to set up and use a clone on a Windows platform. It includes binding and unbinding the LUNs and RAID Groups. 1. Create the source and clone RAID Groups and bind the LUNs.

naviseccli -h ss_spA createrg 10 1_0 1_1 1_2 1_3 1_4 naviseccli -h ss_spA createrg 11 1_5 1_6 1_7 1_8 1_9 naviseccli -h ss_spA bind r5 20 -rg 10 -sp A naviseccli -h ss_spA bind r5 21 -rg 11 -sp A

To use these commands with navicli, replace naviseccli with navicli. 2. Create the clone private LUNs, each 250000 blocks long.

naviseccli -h ss_spA createrg 100 2_1 2_2 2_3 2_4 2_5 naviseccli -h ss_spA bind r5 100 -rg 10 -sp A -sq mb -cp 200 naviseccli -h ss_spa bind r5 101 -rg 10 -sp A -sq mb -cp 200

To use these commands with navicli, replace naviseccli with navicli. 3. Wait for all the LUNs to complete binding. Then set up the storage groups.

naviseccli -h ss_spa storagegroup -create -gname Production naviseccli -h ss_spa storagegroup -create -gname Backup naviseccli -h ss_spa storagegroup -connecthost -o server ServerABC -gname Production naviseccli -h ss_spa storagegroup -connecthost -o server ServerXYZ -gname Backup naviseccli -h ss_spa storagegroup -addhlu -gname Production -hlu 20 -alu 20 naviseccli -h ss_spa storagegroup -addhlu -gname Backup hlu 21 -alu 21

To use this command with navicli, replace naviseccli with navicli. 4. On both servers, rescan or reboot to let the operating systems see the new LUNs. 5. Allocate the clone private LUNs.

naviseccli -User GlobalAdmin -Password mypasssw -Scope 0 -Address ss_spa snapview -allocatecpl -spA 100 -spB 101 -o

To use this command with navicli.jar, replace naviseccli with java jar navicli.jar. 6. Create the clone group and add the clone.

naviseccli -user GlobalAdmin -password mypassw -scope 0 -address ss_spa snapview -createclonegroup name lun20_clone -luns 20 -description Creatinglun20_clone -o naviseccli -user GlobalAdmin -password password -scope 0 -address ss_spa snapview -addclone -name lun20_clone -luns 20

To use this command with navicli.jar, replace naviseccli with java jar navicli.jar. 7. Run Disk Management on production server and create an NTFS file system on the source LUN. Copy files to the drive letter assigned to the source LUN on the production server. This example uses g: as the driver letter for the source LUN, 8. On the production server, run admsnap to write the buffers.

admsnap flush -o g: The clone transitions to the synchronized state. 9. Fracture the clone.

naviseccli -User GlobalAdmin -Password password -Scope 0 -Address ss_spa snapview -fractureclone -name lun20_clone -cloneid 0100000000000000 -o

To use this command with navicli.jar, replace naviseccli with java jar navicli.jar. 10. On the secondary server, run admsnap to activate the clone.

admsnap clone_activate



The admsnap software returns a driver letter for the drive assigned to the clone that was just fractured. This example uses h: as the driver letter for the clone LUN. 11. Verify that the files that were copied to the source LUN also appear on the clone LUN. 12. If you have a VMware ESX Server, do the following:

a. Rescan the bus at the ESX Server level. b. If a Virtual Machine (VM) is already running, power off the VM and use the Service Console of the ESX Server to assign the

clone to the VM. If a VM is not running, create a VM on the ESX Server and assign the clone to the VM. c. Power on the VM and scan the bus at the VM level. For VMs running Windows, you can use the admsnap activate command

to rescan the bus. 13. On the secondary server, delete the existing files and copy different files to the clone (to h:). 14. On the secondary server, run admsnap to deactivate the clone.

admsnap clone_deactivate -o h: 15. On the production server, run admsnap to deactivate the source.

admsnap clone_deactivate -o g: 16. Reverse synchronize to copy the data written to the clone back to the source.

naviseccli -User GlobalAdmin -Password password -Scope 0 -Address ss_spa snapview -reversesyncclone name lun20_clone -cloneid 0100000000000000 -o

To use this command with navicli.jar, replace naviseccli with java -jar navicli.jar. 17. On the production server, run admsnap to activate the source.

admsnap clone_activate Wait for the reverse-sync operation to complete and the clone to transition to the synchronized state. 18. Fracture the clone again to make the source independent.

naviseccli -User GlobalAdmin -Password password -Scope 0 -Address ss_spa snapview -fractureclone -name lun20_clone -cloneid 0100000000000000 -o

To use this command with navicli.jar, replace naviseccli with java jar navicli.jar. 19. On the production server, verify that the clone (g:) contains the files that were written to the clone on the secondary server. It also should not contain the files that were deleted from the clone. 20. On the production server, use admsnap to deactivate the source.

admsnap clone_deactivate -o g: 21. Clean up the storage system by removing and destroying the clone group.

naviseccli -User GlobalAdmin -Password password -Scope 0 -Address ss_spa snapview -removeclone -name lun20_clone -cloneid 0100000000000000 -o naviseccli -User GlobalAdmin -Password password -Scope 0-Address ss_spa snapview -destroyclonegroup name lun20_clone -o

To use this command with navicli.jar, replace naviseccli with java -jar navicli.jar.



Reverse synchronization - all platforms The following example illustrates the admsnap and Navisphere CLI commands required to reverse synchronize a fractured clone. 1. From the production server, stop I/O to the source LUN. 2. Using admsnap, do the following:

a. From production server, deactivate source LUN by issuing the appropriate command for your operating system. On a Windows server, use the following admsnap command:

admsnap clone_deactivate -o source-drive-letter On a UNIX server, unmount the file system by issuing the umount command. If you cannot unmount the file system, use

the sync command to flush buffers. Although the sync command is not a substitute for unmounting the file system, you can use it to reduce the number of times you need to issue the fsck command on the secondary servers file system. Refer to your system's man pages for sync command usage.

On a NetWare server, use the dismount command on the volume to dismount the file system. b. If the clone is mounted on a secondary server, flush all cached data to the clone LUN by issuing the appropriate command for

your operating system. For a Windows server, use the admsnap flush command. For Solaris, HP-UX, AIX, and Linux servers, unmount the file system by issuing the umount command. If you are unable

to unmount the file system, issue the admsnap flush command. The flush command flushes all data and clears all buffers. For an IRIX server, the admsnap flush command is not supported. Unmount the file system by issuing the umount

command. If you cannot unmount the file system, use the sync command to flush cached data. The sync command reduces the number of times you need to issue the fsck command on the secondary servers file system. Refer to your system's man pages for sync command usage.

On a Novell NetWare server, use the dismount command on the volume to dismount the file system. c. Neither the flush command nor sync command is a substitute for unmounting the file system. Both commands only

complement unmounting the file system. With some operating systems, additional steps may be required from the secondary server in order to flush all data and clear all buffers on secondary server. For more information, see product release notes.

3. Using Navisphere CLI, issue the following command from the SP that owns the source LUN: snapview -reversesyncclone -name name|-clonegroupUid uid -cloneid id [-UseProtectedRestore 0|1] Before you can use the protected restore feature, you must globally enable it by issuing the snapview -changeclonefeature [-AllowProtectedRestore 1] command. Important: When the reverse synchronization begins, the software automatically fractures all clones in the clone group. Depending on whether or not you enabled the Protected Restore feature,

Documents

EMC / CLARiiON Troubleshooting Guidequickreference.weebly.com/uploads/1/5/4/2/15423822/clariion... · EMC / CLARiiON Troubleshooting – 2nd Edition Strictly Confidential Section