Upload
imgs
View
200
Download
1
Tags:
Embed Size (px)
Processes undertaken in large data migration projectsSimon McCabeChief Technology Officer, IMGS
Discussion Context
High level case study of processes undertaken in large electricity and water data migrations run by IMGS over the last 5 years
High level data volumes:
Project 1:
60+ primary features
100+ secondary features
Approx. 28 million records
Project 2:
40+ primary features
Approx. 1.2 million records
Discussion Context
Topics covered:
Understanding the legacy system and its data
Designing a migration process
Tools to use (FME…)
Reporting and Reconciliation processes
Data Cleansing
Cutover\Deployment considerations
How FME played its part
Presentation will cover the full migration lifecycle
Analysis Design TestBuild Deploy
Step 1: Understanding legacy system
Analysis Phase
Step 1: Understanding legacy system
Forensic review of current data sets (what needs migration)
Review\define all source features (objects)
What is their makeup?
database tables, CSV, customised file formats, etc..
what are the field characteristics (definition, sparsely populated etc.)
how do we link graphics to attributes?
during migration will a features attribute or graphical data distinguish the feature (or a combination of both)?
What are they related too in the system?
other features/objects
how are they related to each other
should we create relationships during migration
Collate statistics
Before you leave the analysis phase audit the data in terms of volume.
Step 1: Understanding legacy system
Methodology: Workshops and Documentation!
Data Audit
What data is used by the legacy system and the users in the
customers department
External Interfaces?
What legacy system data is utilised by others (outside the
customers department)
Analysis Design TestBuild Deploy
Step 1: Understanding legacy system
• FME assists in the data audit (volumetric\statistics)
• FME helps you discuss the data in terms of what is in the system vs what the user thinks is in there
• Use FME to migrated (a simple migration) graphical data to staging tables
• Migrate style information, level information, size information (widths and heights) etc.. – building up a tabular picture of data
• Utilize Microsoft Excel\Word to provide a human readable view on the above to allow dialogue\discussion
Step 2: Design the migration process
Design Phase
Step 2: Design the migration process
What data is missing or unusable? E.g. do all records have a geometry?
What data is stored in a non-standard format? E.g. are dates in a column all recorded in the same format?
What data values give conflicting information?
What data is missing important relationship linkages?
What data is incorrect or out of date?
What data records are duplicated?
Common Design considerations [Data Quality]
Step 2: Design the migration process
If geometries are missing – should they be fixed in the source system or via migration (can depend on time and source system processes)
If geometries exist with no attributes, should they be migrated or abandoned?
How do we report on inconsistencies in data i.e. field dates might have a combination of DD-MM-YY or YY-MM-DD etc..
Common Design considerations [Data Cleansing]
If fixed during migration will this affect the source system
External Interfaces
(systems get out of sync…)
Step 2: Design the migration process
How you address\fix the points in the last 2 slides can depend on the overall team size.
Fix in the source system
Fix during migration
Don’t fix and migrate as is (if possible)
Or abandon the migration of the problematic feature instance
A key requirement from the migration however will be to report on all issues and allow dialogue on their resolution.
Common Design considerations [Team Size]
Step 2: Design the migration process
Walk before you run
Design\Sketch migration workflows before starting migration process (Microsoft Visio etc..)
Ensure all logical scenarios are covered
Document what is an abandon event vs what is an issue event
Abandon events = feature instance is not migrated
Issue event = an issue was encountered and needs to be reviewed (missing geometry, invalid date format etc..)
Design logging processes for developers to follow (legacy system references, event type and error description)
Common Design considerations [Workflows]
Step 2: Design the migration process
Start up and Shut down scripts (TCL or Python)
Log start time, end time, features read/written and run status to migration_log table
Assists in flagging long running scripts (an thus targeted performance updated)
Write errors/issues to ERROR_LOG table
Excel reports then designed to read error log table and provide periodic reports
FME transformers to validate geometries, dates etc.
OGC Validator
Geometry Validator
Duplicate Remover (if required)
Periodic reviews of reports feed back into FME processes and validation\logging requirements
Common Design considerations [Logging]
Logging
Tra
nsf
orm
ers
OGC Validator
start-up script
Duplicate Remover
GeometryValidator
Python
TCL
ERROR_LOG table
MIGRATION_LOG table
Step 2: Design the migration process
It is impossible to manually test all feature instances, all attributes, all relationships etc.. after they migrate to the destination system
If possible (preferable and advised):
Have a separate team (to the migration team) develop reconciliation reports
Feature Reconciliation List source features and their counts along side the migrated
destination features and their counts
Report and explain any discrepancies
Attribute Reconciliation Per source feature, list the source attribute along side their
destination attributes – all counts must be the same
Automate, Automate, Automate….
Reconciliation reports should easily identify/flag all issues and provide input into further dialogue and issue management.
Common Design considerations [Reconciliation]
Step 2: Design the migration process
Reconciliation Reports
Flag long running scripts
Report on issues Feature Reconciliation
Analyse source and destination feature counts side by side
& Report
Attribute Reconciliation
Analyse source and destination attribute counts side by side
Reconciliation Reports form an integral part of quality control, assurance and testing
Analysis Design TestBuild Deploy
Step 2: Design the migration process
• Decide on FME Desktop Vs FME Server
• Prototype designs where applicable
• Design process to be run with minimal intervention i.e. one play button to go from start to finish
• Ensure logging and reporting is designed into the solution and embedded within build FME scripts
Step 3 & 4: Build and Test
Build & Test Phase
Step 3 & 4: Build and Test
• FME is very powerful, and provides many ways to migrate data – however developers need to regularly review\critique their workflows
• Items to keep an eye on:
• Excessive use of SQLExecutor (can add hours to a migration script). Examples in scripts from previous projects can reduce scripts by 50%.
• Review all database calls i.e. any necessary SQLExecutor calls or queries used to read data – ensure indexes are applied to datasets (FME can only read\link the data if it is setup correctly in the database)
• If you have multiple destination (writer) connections – the first will write almost immediately, the others will be cached (and slow the migration down)
• Try to have only one writer per workspace
• Examples in scripts from previous projects can reduce scripts by 10 hours
• Use the power of the database
• Pre process data where possible
• Cache sequences (hours can be saved here)
Analysis Design TestBuild Deploy
Step 3 & 4: Build and Test
• Prototype where possible
• Design workflows provide the logical interpretation of the migration requirement BUT should not be used as the blueprint for the FME workspace design
• Use the principles of the workflow
• Might be faster to write database functions/procedures to pre-process some of the data
• Utilizing the power of the database
• Removes overly complex custom transformers from FME
• Simplifying the workspace
• Reducing overall time to process the data
• Example of the above can be seen with inexperienced developers or newbies to FME
• Once script in the past read a large dataset (1.5 million records) it had 2/3 SQLExecutors in the workspace which made approximately 4.6 million database calls during the migration (of one feature!)
Analysis Design TestBuild Deploy
Step 3 & 4: Build and Test
• Use the reconciliation process to quality stamp/approve and test your builds.
• At the start of build expect data quality issues (bugs and also source data cleansing activities)
• At the end of build and test data quality should disappear (bug fixing, source system remedies)
• Report analysis is critical moving from build to test to cutover.
• Issues should reduce
• All team members should be aware of open issues
Analysis Design TestBuild Deploy
Step 3 & 4: Build and Test
• The reconciliation report timings will feed times into the cutover\deployment window planning
• Should be reviewed regularly
• Additional reading on the subject of performance improvement could be:
• FME Performance and Profiling
• Turbocharging FME: How to Improve the Performance of Your FME Workspaces
• Performance Tuning FME
Analysis Design TestBuild Deploy
Step 3 & 4: Build and Test
Useful transformers used in the migration process
• Workspacerunner
• SchemaMapper
• AttributeCreator
• Tester
• TestFilter
• Aggregator
• SQLCreator
• SQLExecutor
• Counter
• GeometryRemover
• NullAttributeMapper
• Sampler
• TimeStamper
• StatisticsCalculator
• PointOnAreaOverlayer
• VertexCreator
• FeatureMerger
Analysis Design TestBuild Deploy
Step 5: Deploy
Deployment Phase
Step 5: Deploy
• At this stage everything is fully tested and OK to deploy.
• Cut over window• Timings of scripts can vary from the development server to
the test server to the production server (depending on spec’s of those machines)
• Run dress rehearsals into the final destination environment before cutover to get a view on expected timings.
• Once the final migration has run• Review the Feature reconciliation reports from this migration
to the last – all should be the same (or better)
• Review the Attribute reconciliation reports from this migration to the last – all should be the same (or better)
Analysis Design TestBuild Deploy