Upload
nguyenmien
View
222
Download
3
Embed Size (px)
Citation preview
Presented By: Dave Larson
Data Analysis Structured vs. Unstructured Data
Texas Digital Government Summit
Speaker Bio
Dave Larson – Solu6ons Architect with Freeit Data Solu6ons • In the IT industry for over 20 years. • Specializing in Data and Storage Technologies • Worked with IT Manager, SAN technology, ERP Applica6ons, Database Admin, UNIX Admin, Enterprise Architecture, Data Warehousing
Data & Informa>on
What is Data? Raw, unorganized facts that need to be processed. What is Informa>on? Processed, organized, structured data that is useful. Data is plain facts that is processed, organized, structured or presented into useful informa>on
Facts about Data
• Data is growing at an incredible rate • Gartner and IDC state that data is doubling every 18 months
• Current es6mate is that there is over 4 zeSabytes of data in the world
• If the trend con6nues, by 2020 data will be over 40 zeSabytes
What is a ZeFabyte?
• 1 zeSabyte = 1 billion terabytes • 1,000,000,000,000,000,000,000 bytes • 4 zeSabytes is equivalent to;
– 2 Quin6llion jpg images – 456 Billion hours of digitally recorded
music – 1 Trillion HD Digital Movies – 166 Billion 32GB iPad’s
4 ZeFabytes visualized
• 1 Million 4TB Hard Drives • 250 Billion DVD’s stacked on top of one another would reach the moon -‐ 3 >mes
• All data printed on 8 x 10 paper and laid end to end is 210 Trillion Miles or 35.8 Light years
• All data printed would require 16.4 Trillion Tree’s – NASA es'mates there is 400 Billion tree’s on Earth
What is causing Data Explosion?
• Internet – Connec6ng everything to everyone – Billions of people to Billions of devices
– Online Shopping (Amazon, Wal-‐Mart, eBay, BestBuy) – File Sharing (Drop box, Google Drive, iCloud, SkyDrive)
• Social Media – Facebook – Google+ – TwiSer – YouTube
• Store Everything, Delete nothing, mul>ple copies of it all
Structured vs. Unstructured
Structured – informa6on with a degree of organiza6on that is readily searchable and quickly consolidate into facts. Examples: RDMBS, spreadsheet Unstructured – informa6on with a lack of structure that is 6me and energy consuming to search and find and consolidate into facts Examples: email, documents, images, reports
Expansion of data?
• Structured Data (databases) – Produc6on DB, Test DB, Dev DB, Repor6ng DB – Mul6ple backups of data – Genera6ons of DB backups – Replicated copies of DB – Every Produc6on database has between 3-‐12 copies
• Unstructured Data (Files, media, images) – Desktop, Network share, email, mobile device, Cloud – Copies sent to other people – Backup copies
How to control data growth?
• Change data management policies • Create data reten6on procedures • Store data more efficiently • Purge data that is no longer needed • Backup data less ojen • Archive Data • Develop more efficient backup policies
Analyzing Structured Data (RDBMS)
• Challenges – DB growth impacts data analysis – Too much data to analyze – Analyze only relevant data (current)
• Improvements – Purge data that is no longer relevant – Historical data should be summarized – Compress data to store less on disk – Improve DB performance with Caching technologies and Flash Storage
Improved Analysis of Structured Data
• Normalize Databases to minimize redundancy & dependency
• Divide large tables into smaller tables • Par66on data • Move data into a third normal form (3NF) generally used in a data warehouse
• U6lize and leverage Business Intelligence applica6ons on Normalized data
• Remove Source data once Normalized
Trends in Structured Data
• Structured data is gelng too big for tradi6onal RDBMS requiring BIG DATA solu6ons
• Big Data is handled with applica6ons like Hadoop • Big Data is leveraging new technologies such as
– MongoDB – CouchDB – Oracle NoSQL Database – Apache Cassandra
• New systems some6mes referred as document-‐oriented database system or distributed key-‐value databases
What is Big Data? Tradi>onal Data BIG DATA
Gigabytes to Terabytes Petabytes to Exabytes
Centralized Distributed
Structured Semi-‐Structured and Unstructured
Stable data model Flat schemas
Known complex interrela6onships Few Complex interrela6onships
• Real-‐>me – transac6onal, online, low latency data • Analy>cal – aggregated data from real-‐6me feeds or other
sources • Search – suppor6ng data, both external and internal, used
for loca6ng desired informa6on and/or objects
Technology for Structured Data
• SSD / Flash Technology – All Flash arrays – Hybrid Storage arrays
• SSD / Flash is gelng cheaper, more reliable, & larger capaci6es • Incredible performance 10’s to 100’s of thousands of IOPS • Inline Compression and/or Deduplica6on
– Store more data in less space • Snapshots = reduced RTO/RPO’s and less • Cloning = less data consumed for Development and test • Energy efficient
– SSD uses less than ¼ the power as hard drives – SSD requires less cooling
• Hard Drives, how much longer un6l we remember it as fondly as floppy drives, dot-‐matrix printers, Betamax and 8-‐track?
Unstructured Data
• Challenges – How do you storage Billions of Files? – How do you store 100s of TBs or PBs of data? – How long does it take to migrate 100’s of TB’s or data every 3-‐5 years – No structure to data – Legacy File System approach to file organiza6on – Resource limita6ons – Data has lots of duplica6on – How do you find data that isn’t organized or searchable? – Lack of reten6on policies adds to massive data explosion – Data is gelng too big to backup
How do you backup PBs of unstructured data?
Unstructured Data
• Current Improvements – External search engines (MS Enterprise Search or Google Search appliance)
– Archive data into cheaper solu6ons – Backup data less frequently – Implement deduplica6on technologies – Purge data using reten6on policies
Trends in Unstructured Data
• Object Storage – Trea6ng files as Objects – Crea6ng data describing unstructured data
• Metadata – data about data • Crea6on date, owner, subject, reten6on period, importance, …
– Leverage Commodity hardware to create clusters to store data
– Store replicas of objects for data protec6on – Store replicas between mul6ple sites for DR / BC – Store revisions of data – Reten6on can allow for automa6c purging of old data – Backup data less frequently if at all.
Structure to Unstructured Data
• Object storage has data to describe the data • Object storage is searchable • Object storage is shareable • Object storage can be stored once • Object storage doesn’t need to be migrated • Object storage doesn’t need to be backed up
What can you do?
• Data isn’t going away, growth in inevitable • Implement energy efficient storage that u6lized data reduc6on technology (compression & deduplica6on)
• Summarize data into useful informa6on • Implement ways to reduce data cluSer • Implement more efficient methods of storing data
• Bring structure to unstructured data • Archive and purge data over 6me