37
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Chris & Greg Tinker – HP Master Technologist

Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

Embed Size (px)

DESCRIPTION

HP Master Technologists, Chris & Greg Tinker, presentation deck from HP Discover 2012 Las Vegas “Lassoing Big data”

Citation preview

Page 1: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Chris & Greg Tinker – HP Master Technologist

Page 2: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

BIG Data and IT Solutions

Lassoing Big data

Chris & Greg Tinker, HP Master TechnologistsJune 2012

Page 3: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3

Lassoing Big Data

Agenda

• Defining Big Data• Challenges• Solution design• Scenarios• Take away & closing statements

Page 4: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Defining Big Data

Page 5: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5

“Big Data” originating with analytics – business intelligence (BI)

Defining

• Traversing enormous diverse data types to spot patterns− 10s - 100s of terabytes (TB), petabytes (PB), and yes - even Exabyte's (EB)

• Business needing faster --“real time” (seconds - minutes vs. hours to days) analytic results − combining data from silos − Analyzing diverse data types− Connect data from various business units (cross analyze, access, &

reference )

• Growing at exponential rate − Structured data – data stored in databases− Unstructured – all other data including emails, social media, blogs, free form

feedback, documents, transaction, multimedia (images, videos, etc.) − 90% of enterprise information is unstructured

Page 6: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6

Big Data: growing at a massive scale

Defining

Today’s “Big Data” will not be considered the same in 5 years. By 2020, there will be 4 billion people online creating 50 trillion gigabytes of data*

Data and its management is not just a concern for IT departments.

• ~4 trillion SMSes a month ~4 PB per month worldwide

• ~30 billion pieces of content shared on Facebook every month

• ~48 hours of video uploaded onto YouTube every minute

In sixty seconds:• 1,820 TB of data is created; that’s enough data to fill up 2.6 million CDs**• 1.1 million conversations take place via instant Messenger**

*http://www.hpl.hp.com/research/intelligent_infrastructure.html

**http://www.go-gulf.com/60scs_v2.jpg

Structured Unstructured

Amount 10% 90%

Growth 22% 62%

Page 7: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7

Big Data: Landscape

Defining

Actionable intelligence

HPCC + softwareProgramming with more math and statistic

Unstructured Data

–Benchmarks–Trends

i.e. Social media

Silo Data

–Counts–Sums

i.e. Business units

Cloud Compute and Storage platforms

Structured Data

–Averages–Rates

i.e. databases

Big Data

Page 8: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Challenges

Page 9: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9

Scale

Challenges

Big Clouds – compute and storage platforms to transform data into actionable intelligence

• fluctuating asset valuations−Convergence & Virtualization− Identifying untapped resources –utilization factors • cross-access, cross-analyze, and cross-reference

−Reconcile data silos−Massive data• HPCC solutions

−Hyperscale cluster solutions

Page 10: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10

Governance

Challenges

Compliance

• SOX• Privacy directives – Data access

−US Federal ( HIPAA, FCRA, GLBA, DPPA, DOT, etc ) -- don’t forget the State addendums

−UK (DPA,…) • Data retention and archives • Purging of data after expiration of legal retention • Ability to prove compliance upon request and proving data has not

been manipulated, changed, or deleted• Restrictions and permutations of data models

Page 11: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11

Architecture

Challenges

• Large, single name space file system(s)−Parallel access file system• Clustered file systems

− Proprietary cluster volumes• Distance between data sources• Protocol(s)

− ISCSI, IFCP, FCP, …−CIFS, NFS, …• Metadata management• Backups

Page 12: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12

Analytics – software

Challenges

HP software• HP Autonomy & (Information Data Operating Layer) IDOL10

− Natural language processing− unstructured

• HP Vertica− Structured

Other software (examples)• Hadoop (both a file system and a map/reduce engine)

− Hadoop map/reduce on HP IBRIX parallel single namespace file system− Data processing (no built in natural-language processing)

• Apache, Cassandra, Cloudera, Lucene/Solr and many others

Page 13: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Solution designs

Page 14: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14

Capacity

Solution design

Big Data solutions must deliver optimal utilization of assets while agile enough to support rapid scaling

• Hierarchal storage management methods− Recent data must be readily available for real-time

analytics− Performance− Reliability

• Disaster recovery • Archival / backup management• Leverage open standards – prevent lock-in

Page 15: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15

Near real-time

Solution design

Historically, analytics were derived from archived or aged data, today’s analytics require Cloud Compute and Storage platforms to achieve “nearly real-time” results

• Limit data movement• Hyper-scale solutions: High Performance compute clusters (HPCC)• Cloud and virtualization• capacity scalability -- Just-in-time scalability• Parallel work streams

Get the analytics closer to the data…

Page 16: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16

Performance

Solution design

Need fast and scalable access to data

How tightly coupled the data is to the applications• Application bottlenecks

− Message passing interfaces, network stacks, IO subsystem, Data layout• Parallelism – aging applications which do not make use of threading• Data set size, quantity of objects, access patterns

− How random is random?− File system(s)− Storage subsystem

• Network• Processing

Page 17: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17

Performance requirements influence scale constraints

Solution design

Service time ~1ms per I/O, Throughput ~ 8,000MB/secTransactions ~ 600,000 per day/hour/minute? (end to end? )

Tempering IT solutions with Business Realities • Determine speed at which consumption and indexing of data types needs to take place• Close to real-time, seconds, minutes, hours, days• Utilize Enterprise Solutions

− compress enormous volumes of data (via compression or de-duplication) • Volume of data available encumbers analysis – SCOPE of data set• Capture of data -- Real-time/low latency

Page 18: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Scenarios

Page 19: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19

Power

Scenario 1

Production jobs executing for 4+ days when jobs began to fail… 1,000+ users light up phone bankThough “near real-time” is a great sound byte, most complex analytics of large scale research projects take hours if not days. During which data at rest is expected to remain at rest.• 500TB• 2,000,000 directories• 60,000,000 files• Single file system• Storage subsystem experiences multiple component failure (PDU and UPS failure)

File system linearly space representation. ??????

?

Page 20: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20

Power – Big Data File system complications

Scenario 1

Extremely High Aggregate Performance from a Single

Directory (and Single File)

Dir

F1 F2 F3 Fn

Subdir

S1

S2

S3

Sn

1

4

2

3

100

Segments

F2

F3

Fn

S1

S2

S3

Sn

Subdir

US Patent # 6,782,389

SegmentServers

S1

S2

Sn

Dir

Page 21: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21

Power

Scenario 1

File system metadata corruption• Part of disk subsystem failed while file system remained operational• Application IO errors combined with file system and SCSI IO errors• Disk subsystem was restored• No offline file system check was performed to fix metadata

Solution/mitigation•Production Offline required to perform full check•Restore individual files which were marked for deletion and placed in lost+found•Replication

Page 22: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22

Corruption

Scenario 2

A case that recently reached our desk – corruption of Oracle 700TB databaseChallenge• Production down

− database Down (corruption.. Would not start)− Application pointed to disk subsystem− 32 node farm− 50,000 LUNPATHS (we are seeing systems in excess of 200,000 LUNPATHS)

• Restoration− Exactly what area is corrupt – Data or temp/redo space?

• Why/How?

Page 23: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23

Corruption: layers

Scenario 2

Without ASM

With ASM

Tables

Tablespaces

Files

File Systems

Logical Volumes

Volume Groups

Physical Volumes

011100000100…..

011100000100…..

(S)LVM, VxVM, CVM

VxFS,..,CFS

Files and Disk GroupsManaged by ASM,displayable in OracleViews

Page 24: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24

Corruption: layers

Scenario 2

Upper Layer SD ST SR SG

SCIS MID LAYER: GLUE

SCSI Lower Layer FC ISCSISAS

Etc…

Use

r Space

Applications

GNU C lib

Kern

el S

pace

System Call Interface

VFS (ext3, NTFS, VxFS, etc.)

Buffer Cache

MPIO – device mapper

RAW

LVM, VxVM, Oracle ASM

Blkdev SCSI

IDEEtc…

Page 25: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25

User corruption

Scenario 2

Persistent naming is achieved by using the scsi_id –g –u –s UNIQUE wwid and placing it into the multipaths{} section of the multipath.conf file

multipaths {

       multipath {

         wwid       360060e8005709a000000709a000000c4

         alias      Oracle_vote1

                 }

           }

Example:

#> multipath -ll

Oracle_vote1 (360060e8005709a000000709a000000c4) dm-11 HP,OPEN-V

[size=513M][features=1 queue_if_no_path][hwhandler=0][rw]

\_ round-robin 0 [prio=0][active]

 \_ 0:0:0:4 sde        8:64  [active][ready]

 \_ 1:0:0:4 sdk        8:160 [active][ready]

Page 26: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26

Corruption – result of human error

Scenario 2

Administrators manually modified /var/lib/multipath/bindings This file is a cache file created by multipath for persistent mapping of devices files• /var was it’s own filesystem• /var not mounted at boot time• / filesystem had it’s own /var/ which was covered up later in boot strap

• Identification• Mitigation (establish definitions within multipath.conf)

Page 27: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.27

Scenario 3

10Gb High Availability Network connections not meeting expectations

Previous solution was a 1GE port channel environment on aging infrastructure

Upgraded Servers to• Single Blade w/ 64 Cores ( 4 X 16 core

processors )• 256GB memory• 4X 10Gb flex fabric interface ports• FCOE & ISCSI

Application performance was expected to be nearly 8X faster

Performance

Page 28: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.28

Performance: applications

Scenario 3

Aging applications are becoming bottlenecksFew older applications make use of parallel work streams leveraging the overall bandwidth capacity of today’s servers • IT infrastructure at the time of application design• Home grown application scaled to unforeseen and unpredicted use

− Production use throttled development indicatives− Closed source application vendor no longer exist

Page 29: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.29

Performance: the stack

Scenario 3Program design

Data Access

Structured/Unstructured data

OS

Data Layer

Process model

Bus

Mapping

Message queues

Integration Layer

Application Layer

Infrastructure Layer

Server

Storage

Clustering

Networking

Stability

Scalability

Data Access

Page 30: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.30

Scenario 3

Throughput is achieved by optimizing parallel streamsTCP

# of streams TCP RTT (ms) KB/IO MTU TCP Segments/IO SCSI RTT KB/sec Mbit/sec Calculated MB/sec RTT (s)1 0.009 1.46 1,500 1 0.01 162,222.22 1,267.36 158 0.0000092 0.009 1.46 1,500 1 0.01 324,444.44 2,534.72 317 0.0000093 0.009 1.46 1,500 1 0.01 486,666.67 3,802.08 475 0.0000094 0.009 1.46 1,500 1 0.01 648,888.89 5,069.44 634 0.0000095 0.009 1.46 1,500 1 0.01 811,111.11 6,336.81 792 0.0000096 0.009 1.46 1,500 1 0.01 973,333.33 7,604.17 951 0.0000097 0.009 1.46 1,500 1 0.01 1,135,555.56 8,871.53 1,109 0.0000098 0.009 1.46 1,500 1 0.01 1,297,777.78 10,138.89 1,267 0.000009

 ISCSI

# of streams TCP RTT (ms) KB/IO MTU TCP Segments/IO SCSI RTT KB/sec Mbit/sec Calculated MB/sec SCSI SVC (s)1 0.100 8.00 1,500 6 0.60 13,333.33 104.17 13 0.0006002 0.100 8.00 1,500 6 0.60 26,666.67 208.33 26 0.0006003 0.100 8.00 1,500 6 0.60 40,000.00 312.50 39 0.0006004 0.100 8.00 1,500 6 0.60 53,333.33 416.67 52 0.0006005 0.100 8.00 1,500 6 0.60 66,666.67 520.83 65 0.0006006 0.100 8.00 1,500 6 0.60 80,000.00 625.00 78 0.0006007 0.100 8.00 1,500 6 0.60 93,333.33 729.17 91 0.0006008 0.100 8.00 1,500 6 0.60 106,666.67 833.33 104 0.000600

Performance

Page 31: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.31

Performance: the solution

Scenario 3

Problem identified with the single threaded nature of the applicationTo achieve the desired performance the latency between the data and the analytics had to be reduced due to application rework was not an option. (Source lost)

• Critical business application and data placed on local Storage− Latency maintained at or below ~0.1 msec where data set size allowed for such

latency.• HP PCI based Smartarray battery backed disk controllers with SSD disks

• Tiered storage model adopted• Utilized capacity of local server resources for special locality of application to reduce

network latency− More cores and memory allows for application and OS virtualization on same physical

machine− NOTE: virtual switch allows for network communication to not even leave the adapter

when talking between guests on same vswitch

Page 32: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Take away

Page 33: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.33

Storage Design considerations

Take away

Storage Tech NAS NASFC

FCNAS

Work load Mixed Work loads Mixed Random and High Sequential Throughput

Very high sequential bandwidth access to a single file

Scale Depends on Change Rate Depends on Change Rate Depends on Change Rate

File Types Many (millions) smaller to medium sized files

Some large files (most <100Gbyte) and some smaller files.

Very large files (most over 100GByte)Structured databases (data warehouses)

Aggregate Throughput Requirement

< 5Gbytes/sec 5 to 10 Gbytes/sec 10’s – 100’s of Gbyte/sec required

Protocols CIFS, NFS CIFS, NFS, FTP, HTTP, Webdav, ISCSI/block Access FC

FCNAS w/ IB for low latency and throughput

Page 34: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.34

Scalability Factors

Take away

• Next Generation Data Centers− Power/heat− Scalable storage and compute power – cloud platforms

• Solution Designs− Availability− Scalability− Recovery− Performance

Page 35: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.35

Always on support from HP

Take away

Who does your IT staff call?• Several levels

− Foundation Care− Proactive Care− Datacenter Care− Lifecycle Event Services

Complex Solution Team• Multi-vendor• Multi-solution

Page 36: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.36

Win 1 of 12 HP Envy

Spectre Ultrabooks and Monster Beats

headsetNO PURCHASE NECESSARY & PURCHASE WILL NOT INCREASE CHANCE OF WINNING. OPEN ONLY TO Discover 2012 – Las Vegas ATTENDEES; LEGAL RESIDENTS OF 50 UNITED STATES, & THE DC, 18 YEARS OLD. Void in Puerto Rico, U.S. territories, possessions and where prohibited by law. Employees of Sponsor, its subsidiaries, affiliates, their immediate family and household members, as well as public sector employees, are not eligible. Entry constitutes agreement to rules & Sponsor’s decisions. Participants enter by submitting lead inquiry to HP’s booth. Winners chosen by random drawing daily on or about June 4-6, 2012. One entry per person. Winners will be notified via email and may have to sign and return an eligibility affidavit & liability release, unless prohibited. If eligible winners fail to sign and return required documents, prize may be forfeited. Prizes: One (1) of twelve (12) HP Envy Spectre 14 and Dr. Dre Beats Headsets (ARV $1699.00 each). No substitution, cash redemption or transfer of prizes, except in Sponsor’s discretion. Taxes are winners’ responsibility. Odds of winning depend on number of entries. Entrants release and hold harmless Sponsor, its subsidiaries, affiliates, and their officers, directors, employees, agents from any claim arising out of entry or prize receipt or use. Sponsor: Hewlett-Packard Company, Attn: HP 11445 Compaq Center Drive W, Houston, TX, USA 77070. Use this address for inquiries or  requests for winner’s list.

Demo #563

Test Drive HP Insight Online

Page 37: Lassoing Big data - Chris & Greg Tinker, HP Master Technologists

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Thank you