WRG BigData Presentation DonMoody Nov2012

Embed Size (px)

Citation preview

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    1/24

    "How 'Big' Do I Have To Be To Have 'Big Data' Issues?

    Don C. Moody, J.D., M.S.

    Don C. Moody, 2012

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    2/24

    1.0: Overview/Agenda

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    3/24

    1.0: Overview/Agenda1.1: Short preso (15 mins!)

    Follow-up at lunch or offline

    1.2 'Set the table'

    Framing the Issue (Perspectives on BD)

    How 'big' is data getting? (fun exercise)

    A few BD tools/technologies Is BD a real biz issue or just hype?

    BDs current popularity wave

    Real legal concerns or just Chicken Little?

    1.3: Handouts

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    4/24

    2.0: Perspectives on Big Data

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    5/24

    2.0: Perspectives On Big Data2.1 This is old hat!! (See innumerable beer and diapers examples from 1990s

    even before Target & pregnancy)

    2.2 Framing the threshold question Do I Have to Worry About Big Data legal

    concerns? from two points of view:

    Small-to-Medium Enterprises (SMEs): (or lawyers representing them)

    Large Enterprises/Fortune 1000: I know Im a big company, I knowdata, and (if applicable) I know Im in a heavily privacy regulated area like

    health/HIPAA, financial/GLBA, education/FERPA, video history (VPPA), kids(COPPA) but when do I have to worry about separate BD legal issues?

    BD legal issues typically center around privacy but also can include:

    False/deceptive/unfair business practices

    e-Discovery

    IP

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    6/24

    2.0: Perspectives On Big Data

    2.3 "Big data" is relative term (& somewhat misleading) due to 3 V's:

    Scalability of inexpensive technologies (volume)

    Availability of many unstructured sources (variety)

    Rapid proliferation (velocity)

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    7/24

    2.0: Perspectives On Big Data2.4 Better: "Lotta Data"

    Not limited to large enterprises or large individual file sizes (e.g.trillions of small text entries)

    2.5 Better Still: "Lotta Messy Data"

    Lack of structure huge concern (e.g. 80% of data worldwide isunstructured

    Imposing order on chaos (e.g. pattern recognition) is key goal of'big' data analytics

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    8/24

    2.0: Perspectives On Big Data2.5 Perspectives

    Small-to-Medium Enterprises (SMEs): (or lawyers representing them)

    Cheap IT and clustering = big (volume & velocity)

    Prevalance of Social Media = (variety)

    Large Enterprises/Fortune 1000:

    When data is big enough or detailed enough for:

    Temptation to de-anonymize

    Likelihood of unintendedpattern reocgition (exceeds reasonable

    consumer expectations and/or what Priv Policy says)(FTC)

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    9/24

    3.0: How Big Is Data Getting?

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    10/24

    3.0: How Big Is Data Getting?3.1 Measurement scales (with pragmatic examples and some fun

    facts/tidbits for each)

    Can be 'big' (if unstructured or high volume):

    Kilobytes (Docs, spreadsheets, GIF/JPEGs)

    Megabytes (MP3, higher res images)

    Gigabytes (PC hard drives, HD video)

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    11/24

    3.0: How Big Is Data Getting?3.1 Measurement scales (with pragmatic examples and some fun facts/tidbits

    for each)

    Getting 'bigger': (big' but still not expensive!)

    Terabytes (enterprise servers, large HDDs)

    U.S. Library of Congress had over 235 terabytes of data in 2011 100 terabytes uploaded to Facebook/day

    3 Terabyte Seagate HDD available on Amazon for $120 (as of11/01/2012)

    AT&T claims to have largest single, unique database (1.9 trillionrows) @ 312 terabytes

    ]

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    12/24

    3.0: How Big Is Data Getting?3.1 Measurement scales (with pragmatic examples and some fun facts/tidbits)

    Definitely 'big':

    Petabytes (supercomputers, large virtual "drives")

    The total file size of the movie Avatar (incl. encoding for 3-D, IMAX, HD, etc.)constituted over 1 petabyte of data, roughly equivalent to a 32-year long MP3 song.

    In 2008, eBay, Walmart and BofA were considered data storage leaders with 4 PB,2.5 PB and 1.5 PB respectively

    Now, however, Facebook reportedly has over 30+ petabytes of data in a massiveHadoop cluster

    IBM put together a 120 Petabyte (120 million gigabyte) data cluster (virtual drive)using over 200,000 smaller HDDs, equaling a 1 trillion files or 2 billion hour longMP3

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    13/24

    3.0: How Big Is Data Getting?3.1 Measurement scales (with pragmatic examples and some fun

    facts/tidbits for each)

    Definitely 'big':

    Exabytes (largest individual data sets globally)

    Zettabytes (total data currently on Earth projected to be 2.7 ZB)

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    14/24

    3.0: How Big Is Data Getting?3.1 Measurement scales (with pragmatic examples and some fun

    facts/tidbits for each)

    Mostly theoretical (for now):

    Yottabytes

    Bored geeks at play:

    Reverse alphabet proposal (X,Y,Z)

    Brontobytes

    Excellent video demonstration from Univ. of Utah Prof. Chris Johnson

    @ TEDx Salt Lake City 2011: http://www.youtube.com/watch?v=5UxC9Le1eOY

    http://www.youtube.com/watch?v=5UxC9Le1eOYhttp://www.youtube.com/watch?v=5UxC9Le1eOY
  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    15/24

    4.0 A Few Big Data Tools and

    Technologies

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    16/24

    3.0: A Few Big Data Tools/Technologies Massively parallel processing v. single supercomputer

    Open source BD projects/technologies

    Apache Hadoop (application framework facilitating MPP for BD purposes (integratesMapReduce)

    Apache Cassandra (distributed DBMS for BD)

    Helpful Intro Video on Hadoop: http://www-01.ibm.com/software/data/infosphere/hadoop/

    Proprietary platforms (storage management, analytics, DBMS)

    Greenplum (acquired by EMC in 2010)

    IBMs Big Data Platform

    SAPs HANA

    MapRs Drill (Hadoop re-done as proprietary platform with value-adds)

    Google Dremel

    http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/http://www-01.ibm.com/software/data/infosphere/hadoop/
  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    17/24

    3.0: How Big Is Data Getting?3.1 Measurement scales (with pragmatic examples and some fun

    facts/tidbits for each)

    Getting 'bigger': (big' but still not expensive!)

    Terabytes (enterprise servers, large HDDs)

    Definitely 'big':

    Petabytes (supercomputers, large virtual "drives")

    Exabytes (largest individual data sets globally)

    Zettabytes (total data currently on Earth = 2.7 ZB)

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    18/24

    5.0 Nailing Down The Real 'Big

    Data' Business/Legal Issues

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    19/24

    5.0 Nailing Down The Real 'Big Data' Business/Legal Issues

    5.1 Business: All just hype and marketing spin?

    Yes! CRM/ERP in 1990s = lots of hype/promise + lots of (expensive)flameouts

    No!

    Internet closing in on 20+ yrs of mass adoption; meaningful patterns inonline histories now emerging

    So much info available now that companies do not have to guess ormake assumptions. Now (like Jeopardy) the only hurdle to an answer isknowiong which questions to ask.

    Kinda/sorta: Data is still data, and many traditional data mining techniquesstill apply to 'big' data (e.g. market basket analysis), just thought of in new

    ways

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    20/24

    5.0 Nailing Down The Real 'Big Data' Business/Legal

    Issues

    5.2 BD industry/subsector now getting a lot of press coverage (and money thrownthat way):

    IDC Estimates: Market for BD-related services projected to grow from $3.2Bin 2010 to $16.9B in 2015

    Time Magazine write-up on extensive use of data profiling analyticsbyObama Camp (Released yesterday)

    Obama Administration projected to be spending over $200M on BigData-related projects

    Harvard Business ReviewOctober 2012 Expose

    BD Legal Practice Groups Being Formed: Law Technology News October2012

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    21/24

    5.0 Nailing Down The Real 'Big Data' Business/Legal

    Issues

    5.3 Legal:

    De-anonymization temptation (e.g. "Database of Ruin")

    (Unintended) pattern recognition (e.g. Target example)

    Risk v. reward analysis (likelihood of occurring v. severity of harm)

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    22/24

    5.0 Nailing Down The Real 'Big Data' Business/Legal

    Issues

    5.3 Legal:

    Data source type:

    search queries

    IP addresses (not anonymous!)

    log/use data

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    23/24

    5.0 Nailing Down The Real 'Big Data' Business/Legal

    Issues

    Legal:

    Data subject matter:

    Financial (GLBA)

    Health care (HIPAA)

    Video/library histories (VPPA)

    Education (FERPA)

  • 8/13/2019 WRG BigData Presentation DonMoody Nov2012

    24/24

    CONCLUSION