26
Data Wrangling John Meehan Jeff Rasley

DataWrangling$ - cs.brown.edu

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DataWrangling$ - cs.brown.edu

Data  Wrangling  

John  Meehan  Jeff  Rasley  

Page 2: DataWrangling$ - cs.brown.edu

Working  with  raw  data  sucks.  

•  Data  comes  in  all  shapes  and  sizes  – CSV  files,  PDFs,  stone  tablets,  .jpg…  

•  Different  files  have  different  formaIng  – Spaces  instead  of  NULLs,  extra  rows  

•  “Dirty”  data  – Unwanted  anomalies  – Duplicates  

Page 3: DataWrangling$ - cs.brown.edu

Current  Tools  

•  Focus  on  specific  problems  – Resolving  enQQes  – Removing  duplicates  – Schema  matching  

•  Most  systems  are  non-­‐interacQve  –  Inaccessible  to  general  audience  

•  A  lot  of  people  just  use  Excel  or  regular  expressions…  

Page 4: DataWrangling$ - cs.brown.edu

Data  Wrangling  

•  Goal:  extract  and  standardize  the  raw  data  – Combine  mulQple  data  sources  – Clean  data  anomalies  

•  Combine  automaQon  with  interacQve  visualizaQons  to  aid  in  cleaning  

•  Improve  efficiency  and  scale  of  data  imporQng  •  Lower  the  threshold  for  broader  audiences  

Page 5: DataWrangling$ - cs.brown.edu

Three  Data  Sources:  Database,  PDF,  CSV  

Missing  headers,  MulQple  date  formats,  Merged  columns     Success!      ………but  

immediately  geIng  error:  “Empty  cells  in  column  3”  

Ugh,  screw  it…..  Back  to  imporQng.  

Data  reimported,  More  cleaning  

Analysis  possible,  but  data  quality  sQll  sucks  

SUCCESS!!!!  

Repeat  data  loading  less  painful,  but  sQll  annoying  

Page 6: DataWrangling$ - cs.brown.edu

Types  of  Data  Problems  

•  Missing  data  •  Incorrect  data  •  Inconsistent  representaQons  of  the  same  data  •  About  75%  of  data  problems  require  human  intervenQon  

•  Cleaning  data  vs  overly-­‐saniQzing  data  

Page 7: DataWrangling$ - cs.brown.edu

Diagnosing  data  problems  

•  VisualizaQons  can  convey  “raw”  data  •  Different  visual  representaQons  highlight  different  types  of  data  issues  – Outliers  oaen  stand  out  in  a  plot  – Missing  data  will  cause  gap  or  zero  value  

•  Becomes  increasingly  difficult  as  data  gets  larger  – Visual  design  coupled  with  interacQon  – Sampling  

Page 8: DataWrangling$ - cs.brown.edu

Node-­‐Link  Diagram  

Page 9: DataWrangling$ - cs.brown.edu

Matrix  View  

Page 10: DataWrangling$ - cs.brown.edu

Sorted  Matrix  View  

Page 11: DataWrangling$ - cs.brown.edu
Page 12: DataWrangling$ - cs.brown.edu

Visualizing  Missing  Data  

•  Set  values  to  zero?  •  Interpolate  based  on  exisQng  data?  •  Omit  missing  data?  

Page 13: DataWrangling$ - cs.brown.edu

Visualizing  Uncertain  Data  

•  Can  arise  from:  – Measurement  errors  – Missing  data  – Sampling  

•  VisualizaQon  must  – Consider  all  components  of  uncertainty  – Depict  mulQple  kinds  of  uncertainty  –  Interact  with  uncertainty  depicQons  

Page 14: DataWrangling$ - cs.brown.edu

Transforming  Data  

•  SpliIng  columns,  converQng  into  meaningful  records  

•  Typical  methods:  regular  expressions,  programming  by  demonstraQon  – Prone  to  errors,  tedious  

•  InteracQve  tools  simplify  the  process  – Guide  user  through  seIng  automated  constraints  – Generates  scripts  for  the  user  

Page 15: DataWrangling$ - cs.brown.edu

Transforming  Data  (cont)  

•  Data  formaIng,  extracQon,  and  conversion  •  CorrecQng  erroneous  values  •  IntegraQng  mulQple  data  sets  

Page 16: DataWrangling$ - cs.brown.edu

EdiQng  and  AudiQng  TransformaQons  

•  Data  Provenance  – Maintaining  the  data  history  – Track  the  lineage  of  a  specific  item’s  origins  

•  Used  for  the  modificaQon,  reuse,  and  understanding  of  a  transformaQon  

•  What  transformaQon  language  should  be  used?  – Extend  exisQng  languages?  

Page 17: DataWrangling$ - cs.brown.edu

Wrangling  in  the  Cloud  

•  Allows  the  sharing  of  data  transformaQons  •  Mining  records  of  wrangling  – Befer  automaQc  suggesQons  

•  User-­‐defined  data  types  •  Feedback  from  downstream  analysts  – Crowdsourcing  the  final  result  – Allow  users  to  annotate  or  correct  the  data  

Page 18: DataWrangling$ - cs.brown.edu

Checkpoint-­‐Conclusion  

•  Data  wrangling  is  oaen  a  second-­‐class  ciQzen  •  Common  problems,  very  Qme-­‐consuming  – Manual  approach  is  no  longer  a  viable  opQon  

•  Future  work:  extend  visual  approaches  into  the  data  wrangling  phase  

•  Plenty  of  research  direcQons  

Page 19: DataWrangling$ - cs.brown.edu

Secure  Data  AnalyQcs  

•  Private/sensiQve  data  – SSN,  Medical,  Classified,  etc.  

•  Cloud-­‐base  analysis  currently  doesn’t  work  •  Local  analysis  is  key  •  Research  area?  

Page 20: DataWrangling$ - cs.brown.edu

Pofer's  Wheel:  An  InteracQve  Data  Cleaning  System  

•  Vijayshankar  Raman  and  Joseph  Hellerstein  (VLDB  ‘01)  

•  Provide  a  graphical  interacQve  tool  to  support  various  data  transformaQons  with  suggesQons  

Page 21: DataWrangling$ - cs.brown.edu

TransformaQons  

Page 22: DataWrangling$ - cs.brown.edu

Download!  

•  Pofer's  Wheel  A-­‐B-­‐C:  An  InteracQve  Tool  for  Data  Analysis,  Cleansing,  and  TransformaQon  – hfp://control.cs.berkeley.edu/abc/  

•  (But  you  probably  don’t  want  to,  last  release  was  Oct  10,  2000)  

Page 23: DataWrangling$ - cs.brown.edu

Data  Wrangler  

•  Wrangler:  InteracQve  Visual  SpecificaQon  of  Data  TransformaQon  Scripts  (CHI  ‘11)  – Sean  Kandel,  Andreas  Paepcke,  Joseph  Hellerstein,  Jeffrey  Heer  (Stanford  Vis  Group  +  Berkeley)  

$4.3M  in  Series  A  funding!  (1/12)  

D3  

Page 24: DataWrangling$ - cs.brown.edu

What’s  new?  

•  Similar  goals  as  Pofer’s  Wheel  •  Improved  UI  for  the  web  •  Python  &  JavaScript  libraries  •  AddiQonal  transformaQons  such  as  fill  

Page 25: DataWrangling$ - cs.brown.edu

Demo    (~10min)  

Page 26: DataWrangling$ - cs.brown.edu

7  Command-­‐Line  Tools  for  Data  Science  

•  hfp://jeroenjanssens.com/2013/09/19/seven-­‐command-­‐line-­‐tools-­‐for-­‐data-­‐science.html