15
1 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Mark Chamness Data Scientist EMC Corporation Summer 2014

Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: [email protected] Created Date: 12/4/2014 5:49:16 AM

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

1 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Mark Chamness

Data Scientist

EMC Corporation

Summer 2014

Page 2: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

2 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

EMC Corporation

• Fortune  “100”  Corporation

• 61,000 employees in 86 countries

• Products: – Hardware: Data Storage systems (Petabyte scale) – Software/Security  “data  protection”

• $24 Billion revenue in 2013

Page 3: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

3 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

My Background

Academic:

• B.S. Physics, California Institute of Technology

• Ph.D. Candidate, Physics, Brown University

• M.S. Statistics, San Jose State. – (Expected 2015)

Page 4: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

4 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Finding the Position

• A previous manager contacted me

• After ~1 year, you create your own role

• Be proactive, not reactive in career goals

Page 5: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

5 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Why Study Replication?

• Most common customer query: “Replication”

• Long time to resolve

Page 6: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

6 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Research Goals

• Define, validate, & document the data set • Mountain of undocumented data

• Algorithm to indicate the probability and/or severity of replication lag

Page 7: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

7 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Challenges: Big Bad Data

• Data origin: computers all over world

• Hundreds of variables

• Billions of observations

• Not controlled experiment

• Significant Data Scrubbing

Page 8: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

8 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Time series analysis

Page 9: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

9 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Some forecasts less useful

Page 10: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

10 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

SPC - Statistical Process Control

• Define  rules  for  “Out-of-Control”  Conditions

• Test for rule match

Page 11: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

11 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

p-values are useful

Almost exceeds 95%

confidence interval

~2

Page 12: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

12 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Replication has hourly correlation

Visualization Helps!

Page 13: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

13 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Software/Statistical Methods used

• R/RStudio – ggplot – Time series analysis

• Database – SQL (Postgresql database) – pgAdmin – Madlib: “Big  Data  SQL  for Data Scientists”

• Documentation via Excel/PowerPoint/Word

Page 14: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

14 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Recommendations for Finding Internship

• LinkedIn – Educate yourself: find profiles in target role – Find relevancy of your background/experience – Keep up-to-date & accurate

• Salary.com – Gather data on your worth

• Job sites: Dice/BrassRing/etc

• Apply to companies you like

Page 15: Mark Chamness Data Scientist EMC Corporation Summer 2014Title: Replication Author: mark.chamness@emc.com Created Date: 12/4/2014 5:49:16 AM

15 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Q&A