24
Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community http://semanticommunity.info/ AOL Government Blogger http://gov.aol.com/bloggers/brand-niemann/ September 18-19, 2012 1

Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

Embed Size (px)

Citation preview

Page 1: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

1

Making Data a First Class Citizen38 Degrees: An AOL Gov Conference Series

Dr. Brand NiemannDirector and Senior Enterprise Architect – Data Scientist

Semantic Communityhttp://semanticommunity.info/

AOL Government Bloggerhttp://gov.aol.com/bloggers/brand-niemann/

September 18-19, 2012

Page 2: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

2

Overview• September 18th Tutorial (90 minute): Making Data a First Class

Citizen– Digital Agenda for Europe and Building a Digital Government US

Examples:• See: http://cms.aol.com/809/content/posts/edit/20264973/• See:

http://gov.aol.com/2012/06/06/health-datapalooza-a-model-of-innovation/

– Recommended APIs and Data Sets (in process):• http://semanticommunity.info/AOL_Government/Data_Services_for_Develope

rs

– Results of Competition with Recommended APIs and Data Sets• TBA

• September 19th Presentation (15 minutes): Making Data a First Class Citizen:– Summary of Three Topics Above

Page 3: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

3

Outline

• Data• Data Scientist• Data Science Products• Data Science Teams• Tutorials

Page 4: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

4

Data

• Table: Rows and Columns• Relational Database: Key Field for Multiple Tables• Unstructured: Linked Data, NoSQL, & RDF Graphs• Big: Volume, Velocity, Variety, and Value/Veracity• Architecture: Business and Science, Frameworks,

& Infrastructure• Major Developments: Google Big Table and

Amazon Dynamo

Page 5: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

5

Data Scientist• A data scientist is a job title for an employee or business intelligence (BI)

consultant who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.

• The title data scientist is sometimes disparaged because it lacks specificity and can be perceived as an aggrandized synonym for data analyst. Regardless, the position is gaining acceptance with large enterprises who are interested in deriving meaning from big data, the voluminous amount of structured, unstructured and semi-structured data that a large enterprise produces.

• A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and coding. Perhaps the most important skill a data scientist possesses, however, is the ability to explain the significance of data in a way that can be easily understood by others. Source: http://searchbusinessanalytics.techtarget.com/definition/Data-scientist

Page 6: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

6

Tim O’Reilly: The World’s 7 Most Powerful Data Scientists

• Tim O'Reilly is the founder of O'Reilly Media– "The success of companies like

Google, Facebook, Amazon, and Netflix, not to mention Wall Street firms and industries from manufacturing to retail and healthcare, is increasingly driven by better tools for extracting meaning from very large quantities of data. "Data Scientist" is now the hottest job title in Silicon Valley.“• Source:

http://www.forbes.com/pictures/lmm45emkh/tim-oreilly-is-the-founder-of-oreily-media/#gallerycontent

Page 7: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

7

#1 Larry Page, CEO, Google• Google, more than any other

company, has pushed the boundaries of what is possible with big data. Along with Sergey Brin, he built the search engine that tamed the web, solved the problem posed by John Wanamaker a century ago ("Half the money I spend on advertising is wasted; the trouble is I don't know which half."). And in his quest to provide access to all the world’s information, he has accumulated the largest database on the planet.

Page 8: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

8

#2 Jeff Hammerbacher, Chief Scientist, Cloudera and DJ Patil, Entrepreneur-in-Residence, Greylock Ventures

• Hammerbacher and Patil coined the term "data scientist.” Now it’s Silicon Valley's hottest job title. These two built the first formal data science teams at Facebook and LinkedIn, respectively. Now at Cloudera, Hammerbacher has been key to driving the success of Hadoop as a standard tool for processing large, unstructured data sets with a network of commodity computers. As Data Scientist in Residence at Greylock, Patil is seeking out the next generation of hot data-driven startups.

Page 9: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

9

#3 Sebastian Thrun, Professor, Stanford University and Peter Norvig, Data Scientist, Google

• When Thrun and Norvig decided to teach their Stanford course, Introduction to Artificial Intelligence, over the internet, they managed to sign up over 140,000 students and proved that AI is no longer just an academic subject. Norvig is Google's chief scientist. Thrun is leading Google’s efforts to build a self-driving car that relies on AI algorithms and the memory of hundreds of thousands of miles driven by Google’s street view vehicles, recording and measuring everything they saw.

Page 10: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

10

4 Elizabeth Warren, Candidate, U.S. Senate (Massachusetts)

• The banking system excesses that led to the economic crash of 2008 are an example of big data gone wrong. As the provisional head of the Consumer Finance Protection Bureau, Elizabeth Warren began the job of building the algorithmic checks and balances needed to counter the sorcerer's apprentices of Wall Street. In her campaign for the US Senate, she promises to continue that fight.

Page 11: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

11

#5 Todd Park, CTO, Department of Health and Human Services

• Park is leading the charge to transform American healthcare into a data driven business. From medical diagnostics to insurance reimbursement to community health statistics, he is finding ways to use data to make healthcare more effective and affordable.

Page 12: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

12

#6 Alex "Sandy" Pentland, Professor, MIT

• Sandy is not only a wide-ranging polymath, he's providing the intellectual leadership on how sensors, the internet of things, geolocation and promiscuous connectivity can be used to uncover insights regarding human behavior. Sandy is also looking at privacy - an important adjunct to the data space - and helping develop the conversation regarding the trade-offs between privacy and the value of personal data.

Page 13: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

13

#7 Hod Lipson and Michael Schmidt, Computer Scientists, Cornell University

• Cornell computer scientists Hod Lipson and Michael Schmidt created an AI program that could distill the laws of motion merely by observing data from the swings of a pendulum. In the process, they kicked off the field of robotic science in which AIs try to derive meaning from datasets too large or complex for humans to study.

Page 14: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

14

Data Science Products• Introduction to Data Science (Spring 2012):

– Course Information:• Organizations use their data for decision support and to build data-intensive products and services.

The collection of skills required by organizations to support these functions has been grouped under the term “Data Science”. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.

– Instructors:• Jeff Hammerbacher and Mike Franklin and Guest Speakers

– Components:• Data preparation• Data presentation• Data products• Observation• Experimentation• Final Project

– Resources (Fabulous!):• http://datascienc.es/resources/

Source: http://datascienc.es/

Page 15: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

15

Data Science Products• My Process Model (Jeff

Hammerbacher):– 1. Identify problem– 2. Instrument data sources– 3. Collect data– 4. Prepare data (integrate,

transform, clean, impute, filter, aggregate)

– 5. Build model– 6. Evaluate model– 7. Communicate results

• Jim Gray (“River of Data”):– 1. Capture– 2. Curate– 3. Communicate

• Data Preparation:– HTML tables– File downloads– REST APIs

• Exercises:– 2012 Presidential Campaign Fin

ance website

– http://elections.nytimes.com/2012/campaign-finance

– MINE: dataset was used for a competition hosted by Kaggle

– UNZIP TAR: votes made by users of a social news website TAR

– http://datascienc.es/2011final-project/Source: http://datascienc.es/

Page 16: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

16

Data Science Products

• Chief Data Officer for a Day:– Your team has been tasked with enabling your organization to

“compete on analytics”– 1. Define the top three priorities of the organization– 2. Determine the data sources you’d like to collect– 3. Highlight the largest data integration challenges you’ll face– 4. Determine the most important data to present to your

organization– 5. What data products could you build?– 6. What studies could you run to answer the most pressing

questions for the organization?– 7. Suggest some experiments to run to help guide the organization

towards their goals

Page 17: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

17

Data Science Products• AOL Government (Wyatt Kash, Editorial Director):

– Clear Compelling Headline– Original Graphic– Contextual Introduction Sentence(s)– Descriptive Paragraph– Chart Itself– Individual Static Graphics– Spotfire Interactive Visualizations– Caption– Source– Rate this chart

• BBC (Andrew Leimdorfer, BBC News Interactive and Graphics and Olivier Thereaux, BBC R&D)– Six Tabs: The story, The figures, Explore the data (including download the full

data), Analysis: 1, Analysis: 2. Methodology, and Your comments.NOTE: Examples of these are provided in the actual tutorial.

Page 18: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

18

Data Science Products

Source: http://datajournalismhandbook.org/1.0/en/

1. Identify who keeps the data and how it is kept2. Download and prepare the data3. Create a database4. Double-checking and analysis

Page 19: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

19

Data Science Products

• Data Journalism Handbook Excerpts:– The data journalism project brought a lot of people into the

room who do not normally meet at the ABC. In lay terms — the hacks and the hackers. Many of us did not speak the same language or even appreciate what the other does. Data journalism is disruptive!

– The practical things:• Co-location of the team is vital. Our developer and designer were off-

site and came in for meetings. This is definitely not optimal! Place in the same room as the journalists.

• Our consultant EP was also on another level of the building. We needed to be much closer, just for the drop-by factor

• Choose a story that is solely data driven.

Page 20: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

20

Data Science Teams• Building Data Science Teams• Figure 1. The rise in demand for data science

talents• Being Data Driven• The Roles of a Data Scientist

– Decision sciences and business intelligence– Product and marketing analytics– Fraud, abuse, risk and security– Data services and operations– Data engineering and infrastructure– Organizational and reporting alignment

• What Makes a Data Scientist?• Hiring and talent

– Would we be willing to do a startup with you?– Can you “knock the socks off” of the company in 90

days?– In four to six years, will you be doing something

amazing?

• Building the LinkedIn Data Science Team• Reinvention• About the Author

http://semanticommunity.info/AOL_Government/Data_Science_for_the_Government_Community/Building_Data_Science_Teams

Page 21: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

21

Tutorial:Introduction to Open Government Data

• Understanding the Foundations of Open Data– What makes data open– Why countries share data– Why people want open data

• Making Data Open, Accessible, and Discoverable– Policies– Processes– Change Management

• Selecting and Managing Open Data Technologies– Commercial solutions– Open source platforms– Semantic web and linked data

• Creating an Open Data Ecosystem– Sustaining data publishing– Engaging developers, citizens, and politicians: from communities to hackdays to challenges– Ensuring use and economic benefits

• Measuring the Benefits– Creating Your Own Open Data Roadmap– Sustaining and Communicating Your Success

Source: http://semanticommunity.info/AOL_Government/Invitation_to_International_Open_Government_Data_Conference#Agenda

Monday, July 9, 2012Time: 12 noon - 16:30 p.m.Where: World Bank Headquarters, 1818 H Street, NW, Washington, DC 20433Participants: Data stewards, open data managers, chief information officers, open data advocates, and developersWorkshop Leaders: Jim Hendler, Tetherless World Constellation Professor, Rensselaer Polytechnic Institute and Jeanne Holm, Evangelist, Data.gov

Page 22: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

22

Tutorial:Introduction to Open Government Data

• Highlights based on July 9th presentation:– Understanding the Foundations of Open Data - Having some

mandate or directive to do so– Making Data Open, Accessible, and Discoverable - Getting

people to release their data– Creating an Open Data Architecture - Having a platform to

access and discover data and build apps– Creating an Open Data Ecosystem - Dealing with change

management (policies, culture, compliance)– Measuring the Benefits - Very difficult to do– Summary and Next Steps - Go out and build your own Data.gov

Page 23: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

23

Tutorial:Making Data a First Class Citizen

– Digital Agenda for Europe and Building a Digital Government US Examples• See:

– http://semanticommunity.info/AOL_Government/Digital_Agenda_for_Europe

– http://cms.aol.com/809/content/posts/edit/20264973/

• See: – http://semanticommunity.info/HealthData.gov – http://gov.aol.com/2012/06/06/health-datapalooza-a-model

-of-innovation/

– Recommended APIs and Data Sets (in process)• http://semanticommunity.info/AOL_Government/Data

_Services_for_DevelopersNOTE: More slides to be added for actual tutorial.

Page 24: Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist

24

Postscript

• Presentation to Federal Big Data Senior Steering Group for Big Data, September 27, 2012:– A team comprised of NLM (Tom Rindflesch), Noblis (Victor

Pollara), Cray (Steve Reinhardt), and Semantic Community (Brand Niemann), is working to make what Dr. George Strawn refers to as “the killer semantic web application for government”, Semantic Medline, more well-know, and functional for medical research by putting the Semantic Medline RDF database into the new Cray Graph Computer and demonstrating its usefulness.

– The background for this project is at:• http://semanticommunity.info/A_NITRD_Dashboard/Semantic_Me

dline