Upload
lee-dirks
View
702
Download
2
Embed Size (px)
DESCRIPTION
An invited talk to 40+ directors of national libraries worldwide at the annual ExLibris member meeting at IFLA (Helsinki, Finland) on August 15th, 2012.
Citation preview
The Challenge of Big Data: Implications for Big Librariesat the 2012 National Library / ExLibris MeetingLee Dirks Director, Portfolio Strategy MSR Connections
2
Agenda
• About Microsoft Research (and me)• What is Big Data?• What is the opportunity for libraries?• What resources are available? • Who is going to do the work? • Some recommendations +next
steps…
Some Background
• Redmond, Washington Sept 1991• Cambridge, United Kingdom July 1997• Beijing, China Nov 1998• Silicon Valley, California July 2001
• Bangalore, India Jan 2005• Cambridge, Massachusetts July 2008• New York City, NY May 2012
MSR New England
MSR Asia (Beijing)
MSR India
Redmond
MSR Cambridge, UK
Silicon Valley, California
Microsoft ResearchWorldwide Presence
Microsoft Research
• Expand the state of the art in each of the areas in which we do research
• Rapidly transfer innovative technologies into Microsoft products
• Ensure that Microsoft products have a future
http://research.microsoft.com/
ProductGroups
5-10 years +2-4 yearsPresent
MicrosoftResearch
Graphics &Multimedia
Human-ComputerInteraction
Machine Learning
Hardware& Devices
Computer Systems & Devices
Communication& Collaboration
ComputationalLinguistics
ComputationalSciences
Information Retrieval& Management
Security &Privacy
MicrosoftLabs
Rich Media Labs
E&D Labs
India Labs
Israel Labs
Live Labs Mobile Labs
Office Labs
Search Labs
Startup BusinessAccelerator
Startup Labs
ProductGroups
Microsoft Research | ConnectionsOutreach. Collaboration. Innovation.
77
• Division within Microsoft Research focused on partnerships between academia, industry and government to advance computer science, education, and research in fields that rely heavily upon advanced computing
• Supporting groundbreaking research to help advance human potential and the wellbeing of our planet
• Developing advanced technologies and services to support every stage of the research process
• Microsoft Research Connections is committed to interoperability and to providing open access, open tools, and open technology
http://research.microsoft.com/collaboration/
Engagement and Collaboration Focus
Core ComputerScience
Natural UserInterface
Earth,Energy &
Environment
Health &Wellbeing
Education & Scholarly
Communication
What is the challenge before us?
Data Tidal Wave
…Thus far we seem to be worse off than before—for we can enormously extend the record; yet even in its present bulk we can hardly consult it. This is a much larger matter than merely the extraction of data for the purposes of scientific research; it involves the entire process by which man profits by his inheritance of acquired knowledge. The prime action of use is selection, and here we are halting indeed. There may be millions of fine thoughts, and the account of the experience on which they are based, all encased within stone walls of acceptable architectural form; but if the scholar can get at only one a week by diligent search, his syntheses are not likely to keep up with the current scene…
As We May Think by Vannevar Bush
The Atlantic, July 1945
http://www.theatlantic.com/doc/194507/bush
According to study called How Much Information by the University of California at San Diego,
“…consumption totaled 3.6 zettabytes and 10,845 trillion words, corresponding to 100,500 words and 34 gigabytes for an average person on an average day. A zettabyte is 10 to the 21st power bytes, a million million gigabytes. These estimates are from an analysis of more than 20 different sources of information, from very old (newspapers and books) to very new (portable computer games, satellite radio, and Internet video)."
[Note: Information at work is not included!]
“It’s not information overload. It’s filter failure.”
Clay Shirky at Web 2.0 Expo 2008
14
What is Big Data?http://en.wikipedia.org/wiki/Big_data
• In information technology, big data[1][2][3] is a collection of data sets so large and complex that it becomes awkward to work with using on-hand database management tools. Difficulties include capture, storage,[4] search, sharing, analysis,[5] and visualization.
• Though a moving target, as of 2008[update] limits were on the order of petabytes to exabytes of data.[9] Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics,[10] connectomics, complex physics simulations,[11] and biological and environmental research.[12] The limitations also affect Internet search, finance and business informatics.
• Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.[13][14]
• The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s[15]; as of 2012[update], every day 2.5 quintillion (2.5×1018) bytes of data were created.[16]
So, what about libraries?
Petabytes
The Future: an Explosion of DataExperiments Archives LiteratureSimulations Instruments
The Challenge: Enable Discovery
Deliver the capability to mine, search and analyze this data in near real-time.
Present
17
Sharing Data
What resources are available?
19
• Microsoft Research, in partnership with • California Digital Library’s Curation Center
• Collaboration with Trisha Cruse & John Kunze • Part of the DataONE (an NSF DataNet Project)
• The George & Betty Moore Foundation
DataUp:Data Curation Add-in for Microsoft Excel
Available in Sept
2012
20
DataUp Functionality (http://dcxl.cdlib.org/)
1. Check data file for .csv compatibility & create .csv version data file
2. Generate metadata that is linked to the data file
3. Generate a citation for the data file4. Authenticate user to set up
designated repository5. Link an identifier to the data file6. Ensure data file is ready for deposition
into a repository7. Submit the data file for deposition to
the designated repository8. Ensure compatibility for Excel users
without the add-in
21
FigShare (a DigitalScience project)
• Figshare is the first online repository for storing and sharing all of your preliminary findings in the form of individual figures, datasets, media or filesets. Post preprint figures on Figshare to claim priority and receive feedback on your findings prior to formal publication.
• Figshare allows researchers to publish all of their research outputs in seconds in an easily citable, sharable and discoverable manner. All file formats can be published, including videos and datasets that are often demoted to the supplemental materials section in current publishing models.
• Figshare uses Creative Commons licensing to allow frictionless sharing of research data while allowing users to maintain their ownership. Figshare gives users unlimited public space and 1GB of private storage space for free.
http://figshare.com/
http://www.digital-science.com/
22
DataCite (“DOIs for data”)• An emerging protocol for citing
data, based on DOIs.• Founded jointly by TIB (Germany)
and the British Library• Goals:
• Establish easier access to research data on the Internet
• Increase acceptance of research data as legitimate, citable contributions to the scholarly record
• Support data archiving that will permit results to be verified and re-purposed for future study
• Currently: • 15 members• Over 1,000,000 DOIs registered
[Digital Object Identifier] • Metadata specifications released• Shared technical infrastructure
established
http://datacite.org/
Registry of research data repositories (hosted by
Purdue University)
http://databib.org/
23
Harvard’s “DataVerse” Project
Developed by the Institute of Quantitative Social Science (IQSS) at Harvard University
Leveraging web application software, data citation standards, and statistical methods, the Dataverse Network project increases scholarly recognition and distributed control for authors, journals, archives, teachers, and others who produce or organize data; facilitates data access and analysis for researchers and students; and ensures long-term preservation whether or not the data are in the public domain.
http://thedata.org/
24
Dataverse (cont’d)
25
DataFlow (University of Oxford)
DataFlow is creating a two-stage data management infrastructure that makes it easy for you and your research group to work with, annotate, publish, and permanently store your research data. You manage this locally using your own instance of DataStage, while allowing your institution to deploy DataBank easily to preserve and publish your most valuable datasets.• DataBank is a scalable data repository
designed for institutional deployment.• DataStage is a secure personalized
'local' file management environment for use at the research group level, appearing as a mapped drive on the end-user's computer.
http://www.dataflow.ox.ac.uk/
Preservation
27
DuraCloud (a DuraSpace service) DuraCloud is a hosted service and
open technology developed by DuraSpace that makes it easy for organizations and end users to use cloud services. It is a cloud-based service that leverages existing cloud infrastructure to enable durability and access to digital content.
DuraCloud builds on pure storage from expert storage providers by overlaying the access functionality and preservation support tools that are essential to ensuring long-term access and ease of use. DuraCloud offers cloud storage and replication of content across multiple providers, via one web-accessible interface.
V1.0 Launched in Nov11 with 11 pilot partners, including—Columbia University, Northwestern University, Rice University, and many others.
http://duracloud.org/
28
WorldWideScience.org is a global science gateway connecting you to national and international scientific databases and portals. WorldWideScience.org accelerates scientific discovery and progress by providing one-stop searching of global science sources. The WorldWideScience Alliance, a multilateral partnership, consists of participating member countries and provides the governance structure for WorldWideScience.org.
WorldWideScience.org was developed and is maintained by the Office of Scientific and Technical Information (OSTI), an element of the Office of Science within the U.S. Department of Energy. Please contact [email protected] if you represent a national or international science database or portal and would like your source searched by WorldWideScience.org.
In 3+ years since launch, the site has grown to 65+ countries, 400+ million pages – 96.5% of which is *not* available via commercial search engines – and can now be translated into multiple world languages (on demand).
Who does the work?
31
NAS/BRDI Study on “Future Career Opportunities and Educational Requirements for Digital Curation”
For the purposes of this study, Digital Curation is defined as the active management and enhancement of digital information assets for current and future use. The study will be performed pursuant to the following statement of task:1. Identify the various practices and spectrum of skill sets that comprise
digital curation, looking in particular at human versus automated tasks, both now and in the foreseeable future.
2. Examine the possible career path demands and options for professionals working in digital curation activities, and analyze the economic and social importance of these employment opportunities for the nation over time. In particular, identify and analyze the evolving roles of digital curation functions in research organizations, and their effects on employment opportunities and requirements.
3. Identify and assess the existing and future models for education and training in digital curation skill sets and career paths in various domains.
4. Produce a consensus report with findings and recommendations, taking into consideration the various stakeholder groups in the digital curation community, that address items 1-3 above.
http://sites.nationalacademies.org/PGA/brdi/PGA_069853/
© 2012 IBM Corporation
Symposium on Digital Curation
32
The Future Workforce
Steven Miller
IBM
© 2012 IBM Corporation
Symposium on Digital Curation
33
© 2012 IBM Corporation
Symposium on Digital Curation
34
© 2012 IBM Corporation
Symposium on Digital Curation
35
© 2012 IBM Corporation
Symposium on Digital Curation
36
© 2012 IBM Corporation
Symposium on Digital Curation
37
Enterprise Governance Architect
Define & manage the quality, consistency, usability, security, & availability of information
Ensure compliance with local, state, federal, and international regulations
Define and protect key organizational information assets
Define and manage processes to ensure data quality and remediate data errors
Define and manage appropriate levels of security at many levels
Define processes to protect against security issues such as identity and data theft
Define processes to ensure appropriate testing occurs before implementing
© 2012 IBM Corporation
Symposium on Digital Curation
38
Enterprise Architect – Data Governance Perform baseline logical reviews on key system, content, data, and process assets.
Create and maintain a comprehensive governance architecture for the Enterprise Conceptual Information Model, Content Assets, and Data Assets.
Ensure governed assets adhere to architectural principles and “Golden Rules”
Work with Domain Architects as key interaction point for communication, evangelism, governance and feedback into central architecture
Work with Business / Product Strategy in order to stay up to date with business / product direction in order to anticipate long-lead-time technology needs.
Work with peers within other enterprise information management pillars to develop and maintain business strategy, policies, standards and guidelines pertaining to global enterprise information
Ensure information model, information assets, governance architecture and program are aligned to the business goals across the company and the various business units
© 2012 IBM Corporation
Symposium on Digital Curation
39
Data Curator & Analyst Develop and maintain tools/codes for day-to-day extraction, curation and management of
phenotypic, genomic, breeding process, and logistical data
Be instrumental for extracting and providing clean data to statistical analysis team, IT team, breeders and managements as per request
Develop matrix to measure and track the quality improvements of phenotypic and genomic data
Proactively increase awareness of value of the data quality among breeders and researchers across the company
Work closely with corporate IT groups and statistical teams to identify and implement methods to automate tracking of breeding pipeline and increase quality of pipeline data in order to reduce time in structuring, characterizing and cleaning data.
Create and present summary statistics and reports to researchers and management
© 2012 IBM Corporation
Symposium on Digital Curation
40
Senior Data Steward
Support, build, and sustain relationships with analysts, leads, supervisors and managers within the TFS Business organization on designated projects
Lead all activities (planning, analysis, testing & reconciliation) in support of the delivery of small, medium, and large data and reporting projects and initiatives
Serve as liaison between Business Intelligence and TFS Business units to support requests for data and analysis
Analyze, monitor, profile and administer the metadata, quality and reconciliation of data within assigned areas on designated projects
Prepare and execute detailed data assessments and corrective action plans
Co-develop and execute the process of training business users on how to fully leverage and use TFS' business intelligence tools/reports/applications
Serve as a subject matter expert on TFS data for specific subject areas
41
"Data Services for the Sciences: A Needs Assessment” Study by Brian Westra (University of Oregon, July 2010)
Westra, B. "Data Services for the Sciences: A Needs Assessment“ (30-July-2010) Ariadne Issue 64 [URL: http://www.ariadne.ac.uk/issue64/westra/]
1. Data storage and backup2. Making this data findable by others3. Connecting data acquisition to data storage 4. Allowing or controlling access to this data by others5. Documenting and tracking updates to the asset6. Data analysis/manipulation7. Finding and accessing related data from others8. Connecting data storage to data analysis 9. Linking this data to publications or other assets10. Insuring data is secure/trustworthy11. Other
How do we prepare for this
new world of work?
Workforce Demand and Career Opportunities
in University and Research Libraries
N A S S y m p o s i u m o n D i g i t a l C u r a t i o n
Anne R. Kenney
July 19, 2012
7 NEW ROLES FOR LIBRARIANS*
1. Acquisitions and Rights Advisors
2. Instructional Partners in Learning Spaces
3. Observers/anthropologists of Information Users and Producers
4. Systems Builders
5. Content Producers and Disseminators
6. Organizational Designers
7. Collaborative Network Creators and Participants
Walters and Skinner, New Roles for New Times: Digital Curation for Preservation, ARL, Mar 2001
From Youngseek Kim, et al, “Education for eScience Professionals”, IJDC 6:1 (2011) http://www.ijdc.net/index.php/ijdc/article/view/168
RATINGS OF IMPORTANCE AND FREQUENCY OF ESCIENCE INTERNSHIP TASKS
MOST SIGNIFICANT SKILLS GAPS IN SUPPORTING EVOLVING RESEARCHERS’ INFORMATION NEEDS
1. Ability to advise on preserving research outputs
2. Knowledge to advise on data management and curation, including ingest, discovery, access, dissemination, preservation, and portability
3. Knowledge to support researchers in complying with the various mandates of funders, including open access requirements
4. Knowledge to advise on potential data manipulation tools used in the discipline/subject
© Information School / University of Sheffield 2012Mary Auckland, “Re-skilling for Research,” RLUK, January 2012
MOST SIGNIFICANT SKILLS GAPS (CONTINUED)
5. Knowledge to advise on data mining
6. Knowledge to advocate, and advise on, the use of metadata
7. Ability to advise on the preservation of project records, e.g. correspondence
8. Knowledge of sources of research funding to assist researchers to identify potential funders
9. Skills to develop metadata schema, and advise on discipline/subject standards and practices, for individual research projects
Mary Auckland, “Re-skilling for Research,” RLUK, January 2012
REQUISITE EXPERTISE FOR DIGITAL HUMANITIES AND SOCIAL SCIENCES
Requisite Expertise
Domain/subject expertise
Analytical expertise
Data expertise
Project management expertise
Williford and Henry, “One Culture: Computationally Intensive Research in the Humanities and Social Sciences,” CLIR, 2012
49
39 schools worldwide and growinghttp://www.ischools.org/
The iSchools organization was founded in 2005 by a collective of Information Schools dedicated to advancing the information field in the 21st Century. These schools, colleges, and departments have been newly created or are evolving from programs formerly focused on specific tracks such as information technology, library science, informatics, and information science. While each individual iSchoolhas its own strengths and specializations, together they share a fundamental interest in the relationships between information, people, and technology.
Digital Curation as a Core Competency
Dean Elizabeth D. LiddyiSchool, Syracuse University
Symposium on Digital Curation in the Era of Big Data: Career Opportunities & Educational Requirements
July 19, 2012
Data Archiving / Preservation
Data Collection
Data Management
Data Analytics
Data Presentation / Visualization
5 Stages of the Data Life Cycle
Three Additional Vital Competencies
Data Archiving / Preservation
Data Collection
Data Management
Data Analytics
Data Presentation / Visualization
DigCCurr Matrix
Competencies for Curatorshttp://www.ils.unc.edu/digccurr/digccurr-matrix.html/
Preserving Access to Our Digital Future: Building an International Digital Curation Curriculum
57
Digital Curation Resource Guideby Charles W. Bailey, Jr.Publisher, Digital Scholarship
http://digital-scholarship.org/dcrg/dcrg.htm
“This resource guide presents over 200 selected English-language websites and documents that are useful in understanding and conducting digital curation. It covers academic programs, discussion lists and groups, glossaries, file formats and guidelines, metadata standards and vocabularies, models, organizations, policies, research data management, serials and blogs, services and vendor software, software and tools, and training.” Available under a Creative Commons Attribution-NonCommercial 3.0 Unported License.
58
What will this future – our future – look like?
Evolved Infrastructure Easily shareable and findable
data sets. Data + publications that are
preserved for posterity.
Evolved Tools & Resources Quickly consumable and re-
useable data sets. Powerful tools for visualization
and analysis across data sets (perhaps even across disciplines).
Evolved Librarians Technical professionals
proactively engaged with data producers.
Servers are the new shelves.
The Opportunity Before Us
60
• Seek out and initiate data projects– Cross-domain partnerships– Enhance broad availability
• Pursue value-added services – Data storage and backup services– Enhancing data mark-up and findability– Securing/controlling access to data– Maintaining provenance– Developing analytical and visualization tools– Seeking related data/research– Hosting and linking data to publications/assets– Ensuring that data is preserved for the long-term
• Grow your people – Invest in training your existing staff– Change the technical profile of who you hire– Support the evolution of how we educate the field
61
“People change when the pain of the status quo becomes greater than the fear of making the change.”
– Helen Harkness
http://www.amazon.com/Career-Chase-Creative-Control-Chaotic/dp/0891060987
"If you don't like change, you're going to like irrelevance even less.“
—General Eric Shinseki
Retired United States Army four-star general,
currently US Secretary of Veterans Affairs
Paintings by
Xiaoze XieXiaoze Xie immigrated from the People’s Republic of China in 1992, where he was born and studied art and architecture. He has MFA degrees from Beijing and North Texas University, and taught at Bucknell University before assuming his current post at Stanford. His works are in the collections of the Museum of Fine Arts, Houston, the Scottsdale Museum of Contemporary Art and distinguished private collections. Xie’s oil paintings bring together serene qualities of traditional still-life painting and photography. From the long tradition of still-life painting he employs a rich and selected palette to represent the books, which take on a nearly symbolic role.
Webpage: http://art.stanford.edu/profile/Xiaoze+Xie/ Email: [email protected]
“We’d now like to open the floor to shorter speeches disguised as questions.”
Published in The New Yorker 10/18/2010by Steve Macone