View
1
Download
0
Category
Preview:
Citation preview
Scientific data cloud
infrastructure and services in
Chinese Academy of Sciences
Jianhui LI(lijh@cnic.cn),
Yuanke Wei(weiyuanke@cnic.cn)
Yuanchun Zhou(zyc@cnic.cn)
Computer Network Information Center
Chinese Academy of Sciences
Outline
• About us – CAS (Chinese Academy of Sciences)
– CNIC(Computer Network Information Center), CAS
– SDC(Scientific Data Center), CNIC, CAS
• About Scientific Data Cloud of CAS – Data Challenge
– Architecture
– Infrastructure Service
– Middleware Service
– Data Service
• Conclusion 2
• CAS is a leading academic
institution and comprehensive
research and development
center in natural science,
technological science and
high-tech innovation in China.
• It was founded in Beijing on
1st November 1949 on the
basis of the former Academia
Sinica (Central Academy of
Sciences) and Peiping
Academy of Sciences.
3
4
• a public support institution
for consistent construction,
operation and services of
information infrastructure of
CAS.
• a pioneer, promoter and
participator for informtion of
domestic scientific
research and scientific
research management
5
Operation and Services in CNIC
6
—— Provided by 7 Business Departments
Respectively
Scientific Research Network Environment
Scientific Data Environment
Supercomputing Environment
Informatization of Research Management
Internet-based Science Popularization and Education
Internet Fundamental Resource Services
• Scientific Data Center (SDC) is the support facility in charge of
the construction, management, operation and maintenance of
CAS Informatization Data Application Environment, and has
been taking the lead in implementing the CAS Scientific
Database Project for more than 20 years.
• SDC provides storage services, data services and related
application technology services for the entire CAS
• SDC hosts the Secretariat of Committee on Data for Science
and Technology (CODATA) and the CAS Secretariat for World
Wide Web Consortium (W3C).
• The vision of SDC is striving to become an important facilitator
of exchange and application of scientific data resources, key
technology supplier during lifecycle of scientific data, and leader
in transforming scientific data into knowledge service.
Scientific Data Center
7
Outline
• About us – CAS (Chinese Academy of Sciences)
– CNIC(Computer Network Information Center), CAS
– SDC(Scientific Data Center), CNIC, CAS
• About Scientific Data Cloud of CAS – Data Challenge
– Architecture
– Infrastructure Service
– Middleware Service
– Data Service
• Conclusion 8
Hotter and hotter in data research
Mar.29, 2012, the Obama Administration “ Big Data
Research and Development Initiative ”($200 Million) :
improving our ability to extract knowledge and insights
from large and complex collections of digital data
Feb. 11, 2011, 《Science》issued a Special Online
Collection: “Dealing with Data”
Sep., 2009, 《Nature》 issued “Data’s shameful
neglect”: Research cannot flourish if data are not
preserved and made accessible. All concerned must act
accordingly.
The Second International Symposium on Dataology &
Data Science was held 3 days ago in China
Difficult to
discover
Difficult to access
Being lost
9
Data Driven Scientific Discovery • Data is regarded as the most valuable thing.
“The impact of Jim Gray’s thinking is continuing to get people to think in a new
way about how data and software are redefining what it means to do science."
— Bill Gates
Scientific discovery based on data intensive
computing is now considered as the ''fourth
paradigm'' after theoretical, experimental, and
computational science.
10
Over Moore’s Law in Data • IDC: Data doubles less every 18 months
• Huge volume
• Rapid increase
• Various types and formats
11
Data Challenge
• Scientists are being overwhelmed with exploding scientific
data.
• Much scientific research needs data distributed in different
locations.
• There is a growing gap between ability of modern scientific
instruments and that of scientists.
• It has been a great challenge to view, manipulate, store,
move, share, and interpret the massive data. 12
Scientific Data Deluge in CAS
• Large scientific facilities produce huge data – +20 being operation
– +20 under construction
• Long-Term field observation stations – +100 stations including Ecology, Environment, Space, etc.
• Long-Term Research data need to be archived and shared – 100+ institutes
Large Scientific facilities Field observation stations
13
High Speed Network -CSTNET
-CSTNET-CNGI
-GLORIAD
1.Field observation stations
2.Large scientific facilities
3.others
Advanced CI for Data Lifecycle in CAS
Application
Generation
&Collection
Trans-
mission
Computing
&Analysis
Storage
&Curation
Data
Information Stream
Data Centers -storage &preservation
-Curation
-Sharing and Service
Supercomputing Grid -Computing
-Analysis
-Mining
-visualization
Data intensive e-
Science activities and
Applications
14
It is mixed evolution of grid computing, distributed computing, parallel computing, utility computing, network storage technologies, virtualization, and etc.
It has the characteristics of large-scale, virtualization, high reliability, generality, expandability, on-demand service, extremely cheap, which enables it a popular computing paradigm.
It can bridge the scientists and massive data.
Chinese Academy of Sciences Scientific Data Cloud (CASSDC) is focused on cloud technology to provide facilitated ways for scientists to make use of powerful information infrastructure, massive scientific data and rich scientific software.
Cloud Computing
15
Integrated
Service
Middleware
Infrastructur
e
Scientific
Data
Data
Service
Infrastructure
Service
Infrastructure
Service
Network
Job Scheduler
Data publisher
MetaData Manager
Data Transport
Services of CASSDC
16
Scientific Data infrastructure
Middle ware (Scientific data grid middleware,
internet-based storage service
middleware…)
Scientific databases
Massive storage system
Data-intensive computing facility
High speed network
Application enabled environments
and typical e-science practice
Software and Toolkits
(scientific data collection, curation, and
publishing, data analyzing and
visualization…)
17
Data Centers Distribution of CASSDC Scientific Data
~1PB
Above 60 institutions
Multiple Disciplines
Storage Capacity
~ 22PB(50PB)
1 major center
1 archive center
12 middle-size center
Computing Capacity
~ 5000(10000) CPU
cores
Dedicated design for
DIC
18
System Ach. Of Major Center
19
Enabling Technology: Infrastructure
Global File System of Cloud Storage
20
Enabling Technology: Infrastructure
On fly provision of a computing cluster
CPU
MemoryCPU
MemoryCPU
Memory
CPU
Memory
CPU
Memory
IP kernelWOL
(1) (2) (3)
(4)
Computing Nodes Pool
Image
(4)(4)
switch to root
file systemswitch to root
file system……
……
Storage
Image Image
DHCP Server
TFTPServer
ClusterManager
21
Scientific Databases (SDB) • A Long-term mission started
in 1986 which funded by CAS – many institutes involved
– long-term, large-scale collaboration
– data from research, for research
• Collecting multi-discipline research data and promoting data sharing
– More than 350 research
databases and 500 datasets by
61 institutes
– Over 200TB data available to
open access and download http://www.csdb.cn
22
Scientific Databases (cont.) • focusing on data integration and improving
research database to be resource database and
even reference database)
Research database Research database
Resource database
Reference database
Application oriented database
23
Scientific Databases (cont.) • 8 Resource databases
– Geo-Science
– Biodiversity
– Chemistry
– Astronomy
– Space Science
– Micro biology and virus
– Material science
– Environment
2 Reference databases
– China Species
– compound
4 application-Oriented
databases
– High Energy (ITER)
– Western Environment
Research
– Ecology research
– Qinghai Lake Research
24
Scientific Databases (cont.) • 37 research databases
– Physics & Chemistry, Geosciences, Biosciences,
Atmospheric & Ocean Science, Energy Science,
Material Science, Astronomy & Space Science
GeoScience 43%
Chemistry 9%BioScience 18%
ICT 6%
Space 4%
Astronomy 1%
Physics 6%Ocean 5%Material 5%
Energy 3%
25
CAS Scientific Data Grid • SDG is
– built upon the Scientific Database, supporting to find and access
large scale, distributed and heterogeneous scientific data
uniformly and conveniently in a SECURE and proper way
• Building scientific data application grid according to
domain requirements
– Integrate distributed data, analysis tools and storage and
computing facilities, providing a uniform data service interface
– 4 pilot grids
• bioscience grid
• geoscience grid
• Chemistry grid
• Astronomy and space science grid
26
Scientific Data Grid-Architecture
Organization Architecture of SDG 27
SDG-Platform && Middleware
• Platform – SDGIM: Information
Management
– SDGOM: Operation
Management
– SDGSA: Storage Service
– SDGMS: Monitor && Statistic
• Middelware – SDGDD: Data Publish
– SDGDT:Data Transfer Toolkit
– SDGDC: Data Compress
Toolkit
– SDGMM:MetaData
Management
– SDGJS: Job Scheduler
28
Tools for data management and service
29
An Integrated Case on Geography Supported
by CASSDC
• Data and computing resource are both distributed
• Model is from CAS scientist
• Adopted Middleware: • Data search
• Data transport
• On-fly computing provision
• Job scheduler
• It solves massive data computing while some commercial geometric software can’t work
• Project: High Precision Display of Earth Surface
30
• Data: • Microbiology Institute
• World Data Center for
Microorganisms
• Wuhan Virus Institute
• Computing: • CNIC
• Microbiology Institute
• Adopted Middleware: • Data search
• Data transport
• Job scheduler
• User athentication
• Gene Alignment Project
An Integrated Case on Biography Supported
by CASSDC
31
An Integrated Case on Biography Supported
by CASSDC
32
Cooperation
• International Organization Membership
33
Cooperation with Europe
CSTNET provide network support for the data
transmission between Europe and China
34
ITER
Global Earth Observation System
of Systems
CERN LHC: ATLAS & CMS
ARGO-Yangbajing
Challenges
• On-demand Linking multi-disciplinary data
based on semantic
• Big Data processing
– High scalable, Low cost, high Throughput
– On-demand flexible data processing
• Integrate data, storage, computing,
analysis model and etc. as a whole system
driven by one specific scientific problem
– Making infrastructure invisible for scientists 35
Conclusion
• Science discovery has increasingly become
data intensive, and it calls for reliable and easily
accessible scientific data infrastructure
• CAS is always promoting to build scientific data
infrastructure and data intensive e-Science
practices
• Seeking potential cooperation in data intensive
e-Science and data cloud
36
Thank you!
37
Recommended