Upload
kevin-harrell
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Tennessee Technological University 1
The Scientific Importance of Big Data
Xia Li
Tennessee Technological University
The Scientific Importance of Big Data Financial benefits are the major motivation of big data
research The technical challenges brought by big data The object of "data science" The common question behind data -- relationship network Causality and relationship Big data in social science Complexity in data processing Changes in the way of thinking
Tennessee Technological University 2
Financial benefits According to the statistics of IDC
(International Data Corporation), the size of the created and copied data in 2011 is more than 1.8 Zettabyte (10^21)
75% of them are from individuals (mainly pictures, videos, and musics), more than the data size of all the printed data, 200 Pettabyte (10^15)
Tennessee Technological University 3
Financial benefits Google uses very large scale computing
clusters and MapReduce software to process 400 PB data in one month
In Facebook, registered users upload more than 1 billion photos; The log files generated in each day are more than 300 TB
Tennessee Technological University 4
The technical challenges Six departments of US government started the
big data research projects to "form a unique branch of learning including mathematics, statistics, computer algorithm"
Most of the research projects are focused on data engineering instead of data science
The focus include analysis algorithm and system efficiency
Tennessee Technological University 5
The technical challenges Multiscale abnormal detection Threat plan in network Machine reading Realtime analysis of streaming data Non-linear random data compression Extendable statistics analysis technique
Tennessee Technological University 6
The technical challenges New data expression method
If the data expression method is not suitable, analysis result is more prone to bias
Data combination Data from different locations need to be
combined together to be processed De-redundancy and high efficient low cost
data storage
Tennessee Technological University 7
The object of "data science" Big data research is about how to find new
knowledge; the data itself is not the research object
As a research methodology, it is highly related to artificial intelligence algorithms like: data mining, statistic analysis, information search etc.
Tennessee Technological University 8
The object of "data science" The complexity of traditional algorithm grows
exponentially as the size and dimension of the problem grow
To big data at PB level, new method is needed Traditional AI algorithm can accept
O(NlogN) or even O(N^3) To big data problem, O(NlogN) can hardly be
acceptedTennessee Technological University 9
The common question behind data -- relationship network The big data is composed of individual data
and scattered connections After connection combination, it is a network
Gene data becomes gene network World wide web data becomes social network
Big data exists in a complicatedly connected data network
Tennessee Technological University 10
The common question behind data -- relationship network The distribution
of world wide web
Can obtain scale free network
Tennessee Technological University 11
Causality and relationship Correlation analysis is to find the mutual
relationship hidden in data Correlation factors: support degree,
confidence degree, interest degree
Tennessee Technological University 12
Causality and relationship A and B are related
The values of A and B have mutual influence Cannot say A causes B Cannot say B causes A
Strictly speaking, statistics cannot prove the logic causality
Tennessee Technological University 13
Big data in social science In Facebook, data is generated randomly Researchers need to find valuable information
from these data Big data in social science has some unique
characteristics like: multi-source heterogeneous, interactive, socialized, suddenness, high noise
Tennessee Technological University 14
Big data in social science The future task is not to get more and more
data It is mining useful knowledge from the data When a kid learns to distinguish animals and
cars, tens of sample pictures will be enough How to eliminate unnecessary data sampling
becomes a problem
Tennessee Technological University 15
Complexity in data processing Original theory
Time complexity: time used in algorithm Space complexity: the memory used in algorithm
Data size complexity The problem can only be solved after the data
size achieve a level The relationship between prediction confidence
probability and data level
Tennessee Technological University 16
Changes in the way of thinking The fourth paradigm
Data intensive research All models are wrong, and
increasingly you can succeed without them
Data in PB level can help us to analysis without model and hypothesis
When data is correlated, statistics algorithm will find new patterns unknown to previous methods
Tennessee Technological University 17
Tennessee Technological University 18
Thank you