Upload
frank-kienle
View
93
Download
2
Embed Size (px)
Citation preview
Overview of data sources • http://www.knuggets.com/datasets/index.html Machine learning data • UCI Machine Learning Repository: archive.ics.uci.edu Data Shop: the world’s largest repository of learning interaction data • https://pslcdatashop.web.cmu.edu
Getting Data is not the problem - Very large flavor of Data Sources
06.09.17 Frank Kienle 3
• Formally, a "database" refers to a set of related data and the way it is organized. • A database manages data efficiently and allows users to perform multiple tasks
with ease. The efficient access to the data is usually provided by a "database management system" (DBMS)
• A database management system stores, organizes and manages a large amount of information within a single software application.
• Use of this system increases efficiency of business operations and reduces overall costs.
• Different database systems exist which are designed with respect to: • the data to be stored in the database • the relationships between the different data elements. Dependencies within the data which can
be modeled by mathematical relations • the logical structure upon the data on the basis of these relationships. The goal is to arrange
the data into a logical structure which can then be mapped into the storage objects
Database
06.09.17 Frank Kienle p. 4
Scale up: using more and more main memory Scale out: using more and more computers Definition (m complexity order): Scalability for N data items an algorithms scales with Nm.
E.g polynomial complexity Parallelize it (k nodes): The algorithm scales with Nm/k Goal find algorithms with complexity: N log(N) which relates e.g. with trees (one touch)
Scalability in big data
06.09.17 6 Frank Kienle
CAP theorem
06.09.17 Frank Kienle p. 7
C: consistency (do all applications see all the same data) Any data written to the database must be valid According to all defined rules
A: availability (can I interact with the system In the presence of failures)
P: partitioning If two sections of your system cannot talk to each Other, can they make forward progress on their own - If not you sacrifice availability - If so, you might have to sacrifice consistency
Dynamo Riak Voldemort Cassandra CouchDB
Bigtable Hbase Hypertable Megastore Spanner Accumulo
RDBMS
Relational data bases key idea: § storage and retrieval of large quantities of related data. § When creating a database you should think about which tables needed and
what relationships exist between the data in your tables.
§ Relational algebra, § Physical/logical data independence
Think about the design in advance
Relational Data Bases
06.09.17 Frank Kienle p. 9
A database is created for the storage and retrieval of data. we want to be able to INSERT data into the database and we want to be able to SELECT data from the database. A database query language was invented for these tasks called the Structured Query Language,
Structured query language (SQL)
06.09.17 Frank Kienle p. 10
When you can do JOIN’s its good for analytics When a data base does not provide joins the work is it is all up for the users (Leave the work on the client side)
Fundamental of data exploring (joins)
06.09.17 Frank Kienle p. 11
Outer Relational Join (on time stamp)
06.09.17 Frank Kienle p. 12
Timestamp[s] Valueroom[Wa2]
1 30
2 25
5 12
Timestamp[s] ValueHome[Wa2]
1 100
2 78
3 99
4 70
Timestamp[s] ValueRoom[Wa2|
ValueHome[Wa2]
1 30 100
2 25 78
3 NaN 99
4 NaN 70
5 12 NaN
Left Join (on time stamp)
06.09.17 Frank Kienle p. 13
Timestamp[s] Valueroom[Wa2]
1 30
2 25
5 12
Timestamp[s] ValueHome[Wa2]
1 100
2 78
3 99
4 70
Timestamp[s] ValueRoom[Wa2|
ValueHome[Wa2]
1 30 100
2 25 78
5 12 NaN
Storing data efficiently is all about the application
schema less vs. schema
writing centric vs. reading centric
transactional vs. analytics
batch vs. stream
Key-Value object • A set of key-value pairs
Extensible record (XML or JSON) • Families of attributes have a schema • New attributes may be added
• Many predictive analytics tasks will require a kind of record
• Many REST APIs will deliver JSON, (YAML, XML) structures • Example: tweeter feeds Key Value stores (Document store might be a subset) • No schema, no exposed nesting • often raw data (scalable to peta bytes) • on top simple analytics tasks
Different data structure
06.09.17 Frank Kienle p. 15
45777
Ux_78
321-87
Frank Kienle, Germany
Please learn
Random data
key value
The ability to replicate and partition data over many serves • Sharding: horizontal partitioning of the data set
No query language: a simple API defined Ability to scale operations over many serves
• Throughput increase • Due to missing (language) query layer each operation has to design towards the API
Operations have often restrictions to data locality New features can be added dynamically to data records (no fixed schema) Consistency model often weak (no modeling of transaction)
(typical) NoSQL data base features
06.09.17 Frank Kienle p. 18
In-memory database • primarily relies on main memory for computer data storage • main purpose is faster analytics on data • relational or unstructured data structure
• memory optimized data structures
Main memory database system (MMDB)
06.09.17 Frank Kienle p. 19
Advantage Column-oriented: • Reading efficiency: more efficient when an aggregate needs to be computed over
many rows but only for a notably smaller subset of all columns of data select col_1,col_2 from table where col_2>5 and col_2<45;
• Writing efficiency: more efficient when new values of a column are supplied for all rows at once
Advantage row-oriented: • Reading efficiency: more efficient when many columns of a single row are
required at the same time, and when row-size is relatively small • Writing efficiency: more efficient when writing a new row if all of the row data is
supplied at the same time, as the entire row can be written with a single disk seek.
Row vs. Column data stores
06.09.17 Frank Kienle p. 20
Processing types
06.09.17 Frank Kienle p. 21
OLTP: On-line Transaction Processing e.g. Business transactions (insert, update, delete)
OLAP: On-line Analytical Processing e.g. complex analytics (aggregating of historical data)
Spanner Idea: Planet scale data base system ….we believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions …
Loose consistency for predictive analytics is horrible
Loose consistency is a no go for prescriptive analytics (dynamic pricing)
Systems should always be designed for usability
Many trends in data bases are going back to data consistency
06.09.17 Frank Kienle p. 23