Upload
johannes-schildgen
View
37
Download
4
Embed Size (px)
Citation preview
It‘s For Data Transformation
M. Sc. Johannes Schildgen2015-07-08
… Is Not A Query Language!
on Wide-Column Stores
2
"A DBA walks into a NoSQL bar, but turns and leaves because he couldn't find a table"
3
Column Families
RowId info children
4
RowId info childrenPeter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
Peter, 1965,IBM, 70k
Lisa, 1997,BSIT
Column Families
€10 €5
€0€7
5
HBase APIput ‘pers‘, ‘Carl‘, ‘info:born‘, ‘1982‘
put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BSIT‘
put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BUIT‘
get ‘pers‘, ‘Carl‘
6
Jaspersoft HBase QL
{ "tableName": "pers", "deserializerClass": "com.jaspersoft…DefaultDeserializer", "filter": { "SingleColumnValueFilter": { "family": „info", "qualifier": „school", "compareOp": "EQUAL", "comparator": { "SubstringComparator": { "substr":
„BSIT" } } } }}
𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′𝐩𝐞𝐫𝐬
http://community.jaspersoft.com/wiki/jaspersoft-hbase-query-language
7
Phoenix
SELECT * FROM pers WHERE school = ‘BSIT‘
𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′𝐩𝐞𝐫𝐬
„Parent of each person?“
https://github.com/forcedotcom/phoenix
8
9
Input Table Output Table
10
Column
RowID Value
Input Cell Output Cell
Column
RowID Value
11
_c
_r _v
_c
_r _v
Input Cell Output Cell
12
_cborn
_rLisa
_v1997
_cborn
_rLisa
_v1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,OUT.born <- IN.born;
𝝅𝒃𝒐𝒓𝒏
13
_cborn
_rLisa
_v1997
_cborn
_rLisa
_v1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,OUT.born <- IN.born,OUT.school <- IN.school;
𝝅𝒃𝒐𝒓𝒏 , 𝒔𝒄𝒉𝒐𝒐𝒍
14
_cborn
_rLisa
_v1997
_cborn
_rLisa
_v1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;
𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′
15
_cborn
_rLisa
_v1997
_cborn
_rLisa
_v1997
Input Cell Output Cell
IN-FILTER: school=‘BSIT‘,OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;
row predicate
𝐩𝐞𝐫𝐬𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′
16
That was:Selection and Projection
17
Now:Grouping
18
_ccmpny
_rPeter
_vIBM
_csalsum
_rIBM
_v645k
Input Cell Output Cell
Salary sum of each company.
OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary):
19
RowId info
Eve
Carl
Julia
Lisa
OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary):
born cmpny salary
1965 IBM 70k
born cmpny job
1966 IBM intern
born cmpny salary
1967 IBM 80k
born school salary
1997 BSIT 1k
salsum
IBM 70k
salsum
IBM 80k
salsum
IBM 150k
20
Advanced Transformations:More Filters
21
RowId info childrenPeter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
Peter, 1965,IBM, 70k
Lisa, 1997,BSIT
€10 €5
€0€7
22
RowId info childrenPeter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;
23
RowId info childrenPeter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;
24
RowId info childrenPeter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c) <- IN._v;
25
RowId info childrenPeter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c?(@>5)) <- IN._v;
cell predicate
26
RowId info childrenPeter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c?(!Carl)) <- IN._v;
cell predicate
27
NotaQL Transformation Platform:MapReduce
28
map(rowId, row)
row violates row pred.?
has more columns?
no
cell violates cell pred.?
yes
map IN.{_r,_c,_v}, fetched columns and constants to r,c and v
no
emit((r, c), v)
no
yes
Stop
yes
RowId info
Peter born cmpny salary
1965 IBM 70k
salsum
IBM 70k
((IBM, salsum), 70k)
29
reduce((r,c), {v})
put(r, c, aggregateAll(v))
Stop
((IBM, salsum), {70k, 80k, 10k})
((IBM, salsum), 160k)
30
Advanced Transformations:Graph Algorithms
31
RowId info childrenPeter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
Peter, 1965,IBM, 70k
Lisa, 1997,BSIT
€10 €5
€0€7
32
_cLisa
_rPeter
_c€5
_cPeter
_rLisa
_v€5
Input Cell Output Cell
„Parent of each person?“
OUT._r <- IN.children._c, OUT.$(IN._r) <- IN._v;
33
RowId info linksWikipedia
crawled pr
17:35 0.333
Twitter Google
- -
-
Wikipedia
-
crawled pr
17:36 0.333
crawled pr
17:36 0.333
OUT._r <- IN.links._c, OUT.incoming.$(IN._r) <- IN._v;
34
RowId info links incomingWikipedia
crawled pr
17:35 0.333
Twitter Google
- -
-
Wikipedia
-
crawled pr
17:36 0.333
crawled pr
17:36 0.333
OUT._r <- IN.links._c, OUT.incoming.$(IN._r) <- IN._v;
Twitter Wikipedia
- -
-
Wikipedia
-
Reverting the graph
35
RowId info links incomingWikipedia
crawled pr
17:35 0.333
Twitter Google
- -
-
Wikipedia
-
crawled pr
17:36 0.333
crawled pr
17:36 0.333
OUT._r <- IN.links._c, OUT.info.pr <- SUM(IN.pr/COL_COUNT(links));
Twitter Wikipedia
- -
-
Wikipedia
-
36
RowId info links incomingWikipedia
crawled pr
17:35 0.333
Twitter Google
- -
-
Wikipedia
-
crawled pr
17:36 0.167
crawled pr
17:36 0.5
OUT._r <- IN.links._c, OUT.info.pr <- SUM(IN.pr/COL_COUNT(links));
Twitter Wikipedia
- -
-
Wikipedia
-
PageRank
37
Advanced Transformations:Text Processing
38
RowId infoWikipedia
crawled pr body
17:35 0.333 all information can be found here
OUT._r <- IN._r, OUT.words <- COUNT(IN.body.split(‘ ‘));
crawled pr body
17:36 0.167 click here for more information
39
RowId infoWikipedia
crawled pr body words
17:35 0.333 all information can be found here 6
OUT._r <- IN._r, OUT.words <- COUNT(IN.body.split(‘ ‘));
crawled pr body
17:36 0.167 click here for more information 5
Word Count
40
RowId infoWikipedia
crawled pr body words
17:35 0.333 all information can be found here 6
OUT._r <- IN.body.split(‘ ‘), OUT.$(IN._r) <- COUNT(*);
crawled pr body
17:36 0.167 click here for more information 5
RowId infohere
…
Wikipedia Twitter
1 1
Term Index
41
Conclusion
Selection, ProjectionGrouping, AggregationSchema-FlexibleHorizontal AggregationMetadataDataGraph ProcessingText Processing
SQL
42
Thank you!