The Stringex Viewpoint (over-the-network)
Data
Indexer
Index
Network
Traditional Client
Data
Indexer
Index Read, Write
Stringex Client
The
01 myself+0 "A New Practical Design for Browsable Over-the-Network Indexing" ISEEE (2014)
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 2/222/22
The Old Stringex
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 3/223/22
The Stringex Engine
Stringex
Index
Stringex Client
The
Sync Engine
Optimization
Local Cache
Check 1 2
Use
01 myself+0 "A New Practical Design for Browsable Over-the-Network Indexing" ISEEE (2014)
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 4/224/22
Stringex v1 : DesignJSON { name : value1, age : value2, …}
Hash table 000 [ ] 001 … …
#1 #2 … [ ]
Doc # JSON data a123d … 53ffe3 { name: value1, age: value2, …} …. ….
Per JSON key …
hashing
Bit mask
Doc # Doc #
Cloud storage
Local storage
Realtime Sync name .block1
…
Block
Block
name .block2
age .block1
… age .block2
docs .block1
… docs .block2
Cloud Drive API App Space
• load balancing, basically• fixed-size blocks permeta key and docs
• the engine is in charge ofminimizing traffic
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 5/225/22
Stringex v1 : Design (2)INPUT JSON { name : value1, age : value2, …}
Files
… name .imap { ‘bk ’: { ‘ ik’: ‘ start,end ’ , … next ‘ik’ }, … next bk } name .vmap { ‘value’: ‘ bk’, … next value } name .bk1 name .bk2 …
Key: name
…
Key: age docs .imap { ‘bk ’: { ‘docid ’: ‘ start,end ’ , … next ‘docid ’ }, … next bk }
docs .bk1 docs .bk2 …
Docs
No . vmap
Same Same
Index Data
• blocks aggregated by prefixes ofmd5 hashes
• some JSON structure in .imapfiles with positions for partial filereading
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 6/226/22
Stringex v1 vs Lucene
3.15 3.85 4.55 5.25 5.95 6.65Index Size (log)
2.55
2.65
2.75
2.85
2.95
3.05
3.15
3.25)cod/setyb fo gol( tuphguorhT
Lucene
Stringex
Normal operation
…
Need to improve this part
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 7/227/22
The New Stringex
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 8/228/22
Stringex vs : Visual Idea
• variable-size blocks
• possibly, variable depth as well --hierarchical layering
• unfortunately, not in this version -- couldnot figure out how to make it work in practice
• ... but a distant goal, anyway
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 9/229/22
Stringex v2 : Variable Blocks
prefixmin prefixmax keyorder
global config
1 3 authors, title, pages
Example values
Cloud storage
meta.a.a.a
Three keys
meta.a.af.a …
meta.z.z.z
Update
docs.a.a.a …
docs.z.z.z
Stringex Client
The
Background (lazy) processing
• the biggest change : variableprefix length = variableblock size
• all metadata is togethernow, but order is important --encoded in filenames
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 10/2210/22
Stringex v2 : LogicThe idea is...
...the same -- to minimize traffic between client andcloud
Stringex Client
The
JSON { name : value1, age : value2, …}
caching hashing
Sync Engine
Fill gaps ‘0’ prefix
My own recent?
Timeout passed?
no Get cache Index
no
Get large block
yes
Small block still there?
if failed
Get small block try
• since metadata order is infilename, zero prefix isimportant -- gap filling
• local cache can help toavoid syncing recent files
• longer prefix at cloudside means smallerfiles
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 11/2211/22
Analysis
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 12/2212/22
Analysis : Components
• real life tests using the new client• hotspot distribution defines access frequency• parameters: file count, cache ratio, hotspot class
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 13/2213/22
Analysis : Hotspot TraceHotspot distribution...
...consists of normal, popular, and hot/flash sets
0 20 40 60 80 100
Decreasing order
0
0.35
0.7
1.05
1.4
1.75
2.1
2.45
2.8
log(
val
ue)
Class A Class B Class C Class D Class E
• common in CDN today
• top 5% of content is hot/flash• top 20% of content is popular• the rest are normal
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 14/2214/22
Analysis : Raw Trace
• logs from an actual run• regularly take snapshots of filesystem state, keep track of access count inthe background
• can be used to visualize the details of operation
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 15/2215/22
Analysis : Visualizationtime#725 class#E files#100 topn#0.1 dynamics(depth 4 >> 3@5 2@25 1@100)
• visualized snapshot
• 100 files, 10% incache (cloud side),4-stage dynamics
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 16/2216/22
Analysis : Visualizationtime#532 class#B files#200 topn#0.5 dynamics(depth 4 >> 3@5 2@25 1@100)
• 200 files, 50% incache at cloud side
• same dynamics asbefore
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 17/2217/22
Analysis : Visualizationtime#23 class#E files#200 topn#0.5 dynamics(depth 4 >> 1@5)
• yet another set, earlyin process (time 23)
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 18/2218/22
Analysis : Visualizationtime#845 class#B files#500 topn#0.25 dynamics(depth 4 >> 3@5 2@25 1@100)
• very deep time,mostly settled
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 19/2219/22
Analysis : Visualizationtime#451 class#D files#500 topn#0.25 dynamics(depth 4 >> 3@5 2@25 1@100)
• also deep time, alsosettled but this timeinto 2 islands
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 20/2220/22
Goal: Yet More Flexibility?
Blocksize
Client
Index
Index > Client
?
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 21/2221/22
That’s all, thank you ...
M.Zhanikeev -- [email protected] -- A Method for Dynamic Packing of Data Blocks for Over-the-Network Indexing -- http://bit.do/150928 22/2222/22