EFFICIENT RETRIEVAL OF THE TOP-K
MOST RELEVANTSPATIAL WEB OBJECTS
GAO CONG CHRISTIAN S. JENSEN DINGMING WU
Manzoni Paolo Menchicchi Fabio(Relatore)
(GRUPPO 1)
Motivations• About one fifth of web search queries are
textual and have local intent• This imply that the user would like to find:
– Something that satisfy his needs – Something near to him
– … and know only best results …
I’d like to find an italian restaurant where i can drink
duff-beer near Springfield
Scenary• The conventional internet is acquiring a geo-spatial
dimension:• Many points of interest are being associated with
descriptive text• Web documents are increasingly being geo-tagged• Commercial search engines have started to provide
location based services, such as map services, local search, and local advertisements.
• For example, Google Maps and Yellow Pages supports location-aware text retrieval queries
Example
Problems to face1. How to evaluate both spatial proximity and
text similarity Location aware top-K Text retrieval (LkT) query
2. How to join two different worlds:1. Inverted files for Text Searches2. R-tree for Geo-Searches
Ad hoc data structures (IR-tree , DIR-tree)
Object EvaluationsObject
Text Document Coordinates
Dst (Q ,O )=𝛼 D ε(Q . loc ,O. loc )maxD
+(1−𝛼)(1− P (Q .keywords|O .doc )maxP
)
D= Spatial DatabaseO= Object in D O=(O.loc, O.doc)
Q= Query Q=(Q.loc, Q.keywords)
Normalized euclidean distancebetween queriy and O
P evaluates the normalized relevancy of keywords in the document related to O.
Lower values of Dst implys interesting objects for users
IR-tree• Ad-hoc Data Structure created for LkT Query
• Basically an R-tree, with extra infos needed to evaluate Text Relevancy in each node ...
𝑹𝟕
𝑹𝟓 𝑹𝟔
𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏
R5R6
R3R4 R2R1
𝑶𝟔 𝑶𝟖 𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟗𝑶𝟕
R-tree We must be able to evaluate FOREACH NODE:
1. The SPATIAL DISTANCE from a query (R-tree help us)
2. The TEXT RELEVANCY for a query
• PSEUDO DOCUMENTRepresents all the documents into the entries of a child node.
How can we evaluate NON-LEAF NODES about TEXT RELEVANCY?
• INVERTED FILES
𝑶𝟒
𝑶𝟐
𝑶𝟑
𝑶𝟓
𝑶𝟕
𝑶𝟔
𝑶𝟖
𝑶𝟏
𝑶𝟗
𝑹𝟕
𝑹𝟏
𝑹𝟔𝑹𝟓
𝑹𝟒
𝑹𝟑 𝑹𝟐
ExampleQ=(A; ‘IRISH’, ‘PUB’)
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
𝑹𝟕
𝑹𝟓 𝑹𝟔
𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏
R5R6
R3R4 R2R1
𝑶𝟔 𝑶𝟖 𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟗𝑶𝟕
R-Tree
𝑹𝟕
𝑹𝟓 𝑹𝟔
𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏
R5R6
R3R4 R2R1
𝑶𝟔 𝑶𝟖 𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟗𝑶𝟕
INVERTED-FILE 7PUB [R5] [R6]IRISH [R54] [R65]PIZZA [R5 [R66]
INVERTED-FILE 5PUB [R3]IRISH [R33] [R44]PIZZA [R3[R46]
INVERTED-FILE 6PUB [R2 5] [R1]IRISH [R25] [R15]PIZZA [R2 [R16]
INVERTED-FILE 3PUB [O.7] [O.63]IRISH [O.72] [O.63]PIZZA [O.7 [O.6 1]
INVERTED-FILE 4PUB IRISH [O.84] [O.93]PIZZA [O.85]
INVERTED-FILE 2PUB [O.1] [O.22]IRISH [O.15] [O.24]PIZZA [O.1
INVERTED-FILE 1PUB [O.3 5] [O.52]IRISH [O.45]PIZZA [O.3 [O.4[O.5
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
PUB 4 IRISH 3PIZZA 1
IRISH 4PIZZA 6
PUB 5 IRISH 5PIZZA 1
PUB 5IRISH 5PIZZA 6
PUB 4 IRISH 4PIZZA 6
PUB 5IRISH 5PIZZA 6
Searching in the treeWe can now evaluate each node of the tree both in text and distance, and explore it in a ‘clever way’
Dst (Q ,O )=𝛼 D ε(Q . loc ,O. loc)maxD
+(1−𝛼)(1− P (Q .keywords|O .doc )maxP
)
For a Leaf node this is our evaluation :
MINDst (Q , N )=𝛼 Dε (Q .loc ,N . rect )maxD
+(1−𝛼)(1− P (Q .keywords|N .doc )maxP
)
For a NON Leaf node this is our evaluation :
Because of the nature of Pseudo documentsGiven a query point Q and a node N whose rectangle encloses a set of objects SO 𝑹𝟐 𝑹𝟏
R2R1
𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓
+
LkT Query
𝑶𝟒
𝑶𝟐
𝑶𝟑
𝑶𝟓
𝑶𝟕
𝑶𝟔
𝑶𝟖
𝑶𝟏
𝑶𝟗
𝑹𝟕
𝑹𝟏
𝑹𝟔𝑹𝟓
𝑹𝟒
𝑹𝟑 𝑹𝟐
𝑹𝟕
𝑹𝟓 𝑹𝟔
𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏
R5R6
R3R4 R2R1
𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟖𝑶𝟗𝑶𝟔𝑶𝟕
PriorityQueue for increasing values of
Q=(A; ‘IRISH’, ‘PUB’)
R70
R60,1
R50,16
R10,15
R50,16
R20,19
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
1)
2)
3)
LkT Query
𝑶𝟒
𝑶𝟐
𝑶𝟑
𝑶𝟓
𝑶𝟕
𝑶𝟔
𝑶𝟖
𝑶𝟏
𝑶𝟗
𝑹𝟕
𝑹𝟏
𝑹𝟔𝑹𝟓
𝑹𝟒
𝑹𝟑 𝑹𝟐
𝑹𝟕
𝑹𝟓 𝑹𝟔
𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏
R5R6
R3R4 R2R1
𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟖𝑶𝟗𝑶𝟔𝑶𝟕
PriorityQueue for increasing values of
Q=(A; ‘IRISH’, ‘PUB’)
O30,46
O40,58
R50,16
R20,19
O50,66
O30,46
R40,48
R20,19
R30,25
O40,58
O50,66
R30,25
O30,46
O10,23
R40,48
O40,58
O50,66
O20,7
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
4)
5)
6)
Considerations O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
𝑶𝟑
𝑶𝟐
𝑶𝟒
𝑶𝟓
𝑶𝟕
𝑶𝟔
𝑶𝟖
𝑶𝟏
𝑶𝟗
𝑹𝟕
𝑹𝟏
𝑹𝟔𝑹𝟓
𝑹𝟒
𝑹𝟑 𝑹𝟐
𝑹𝟕
𝑹𝟓 𝑹𝟔
𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏
R5R6
R3R4 R2R1
𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟖𝑶𝟗𝑶𝟔𝑶𝟕
Q=(A; ‘IRISH’, ‘PUB’)
In the previous example we explore nodes R5 and R1, even if they do not contain really interesting objects ...
The guilty is the pseudo-document because inherits the best of all the documents inside its subtree, and those documents may be very different between them
To avoid this we would like that inside a node there are:• SIMILAR DOCUMENTS • Near objects
DIR-tree
DIR-TreeUnlike the IR-tree, the DIR-tree aims to take both location and text information into account during tree construction, by optimizing for a combination of minimizing the areas of the enclosing rectangles and maximizing the text similarities between the documents of the enclosing rectangles.
SimAreaCost ( Ek ,O )=(1− 𝛽 )AreaCost ( Ek )maxArea
+𝛽(1−cosine(Ek .Vector ,O .vector))
We need an heuristic function to evaluate in which leaf node insert an item during tree construction.
𝑬𝟏 𝑬𝟐
E1E2
𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓
…E p
Current Node
= generic element of a non-leaf node = object to insert in the tree
𝑶𝟏
𝑶𝟐
𝑬𝟏
O𝑬 ′𝟏
AreaCost ( E k )=area (E ′k . rectangle )−area (Ek .rectangle )
E ′ k=area ¿
O
𝑬𝟏 𝑬𝟐
E1E2
𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓
…E p
Current Node
Pub 2Pizza 0
Pub 1Pizza 0
vector [2,0] [1,0]
Pub 2Pizza 0
Pseudo-Document-Vector
vector [2,0]
Pub 0Pizza 1[0,1]
O.Vector = term frequency vector of the document OVector = term frequency vector of the pseudo-document related to
Pub
Pizza𝑶 .𝒗𝒆𝒄𝒕𝒐𝒓
θ
SIMILARITY=0max penality (1)
𝑬𝟏 .𝒗𝒆𝒄𝒕𝒐𝒓
O
𝑹𝟕
𝑹𝟓 𝑹𝟔
𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏
R5R6
R3R4 R2R1
𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟖𝑶𝟗𝑶𝟔𝑶𝟕
O10
Here the benefits
𝑶𝟒
𝑶𝟐
𝑶𝟑
𝑶𝟓
𝑶𝟕
𝑶𝟔
𝑶𝟖
𝑶𝟏
𝑶𝟗
𝑹𝟕
𝑹𝟏𝑹𝟔
𝑹𝟓
𝑹𝟒
𝑹𝟑 𝑹𝟐
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
𝑶𝟒
𝑶𝟐
𝑶𝟑
𝑶𝟓
𝑶𝟕
𝑶𝟔
𝑶𝟖
𝑶𝟏
𝑶𝟗
𝑹𝟕
𝑹𝟏
𝑹𝟔𝑹𝟓
𝑹𝟒
𝑹𝟑 𝑹𝟐
IR-Tree DIR-Tree
𝑹𝟖
How to improve performances ...• Clustering :Spatial web objects often belong to different categories, for example accommodations, restaurants, and tourist attractions.
The idea is to cluster objects into groups according to their corresponding documents.
But what does it involve from the point of view of:
• Efficiency?• Changes in the trees(IR or DIR) structure?
𝑹𝟕
𝑹𝟓 𝑹𝟔
𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏
R5R6
R3R4 R2R1
𝑺𝒑𝒐𝒓𝒕 :𝑶𝟏Art: 𝑨𝒓𝒕 :𝑶𝟑𝑨𝒓𝒕 :𝑶𝟒𝑺𝒑𝒐𝒓𝒕 :𝑶𝟓𝑺𝒑𝒐𝒓𝒕 :𝑶𝟖𝑨𝒓𝒕 :𝑶𝟗𝑨𝒓𝒕 :𝑶𝟔𝑺𝒑𝒐𝒓𝒕 :𝑶𝟕
CLUSTER 1 Art, ,)CLUSTER 2 Sport,
Art
Sport
Sport
Sport
Sport
Sport
Art
Art
Art
Art
Art
Sport
ADVANTAGESbounds estimated using clusters in a node will be tighter than those estimated for whole nodes
The problem of finding a clustering solution that minimizes ScanTime is NP-hard.
STRUCTURE MODIFICATION:We need to memorize in each entry of each node one pseudo-document for each kind of cluster represented in a subtreeThis imply an additional space usage
Performances resultsProperty DatasetTotal number Of objects 131.461
Average number of unique words per objects 112
Total number of unique words in dataset 30.616
Total number of words in dataset 14.809.845
Baseline
IR-tree
CIR-tree
DIR-tree
CDIR-tree