View
222
Download
1
Tags:
Embed Size (px)
Citation preview
Do Not Crawl In The DUST:Different URLs Similar Text
Uri Schonfeld Department of Electrical Engineering
TechnionJoint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar
Problem statement and motivation Related work Our contribution The DustBuster algorithm Experimental results Concluding remarks
Talk Outline
DUST – Different URLs Similar Text Examples:
Standard Canonization: “http://domain.name/index.html”
“http://domain.name” Domain names and virtual hosts
“http://news.google.com” “http://google.com/news” Aliases and symbolic links:
“http://domain.name/~shuri” “http://domain.name/people/shuri”
Parameters with little affect on content Print=1
URL transformations: “http://domain.name/story_”
“http://domain.name/story?id=”
Even the WWW Gets Dusty
Dust rule: Transforms one URL to another Example: “index.html” “”
Valid DUST rule:
r is a valid DUST rule w.r.t. site S if for every URL u S,
r(u) is a valid URL r(u) and u have “similar” contents
Why similar and not identical? Comments, news, text ads, counters
DUST Rules!
Expensive to crawl Access the same document via multiple URLs
Forces us to shingle An expensive technique used to discover similar
documents Ranking algorithms suffer
References to a document split among its aliases Multiple identical results
The same document is returned several times in the search results
Any algorithm based on URLs suffers
DUST is Bad
Given: a list of URLs from a site S Crawl log Web server log
Want: to find valid DUST rules w.r.t. S As many as possible Including site-specific ones Minimize number of fetches
Applications: Site-specific canonization More efficient crawling
We Want To
Domain name aliases Standard extensions Default file names: index.html,
default.htm File path canonizations:
“dirname/../” “”, “//” “/” Escape sequences: “%7E” “~”
How do we Fight DUST Today? (1) Standard Canonization
Site-specific DUST: “story_” “story?id=“ “news.google.com”
“google.com/news” “labs” “laboratories”
This DUST is harder to find
Standard Canonization is not Enough
Shingles are document sketches [Broder,Glassman,Manasse 97]
Used to compare documents for similarity Pr(Shingles are equal) = Document similarity Compare documents by comparing shingles Calculate Shingle:
Take all m word sequences Hash them with hi Choose the min That's your shingle
How do we Fight DUST Today? (2) Shingles
Shingles expensive: Require fetch Parsing Hash
Shingles do not find rules Therefore, not applicable to new
pages
Shingles are Not Perfect
Mirror detection[Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01]
Identifying plagiarized documents [Hoad,Zobel 03]
Finding near-replicas[Shivakumar,Garcia-Molina 98], [Di Iorio,Diligenti,Gori,Maggini,Pucci 03]
Copy detection[Brin,Davis,Garcia-Molina 95], [Garcia-Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia-Molina 96]
More Related Work
An algorithm that finds site-specific valid DUST rules requires minimum number of fetches
Convincing results in experiments Benefits to crawling
Our contributions
Alias DUST: simple substring substitutions “story_1259” “story?id=1259” “news.google.com” “google.com/news” “/index.html” “”
Parameter DUST: Standard URL structure:
protocol://domain.name/path/name?para=val&pa=va Some parameters do not affect content:
Can be removed Can changed to a default value
Types of DUST
Input: URL list Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples:
Eliminate DUST rules that are “wrong” Further eliminate duplicate DUST rules
No Fetch Required
Our Basic Framework
Large support principle: Likely DUST rules have lots of “evidence” supporting them
Small buckets principle:Ignore evidence that supports many different rules
How to detect likely DUST rules?
Large Support Principle
A pair of URLs (u,v) is an instance of rule r, if: r(u) = v
Support(r) = all instances (u,v) of r
Large Support Principle
The support of a valid DUST rule is large
Rule Support:An Equivalent View
: a string Ex: = “story_”
u: URL that contains as a substring Ex: u =
“http://www.sitename.com/story_2659” Envelope of in u:
A pair of strings (p,s) p: prefix of u preceding s: suffix of u succeeding Example: p = “http://www.sitename.com/”, s
= “2659” E(α): all envelopes of in URLs that appear in
input URL list
Envelopes Example
Rule Support:An Equivalent View : an alias DUST rule
Ex: = “story_”, = “story?id=“
Lemma: |Support( )| = | E() ∩ E()| Proof:
bucket(p,s) = { | (p,s) E() } Observation: (u,v) is an instance of if
and only if u = p s and v = p s for some (p,s)
Hence, (u,v) is an instance of iff (p,s) E() ∩ E()
Large Buckets Often there is a large set of
substrings that are interchangeable within a given URL while not being DUST:
page=1,page=2,… lecture-1.pdf, lecture-2.pdf
This gives rise to large buckets:
Big Buckets: popular prefix suffix Often do not contain similar content Big buckets are expensive to process
I am a DUCK not a DUST
Small Bucket Principle
Small Buckets Principle
Most of the support of valid Alias DUST rules
is likely to belong to small buckets
Scan Log and form buckets Ignore big buckets For each small Bucket:
For every two substrings α, β in the bucket print (α, β)
Sort by (α, β) For every pair (α, β):
Count If (Count > threshold) print α β
Algorithm – Detecting Likely DUST RulesNo Fetch here!
Size and Comments
Consider only instances of rules whose size “matches”
Use ranges of sizes Running time O(Llog(L)) Process only short substrings Tokenize URLs
Input: URL list Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples:
Eliminate DUST rules that are “wrong” Further eliminate duplicate DUST rules
No Fetch Required
Our Basic Framework
Eliminating RedundantRules
“/vlsi/” “/labs/vlsi/”“/vlsi” “/labs/vlsi”
Lemma:A substitution rule α’ β’ refines rule α β if and only if there exists an envelope (γ,δ) such that α’ = γ◦α◦δ and β’=γ◦β ◦ δ
Lemma helps us identify refinements easily φ refines ψ? remove ψ if supports match
Rule φ refines rule ψ if SUPPORT(φ) SUPPORT(ψ )
No Fetch here!
Validating Likely Rules For each likely rule r, for both
directions Find sample URLs from list to which r is
applicable For each URL u in the sample:
v = r(u) Fetch u and v Check if content(u) is similar to content(v)
if fraction of similar pairs > threshold: Declare rule r valid
Assumption: if validation beyond threshold in 100
it will be the same for any validation above
Why isn’t threshold 100%? A 95% valid rule may still be worth it Dynamic pages change often
Comments About Validation
We experiment on logs of two web sites: Dynamic Forum Academic Site
Detected from a log of about 20,000 unique URLs
On each site we used four logs from different time periods
Experimental Setup
Precision at k
Precision vs. Validation
How many of the DUST do we find? What other duplicates are there:
Soft errors True copies:
Last semesters course All authors of paper
Frames Image galleries
Recall
In a crawl examined
18% of the crawl was reduced.
DUST Distribution
47.1 DUST
25.7% Images
7.6% Soft Errors
17.9% Exact Copy
1.8% misc
DustBuster is an efficient algorithm Finds DUST rules Can reduce a crawl Can benefit ranking algorithms
Conclusions
THE END
= => --> all rules with “” Fix drawing urls crossing alpha not
all p and all s
Things to fix
So far, non-directional Prefer shrinking rules Prefer lexicographically lowering
rules Check those directions first
Parameter name and possible values What rules:
Remove parameter Substitute one value with another Substitute all values with a single value
Rules are validated the same way the alias rules are
Will not discuss further
Parametric DUST
Unfortunately we see a lot of “wrong” rules
Substitute 1 with 2 Just wrong:
One domain name with another with similar software
False rules examples: /YoninaEldar/ != /DavidMalah/ /labs/vlsi/oldsite != /labs/vlsi -2. != -3.
False Rules
Filtering out False Rulese
Getting rid of the big buckets Using the size field:
False dust rules: May give valid URLs Content is not similar Size is probably different Size ranges used
Tokenization helps
DustBuster – cleaning up the rules Go over list with a window If
Rule a refines rule b Their support size is close
Leave only rule a
DustBuster – Validation
Validation per rule Get sample URLs URLs that the rule can be applied Apply URL => applied URL Get content Compare using shingles
DustBuster - Validation
Stop fetching when: #failures > 100 * (1-threshold)
Page that doesn't exist is not similar to anything else
Why use threshold < 100%? Shingles not perfect Dynamic pages may change a lot fast
Detect Alias DUST – take 2
Tokenize of course Form buckets Ignore big buckets Count support only if size matches Don't count Long substrings Results are cleaner
Eliminate Redundancies 1: EliminateRedundancies(pairs_list R) 2: for i = 1 to |R| do 3: if (already eliminated R[i]) continue 4: to_eliminate_current := false /* Go over a window */ 5: for j = 1 to min(MW, |R| - i) do /* Support not close? Stop checking */ 6: if (R[i].size - R[i+j].size > max(MRD*R[i].size, MAD)) break /* a refines b? remove b */ 7: if (R[i] refines R[i+j]) 8: eliminate R[i+j] 9: else if (R[i+j] refines R[i]) then 10: to_eliminate_current := true 11: break 12: if (to_eliminate_current) 13: eliminate R[i] 14: return R
No Fetch here!
Validate a Single Rule 1:ValidateRule(R, L) 2: positive := 0 3: negative := 0 /* Stop When You Are sure you either succeeded or failed */ 4: while (positive < (1 - ε) N AND (negative < εN) do 5: u := a random URL from L to which R is applicable 6: v := outcome of application of R to u 7: fetch u and v 8: if (fetch u failed) continue /* Something went wrong, negative sample */ 9: if (fetch v failed) OR (shingling(u) shingling(v)) 10 negative := negative + 1 /* Another positive sample */ 11: else 12: positive := positive + 1 13: if (negative ε N ) 14: retrun FALSE 15: return TRUE
Validate Rules 1:Validate(rules_list R, test_log L) 2 create list of rules LR 3: for i = 1 to |R| do /* Go over rules that survived = valid rules */ 4: for j = 1 to i - 1 do 5: if (R[j] was not eliminated AND R[i] refines R[j]) 6: eliminate R[i] from the list 7: break 8: if (R[i] was eliminated) 9: continue /* Test one direction */ 10: if (ValidateRule(R[i].alpha R[i].beta, L)) 11: add R[i].alpha R[i].beta to LR /* Test other direction only if first direction failed*/ 12: else if (ValidateRule(R[i].beta R[i].alpha, L)) 13: add R[i].alpha R[i].beta to LR 14: else 15: eliminate R[i] from the list 16: return LR