23
Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013 BUCC2013, Sofia, Bulgaria

Finding More Bilingual Webpages with High Credibility via Link Analysis

  • Upload
    leigh

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

BUCC2013, Sofia, Bulgaria. Finding More Bilingual Webpages with High Credibility via Link Analysis. Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013. 3 ideas. - PowerPoint PPT Presentation

Citation preview

Page 1: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Finding More Bilingual Webpages

with High Credibility

via Link Analysis

Chengzhi Zhang , Nanjing University of Science and Technology

Xuchen Yao , Johns Hopkins University

Chunyu Kit , City University of Hong Kong

8 August 2013

BUCC2013, Sofia, Bulgaria

Page 2: Finding More Bilingual Webpages  with High Credibility via Link Analysis

3 ideas

• Bilingual URL Pattern Detection

• Deep Webpage Recovery

• Incremental Bilingual Website Exploration

Page 3: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Bilingual URL Pattern Detection

• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm

• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)

• U is the set of all URLs within a website• approach: inverted index for URLs

– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...

– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>

– supports multiple languages• better mining multilingual websites such as EU and UN

Page 4: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Bilingual URL Pattern Detection

• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm

• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)

• U is the set of all URLs within a website• approach: inverted index for URLs

– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...

– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>

– supports multiple languages• better mining multilingual websites such as EU and UN

Page 5: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Bilingual URL Pattern Detection

• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm

• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)

• U is the set of all URLs within a website• approach: inverted index for URLs

– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...

– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>

– supports multiple languages• better mining multilingual websites such as EU and UN

Page 6: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Bilingual URL Pattern Detection

• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm

• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)

• U is the set of all URLs within a website• approach: inverted index for URLs

– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...

– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>

– supports multiple languages• better mining multilingual websites such as EU and UN

Page 7: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Top 20 Keys

Page 8: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Number of matched URLs (top 10)

Page 9: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Keys in domain “gov.hk”(rescue local weak keys if they are globally strong)

Page 10: Finding More Bilingual Webpages  with High Credibility via Link Analysis

3 ideas

• Bilingual URL Pattern Detection

• Deep Webpage Recovery

• Incremental Bilingual Website Exploration

Page 11: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Deep Webpage Recovery

• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions

• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,

<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...

Page 12: Finding More Bilingual Webpages  with High Credibility via Link Analysis

various structures of websites with deep pages

Page 13: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Deep Webpage Recovery

• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions

• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,

<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...

Page 14: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Deep Webpage Recovery

• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions

• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,

<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...

Page 15: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Deep Webpage Recovery

• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions

• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,

<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...

Page 16: Finding More Bilingual Webpages  with High Credibility via Link Analysis

3 ideas

• Bilingual URL Pattern Detection

• Deep Webpage Recovery

• Incremental Bilingual Website Exploration

Page 17: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Incremental Bilingual Website Exploration

• Intuition: bilingual websites tend to link to other bilingual websites.

• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)

– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

Page 18: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Incremental Bilingual Website Exploration

• Intuition: bilingual websites tend to link to other bilingual websites.

• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)

– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

Page 19: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Incremental Bilingual Website Exploration

• Intuition: bilingual websites tend to link to other bilingual websites.

• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)

– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

Page 20: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Incremental Bilingual Website Exploration

• Intuition: bilingual websites tend to link to other bilingual websites.

• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)

– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

Page 21: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Discovering related webistes from seed websites(select the top K most related websites)

[Linkout, PageRank, WeightedPageRank]

Page 22: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Evaluationon number of URL pairs found and precision

total websites: 12,800

Page 23: Finding More Bilingual Webpages  with High Credibility via Link Analysis

Conclusion

• Unsupervised bilingual pair detection (no heuristics)– http://mega.ctl.cityu.edu.hk/~czhang22/pupsniffer-eval/Data/

Pattern_Credibility_LargeThan100.txt

• A large collection of English-Chinese webpages