Mapping Domain Names to Categories
Maya Rotmensch, Sorcha Gilroy, Corina GurauAcademic Mentor: Cristina Garcia-Cardona
Industry Sponsor: Oversee.net (Kryztof Urban)
Institute of Pure and Applied MathematicsResearch in Industrial Projects
August 15, 2013
Institute for Pure & Applied Mathematics
University of California, Los Angeles
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 1 / 41
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 2 / 41
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 3 / 41
Oversee.net’s Business Model
Person Website
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 4 / 41
Person looking for games A gaming website
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 5 / 41
Oversee.net’s Business Model
Person looking for games Domain A gaming website
Direct Navigation: when users navigate to a website by using theaddress bar instead of a search engine.
looking for a gaming website → navigates to ’addictinggamas.com’
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 6 / 41
Oversee.net’s Business Model
Domain parking + traffic matching −→ Oversee.net
Person Domain Category Website
Monetized Domain Parking
I The registration of internet domain names without placing anycontent on the domain.
I Owners monetize traffic by displaying links and advertisements
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 7 / 41
Oversee.net’s Business Model
AdvertisersI Partners of Oversee.net
I Choose the types of traffic they want from Oversee.net’s category tree
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 8 / 41
Oversee.net’s Business Model
Parked domains do not have any content
Mapping Domains to Categories is extremely difficult
I Oversee.net uses Keywords to describe Domains and Categories
Domain Keywords Keywords Category
Not enough, as we are not guaranteed use of same language!
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 9 / 41
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 10 / 41
So what’s the big deal?
Reasoning about concepts
Scarcity of input information
I Example 1 - Spelling errorcheapvacatins.com
I Example 2 - Ambiguous meaningbigbearhuts.com (animals? huts? it’s supposed to be winter sports)
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 11 / 41
Text Categorization
Our problem can be thought of as a problem of categorization. Weneed to assign a domain to one or more classes or categories
I A natural choice is topic modeling
I However, unlike most text categorization problems, we don’t actuallyhave documents to classify, as we are dealing with undevelopeddomains
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 12 / 41
Topic Modeling
This method analyzes the relationships between documents in a corpus byisolating a set of topics from the documents
For meaningful results, one must work with a set of large texts
I Our data set consists of keywords, as our domains are undeveloped
This method results in organic generation of topics
I The categories we are attempting to map into are pre-defined
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 13 / 41
ESA - Explicit Semantic AnalysisBuilding a Semantic Interpreter
Using a Vector Space Model + an exogeneous knowledge base−→ represent the meaning of text
1
# of articles ∼ 3.5 Million# of terms ∼ 45 Million
1Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis, 2007. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI)
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 14 / 41
ESA - Explicit Semantic Analysis
Government Finance Toys Children Bank School . . .
Law 0.2 0.3 0.8 0.9 0.2 0.7 . . .Article2 0.8 0.9 0.1 0.3 0.7 0.5 . . .Article3 0.5 0.2 0.3 0.6 0.4 0.8 . . .Article4 0.1 0.2 0.1 0.3 0.4 0.2 . . ....
......
......
......
...
Term frequency inverse document frequency:
tfidf (t, d ,D) = tf (t, d)× idf (t,D)
Logarithmically scaled term frequency:
tf (t, d) = log(f (t, d) + 1)
Inverse document frequency:
idf (t,D) = log|D|
|d ∈ D : t ∈ d |(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 15 / 41
ESA - Explicit Semantic AnalysisUsing a Semantic Interpreter
Cosine similarity measure
similarity = cos(θ) =A · B||A|| ||B||
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 16 / 41
How Oversee.net Does It
Instead of comparing two texts - compare two small sets of words!
Use keywords to describe domains and categories
Represent these keywords in terms of DBpedia articles
I A keyword is significantly related to an article if the TF-IDF is above acertain threshold
I The set of articles associated to a domain/category is the union of thesets of articles associated to its keywords
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 17 / 41
How Oversee Does It
Compare the two sets of articles (A - domains, B - categories) usingthe Jaccard Index:
J(A,B) =|A ∩ B||A ∪ B|
Categories with highest scores using this index are matched to adomain
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 18 / 41
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 19 / 41
Our Focus
Domain Keywords Keywords Category
Critical link: domains to keywords
Improve quality of keywordsI Click Through Rate
I String Similarity
I Semantic Analysis
Keyword CTR String Similarity Semantic Similarity
industrial 20 80 0
industriel 20 89 0
industrie 20 100 0
china manufacturer 20 0 88
industries 20 80 98
industrial companies 20 0 86
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 20 / 41
Domain Keywords
Focusing on developing the link between domains and keywords, the twomain questions we posed for our research were:
Could we use ESA to extend the number of meaningful keywords perdomain?
Could we use the keywords obtained through Oversee.net inhousestatistics as the basis of the new keywords?
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 21 / 41
MethodologyExtending the set of keywords:
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 22 / 41
MethodologyExtending the set of keywords:
When generating new keywords:
Only take top 3 articles
Only take top 2 terms
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 23 / 41
MethodologyMethod 2 for extending the set of keywords:
Breaking up and correcting the domain name
chaselogon.com
haselogonaselogon
cha selogonchas elogonchase logonchasel ogonchaselo gon
chaselogchaselogo
Example: domain = ’chaselogon.com’
If entire string matches a word in reference file then stop
If both parts of broken string are exact words then stop
If substring is an exact word then correct other part using editdistances
I Corrections used: deletions, transpositions, replacements, insertions
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 24 / 41
MethodologyMethod 2 for extending the set of keywords:
Reference file made up of collections of text, have added moreinformation
I Company namesI Popular websitesI Brand and store namesI Countries and major cities
Initial Keywords Keywords after parsing
chameloeon chas
chase
elson
login
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 25 / 41
MethodologyGenerating new keywords and mapping to categories
bankfianancial.com
ncofinancialban
bankfinancial
financial institutionsfinancial centre
lobstersofficial personal
societies chairman. . .
Jaccard Index = 0.240492
finance
retirement pensiondebit card
tenant credit check...
Jaccard Index = 0.348147
credit cards
debit cardcredit applicationsrewards program
...
Jaccard Index = 0.219457
banking
savings bankingchecks
community bank...
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 26 / 41
Results: Comparing Their Keywords to Semantic
We were given a sample of 300 domains that had been matched byhand to a total of 500 categories
CTR & String Similarity CTR, String Similarity & Semantic Analysis
Number of matches 25 309
percentage of match 5% 61.8%
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 27 / 41
Results: Generating New Keywords
Using Method 1:
CTR & String Similarity Method 1 CTR & String Similarity & 7 Random
Number of matches 25 21 24
percentage of match 5% 4.2% 4.8%
Most of the time, the different methods yielded the same results
Cases where the new keywords improved the system:I thhetrainline.com
Cases where the base case did better:I inindustries.com
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 28 / 41
Results
thhetrainline.com
thetrainline
Jaccard Index = 0.0001 microcars & city cars
Jaccard Index = 0.0002 property management
thhetrainline.com
thetrainlinestrafe train
moving departingtrain station
telecommunicationsgeorgia
rain shine. . .
Jaccard Index = 0.1348 bus & rail
Jaccard Index = 0.2255 libraries & museums
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 29 / 41
Results
inindustries.com
industrialindustriasindustriel
. . .
Jaccard Index = 0.0786 manufacturing
inindustries.com
industrialindustriasindustriel
. . .ministry
quarterly garden/outdoorfilipino footballer
. . .
Jaccard Index = 0.099 tourist destinations
Jaccard Index = 0.1326 real estate
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 30 / 41
Results: Parsing the Domains
Using Method 1 & 2:
CTR & String Similarity Method 1 & 2 CTR & String Similarity & 15 Random
Number of matches 25 93 23
percentage of match 5% 18.6% 4.6%
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 31 / 41
Results - Parsing the Domains
chaselogon.com
chameloeon
No category matched
addictinggamas.com
chameloeonchaschaseelsonlogin
passwordjournalists cyberlogins expensive
beatles. . .
Jaccard Index =0.4637 credit cards
Jaccard Index = 0.4637 banking
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 32 / 41
Results: Parsing the Domains
Using Method 2:CTR & String Sim. Method 1& 2 Method 2
Number of matches 25 97 77 out of 356
percentage of match 5% 19.4% ∼ 21.6 %
Initial results show that overall, just using parsing might be more beneficial→ depends on the amount of noise.
Example with a lot of noise:I mobilestorage.ca
Example with minimal noise:I addictinggamas.com
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 33 / 41
Results - Amplification of noise
mobilestorage.ca
gfilestoragemobileshop
mobilestorage
ageinvestor
vilest. . .
Jaccard Index = 0.1011 mobile & wireless
Jaccard Index = 0.0959 music & audio
mobilestorage.ca
gfilestoragemobileshop
mobilestorage
ageinvestor
vilest. . .
legal agetaylor
phone companiesmobil
. . .
Jaccard Index =0.0942 music & audio
Jaccard Index = 0.0887 education
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 34 / 41
Results - Minimal noise
addictinggamas.com
addictinggamsaddictivegamesadictigegames
. . .addict
addictinggamesingram
. . .
Jaccard Index = 0.0153 software
addictinggamas.com
addictinggamsaddictivegamesadictigegames
. . .addict
addictinggamesingram
. . .gameplay requires
gameimpulsedriven flash
add ons. . .
Jaccard Index = 0.2019 computer & video games
Jaccard Index = 0.1975 games
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 35 / 41
Results: Extended Matches
Using Extended Matches:
We extended possible matches to parent and root nodes of thecategory tree.
I Checked in how many cases did the parent or root node of thecategories we got matched the manual matching.
CTR & String Sim. Method 1 Method 1& 2 Method 2
Number of matches 25 21 97 77 out of 356
percentage of match 5% 4.2% 19.4% ∼ 21.6 %
Number of extended matches 32 29 128 102 out of 356
Percentage of matches 6.4% 5.8% 25.6% ∼ 28.7 %
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 36 / 41
Outline
1 Oversee.net
2 Problem StatementWhy so complicated?ESA - Explicit Semantic AnalysisHow Oversee.net Does It
3 Our ProjectOur FocusMethodologyResults
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 37 / 41
Conclusion
Implemented a program to match domains with categories
Created an ESA based method to amplify existing keywords
Adapted a domain name parsing and spell correcting method
Revisiting our research questions:
Could we use ESA to extend the number of meaningful keywords perdomain? → Yes
Could we use the keywords obtained through Oversee.net inhousestatistics as the basis of the new keywords? → No. Or at leastfurther processing must be done.
getting better & more keywords → getting a few good keywords
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 38 / 41
Future Directions
Find out how many good initial keywords are required to use ourmethod successfully
Explore a better way of ranking keywords and determine which arethe most descriptive ones
I Click through rate and string similarity comparisons are not sufficientlydescriptive, need a better scoring method
Have a reference of the most popular websites, so that the domainsgiven could be compared to these
I Analyze content in websites to amplify domain to category mapping
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 39 / 41
Thank you!
Academic Mentor: Cristina Garcia-Cardona
Industry Sponsor: Kryztof Urban and Oversee.net
RIPS Director: Dr. Michael Raugh
Director of IPAM: Dr. Russ Caflisch
IPAM Staff: Dimi, Stacey, Stacy, Roland, Stephanie, and everyonethat made RIPS possible
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 40 / 41
Questions?
Thank you for listening!
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 41 / 41