Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Indexing Approximations and Optimizations in Search Systems
by
Guillermo Antonio Rodriguez, B.Sc., M.S.
Dissertation
In
Systems and Engineering Management
Submitted to the Graduate Faculty
of Texas Tech University in
Partial Fulfillment of
the Requirements for
the Degree of
DOCTOR OF PHILOSOPHY
Approved
Dr. Mario Beruvides, Ph.D., P.E.
Chair of Committee
Dr. Susan Mengel, Ph.D.
Dr. Patrick Patterson, Ph.D., P.E., C.P.E.
Dr. Jennifer Cross, Ph.D.
Dr. Mark Sheridan, Ph.D.
Dean of the Graduate School
May 2017
Copyright 2017, Guillermo Antonio Rodriguez
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
TABLE of CONTENTS
CHAPTER I...................................................................................................................................................... 1
INTRODUCTION ......................................................................................................................................... 1
1.1 Problem Statement ......................................................................................................................... 2
1.2 Research Question .......................................................................................................................... 4
1.2.1 Search Index Regression Approximations .................................................................................... 5
1.2.3 Index Matrix Optimizations .......................................................................................................... 6
1.3 General Hypothesis ......................................................................................................................... 7
1.3.1 System Attribute Formulas in Search Engine Index Approximations .......................................... 8
1.3.2 Indexing Optimization .................................................................................................................. 8
1.4 Assumption ..................................................................................................................................... 9
1.5 Research Benefit ............................................................................................................................. 9
1.6 Research Outputs and Outcomes ................................................................................................. 10
1.7 Research Outline ........................................................................................................................... 11
CHAPTER II .................................................................................................................................................. 13
LITERATURE REVIEW ............................................................................................................................... 13
2.1 Introduction .................................................................................................................................. 13
2.2 Content Clustering ........................................................................................................................ 14
2.2.1 Proximity Clustering ................................................................................................................... 16
2.2.2 Query Logs ................................................................................................................................. 17
2.2.3 Page Attributes .......................................................................................................................... 21
2.3 Conclusion ..................................................................................................................................... 29
CHAPTER III ................................................................................................................................................. 31
SYSTEM ATTRIBUTE FORUMULAS IN SEARCH ENGINE INDEX APPROXIMATIONS ................................. 31
3.1 Introduction .................................................................................................................................. 31
3.2 Literature Review .......................................................................................................................... 32
3.3 Page Attributes ............................................................................................................................. 32
3.3.1 Title ............................................................................................................................................ 33
3.3.2 Copy ........................................................................................................................................... 34
3.3.3 URL ............................................................................................................................................. 35
3.3.4 Meta Tags ................................................................................................................................... 36
3.3.5 Keyword Proximity ..................................................................................................................... 37
3.3.6 Keyword Prominence ................................................................................................................. 37
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
3.3.7 Anchor Text ................................................................................................................................ 38
3.3.8 Domain Age ................................................................................................................................ 39
3.3.9 Back Links ................................................................................................................................... 39
3.3.10 Click Through Counter ............................................................................................................. 40
3.3.11 Lexical Context ......................................................................................................................... 45
3.3.12 Attribute Summary .................................................................................................................. 45
3.4 Premise Introspection ................................................................................................................... 46
3.5 Google ........................................................................................................................................... 48
3.5.1 Title Tag ...................................................................................................................................... 48
3.5.2 Meta Tags ................................................................................................................................... 49
3.5.3 URL ............................................................................................................................................. 49
3.5.4 Anchor Text ................................................................................................................................ 50
3.5.5 Image Alternate Tags ................................................................................................................. 50
3.5.6 Header Tags ............................................................................................................................... 50
3.5.7 Google Attribute Summary ........................................................................................................ 50
3.6 Yahoo ............................................................................................................................................ 51
3.7 Bing ............................................................................................................................................... 51
3.8 Algorithm ...................................................................................................................................... 52
3.9 Search Engine Approximation Model ........................................................................................... 52
3.10 Analysis Underpinnings ............................................................................................................... 54
3.11 Bing Formula ............................................................................................................................... 60
3.12 Yahoo Formula ............................................................................................................................ 65
3.15 Data Collection Challenges ......................................................................................................... 69
3.14 The Systems Paradigm ................................................................................................................ 70
3.16 Summary ..................................................................................................................................... 70
3.14 Bibliography ................................................................................................................................ 71
CHAPTER IV ................................................................................................................................................. 74
A PAGE INDEXING OPTIMIZATION PROPOSITIONS ................................................................................. 74
4.1 Introduction .................................................................................................................................. 74
4.2 Big Data ......................................................................................................................................... 76
4.3 System Variables ........................................................................................................................... 79
4.4 Literature Review .......................................................................................................................... 80
4.5 Theoretical Formulation................................................................................................................ 86
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
4.6 Underpinnings ............................................................................................................................... 88
4.7 Bing Formula ................................................................................................................................. 89
4.8 Yahoo Formula .............................................................................................................................. 91
4.9 Summary ....................................................................................................................................... 94
4.10 Bibliography ................................................................................................................................ 96
CHAPTER V .................................................................................................................................................. 99
CONCLUSION ........................................................................................................................................... 99
BIBLIOGRAPHY .......................................................................................................................................... 101
APPENDIX A ............................................................................................................................................... 109
APPENDIX B ............................................................................................................................................... 150
APPENDIX C ............................................................................................................................................... 151
APPENDIX D ............................................................................................................................................... 152
APPENDIX E ............................................................................................................................................... 153
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
DEFINITIONS
API Application Program Interface
CSS Cascading Style Sheet
EM Expectation Maximization
HTML Hypertext Markup Language
SEO Search Engine Optimization
SOAP Service Oriented Architecture Pattern
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
TABLES
2.1 Data Clustering Algorithm 18
2.2 Beeferman & Berger Click Through Results 19
2.3 Input Parameters 21
3.1 Attribute Summary 46
3.2 Google Attribute Summary 51
3.3 Attribute Mapping 57
3.4 Bing Attribute Statistics 60
3.5 Yahoo Attribute Statistics 65
4.1 Bing Quality Metric 89
4.2 Yahoo Quality Metric 91
5.1 System Attributes Summary 99
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
FIGURES
2.1 Venn Diagram – Content Clustering 16
2.2 Link Sample 19
3.1 Pair Relationship Plots in R – Bing Data 58
3.2 Pair Relationship Plots in R – Yahoo Data 59
3.3 Receiver Operating Characteristic (ROC) Curve – Bing Data 64
3.4 – Receiver Operating Characteristic (ROC) Curve – Yahoo Data 68
4.1 Receiver Operating Characteristic (ROC) Curve – With Quality Component –Bing Data
92
4.2 Receiver Operating Characteristic (ROC) Curve – With Quality Component –Yahoo Data
95
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
1
CHAPTER I
INTRODUCTION
The notion of the Internet is credited to a J.C.R. Licklider of MIT who had the original
idea of using networking concepts to link a group of memos together [Jones (2002)]. The
concepts of Licklider were later extend and evolved to a project by Lawrence G. Roberts that
manifested itself into a concept known as ARPANET. It was through this idea of networking
concepts that the concept of the Internet came to be and has become such an indispensable tool
for so many. This linking of documents across a disbursed network structure lead to further
developments such as email and the cloud all concepts for which the idea of linking like data was
paramount. This idea of bridging the gap between disbursed object sets is what is of interest in
this paper and whose core concepts date back to the early 1960s and credited to a gentleman who
simply wanted to link a series of memos together.
In order to facilitate the bridging of like documents across a network a type of markup
language was needed; a format if you will for the exchange of information. This language is
called Hypertext Markup Language or HTML as it is most widely known. HTML is a subset of
the Extensible Markup Language (XML). XML format is composed of start and end tags that are
of the form:
<TAG>Some Content</TAG>
It should also be noted that in XML the end tag is not required as the start tag may have a
forward slash prior to the greater than symbol, thus designating a tag with only one node – a start
node and end node all in one. The 'TAG' portion refers to a specific designation object in HTML,
but could be standard text in the more generic form of XML. Documents are linked on the
Internet through the use of a specific type of tag – an anchor tag. The anchor tag has a specific
format it adheres to and which is defined by the World Wide Web Consortium (W3C). The W3C
(1999) has specified the anchor tag to have the following definition:
<a name=”A” href=”B” hreflang=”C” type=”D” rel=”E” rev=”F” charset=”G”></a>
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
2
Where A = Name of the current anchor tag B = Web resource location C = Base language D = Content type hint E = Relationship from current document F = Reverse link description G = Character encoding
Documents are linked via the Internet by way of the href (hyperlink reference) attribute of the
anchor tag and designate one document or object to be relevant to another document or object.
The search engine movement was advanced by two individuals, Larry Page and Sergey
Brin. Larry Page and Sergey Brin met at Stanford University and collaborated on a project – a
search engine called BackRub. Larry and Sergey placed their search engine on Stanford
University servers and ultimately took their project to the world as Google.com [White (2007)].
While the concepts of networking lead to the manifestation of the monster that today is
Google.com, it only became so through the direct application of placing order to chaos. Today
there are other search engine providers such as Microsoft with Bing and Yahoo with their own
search engine named after themselves each of which is a perceived free service, but in reality
they are viable businesses making pennies per click on advertisement space that is sold on the
real estate that has become the search engine results. While the bulk of the motivation is on the
corporate side for the giants such as Google and Microsoft there are also secondary markets that
have risen out of the forest that is the Internet. Electronic commerce in the United States
accounted for $1.8 trillion dollars according to the U.S. Department of Commerce (2014). The
ability for customers to find a specific business online has become a necessity and the argument
may even be made that it could mean the survival of a business venture in current times. For a
business knowing how to structure the dynamics of a website are paramount in order to be
ranked highly by the search engines.
1.1 Problem Statement
While the major search engine Google (http://www.google.com) provides a general
outline for search engine optimization some of the others such as Bing (http://www.bing.com),
and Yahoo (http://www.yahoo.com) do not provide such guidance. Even though Google does
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
3
provide a general outline of what constitutes good content they do not disclose their specific
algorithm or the weight of importance of the attributes they have explicitly stated warrant interest
in their search indexing. Evidence of this may be validated through the secrecy of the PageRank
algorithm and its details or lack thereof rather.
The process of indexing content quality is also an evolving process that is by no means
stagnant and thus represents a moving target to the website owner. Take for example one quality
attribute utilized by the search engines to gauge search significance – the keyword index. The
keyword index represents the ratio of keywords found divided by the total amount of words
found in the document; or stated differently:
KI = [Keywords Sought] / [Total Document Words]
The total document words is the physical count of all the words the document contains void of
all HTML markup. In the early days of the internet there were sites that flooded their content
with specific keywords so that the Keyword Index value would be high, but hid the excess
keywords by formatting the foreground text color with the same color as the background color of
the page, thus making the content invisible to the viewer of the page, but completely visible to
the indexing algorithm of the search engines (Spamdexing). Once the search engine providers
discovered the illusion they began to penalize the website owners by changing the rules of the
game and once again changing the algorithm to account for the transgression.
The art of optimizing a website for high rankings on the search engines is referred to SEO
or Search Engine Optimization and represents big business. Moving an e-commerce website
from the bowls of obscurity to the top of the search engine results, i.e. the first ten slots in a
given search would entail a windfall for any business. The search engines have even come to
understand that changes made to a website cannot be percolated through the search engine filter
immediately as it would signal to the astute listener that a change was positive or negative and as
such companies like Google have explicitly stated that it can take up to six months (King: 2008)
for website changes to fully take effect in their indexing. By pushing out the feedback
mechanism by the search engine the company is trying to conceal the value added proposition of
the argument by the affected party.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
4
The problem faced by any individual hoping to have positive results on the major search
engines is how can the document or documents be structured in such a way so as to facilitate
positive results for a given search? While the current state of affairs of search engine
optimization weighs more positively on the side of art than science can this paradigm be changed
to systematically follow key index factors while designing a web page to ensure a positive search
result? Out of all of the attributes that are deemed to be relevant in designing content for the web
which of these are actually relevant and which of these are secondary or irrelevant? It is this final
question that brings forth the true problem statement of this discourse and the underlying
justification for its study. The problem sought to answer in this endeavor is whether there can be
generated a model given a series of system attributes that may be derived directly from a
document or system that can yield an approximation to the search engine index value?
1.2 Research Question
In order to build the regression model there must exists a series of system attributes that
drive search results, such that the search attributes are system attributes of the document or
environment including external linking; the following is explicitly what is sought.
Let:
I = Search Index A = Document / Environment Attributes
Seek {X | X → ∆ I}
The research questions that must be answered are as follows.
1. What are the system attributes for each of the search engine providers studied – Bing and Yahoo?
2. Can the system attributes be combined into a regression model to predict search results?
3. Can the big data paradigm be investigated from a systems perspective to help define system homeostasis?
4. What is the optimized classification formula that may be derived using systems theory?
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
5
1.2.1 Search Index Regression Approximations
A solution to the primary research question will begin to be framed by determining the
system attributes of the page. What is sought to be determined from the research literature is
what constitutes worth of search. Once the basic premise is answered then there needs to be
created a mechanism by which data may be tied to the model. Data will be obtained through a
search algorithm that will be created for this sole purpose.
The data mining algorithm will be written in the Python programming language given its
rich syntax and its access to libraries that facilitate the searching of XML content. Python
contains a library called HTMLParser (https://docs.python.org/2/library/htmlparser.html) that
facilitates the manipulating of XML content and thus will facilitate the parsing of sought criteria.
The algorithm developed will contain modules for each search engine type and its sole purpose
will be to take the derived system attributes and quantify their value on some search criteria. The
results of executing the data mining algorithm on the search engine data feed will be a key value
pair mapping yielding the inputs to the regression model for further analysis and the
approximation of some index value that may be compared directly to the search engine search
results as a comparative metric to asses merit of the results derived.
Given the lexical context of language and the utilization of replacement words in search
results by the search providers with lexically equivalent word(s) it is paramount that the
algorithm account for this dynamic. In an effort to mimic this pattern of classification a
repository will be used to aid in this effort. WordNet (https://wordnet.princeton.edu/) is an online
utility that allows for the derivation of lexically equivalent word(s) or phrase(s). WordNet
currently exposes this content by way of a utility called the ‘Natural Language Toolkit’ for the
Python programming language; the library may be found at the following URL:
http://www.nltk.org/. This library will allow for the query term to be mapped to an array of
lexically equivalent word(s) and thus classify accordingly, which puts the algorithm in line with
the pattern utilized by the search providers.
The previous progressive input from the underlying body of work will be utilized in this
next section of the paper to build the predictive model to approximate the index for each of the
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
6
search engines – Bing and Yahoo. The system attributes derived for each of the search engines
and their corresponding weight will be used as follows in the regression model:
Given
S = Search Engine Index
Xi = Ith System Attribute of Model
Where
S = ∏ � + ∆��
Or In a Regression Context S = b1X1 * ….. * bnXn + u
Where S = Approximated Search Engine Index Xn = System Attribute 'n' bn = Slope of Component Xn u = Regression Correction
The regression model that will be built will approximate the actual search engine model and will
be validated by how closely it predicts the actual results found. This approximation model will
be two fold given each of the search engine providers and each will ultimately yield an index
value that will approximate actual results for content crawled.
1.2.3 Index Matrix Optimizations
It is the contention that given an approximation matrix for some system there exists an
enhancement to this matrix that is based upon the system view in link structures. One of the
major components that is identified by the search engine providers as having merit when
indexing is the back links that are tied to some page. What is not discussed by the search engine
providers is the merit of the back link. System homeostasis must have a bounding constraint on
the back link worth. If back links are open to a simple Boolean interpretation then it is possible
for any individual to create an online directory and skew the results to some desired domain.
This is something that in my professional career I have directly experienced and have seen this
simple trick or rouse used to invoice thousands of dollars for the service.
In the second part of this body of work the big data system that is the World Wide Web
content will be studied from a systems perspective where link structure will be investigated in
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
7
depth and a solution to the problem frame will be produced where the feedback mechanism on
link structure will be investigated and indexed separately for each search result entry. It is the
contention here that through the use of systems theory an optimized solution to the search
indexing paradigm exists, thus creating a general framework that may be used going forward to
build search systems.
1.3 General Hypothesis
The current search indexing paradigm by the major search engines is based on key
attributes. These key attributes are combined into a composite model and evaluated to derive an
index value signifying value or worth given some search performed. This research will build an
approximation to this search engine index by following a systematic process of identifying the
system components, and then through the use of an algorithm derive metrics for search results
from the major search engine providers – Bing and Yahoo. The determined metrics will be
utilized to derive a regression model for each of the major search engine providers. This
paradigm will allow the building of a deterministic model for search engine optimization for
each of the major search engine providers described previously. This model will allow an
individual to follow a deterministic model as a guideline for building a website that maybe
ranked highly by the search engine providers without the need to resort to general rules of thumb
or fall victim to unwarranted speculation.
It is hypothesized that there exists a series of attributes surrounding each site that may be
used to measure value given some search. Secondly it is hypothesized that given these attributes
they may be combined into a predictive regression model to predict current system behavior of
the major search engines. It is the third hypothesis of this paper that the optimum system state
may be best predicted through the qualifying and then quantifying of the link structure for
individual components that provide the link equity. The analysis will be done for each individual
linking element by analyzing referring page content for value, thus removing the simple dynamic
of simple summation of total links as the underlying metric.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
8
1.3.1 System Attribute Formulas in Search Engine Index Approximations
Each of the major search engine providers relies on a series of attributes to determine
worth or likelihood of match given some search. The first phase of this research in this
dissertation will identify each of the system attributes that are deemed to be significant to each of
the major search engine providers – Bing and Yahoo. The identification of the system attributes
that contribute to overall worth of search will allow for the derivation of some collective that will
be used as the inputs to the regression model that is to be built to evaluate each of these attributes
specifically to the search engine provider and while an exact derivation of the model used by
each of the major search engine providers is not feasible what is feasible is the identification of a
given set of variables that do affect search engine worth and while this set is a subset of the total
it will yield an approximation of the super set.
The first phase of this two phase dissertation will derive an algorithm in the Python
programming language that will take as input the search results of some given search engine –
Bing or Yahoo. The algorithm will then take the search results and asses the value of each of the
attributes derived in this phase. If for example a system variable is deemed to be 'Key Words
Found in URL' then the algorithm will parse the URL and determine the quantity of key words
found in the URL and assign a value to this parameter or index. The collective of all these
parameters as determined in the first phase of this dissertation will then be used as the input to
determine parameter value of the regression model that will be built and based upon the
identified system attributes.
The set of variables determined along with their corresponding value will be combined
into a regression model for each of the search engine providers to derive an approximation model
for each of the major search engine providers – Bing and Yahoo. It is the contention of this phase
that a regression model may be built using system attributes and their given weights into a
paradigm to predict search result worth.
1.3.2 Indexing Optimization
The second research phase of this dissertation will look at the problem domain as an
optimization challenge to the model derived in the first section of this body of work. To this
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
9
endeavor a paradigm will be built to enhance the framework developed previously to apply an
indexing construct to the link structure and consequently create a value definition for the link
structure elements. This diverges from the simple construct of simply deeming linking content as
present or not and helps to drive the argument forward where indexing is applied to each element
of the paradigm to help create a mathematical proposition whenever possible.
1.4 Assumption
The assumptions that are taken in this body of work are that there exists some set of
system attributes that collectively describe some system state, i.e. some index value that defines
worth given some search. It is also the assumption of this body of work that an algorithm may be
used to derive system attribute component values that may later be used as inputs into a
regression model that can correctly predict behavior or a search index approximation. It is also
the assumption of this paper that the systems perspective may be extended to the software
development domain with tangible benefits. An additional assumption of this research is that the
big data paradigm of volume, veracity, variety, and velocity may be simplified using system
attributes.
1.5 Research Benefit
The research benefits of this body of work are diverse and may be stated as follows:
1. The derivation of a systematic model for designing search engine friendly content 2. A predictive model to assess search engine index state 3. A comparison model to assess the modeling behavior of each of the two major search
engine providers 4. Given the regression model paradigm - a system may be created to systematically
create searchable content 5. The proof that a systems perspective may be extended to the software development
domain with tangible benefits 6. The extension of big data theory to embody decision science by way of systems
theory 7. The explanation of an indexing system which is sensitive to initial conditions and
shows to be topologically transitive and thus shows a parallel to a chaotic system
The derivation of a systematic model will allow for the derivation of some algorithm that may
later be used in open source development frameworks that utilize a search component, thus
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
10
yielding a concrete benefit to the open source community. The study of a big data system – the
World Wide Web and the deciphering of a predictive model has a tremendous impact on
research in decision science going forward. The current perspective of a non-normalized system
is to classify the domain under investigation as chaotic, but what if chaos is the result of not
properly identifying system attributes thus yielding the misalignment of the utilized attributes
into an amalgamation to show a non-predictive path? The discovery of truth is no simple task
given the quantity of data available to some problem frames and the multitude of truths that may
be derived from said data points, i.e. an HTML document. The major contribution of this body of
work comes from the second phase of the research in its underlying premise which if proven true
will be significant as it will show that there is a direct correlation between systems theory and an
optimization proposition. From the argument proposed herein then what may be stated if the
underlying premise is proven true is that there are complex systems that fail classification or are
classified sub-optimally due to their predictive model being in an inconsistent state with the
system boundary or constraints and as such it sets forth the possibility for the precedence of the
investigative study.
1.6 Research Outputs and Outcomes
The research outputs and outcomes that are to come from this investigation as they relate
to the systems engineering domain are significant as it will provide a concrete case of how
general systems theory maybe applied to a computer science domain to yield a predictive
systems model. From an engineering management domain the research focus will demonstrate
that scientific principles when combined with system theory may yield the output that each
manager seeks to address in an engineering domain - the prediction of system behavior.
In an age were software runs enterprises the building of predictive models given some
vast amount of data is the current focus of domains such as analytics and “Big Data”. What is
void in that whole discussion however is the systems or engineering management perspective of
analyzing the data set. This research hopes to bring to light that while a vast amount of data
maybe organized and put on a spreadsheet with pretty graphs this amounts to nothing more than
noise. The power or strength of the predictive modeler is not in the utilization of technology that
is in vogue, but is rather the systematic thinking that can only be brought forth with a systems
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
11
understanding, an engineering acumen and of course an understanding and application of
computer science to do the heavy lifting. It is this combination of cross disciplinary fields of
study that adds true value to the modeling paradigm and to be specific it is this combination of
computer science and systems / engineering management that makes modeling take on a
completely new perspective. Modeling intelligently through first identifying the relevant
components through a systems theory perspective and then using computer science to do the
heavy lifting maybe the way to solve those problems that plague society such as cancer or some
other evil. While this paper does not seek to address such matters it is a small step towards a
more profound argument – there is much that can be done if work is done in a smarter fashion
and as all engineering students are taught the best design is always the simplest design. Einstein
discovered the theory of relativity by riding a street car and noticing the flow of pedestrians
across the landscape. Insight is often lost when knee deep in data; the science domain given the
current problems faced may be lacking because the collective of the scientists have not been able
to view the landscape and see the pattern in the noise of pedestrians or data points or cells in a
body or well the discussion here could go into infinitum.
The research focus in the first phase of the research is on creating a predictive model for
search indexing and while this proposition has a true benefit to the research community the
largest benefit of this body of work lay in the second paper if the underlying premise is proven
factual of course. System modeling must be bound to a boundary constraint that extends past
numerical algebraic operation and whether the discourse is about physical systems or meta-
physical systems if that boundary constraint is present then it changes the dialog profoundly. The
secondary outcome of this research is the bridging of big data with decision science and the
illustration that the discovery within big data systems still lay at the doorstep of decision science
by correctly classifying system attributes.
1.7 Research Outline
This research endeavor will follow the two paper dissertation outline. The first paper will
create some body of knowledge that will be utilized by the second paper, thus allowing for an
incremental benefit in the body of work.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
12
The first paper of the body of work will utilize the Python programming language to
build a data mining algorithm for each of the major search engines – Bing and Yahoo. This data
mining algorithm will be used to mine the content output of a given search on each of the search
engines and for each result derived calculate the system attribute scalar values to be used in the
regression model to follow. The system attributes will be derived from outstanding literature on
the subject matter. This paper will measure each of the components found to build a record of
value on each index attribute. The first paper will then progress to build the regression model
from the system attributes identified.
The second paper will take the inputs of the first paper and build an enhanced modelling
paradigm. The second paper will take the big data problem domain that is the World Wide Web
and study the problem within a decision science perspective by identifying those attributes that
tie back to the system under study. The specific attribute that will be investigated here is the back
link equity component. In this part of the study the back link equity will be evaluated from the
perspective of value or quality and remove the simple existence argument. By studying the
quality aspect of link referrers it is the contention in this specific body of work that a better
regression model may be created; a regression model that better models the current system
behavior of the search engine providers.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
13
CHAPTER II
LITERATURE REVIEW
The reason to study the search optimization domain include such factors such as
providing a better result set for search engine users i.e. facilitating research by the identification
of relevant content. Another research benefit could include the development of a new indexing
algorithm i.e. an improvement over current search trees. While these noted factors place the
burden of inquiry on the provider of the service for example a Google or Bing there is also
another party that would just as equally be interested in understanding the underpinnings of a
search algorithm – a consumer of the search. Consumers of search data would include e-
commerce site owners or for that matter any party selling content online. A high search engine
ranking can translate to a direct monetary benefit for any party selling content online. Yet
another benefit of this research is the identification of the system attributes that define a good
search engine. These underpinnings could then be used to build an open source search engine for
all to use such as Universities and nonprofits. By identifying the framework for a good search
engine such as Bing or Google there could then be the evolution of the framework to a tangible
benefit for all to use.
The discussion that follows highlights current theory and the current understanding over
the search indexing domain. The literature review will place an added focus on the identification
of the factors that would yield a positive attribute in a predictive modeling algorithm or stated
differently what is sought to be identified is the series of attributes that determine search
relevance in a query.
2.1 Introduction
The available literature on the subject matter of search engine indexing essentially falls
into two distinct camps; the group of research that deals with content clustering in an effort to
facilitate searching and the second group which deals with the identification of attributes pairs
'A' that define page relevance. In all of the literature identified there was not one single article
that was able to produce a predictive model for web page indexing. Some of the methodologies
that have been utilized to cluster content include Meta data, syllogism pairing, and query logs.
Meta data clustering would include the data returned for a search on local restaurants given a
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
14
user's GPS coordinates retrieved from their mobile device where the search was performed.
Query logs represent existing searches made on the device. If a search previously made by the
user on some device is stored locally by the search tool, the browser could retrieve the archived
information as the primary option to display for the interested party. There is already evidence of
this performance enhancement in existing search tools that may be validated by the search hints
provided at the initialization of a query in some browser such as Google Chrome. The searching
over some domain in some browsers caches the search query. Each subsequent search query that
matches this original query results in the query hints displaying this entry first.
The search engine algorithm used by the search engine providers such as Google or
Yahoo is a much coveted formula as such a model would allow the web site owner to
automatically generate a windfall of profit. The algorithm utilized by the search providers is a
closely guarded secret that while may not be completely predictable it could be approximated
through the use of the identification of the attributes that define behavior.
Modern search engine providers utilize a combination of content clustering and page
attributes to determine the best possible search result for the user. While the search algorithms
have evolved over the years and will undoubtedly evolve from this point forward for example in
the current mobile space there can be a focal analysis of the state of the system at some point in
time. This paper is just that endeavor, but should in no regard be taken as a blue print for analysis
after the ink is dry as the future of dedicated prediction is unknown especially given the state of
mobile devices and their utility in the real world that is the user experience. By being able to
identify the system attributes that contribute to search relevance however it could be argued that
some evolution of the current systems in place would rely on the current state of the system.
Parents after all do look somewhat like their children – at times.
2.2 Content Clustering
Content clustering is the utilization of extra data surrounding the search undertaken. This
extra data is important as it extends the probability of providing a positive search experience for
the end user. Take for example a user during a lunch hour looking for a Taiwanese restaurant to
visit. This could entail a search proposition performed on a hand held device such as a mobile
phone. This hypothetical user would be interested in seeing results for restaurants in the
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
15
surrounding vicinity and probably even more interested in seeing those restaurants that are
ranked highest on some rating scale by other diners. It would be absolutely futile to display a
search result matrix showing restaurants in a different city than the hungry patron and equally as
un-enticing to show those restaurants with poor reviews. Prior to the mobile revolution taking
place searches had no interest in considering the location of the user, but now given the
mechanism providing the inquiry a new dynamic has emerged. Another example of content
clustering surrounding extra data could be geotags on images. Images on a restaurant website for
some entre having geo tags alongside the verbiage could deem some body of content as being
more relevant than others, i.e. the hungry patron again and the proximity of the dish to them.
The second type of content clustering algorithm identified is the query log. Query logs
are typically found on the server and identify queries that have been performed by users. Query
history may also be found on the client side by way of cookies placed on the device by the
website performing the search. If you currently use a Chrome browser and point the browser to
Google.com then proceed to perform some sort of search such as 'Tennis Classes in Lubbock
Texas' you will find that the search quick hints, the drop down window that displays shows
content similar to what you type. After you press the search button you will be provided with a
series of results. The resulting query has been added by the provider to your browser history. If
you close your browser and return to Google.com once again and begin to type the same search
term as before you will find that the first entry in the list will narrow down to your original
search prior to completing the typing effort. Google has stored your search and placed it first in
the list to facilitate your search effort.
The third type of mechanism that may be relied upon to determine worth is the specific
attributes that may be attached to a document. These system attributes may be directly retrieved
from the pages searched and indexed according to some formula that may be defined by the
search engine provider internally. The work of Jerkovic specifically alluded to this paradigm
having the most significance constraint on the search boundary. To reinforce this idea one only
need to look at the SEO guide by Google to understand that page or system attributes play a
significant role in the determination of page worth. While the resolution of the problem frame is
not as simple as plugging in the page attributes into a formula and evaluating through some sort
of regression modelling. The general consensus does have to be that search engine results are
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
16
based upon system attributes to a large extent. The only question that needs to be answered is
how do the system attributes interact with the global boundary constraints to define behavior?
This last question is what the remainder of this dissertation will seek to answer through a
modeling paradigm.
2.2.1 Proximity Clustering
Proximity clustering refers to the grouping of content that is deemed to be correlated
across search boundaries. In proximity clustering the principle idea is to return the data set where
one search result resulted in data like documents of another search result. Take for example a
user searching for 'Texas Tech University Academic Programs' – a potential student to the
university. This user might also be interested in searches relating to 'Apartments'. Search results
that contain information about 'Texas Tech University' and about 'Apartments near Texas Tech
University' would represent documents that are approximate solutions to the query. Proximity
clustering takes those elements that share some common attributes and groups them together as a
search result.
Landrin-Schweitzer, Collet, Lutton, & Prost (2003) introduce the notion of lateral thinking in the
search domain. The author's formulate the hypothesis that rather than using a blind thesauri
expansion to retrieve like documents simply because the words are similar or share a similar
meaning the search engine should retrieve a data set based upon a query processing phase. The
work of Lardin-Schweitzer et al was applied to a medical database where an enhanced document
retrieval algorithm was needed to specifically overcome the hurdles with disbursed databases and
different fields. While the problem frame is different as is the case that is encountered with the
Search A Search B
Fig. 2.1 - Venn Diagram – Content Clustering
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
17
major search engines it does highlight the necessity of data clustering outside of a simple
language construct.
The work of Chang and Chiou (2009) bring to light another dynamic and this being
context. The author's make the argument that search texts such as ‘Big Apple’ and ‘New York’
while different in language also represent similar search interests. The intelligent search engine
must be able to decipher the context of the language being used. The author’s found that an
Expectation Maximization (EM) algorithm outperforms a simple text base algorithm from 2% -
48%. A search to Google on ‘the big apple’ returns the Wikipedia page as the first result for New
York City. Google as a search engine is taking into account the context of the search terms and
not simply indexing content on a bit by bit basis.
The work of Heuer and Dupke (2007) introduce the idea of a spatial context to web site
content. If content can be tagged to a spatial context then the search paradigm will take on a new
context, such as the example given prior of searching for a Taiwanese restaurant. While the work
of Heuer and Dupke prove that currently only simple content may be indexed it does show the
potential for an enhanced search in the near future. The search paradigm with geotags could be
extended to images and not just text in the future.
2.2.2 Query Logs
Query logs typically reside on the server where the search is performed and queries are
used to retrieve data sets for user interest. Each query performed will return a specific record
sequence from which the search provider can use the actual user click events to gauge
effectiveness. Bar-Yossef and Gurevich (2008) report the deterministic metric “ImpressionRank”
as the normalized amount of impressions a page receives from user queries in a certain time
frame. The impressions created by the user base are tracked by the search engine provider to
gauge utility. If you perform a search on Google.com for any type of interest point you will
notice towards the bottom part of your screen that once you click the link the request for content
first goes to Google for tracking then it is transferred to the destination site for processing of the
content. The purpose of this work effort by the search provider is to create a scoring function (S)
Bar-Yossef and Gurevich (2009) show that:
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
18
S → [0, ∞)
Where there exists a target distribution π on S such that a target function may be defined as f : S → ℝ, having an integral relative to π as:
Int π (f) = Σ f(x) π(x)
During each click event that is performed Google.com is actually enhancing its scoring function
to derive utility for the given search. King (2008) also reports that the Google toolbar will track
each link event – an effort to also add to the precision of its scoring function. Just as is the case
for Google you will also find that link content that is clicked on in a search result window for
Yahoo does go through a processing function. A link that is clicked on in the Yahoo result page
first goes to the search engine provider for processing prior to being routed to the intended target
URL. The search providers are tracking user preference to the search results in an effort to
incorporate a human element in the paradigm; after all who better to validate the search results
provided than a user of the system themselves.
The work of Beeferman and Berger (200) also demonstrate the effectiveness that may be
achieved through the use of query logs. The author's define an algorithm to cluster like data pairs
and is given below.
Input Click through data C in the form of (query, URL) pairs
Output Bipartite graph G
Step 1 Collect a set of Q of unique queries from C
Step 2 Collect a set U of unique URLs from C
Step 3 For each of the n unique queries create a “white” vertex in G
Step 4 For each of the m unique URLs create a “black” vertex in G
Step 5 If query q appeared with URL u then place an edge in G between the corresponding white and black vertices
Table 2.1 - Data Clustering Algorithm
2.1
2.2
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
19
The result of the Beeferman & Berger algorithm is the creation of a graph where the relationship
between like nodes is represented with a physical link. The end result of the Beeferman & Berger
algorithm is the grouping of like data sets thus enhancing the user experience incrementally
through the collection of the query generated data pairs. The results of the Beeferman & Berger
(2000) paper are displayed below. The results were retrieved by way of the Lycos search engine
and there were a total of 500,000 records that were analyzed (click through records). The largest
point of interest in the findings of Beeferman & Berger are in the data clustering results. The
'URL Sibling Pairs' represent those queries which occurred in a click through record and
contained the same URL. These records represent the common neighbors in a search or a
clustered record set on “like queries”.
Click Through Records 500,000
Unique Queries 243,595
Unique URLs 361,906
Query Sibling Pairs 1,895,111
Query Edge Density 6.38x10-5
URL Sibling Pairs 476,505
URL Edge Density 7.27x10-6
Fig. 2.2 - Link Sample
URL 1
URL 2
Query 1
Query 2
Table 2.2 - Beeferman & Berger Click Through Results
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
20
The Beeferman & Berger (2000) results show that given some set of data this data may
be used in turn as the input to another system, a repository that may be analyzed to score the
quality of the initial data set and consequently enhance the user experience.
Query logs demonstrate an important aspect of the user experience and this is the
utilization of user interaction with search engine results. Each query along with the
accompanying link selections represents the desired value or ranking that should take place by
the search engine and hence shows its value. Another element that may be used with the query
logs is the user attention span once a link has been followed.
The work of Xu, Zhu, Jian, and Lau address the allocation of time spent on page to the
search engine equation as a component. While the idea when combined with query logs does
represent an interesting dynamic it does have a major drawback, this being that the amount of
time a page is left open does not directly represent page interest as a page may be loaded just
prior to an individual heading off to lunch or home for the night. The other major drawback with
attention span as a system attribute is that you would need to have data on each page on the web
in order to assess value! Google has acknowledge the use of over 200 attributes that are used in
order to assess value according to Kumar, Saini (2011), could attention span be one of the
parameters in the Goolge matrix? The answer here would have to be no at least from the systems
perspective because by definition attention time is tied to the user and not the page.
An interesting argument is made by Zhou (2015) for search results is the argument for
personalization. Search engine results could be returned based upon the profile of the user to
enhance the user experience. Given a technology professional for example then the search results
sought by this individual would tend to lean towards the technical domain as opposed to
otherwise. This notion of search customization would greatly enhance the user experience as it
would eliminate extreme points. This is the current trend in technology where now Python is the
name of a programming language and not a type of snake then it would stand to reason that
personalization of search queries would be a benefit for the user. The issue with the argument
made by Zhou however is that it does not account for two or more individuals using a device
such as a family tablet or the extra needed body of work that would be needed to configure the
framework – a burden on the user; a far from optimal scenario.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
21
2.2.3 Page Attributes
By far the greatest amount of literature that was discovered and the topic of two O'Reilly
books was related to search engine optimization through page attributes. Page attributes are the
primary indicator that Google and the other major search engines report to track and index as the
primary indicator of content value.
There have been a great many researchers that have tried to understand the search engine
results by way of page attributes with varying degrees of success. The work of Sedigh &
Roudaki (2003) actually incorporated a least squares approach to try to model the dynamic
behavior of the Google search engine. The author’s utilized a series of attributes to model the
behavior, which included the following
1 PageRank
2 Keyword Frequency
3 Keyword Density (web page title)
4 Keyword Density (web page text)
5 Keyword Density (linked text)
6 Keyword Density (ALT tags)
7 Keyword Prominence (relative to top of document)
The author’s then proceeded to incorporate a binary attribute mapping to the defined indexes. If
the attribute was found to exist then the author’s placed a 1 in the parameter of the least squares
equation and if the index value did not exist then the author’s incorporated a 0 value for the
parameter in their model. What the author’s discovered in their research was that their intended
model was not able to predict page rank, but what was possible was the approximation of a
perceived pattern in the ranking algorithm. From a systems perspective what has been
determined here is that while the authors were not able completely describe the system
homeostasis through the identification of all the system attributes, what was possible was the
identification of the shadow or a fuzzy image of the canvas sought. It should also be mentioned
that while the author’s incorporated a least squares approach to solving the problem domain the
Table 2.3 - Input Parameters
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
22
deterministic input values into the equation were binary. A least squares formula with binary
input parameters may miss its intended purpose if the input parameters are not more
sophisticated such as index values. To this point, the authority on the matter Google actually
alludes to such a fact in their SEO documentation.
One of the factors that was not brought up by Sedigh & Roudaki (2003) was the link
composition of web pages. Hezinger (2007) utilizes the link structure of the web to define a web
graph (V,E) where if there exists a hyperlink between two nodes ‘u’ and ‘v’ in the complete
space ‘V’ then this results in a directed edge in E. This composition of nodes and edges results in
a data structure referred to as an inverted index and it is the mechanism that is used to answer
user queries (at least in part). According to King (2008) search engine optimization may be
classified in two domains – on-site and off-site domain. King goes further in his explanation of
ranking highly and not only states that it is the number of links that matter, but also the quality of
links and the relationship of the link. Links for the sake of links matter less than do links that
map the link content to the link description or URL of the link.
The inverted graph that is created through link mapping by the search engines does
appear to weigh heavily in the PageRank algorithm, but does so with a large caveat and this
being the quality of the link structure. To currently view the number of links to a specific
primary domain the following may be entered in a Google search box (Jerkovic 2010):
link:[URL]
Where [URL] = Root Domain The above query to Google will return all the external links to a given domain.
Jerkovic (2010) identifies a series of attributes on a given website that dominate its index-
ability by the search engines. Jerkovic identifies the title tag <title> of a document as holding
value with the search engines. Given how search engines such as Google track user response
given some sort of search this only furthers the claim of Jerkovic of the attributes importance.
The title tag is displayed in the search engine results; the more attractive a title appears to a user
the more likely that they are to click on the link and given how user clicks are directly
proportional to throughput traffic as may be found in the query logs the greater the attributes
2.3
2.4
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
23
importance takes on. As an exercise the user may perform a search on the Yahoo search engine
and then place the mouse cursor over the link provided – any will do from the result set. What
will be discovered is that the link route is a Yahoo URL such as:
http://r.search.yahoo.com/_ylt=AwrBT.Unkl9VoVYAmAhXNyoA;_ylu=X3oDMTByOHZyb21
tBGNvbG8DYmYxBHBvcwMxBHZ0aWQDBHNlYwNzcg--
/RV=2/RE=1432355495/RO=10/RU=http%3a%2f%2fwww.yellowpages.com%2fdallas-
tx%2fbook-stores/RK=0/RS=bTrZnTedC7end7WvokjyBWgYwag-
The URL is a tracking mechanism that is being used by the search provider. Clicking of the link
will also take the user to the search engine provider, i.e. the link provided – further confirmation
of the tracking mechanism being employed by the business. The search engine providers are
relying on the community as a whole to tell it what is relevant!
The page copy is the second domain that Jerkovic identifies as holding significance to the
search engine providers. Copy is text outside of the HTML markup that is directly visible by the
viewing audience. The significance of this attribute is grossly apparent by viewing search results.
Take for example any search found in the results list such as that given by Bing; a substring of
the copy that contains the significant words or phrases searched in the results output is given.
The search engines will actually tell you the items found that are of significance as the
significant items found and displayed in the results output are in bold! In the very early stages of
the web and online selling this point became an exploitable point for the vendors. Online sellers
were taking page copy and duplicating it across the background of the page so that it would not
be visible to the user, but would be completely visible to the search engine data crawler. The
result of which was abnormally good ranking for pages that exploited this loop hole. This
loophole does not currently exist and pages that utilize such a mechanism are penalized by the
search engine providers.
The document URL is the third domain over which Jerkovic identifies as having
significance over the search engines in determining value on a given search. The document URL
is displayed on the search results for the search engines for Yahoo, Google, and Bing. The results
also show the link text to show the keyword(s) found from the search; these are also displayed in
bold as well for two of the search engine providers – Google and Bing. Keywords in the URL
designate an entire page to be relevant to the search if the keywords are found in the URL.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
24
Apache – the web server product (http://httpd.apache.org/) even allows for the rewriting of URLs
to meet the needs of users wanting to optimize their website links and may be done through a
specific file known as the .htaccess file. The .htaccess file allows for URL mapping to allow
URLs to display relevant content; these mappings are performed through rules that are written
directly into the .htaccess file. The .htaccess file allows for rich URLs such as the following:
http://www.amazon.com/To-Kill-Mockingbird-Harper-Lee/dp/0446310786
The search performed to retrieve the above link (the first in line by the way) was the following:
‘to kill a mockingbird purchase’ on Google.com (2015/05/22 at 4:06 PM Central Time).
The fourth page attribute that Jerkovic identifies as being significant to the search engines
is the meta tag <META>. The meta tag may take on different signatures such as the following:
<meta name=”description” content=”Texas Tech University Dissertation”> <meta name=”keywords” content=”SEO, Search Engine Optimization”> <meta name=”author” content=”Guillermo Rodriguez”> The name attribute in the meta tag may take up five distinct values (application-name, author,
description, generator, and keywords) as far as the search engines are concerned there are only
two that are significant – description and keywords. The meta tag with the keywords attribute
defines all the keywords that should be acknowledged by the search engine with the given page.
King (2008) advises on using a limit of less than 20 keywords in the meta tag for keywords as
applying more would be deemed keyword stuffing. The meta tag with the description attribute
defines a short description for the document that may also be used by the crawlers. Jerkovic
states that the search engine may choose to ignore the meta description tag and opt to incorporate
the description for the site found on Dmoz.org. The description of Dmoz.org follows
(http://www.dmoz.org/docs/en/about.html):
“DMOZ is the largest, most comprehensive human-edited directory of the Web”
Jerkovic highlights the lengths to which some of the search engine providers will traverse in
order to validate their content to try to provide an honest set of results for their users. For
description meta tags King (2008) advises on using at most 250 words to maximize productivity.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
25
The fifth element on a document that Jerkovic identifies as having significance is the
header tags <HX> where ‘X’ represents a numeric value starting at 1 and going to 6. The smaller
the value of ‘X’ the larger the text that is displayed for the user. The main tenant here is that if a
given word or phrase is found in the header text then the larger the text is displayed for the user
then the more important it should be considered when ranking the keyword or phrase. It should
be pointed out that given an HTML element such as a header tag it is possible to change the view
of the element using cascading style sheets (CSS). Using CSS it is possible to make a <H1> tag
display exactly like a <H6> tag. An example follows using CSS notation.
H1{ font-size: 12px; } H6{ font-size: 12px; }
In both cases the font used for the <H1> and <H6> will be 12 pixels. While the formatting of the
document could be viewed by the search engine as proposition Y it could very easy be viewed as
proposition X.
The sixth element that was identified by Jerkovic as having significance to the search
engines is keyword proximity. Keyword proximity refers to the physical distance (in words or
bytes) that one keyword is separated by another. Two separate documents having the following
copy:
Texas Tech University is located in Lubbock Texas And The University of Texas at Austin is a tech hub for scientists While both lines contain the words ‘Texas’ and ‘Tech’, if a user performed a search on ‘Texas
Tech’ it would be the first copy that will take precedence. Viewing the copy a little differently
will highlight the argument:
Texas Tech University is located in Lubbock And The University of Texas at Austin is a tech hub for scientists In the first copy the separation was 0 between keywords. In the second copy the distance
between keywords was 4 words. Keyword proximity relates directly to a language construct that
2.5 2.6
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
26
is incorporated in the search engines by way of indexing relevant content and while this could be
measured in words or bites or ASCII characters what does matter is the standardization of the
metric for a consistent ranking.
The seventh element that is deemed to be of importance by Jerkovic is keyword
prominence. Keyword prominence refers to the position of a keyword with regards to the
physical top of the document. The argument here being that content towards the top of the page
is of more importance and consequently the page should be of more importance to the searcher.
It should be noted that much like the header text size the physical location of copy may be
changed through the use of a scripting client side language such as JavaScript. With JavaScript
the physical location of copy could be moved on loading of the document to where ever the
programmer desired.
The eighth element that Jerkovic identifies as affecting search engine results is the
keywords found in anchor text. This optimization technique is relevant as the pages indexed by
the search engines will display this link for users to see in the search results and will inevitably
result in more clicks by a wider audience if the text is relevant to what is being searched.
According to Jerkovic (2010) Google bombing refers to a technique of employing deceptive text
in the anchor text to fool the search engine into believing that the destination URL holds
reference to some specific content when in actuality it would not. Jerkovic states that while
Google has changed its algorithm to account for the deceptive practice some sites are still
succeeding in increasing their page rank by employing the technique.
The ninth component that Jerkovic identifies as having significance to the search engines
is the length of time that a domain has been registered. Jerkovic states that multiyear registration
periods are looked at more favorably than single year registrations. A multiyear registration
along with the length of the domain name registration period alludes to a business having been in
business for the period since the domain was registered and a multiyear registration period
alludes to a business owner expecting to be in business for a period longer than one year. If a
restaurant owner only signed a one year lease for a unit than the property owner would probably
be more skeptical of the business owner’s confidence in his business. The comparable is
definitely there. The analysis and validation of the point made by Jerkovic is definitely pending.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
27
The tenth and final component that Jerkovic identifies has having significance in the
search engine ranking algorithm is the quantity and quality of referrers to the site. Sun and Wei
(2005) define the PageRank algorithm as:
“… link structure-based algorithm, which gives a rank of importance of all the pages
crawled in the Internet by the Google’s web crawler.”
Jerkovic (2010) states that back links from pages that have a PageRank value of at least 4 yield
the best results. PageRank as defined by the PageRank wiki as follows
(�) = ∑��(�)
�(�)�∈��
The PageRank for an individual page is stated as being the sum of the PageRank of each linking
page divided by the outbound links from the same site. It should be pointed out that this
algorithm is not the only mechanism that is used by the popular search engine provider Google.
The PageRank algorithm as defined entails that the more links that are bound from the site linked
the lesser value of the PageRank contribution by the contributing site. In turn it also entails that
the more the quantity of links to the site from external parties the better the PageRank by the site.
The anchor tag contains an attribute called ‘ref’ when this attribute is set to the value of
‘nofollow’ it indicates to the search engines that the accompanying link should not be attributed
to the linked domain - Jerkovic (2010). The reason why this option is enabled is to prevent
applications from going out through the internet and creating links back to their site for the sole
purpose of creating link equity. Just another example of the mechanisms that are being used by
the search engine providers to enforce the rules of fair play. King (2010) also makes the
argument that adding external links from your site dilutes the PageRank component of the site,
i.e. it increases the denominator of the equation L(v).
Dahiwale, Raghuwanshi & Malik (2014) used page attributes to predict relevance of
content. They used content found in the head tag <HEAD>, the title tag <TITLE>, the body tag
<BODY>, the meta tag <META>, and the URL to gauge page content worthiness. Their
algorithm downloaded page content from some URL and then parsed the defined tags to
determine the existence of some queried attribute based upon a mathematical expression. The
formula followed by the author’s is given below.
2.7
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
28
t = (Nb*B) + (Nt*T) + (Nm*M) + (Nh*H) + (Nu*U) Where Nb = Number of occurrences of search string in body tag <BODY> Nt = Number of occurrences of search string in title tag <TITLE> Nm = Number of occurrences of search string in met tag <META> Nh = Number of occurrences of search string in head tag <HEAD> Nu = Number of occurrences of search string in URL The author’s further assumed that the weight of content was subjective and as follows. M = 5 U = 4 T = 3 H = 2 B = 1 The criteria was also established that documents containing a total value of ‘t’ > 3 were deemed
to relevant and documents containing a value <= 3 for ‘t’ were deemed to be irrelevant. The
conclusion derived by the author’s was that their algorithm proved to be between 20 and 70%
effective. From the work of Jerkovic and King it stands to reason that a more formidable
algorithm should have been sought without the underlying assumptions for adequacy or
relevance.
Pal, Tomar, and Shrivastava (2009) studied content based upon link structure and found
positive results. While the work of the author’s contained a significant speculative component in
terms of a weight table it does show a positive correlation between content structure and search
results. This parallel is consistent with the publication of the large search engine providers such
as Google.
Another body of work that reinforces this concept is that of Mukhopadhyay, Biswas, and
Kim (2006). The author’s studied ranking from the perspective of a weighted attribute
correlation paradigm. A significant component of the Mukhopadhyay et al algorithm was the
concept of ‘Authority’ as a weight. Jerkovic (2010) brings this point to light when he notes that
the search for ‘Hilltop Algorithm’ brings up the Wikipedia page first on Google even though the
content of the page is minimal at best. The degree of authority of a site does play a significant
role in the search engine results and a valid component of the Mukhopadhyay et al model. The
unfortunate result of the Mukhopadhyay findings however were not positive and allude to a
2.8
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
29
complex model at hand at least one more sophisticated than the one created by Mukhopadyay et
al.
The work of Beel and Gipp (2009) looked at the age of an article in a search engine
result. The author’s looked at one particular search domain – Google Scholar. The authors found
that there did not exist any correlation between the age of an article and its ranking. Old content
may or may not be relevant. This point alludes to the possible cancellation of an attribute
component in a possible formula for search engine ranking – it should be void of age of site.
Spirin and Han (2014) point to the fact that a link farm affects PageRank in a positive
context. Each link from a farm directly contributes to PageRank positively. The argument of
Spirin and Han bring to light the complex dynamic of ranking pages. Pages or sites holding a
large influence factor such as Wikipedia hold more clout than sites with significant inbound
links, but for a second category of sites, i.e. dot coms the inbound link flux is paramount and
directly proportional to inbound links. A preliminary function of value can be defined that is
composed of ‘Authority’ and ‘Inbound Links’ and it is listed below.
Let i = Inbound Links A = Authority Level µ = Authority Factor PR = PageRank Where µ = { X | X ε [1, 0, 0.5] } And A = { Y | Y ε [1, 0]} Such That: PR = ∑ i + µA
2.3 Conclusion
The indexing of content by the search engine providers does not adhere to a simple
paradigm it is rather composed of input from users, the environment, and the individual sites
themselves. The identification of the indexing model by the search engine providers is not
stagnant it has evolved over time and will continue to evolve for example take the relative new
medium of searching through mobile devices. Will it be possible to search someday from a
mobile device by taking a picture and searching based upon an image rather than text; probably.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
30
Research literature has tried to pin down the exact paradigm that is used by the search
engine providers without success so is it reasonable to even try? Some time ago I interviewed for
a position with a software firm that has the purpose for its sole existence to help online retailers
sell content. During the interview process they revealed that they were currently performing the
search indexing effort for JCPenny. They even boasted that due to their effort the company was
ranked first on the search engine Google.com for a series of product offerings. Was the formula
determined in this case? The answer to the question may not be as perplexing as the research
community may think it may just require thinking along different lines. As a side note I should
also add that a couple of years later I was watching the news and it was about Google. The
search engine provider could apparently not understand how it was possible that JCPenny was
ranked first for ‘Women’s Shoes’ and a few other items. It was the classic case of the hunter
being the hunted. When the search engines index content they enter the domain through a
specific protocol; the protocol is open and visible. Could the protocol be intercepted by a third
party library to tailor the content specifically for an audience and then tailor the output to
maximize index values?
The work of Gwizdka and Chignell found that on the search engine HotBot
(www.hotbot.com) the words in the title were deemed to be more important than the words
found in the body of the document. The paradigm while conclusive in the findings here does
point to the complex dynamic between the search engine providers and the reality that each
provider does have their own ranking methodology and reason for their formula. There will
exists a specific formula for search indexing that will be proprietary to each vendor that at best
will only be able to be approximated and not absolutely defined given the input parameters that
exist and the ever changing philosophy surrounding the classification of web content exclusive
of course of the evolution of technology – another hurdle in the paradox.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
31
CHAPTER III
SYSTEM ATTRIBUTE FORUMULAS IN SEARCH ENGINE INDEX
APPROXIMATIONS
This study will examine a series of attribute pairs for web pages that are defined by the
search engine providers Google, Microsoft, and Yahoo or have been identified through a formal
literature review to contribute to the search engine results. The chapter will begin with a
literature review of existing vendor specific literature and third party literature to identify all
those system attributes that directly contribute to search engine ranking. The chapter will then
proceed by creating an algorithm in the Python programming language to derive the metrics for
each of these system attributes and yield a series of metrics that can be used to quantify the
parameters pairs identified as having relevance by the search engine providers or researchers.
3.1 Introduction
Google provides documentation as to what attributes are utilized when crawling a web
site to determine worth or value in their specific ranking algorithm. Bing and Yahoo do not
provide a formal specification to aid a search engine optimization effort. In this investigative
study the online documentation provided by Google will be explored to identify all those page
and non-page specific attributes that contribute to a positive page rank. The study will also rely
on third party literature to help identify any attribute pairs that may not be disclosed by the major
search engine providers or provider in this case Google. In the case of Yahoo and Bing only third
party research will be utilized in deriving a predictive model for search engine results as these
two entities do not provide a formal specification. The Google documentation will be analyzed
for the simple case of completeness as this study will focus on the indexing approximation for
the search engines of Bing and Yahoo only.
While the attribute pairs available on each page represent a finite set the specific
combination that are utilized by each of the vendors is unique and specific to their business. The
subjectivity of relevance in the search domain is derived by each of the businesses and the
algorithm to determine value is also an evolving fixture. With this constant system flux being a
constant in the paradigm it is thus a necessity to state that this study represents a snapshot as each
of the search providers are free to change and alter their search relevance scheme and disclosure
to the public.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
32
3.2 Literature Review
The literature review of this section will focus on the journal readings that focus on
attribute mappings to search engine optimization. A systems perspective entails the mapping of
attribute components to a domain for which the said components help predict a behavior pattern
in the system. Given the excess literature available on the possible attributes that may or may not
truly affect system homeostasis this paper will begin to frame the problem by identifying those
attributes of the system that are undeniable; the page attributes.
3.3 Page Attributes
By far the greatest amount of literature that was discovered and the topic of two O'Reilly
books was related to search engine optimization through page attributes. Page attributes are the
primary indicator that the major search engines report to track and index as the primary
component of content value.
There are have been a great many researchers that have tried to understand the search
engine results by way of page attribute with varying degrees of success. The work of Sedigh &
Roudaki (2003) utilized a series of page attributes to determine index priority. The page
attributes included such components such as:
• Keyword Frequency
• Title Keyword Density
• Text Keyword Density
• ALT Tags Keyword Density
• Keyword Prominence These components were then combined into a least squares regression formula were the
existence of a component was termed a 1 or 0 otherwise. The conclusion derived by Sedigh &
Roudaki (2003) proved to be consistent with page rank, but was not able to predict page rank.
From the systems perspective Sedigh & Roudaki were able to approximate system homeostasis,
but failed to model it accurately. The work of King (2008) points to a possible explanation of
where Sedigh & Roudaki may have the composition formula defined in general, but lacking in
detail. King identifies a series of attributes that are relevant to search indexing and does so to the
tune of a much greater set of parameters than were identified by Sedigh & Roudaki. Also, King
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
33
points to a possible extension to the regression formula to evaluate index value and this being
that the parameter inputs are not simple binary components, but rather indexes such as some
integer divided by some other integer. If you take for example the second entry given above –
Title Keyword Density; this in the King (2008) model would be defined such as:
Title Keyword Density = ∑ Keywords In Title / ∑ Words in Title This subtle difference makes for a vast difference in the regression coefficients and the main
argument of this body of work – A regression formula may be derived from the system attributes
of a page and its corresponding link structure and this formula is tied directly to index values
derived directly from the system attributes.
One of the factors that was not brought up by Sedigh & Roudaki (2003) was the link
composition of web pages; a fundamental argument that is brought forth by King (2008). King
makes the argument that search engine indexing is composed of two facets – on page factors and
off page factors. King goes as far as to make the argument that links from certain domains such
as those coming from Wikipedia possess more value to the search engines than on page content!
This page link structure may be used to define the structure of the web according to Hezinger
(2007). Hezinger defines the web link structure as a graph composed of elements V and E, where
V identifies a vertex and E identifies an edge. When two nodes ‘a’ and ‘b’ contain a link between
them then this results in a data structure that is identified as being an inverted index. This
inverted index is one of the mechanisms that is evaluated when a user performs a search on a
given search engine. While the link structure does play a critical role in assessing value it does
contain a large caveat and this being that value is bound by quality of referring vertex node as
identified by King (2008).
3.3.1 Title
Jerkovic (2010) in his book SEO Warrior provides the most comprehensive collection of
site attributes that determine page rank. Jerkovic identifies nine components on a page that are
used to determine page worth. The first component that Jerkovic identifies as having value to the
search engines is the title tag <Title></Title>. The title tag is displayed in the web browser and
the contents of which may be found on search engine result pages. The search engine Google
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
34
actually makes a concerted effort to track user response to their page results. On September 17,
2005 the term ‘Toronto Ontario’ was performed as a search on Google. The results of the search
provided a list of results and the link that was provided to click on was not to the website, but
was rather to the search engine provider for tracking purposes. The user is only routed to the
targeted designation only after the search engine provider accepts the user response. The
complete URL that was provided to track user activity is given below.
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&a
mp;cd=1&cad=rja&uact=8&ved=0CBwQFjAAahUKEwjZip60iv_HAhWFFZIK
HZYTB-8&url=http%3A%2F%2Fwww.toronto.ca%2F&usg=AFQjCNEevtKPgE--
qnlQOiwX5wKT1K6HcA&bvm=bv.102829193,d.aWw
The above URL was prepended with http://www.google.com as the page results are listed in
relative paths. The URL is used as the tracking mechanism by which Google assigns user worth
to their internal equation. The reader should note that this differs from Bing as they provide the
absolute path to the destination URL and as such must be left to conclude that they are not
tracking user response and place their indexing emphasis in other attributes.
3.3.2 Copy
Page copy is the second component that Jerkovic identifies as having value to the search
engine providers. Page copy in the context of web pages refers to the text embedded within text
tags. Text tags are defined as the paragraph tag <p></p> and the header tags <hx></hx>, where
the ‘x’ component in the header tag <hx></hx> is an integer value between 1 and 6. The integer
value of the header tag designates importance; the smaller the integer value the larger the text
size is displayed. Copy text is important in search queries as the search engines will display
query text within the documents found in the search results. A search for the previous query
‘Toronto Ontario’ displays the search result in the first position with the following text for the
reader:
Parking Ticket Lookup. Review the status of your City of Toronto parking tickets anytime, anywhere, from your computer or mobile device. As may be verified by the reader the word ‘Toronto’ is highlighted for the reader; a query term
found in the copy text. Inspection of the page content that may be done by right clicking on a
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
35
web page and selecting ‘View Source’ in Internet Explorer version 11, yields the complete
HTML content that was used to generate the page layout. The result of this endeavor results in
locating the following:
<p>Read about the latest services, innovations and accomplishments at the City of
Toronto.</p> Inspection of the page attributes will also lead the reader to find keywords searched within div
tags of the document. Div tags are displayed as <div></div> and they are used as containers for
content; do the search engines index these? According to Jerkovic, page copy is restricted to the
paragraph and header tags, none the less it is an interesting test case to investigate in a possible
regression formula to the approximation equation. One possible explanation for ignoring the div
tag in parsing content could be to prevent online resellers from keyword spamming. Keyword
spamming is the physical placement of keyword terms in page content that is hidden from the
viewer, but visible to the search engines. In the past web site owners were placing keywords in
the background and hiding the text from viewers by making the foreground color of the text the
same as the background color; invisible to the viewer, but visible to the search engines.
3.3.3 URL
The third component that Jerkovic (2010) identifies as having significance to the search
engines is the document URL. The document URL refers to the physical location of the
document in the Internet. The document URL along with the query terms found within the
document URL are displayed in the query results for the three search engines Bing, Google, and
Yahoo. In the case of Google the found search term is displayed in bold. The case of the
previously used search query ‘Toronto Ontario’ the first two results for Google displayed the
following URLs in the query results:
1. www.toronto.ca 2. https://en.wikipedia.org/wiki/Toronto
An interesting point to note in the above is that the first result is the website for the city of
Toronto and the second result is the Wikipedia page for the city of Toronto. It should also be
pointed out that the search results for the first two entries were the exact same for Bing and
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
36
Yahoo. In the result above you will also note that one of the query terms ‘Toronto’ is also
displayed in bold as Google is directly informing the reader of the direct correlation between the
search query and the located URL. Bing also displayed the search terms in bold for the viewer to
physically see on the results page. Yahoo was the only search engine that did not place emphasis
on the URL displayed by highlighting or placing in bold the search term within the URL; does
this represent a distinct difference between the search engines – again another point to take note
of when evaluating the regression model.
3.3.4 Meta Tags
The Meta tag is the fourth component that Jerkovic (2010) identifies as having
significance to the search engine providers. Meta tags physically reside within the header section
of a web document. The reader should note that the phrase web document was used here as there
are a vast amount of technologies that may be used to render web content and while the
mechanism changes the web browsers will always read a hybrid of XML or HTML content. Web
pages may have varying extensions such as:
• asp - Classic Active Server Pages
• .aspx - .NET Server Pages
• .php – PHP documents
• .js – Java Server Side Scripts Meta tags are atomic and contain varying attribute key values. The signature of a Meta tag is as
follows:
<meta name=”” content=””> The key attribute ‘name’ may contain one of three values ‘description’, ‘keywords’ or ‘author’.
The description designation identifies the Meta tag containing a short description of content that
resides within the page content. The keywords designation identifies a Meta tag containing the
keywords that are pertinent to the document. King advises on using a limit of less than 20
keywords in the Meta tag for keywords as applying more results in the search engines deeming
this act to be keyword stuffing and may result in possible blacklisting of the entire page. The
author designation identifies the Meta tag as holding the value of the author of the current
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
37
document. Jerkovic states that the description Meta tag may be completely ignored by the search
engines and instead they may opt to use the description found on Dmoz.org. Dmoz.org is a web
directory for site content. This example highlights the extent to which search engine providers
are going through in order to validate web content. King does state however that description
Meta tags should be used and that the length of the description should be limited to 250 words at
most.
3.3.5 Keyword Proximity
The fifth component that Jerkovic (2010) identifies as having significance to the search
engine providers is keyword proximity. Keyword proximity refers to the physical distance
between keywords found in page copy. Physical distance may be measured in bytes or words for
all practical purposes. A common practice in optimizing web content is the placement of
keywords in page copy to coincide with search queries. Take for example the query ‘Toronto
Ontario’, a search engine optimization technique for keyword proximity would be to create page
content as follows:
<h1>The largest city in Ontario is Tornto. Ontario is a province located in eastern Canada.</h1> The bringing of the search words together as they appear in the query submitted results in a
direct match to the search query; an optimization over a similar page copy such as.
<h1>Ontario is a province located in eastern Canada. The largest city in Ontario is Toronto</h1> This is a direct technique that I have applied to web site content in order to optimize search
engine visibility and have seen direct results in the positive for the effort.
3.3.6 Keyword Prominence
The sixth component that Jerkovic (2010) identifies as having significance to the search
engine providers is keyword prominence. Keyword prominence refers to the physical location of
keywords with respect to the top of the document and the type of header tag used to display
content. King states that header tags that display larger text are perceived by the search engines
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
38
as having more important text, i.e. the text is more prominent on the page. It should be noted
however that this is a search optimization technique that has been exploited by the online
resellers to enhance their presence on the search engines. Through the use of cascading style
sheets (CSS) it is possible to format the physical appearance of page content with regards to
positioning or size. Two examples follow were the physical location or the physical size of
content is changed using CSS.
H1{ font-size: 7px; } .showBottom{ position: absolute; left: 0px; top: 300px; }
In the header tag definition the font size is changed to be smaller than the default; font size for all
header one tags is set to seven pixels. In the second definition for the class ‘showBottom’ the
physical location is changed to be absolute - zero pixels from the left, and three hundred pixels
from the top. An online reseller that is optimizing a page for iPhones for example could display a
series of iPhones in the top of the document and then format the encompassing tag to physically
display the content at the very bottom of the page!
3.3.7 Anchor Text
The seventh component that Jerkovic (2010) identifies as having significance to the
search engines is keywords found in anchor text. Anchor tags are the physical mechanism by
which web pages are linked to one another. Anchor tags are the basis for the creation of the web
by Lawrence G. Roberts; the physical linking of disbursed content across a network. Anchor text
is important in the search paradigm as the search engines will display the search term in the
URL. For the search performed previously ‘Toronto Ontario’, Bing provided the following links
in the search results:
• www.mapquest.com/maps?city=Toronto&country=ca
• www.torontosun.com/news/ontario
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
39
The links identified as being significant show the search terms directly in the URLs. When the
crawlers parse page content any anchor text that is found that would display the above link
references would then be perceived as being relevant to the words found within the text string.
While the above example is derived directly from the search page of Bing the reader should note
that the links do not point to the root URL such as www.mapquest.com, but rather point to an
internal page within the domain. This type of internal link path would only be found by directly
linking from the main page or some subpage within the main domain or through the including of
the link in a site map.
3.3.8 Domain Age
The eighth component that Jerkovic (2010) identifies as having significance to the search
engines is the duration for which the domain has been in existence. Domains that are newer are
viewed as having less significance as domains that have been in existence for a longer period of
time. Jerkovic also makes the argument that multiyear registrations for domains are viewed as
being more favorably by the search engines as opposed to single year registrations.
3.3.9 Back Links
The ninth and final component that Jerkovic (2010) deems to be significant to the search
engine providers is the quality and quantity of back links to the page. The reader should note that
the literature up to now has only identified value of back links to coincide with type of domain
and not some perceived quality metric from the linking page as is already derived for the
individual pages by all the search engine providers. This will be the focus of the next
investigation that will be undertaken. Back links to the page are related to the infamous metric
called PageRank and the basis for the popular ranking algorithm of Google. Sun and Wei (2005)
define the PageRank algorithm as:
“… link structure-based algorithm, which gives a rank of importance of all the pages crawled in the Internet by the Google’s web crawler.” An interesting part of the definition given by Sun and Wei is ‘importance’; for all intensive
purposes this metric may be viewed as the popularity or the sum of back links from one site to
another. Each site essentially has a voting share in what it deems to be pertinent content on the
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
40
web; this content pertinence is voted on by each site by its linking to some destination site.
Jerkovic makes the argument that back links from pages that have a PageRank value of at least 4
yield the best results. The result of this assertion by Jerkovic (2010) is that linking to some site
from just any site is fruitless or less than optimal unless the source site has a positive PageRank.
PageRank as defined by the PageRank wiki as follows
(�) = ∑��(�)
�(�)�∈��
The PageRank for an individual page is stated as being the sum of the PageRank of each linking
page divided by the outbound links from the same site. The end result of the paradigm is that
each contributing link adds to a sites presence on the search engines, but does so on an
incremental scale that is directly proportional to the PageRank of the contributing site. A search
engine optimization technique could then be to go through sites that allow an individual to place
comments and then proceed to create links back to some desired site. The HTML markup
contains an attribute to the anchor tag called ‘ref’ that when set to ‘nofollow’ informs the
crawlers to not consider the target URL in the PageRank algorithm; once again an example of the
evolving search indexing paradigm and the cat and mouse game that is played between the
search engine providers and the would be high indexed aspirers. King (2008) brings to light an
interesting aspect of the PageRank algorithm and this being that internal link will dilute the
PageRank algorithm, i.e. it elevates the denominator L(v) of the above equation.
3.3.10 Click Through Counter
There has been much investigative work performed on search engine query logs and the
study of the correlation between query term(s) and links clicked from a resultant set presented.
Two of the major search engines currently track clickthrough information as it provides the
search engine service a direct correlation between query and user input and all of this for free for
the search engine providers. The major search engines that track user input are Google and
Yahoo. A simply query to each of the tracking search engines results in a DOM structure as
follows for each (extraneous content removed).
Yahoo:
[3.1]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
41
<a title="..." href="http://r.search.yahoo.com/_ylt=AwrSbnhhJsJWag4Af1RXNyoA;_ylu=X3oDMTByYnR1Zmd1BGNvbG8DZ3ExBHBvcwMyBHZ0aWQDBHNlYwNzcg--/RV=2/RE=1455593186/RO=10/RU=http%3a%2f%2fwww.siliconvalleyrealestateteam.com%2f/RK=0/RS=SG2kugSazxVgi23QkEhL2RQIi_s-" target="_blank"> Silicon Valley Real Estate</a> Google:
<a href="http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwi8tOnIv_rKAhUQwGMKHX2IB-UQFgjXATAA&url=http%3A%2F%2Fwww.tripadvisor.com%2FVacation_Packages-g28930-Florida-Vacations.html&usg=AFQjCNE9oPZpHjWLTlmd-4vUWmO-yLijwg&sig2=JLXlbtAql51vHolN72kblA&bvm=bv.114195076,d.cGc">The Best Florida Vacation Packages 2016 - TripAdvisor</a> In each of the examples the direct link is not routed to the destination, but is first routed to the
search provider. Selection of the link does take you to the desired destination location, but only
after first being routed to the search provider. In the case of Bing, the resultant DOM structure
will generate an output similar to the following for the anchor tag structure (once again
extraneous content removed).
Bing:
<a href="https://nodejs.org/">Node.js - Official Site</a> The tracking of DOM level events for the purpose of clickthroughs is not part of the Bing
paradigm, different from that of the Google and Yahoo search engines. In the current model
developed it is not possible to include the clickthrough rate as part of the overall equation as this
is not a direct measurement that may be derived from the DOM structure or the link structure for
each node in the model, i.e. each of the pages. Carterette and Jones (2008) make note of the issue
with using clickthrough data as a distinguishing characteristic of value as they point to
clickthrough data being skewed. Mathematically speaking the only way to circumvent the
skewness of the data points would be to either collect all possible click events on the node or to
offset the skewness by some compensating factor. In the former case it would not be possible to
have one single search engine to process all clickthrough events on the web. In the latter case this
bias would simply entail an error factor in the equation and move the paradigm to an unusable
state rather than to refinement.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
42
In the case of search results the findings presented to an audience after the query is sent
to the search provider are specific to the DOM elements of the node, not taking into account link
structure at least for now. The title tag is presented as well as is content of the page and the URL
of the page. If each of these result entry domains contain search specific content and they are
highlighted for the user to view. The higher the degree of content that is present in each domain
presented to the viewer the more relevant the content is presented to be deemed by the search
and consequently to the user. Given this premise the mathematical context is next presented over
each of the domains in the search view.
Title Bold Content α Amount of Query Terms Found in Text URL Bold Content α Amount of Query Terms Found in URL Verbiage Bold Content α Amount of Query Terms Found in Verbiage Text The degree of bold content in the results output of the Title tag, the URL, and the page verbiage
is directly proportional to the text matching in the particular domain. In this investigative study
the argument is made that given the fallibility with clickthrough rate mapping to search terms it
should be excluded from the paradigm. Jansen and Spink (2006) make the argument for
clickthrough data being relevant, but a counter argument to their premise is that a user identifier
or IP address to be specific, cannot be used to judge the behavior of one specific user. Take for
example the case of the query term ‘China’ from two devices at one household, which to the
outside world could be presented as the same IP address. If the query is performed by one
individual where the intent was for ‘Trips to China’ and for a second individual searching for
dinner ware, the context is lost. The preference between search term and selection is skewed in
this case. Smyth et al. (2004) make notion of the paradigm that exists between user query and
language used; the author’s make note that it is paramount for the search engine to understand
the historical context between query and preferences. What Smyth et al. point to is the evolution
of the web and search indexing where derived search content needs to be specific to the user
and/or profile of the actor in the use case model. Smyth et al. point to this evolutionary process
materializing in the form of a Google search service (labs.google.com/personalized). The
argument for personalization is further enhanced by Smyth et al. when they point to a statistical
data set where 15% of the Excite queries where duplicate; so while specific users recycle past
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
43
queries in the general case the relationship between query and search maps to a probability in the
teens!
The main hurdle that exists in using clickthrough data to derive a conclusion for worth is
that the widget is not identical. Each user has their own biases and impediments to performing
the task at hand. Smyth et al. point to the fact that 90% of selections given a data set for some
query is conducted over the top five results; not even the first page! If a metric is significant over
a data set then this metric needs to be encompassed over the complete data pool not just the top
five results in a typical result set of R (results per page) * P (pages of results), where P is 10 for
Google (10 results per page), 5 for Yahoo (12 results per page), and 5 for Bing (8 results per
page), or stated differently the amount of pages that are available for selection in the result set.
Given the best possible outcome in this scenario, i.e. P = 5 and R = 8 this would entail that the
query log analysis or the clickthrough rate would apply to 5/(8*5) = 4/40 = 12.5% of the search
results presented in 90% of the cases! Please note that the figure calculated previously would
represent a best case scenario for the clickthrough data and its relevance to search or an optimum
case. In the worst case scenario clickthrough data would apply to 5/(10*10) = 5/100 = 5% and
this would be the case for Google the majority share owner of search on the World Wide Web! A
deterministic model cannot be relied upon that has a total scope of 5% of truth to its underlying
argument. It should also be pointed out that the denominator in the equation is derived from the
total pages available for viewing on the first query, this value actually has a higher upper limit as
selection of the last link on the results grid provides for further results for all the search engines;
the 5% estimate from above should be a lot lower if not taking an optimistic point of view over
the data set.
Another issue that exists with the use of clickthrough data in analyzing preference is that
it opens up the model to fraud as is alluded by Smyth et al. (2004). In the current age of data
mining it becomes a very simple exercise to create a tool that would generate a query to the
search engine provider and then select some known entity from the resultant set, i.e. the shoe
store link of the unscrupulous website owner. This paradigm is similar to the paradigm that
existed once on the World Wide Web where website owners were placing keywords into the
background of the page, thus fooling the index engines into believing that the page contained
more relevant content than a competitor. While the argument was made in the research literature
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
44
for the tracking of clickthrough data what was not addressed given the context from above is the
justification for the tracking of the data. If the clickthrough data is not used to apply to search
rank what else could it possibly be used for? Well, what comes to mind immediately is keywords
and pay per click relationships. From an academic perspective this cannot be addressed here, but
it is enough to say that another possibility does exist outside of simple search engine indexing of
pages.
Huang et al. (2013) part from the premise that clicked documents directly represent a sink
that has relevance to the query. There is a fallibility in this argument in that it does not account
for errors in judgement for example the user selected the wrong link. A counter argument for this
argument however could be that this represents a small proportion of the population; but then
please define small. Wang et al. (2014) like Huang and his colleagues makes a similar premise in
their case for using clickthroughs as a basis for measuring worth by proxy through search logs
collected from the Yahoo news search engine. The researchers used data from May 2011 to July
2011 and drew a comparison of intent based upon the context of a user group. The author’s did
arrive at a favorable model to predict link favorability given a query under a context, which
much like the work Smyth el al. (2004) point to the evolution of the World Wide Web and its
indexing. What is not addressed in the work of Wang et al. is the issue that cannot be addressed -
the utilization of third party software to tilt the scales in favor of one sink as opposed to another;
once again what is seen is that measurement of a system context needs to be by way of direct
measurement of the system attributes as is defined in Table 3.1 of this body of work. It is the
only solution put forth that treats each widget as being the same and void of a gross speculative
factor, i.e. the human element.
While many have argued for the justification of clickthrough data to be incorporated into
an underlying model to predict sink relevance the justification for such a component cannot be
justified from a systems perspective as such an element would entail that the behavior of a
system could be measured external to the specific attributes of the system at hand, but by
definition this type of attribute represents a proxy to the system at hand. While a proxy element
may mirror an actual system variable it does not entail that it is a fundamental component of the
system. Does a clickthrough rate represent a system attribute as some have argued or does it
represent the manifestation of the combination of link placement combined with highlighted text
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
45
in the search results which is a direct result of a query matching the verbiage (or lexical context)
of the text found by the search engine. Well the indexing parameter that can be derived by way
of the Title tag and the index derived by way of the URL combined with the index derived by
way of the message content is directly measurable from the sink at hand and does not require the
incorporation of dissimilar widgets in the equation; a fundamentally more sound premise from
which to part from would be the argument here and the reason for which it is void in the
regression formal derived. Also worthy of mention is the fact that this metric ‘clickthrough’ data
is not available for measurement given the current state of the system, i.e. current page rank to
clickthrough data and the underlying reason for its dismissal from this body of work.
3.3.11 Lexical Context
One of the attributes that is being considered as part of the paradigm is the lexical
mapping of search content. Take for example the query of ‘Tshirt’ which could yield ‘T-Shirt’ or
‘T Shirt’. When each of the search engines under investigation was queried for the term ‘Tshirt’
the result set did bring up entries that differed in their textual representation, but where the
lexical context was the same. The regression formula utilized in this endeavor will incorporate
the lexical context of a search through the use of WordNet which may be found at
https://wordnet.princeton.edu. There exists an API that maybe used with the Python
programming language called the ‘Natural Language Toolkit’ that may be found at
http://www.nltk.org. The algorithm that may be found in the appendix incorporates the Python
library of the Natural Language Toolkit to breakdown the search term into its lexical synonyms
from which point the desired indexing parameters needed can be determined. The ability to
incorporate the lexical context in the search endeavor and the indexing work effort will allow for
the modelling of the indexing to fit in line with the physical world and the process that is
followed by the search engine providers.
3.3.12 Attribute Summary
Table 3.1 represents the list of attribute pairs that have been defined through the literature
review to have relevance with search indexing and for which no proxy variable is utilized to
gauge value. This is the direct list of attributes that will be drawn upon to build a predictive
model to gauge an approximation to the search index of each of the major search engine
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
46
providers – Bing and Yahoo. In section 3.5 the literature, available by way of Google, will be
analyzed to derive those specific page attributes that Google has specifically laid bare to hold
value to its indexing algorithm and is done so here for the sake of completeness. While the list
proves to be a subset of the list below it does highlight those indicators from below that are valid
explicitly by way of Google.
Index Attribute
1 Title Tag
2 Copy Text
3 URL
4 Meta Tags
5 Keyword Proximity
6 Keyword Prominence
7 Anchor Text
8 Domain Age
9 Back Links
3.4 Premise Introspection
Studies such as the one being undertaken have been performed in the past, but all those
found did not incorporate all of the page attributes identified in this paper and they also made
assumptions about the paradigm that lead to inconsistent results. Dahiwale, Raghuwanshi &
Malik (2014) used a subset of all those attributes identified in this section to gauge search index
relevance. The Dahiwale et al (2014) utilized the tags header, title, body, meta along with the
URL to build a regression model to gauge search index correlation. The formula derived by
Dahiwale et al (2014) is given below
t = (Nb*B) + (Nt*T) + (Nm*M) + (Nh*H) + (Nu*U)
Where Nb = Number of occurrences of search string in body tag <BODY> Nt = Number of occurrences of search string in title tag <TITLE> Nm = Number of occurrences of search string in met tag <META>
Table 3.1 – Attribute Summary
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
47
Nh = Number of occurrences of search string in head tag <HEAD> Nu = Number of occurrences of search string in URL The author’s also placed the system constraint on their model that follows: M = 5 U = 4 T = 3 H = 2 B = 1 The author’s further surmised that page content where the value of t > 3 was deemed significant
and content were the value of t <= 3 was deemed to be irrelevant. The conclusion surmised by
the author’s was that their algorithm proved to be between 20% and 70% accurate. From the
work of King and Jerkovic it stands to reason that the underlying assumptions may have proved
to have been the demise of their thesis along with the lack of the additional components that have
been identified in this section of the literature.
Pal, Tomar, and Shrivastava (2009) studied search engine results based upon link
structures and found positive results. A point of contention with the research effort of the
contributors must be acknowledged and it is a speculative component in terms of a weight table.
The research does show however that a positive correlation between content structure and search
results can be derived. This research finding by Pal et al (2009) does highlight a parallel between
their body of work and the large search engine providers such as Google.
Another body of work that mirrors the work of Pal et al (2009) is that of Mukhopadhyay,
Biswas, and Kim (2006). The author’s studied ranking from the perspective of a weighted
attribute correlation paradigm. A significant component of the Mukhopadhyay et al (2006)
algorithm was the concept of ‘Authority’ as a weight or a load baring component on the system.
The work of Mukhopadhyay (2006) highlights the findings of Jerkovic (2010) and specifically
the argument that Jerkovic makes with regards to source of information having a bearing on
rank. Jerkovic states that searching for the term ‘Hilltop Algorithm’ brings up the Wikipdeia
page first on Google even though the content of the page, i.e. the header and page tags inner text
are miniscule. The source of information will need to be addressed in any regression model that
is created to approximate the actual findings of the search engines unless the data dictates
differently. To the point of Mukhopadhyay et al (2006) there is another argument that is made by
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
48
Spirin and Han (2014) that highlights the complex dynamic of search engine results. Spirin and
Han point to the fact that a link farm affects the PageRank in a positive way; so while
Mukhopadhyay et al (2006) make the argument of source being a factor Spirin and Han make the
argument that contributing links also affect rank irrelevant of source. This divergence in
paradigm between Spirin and Han and Mukhopadhyay et al (2006) highlights the complexity of
the search engine paradigm – using a network nomenclature; sinks are affected by the source and
a weighted paradigm. This paradigm may be stated as follows:
Let
i = Inbound Links µ = Authority Factor PR = PageRank Where µ = { X | X ε [Real Number] } Such That: PR = ∑ i * µ
The above represents a partial definition of what needs to be created – a regression model
encompassing all of the constraints identified in this section of the paper.
3.5 Google
Google provides its user base with a search engine guide to utilize when formatting page
content; this guide may be found at the link below.
http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-
engine-optimization-starter-guide.pdf
The Google guide provides a general reference to follow when optimization page content
for the web. While the guide identifies some of the attributes that are deemed significant by
Google it does not disclose the complete algorithm. The specific page attributes that Google
deems to be relevant in their evaluation of page content follows next.
3.5.1 Title Tag
The first component that Google identifies as being significant to their crawler is the title
tag <title></title>. The title tag is displayed for users by Google in the search results and words
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
49
entered in the query that are found in the title of the document are highlighted for the reader.
Google also states that each page created should have a unique title tag; does this allude to a
possible glimpse into their algorithm? Could it be that Google has placed the title of the page in
some sort of search tree that is traversed and that each duplicate node is flagged?
3.5.2 Meta Tags
The second component that Google identifies as having significance to their algorithm is
the description meta tag. The description meta tag contains the key value pair
name=”description”. Google also makes note that the description text may be used by Google in
the search engine results. Content that is displayed for the user on the search engine results page
is significant because the bold text shown is more likely to catch the eye of a normal reader and
consequently be clicked by the reader.
3.5.3 URL
The third component that is identified by Google as having significance to their search
engine is the URL. Keywords in the URL will be displayed for the user on the search results
page and be highlighted – again another eye catcher for the reader. Google also makes note that
small URLs are preferred over large ones and that a consistent directory structure should be used
to display content. Take for example the case of an online retailer selling shoes then a directory
structure and consequently a URL structure such as that given below would be preferred.
/mens/shoes/nike/1890.html /womens/shoes/Adidas/1890.html The above is the preferred method by Google to structure content as opposed to some URL
structure such as any of the following:
/mens_shoes_nike_1890.html http://www.somesite.com?id=18902569775 Does the above point to an indicator of what Google is doing internally to evaluate worth? A
string of text separated by some specific character can be split into an array at which point the
specific keywords associated with the URL become clear. To compute an index from this
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
50
perspective becomes easier to perform than performing the same operation over the first or
second immediate example from above.
3.5.4 Anchor Text
The fourth component that Google identifies as having value to its search engine is
navigation text or anchor text if you will. Google states that anchor text should be simple text
and as short as possible. Google has directly stated in this case that length does matter. Does
length matter because it entails a smaller byte array to store internally to their system or does it
matter because an index is calculated for each anchor text component?
3.5.5 Image Alternate Tags
The fifth component that Google identifies as having value to their search engine is the
alternate attribute of the image tag. The ‘alt’ component identifies a text string to associate with
each image, Meta text of the image if you will. For the visually impaired the text component of
the image tag has significant meaning and a clear indicator of how Google is tailoring content to
their audience.
3.5.6 Header Tags
The sixth element that Google identifies as having value to their search engine is the
header tags <hx></hx>, where ‘x’ is an integer and a value between one and six. This assertion
by Google is consistent with King and Jerkovic and a clear indicator that header tags need to be
encompassed in the regression model to build.
While the Google specification only alludes to six components driving search results the
work of the researchers discussed in this section clearly indicate a more complex model at work.
The regression model to build must incorporate the six components defined by Google, but it
must also encompass a larger framework and is the focus of the discussion to follow.
3.5.7 Google Attribute Summary
Table 3.2 list those variables that have been identified by way of the Google
documentation to provide worth to the search indexing paradigm of the search provider. Two
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
51
distinct differences from the attribute pairs identified in Table 3.1 are the ‘Image Alternate Tags’
and the ‘Header Tags’. The header tags have been classified in the generic model defined as
‘Copy Text’ as the current styling and formatting paradigm using Cascading Style Sheets makes
presentation flexible and what may be classified as a header one tag may actually be paragraph
content. Also, to void of the generic formula, the ‘Image Alternate Tags’ is used since these
point to images and not linked to web pages directly, i.e. a sink that has a map able attribute
array and as such the reason why they were left off of the generic model.
Index Attribute
1 Title Tag
2 Meta Tags
3 URL
4 Anchor Text
5 Image Alternate Tags
6 Header Tags
3.6 Yahoo
Yahoo does not provide a search engine guide to optimize page content. While there may
exist some similarity between search engine categorization by the provider with some datum
such as the Google search engine there does not exist any documentation that is provided by
Yahoo to allude to the page attributes the search engine provider finds relevant. For Yahoo this
paper will assume that page indexing follows a similar pattern to that of Google and as such this
paper will map those same page attributes to the search engine provider.
3.7 Bing
Bing much like Yahoo does not provide a general guide to optimize search engine page
content. Bing does provide its user base with an online tool to help determine which keywords
are relevant for users in an effort to help the user create page copy that would be of interest to its
user base. Implicitly this implies that page content is deemed significant by the search engine
Table 3.2 – Google Attribute Summary
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
52
provider, but explicitly there is no key indicator as to what methodology may be used to
maximize search engine presence.
3.8 Algorithm
An algorithm has been defined to extract content from the search engine providers and
categorize the content for model fitting. The algorithm derived and available for viewing follows
the following steps:
1. Retrieve search engine content from a query term by search engine provider. A
repository of available English words was used for the purpose of creating a query term where the query term was selected at random from a list of available English words. A partial list of English words may be downloaded here: http://www-
01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt. The document contains a total of 109,583 words. The algorithm selects a word in sequence from a file called words_chosen.txt that may be found in the GitHub repository https://github.com/guillermorodriguez/Dissertation. The words_chosen.txt file was created by parsing through the words.txt as was downloaded from the first link given above and choosing a word at random that was written to the file words_chosen.txt. The process of choosing a random word was repeated a total of 100 times in order to create 100 distinct words to query upon. The specific algorithm that was created in Python to extract the 100 distinct words may be found within the file named words_set.py. You will find the files referenced here in the GitHub repository given above and also in Appendix ‘A’ of this body of work.
2. Crawl each URL retrieved and determine if the system attributes contain the query term. The lexical dictionary WordNet is used to index lexical content. The lexical content is mapped with the use of the Natural Toolkit Library. For each attribute determined the specific indexing value over each system attribute is tabulated.
3. Collectively maps attributes to the mathematical model that are written to file for processing through the R utility later.
The algorithm is evaluated for a series of terms to create a database of queries to results. The
results are then be used to create an approximation model for each of the major search engines
Bing and Yahoo. The regression model that is used by the algorithm is explored in the next
section of this body of work.
3.9 Search Engine Approximation Model
Each of the search engine providers Bing and Yahoo have a paradigm that is adhered to
and used to rank content. While it was only Google that provided a definitive guide to their
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
53
search paradigm it is from past experience that can be stated that the algorithms are similar in the
criteria that is utilized to rank content; otherwise a search engine optimization effort would be
vastly different for each search engine – something which is not found in the practical domain.
What follows is the best approximation that may be determined from the research literature as to
what the ranking algorithm may be for each of the search engine providers in the form of a
formula.
Bing is one of the search engine providers that does not disclose the intricacies of their
paradigm in this case as is the case for the Yahoo search engine what will be utilized for the base
formula will be derived from the research literature. The base formula for these two search
engine providers will be identical and will be refined by way of the data obtained from each of
the two search engine providers respectively. The data to be used as inputs into the model
formula defined below will come directly by way of the data mining algorithm defined in
Appendix A of this document.
Let: S = Search Engine Index
Bn = Slope of Component ‘n’ D = Meta Tag Description Index → Keywords / Total Words in Description I = Inbound Links → External Links to Page K = Meta Tag Keywords Index → Keywords / Total Words in Keywords O = Outbound Links → Links to External Pages P = Page Copy Index → Keywords in Copy / Total Words in Copy T = Title Index → Keywords In Title / Total Words In Title U = URL Index → Keywords in URL / Total Words in URL
Given the above parameter definitions then the formula for each of the search engine
providers Bing and Yahoo may now be defined as follows:
S = B1D * B2I * B3K * B4O * B5P * B6T * B7U + µ
Missing from the above equation and identified in the literature as being relevant is the user click
through index. This value cannot be measured by the system attributes and hence been left out of
the equation.
[3.2]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
54
In the model formula the parameter index ‘I’ may be measured for Yahoo by way of the
advanced search parameters and the issuing of a query such as ‘link: [URL] -site:[BASE_URL];
where URL is the end point being indexed and the BASE_URL is the root URL. The list of
available advanced query parameters for Yahoo may be found here
http://www.wikihow.com/Count-Inbound-Links-to-a-Website-With-Yahoo%21. In the case of
Bing the inbound link index may also be tabulated through the issuing of the query term ‘link:
[URL] -site:[BASE_URL]’ just as in the case for Yahoo.
3.10 Analysis Underpinnings
The algorithms depicted in Appendix A were used to create the data input files that were
later fed into the statistical modeling utility R (https://www.r-project.org/). The algorithms
utilized were created using the Python programming language (https://www.python.org/) version
3.4.3. The version of R that was used was 3.3.1. The complete source code along with the data
files may be obtained through a GitHub repository at the following URL
https://github.com/guillermorodriguez/Dissertation.
Given equation 3.2 from above there is an identified attribute of the formula termed
‘Page Copy Index’ that represents the keywords in the page copy divided by the length of the
copy text. Page copy is found in HTML documents in variants of tag definitions such as the
header tags or paragraph tags for example. This composition element termed ‘Page Copy Index’
may further be defined as follows.
Let:
P = Page Copy Index → Keywords in Copy / Total Words in Copy DI = Division Tag Copy Index → Keywords in Tag / Total Words in Tag H1 = Header One Index → Keywords in H1 Tag / Total Words in H1 Tag H2 = Header Two Index → Keywords in H2 Tag / Total Words in H2 Tag H3 = Header Three Index → Keywords in H3 Tag / Total Words in H3 Tag H4 = Header Four Index → Keywords in H4 Tag / Total Words in H4 Tag H5 = Header Five Index → Keywords in H5 Tag / Total Words in H5 Tag H6 = Header Six Index → Keywords in H6 Tag / Total Words in H6 Tag PA = Paragraph Index → Keywords in Paragraph Tag / Total Words in Paragraph SP = Span Index → Keywords in Span Tag / Total Words in Span Tag
Where: P = DI + H1 + H2 + H3 + H4 + H5 + H6 + PA + SP Such That:
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
55
B5P = B51DI + B52H1 + B53H2 + B54H3 + B55H4 + B56H5 + B57H6 + B58PA + B59SP
The page copy content is essentially the amalgamation of each of the individual copy tags
that are composed of the header tags, division tag, paragraph tag, and span tag text. Once
equation 3.3 is substituted into equation 3.2 the following formula is derived.
S = B1D * B2I * B3K * B4O * (B51DI + B52H1 + B53H2 + B54H3 + B55H4 + B56H5 + B57H6 + B58PA + B59SP) * B6T * B7U + µ
Equation 3.4 given above is a composite formula that combines the system attributes to
assess value. The equation given is a departure from the standard linear regression formula such
as that given by Dahiwale et al (2014) and for which the composition may be supported by
plotting the linear correlation between the dependent and independent variable pairs. This
attribute may be investigated through R with the use of the pairs function. The pairs function of
R displays a grid plot of independent to dependent variable for a given data matrix. The data
matrix in this case was loaded into R through the use of the table read function.
Figure 3.1 below shows this plot for the data matrix given in Appendix B. The data in
Appendix B was collected through the use of the BING algorithm and was composed of data
collected for the search term ‘adheres’. The data plots in figure 3.1 shows that there does not
exists a simple linear correlation between the dependent variable – page index and any of the
independent variables. The relationship plots are a clear indicator that system homeostasis is
complex and not bound to singularity of relationship which further supports the argument posed
in the form of equation 3.4.
Figure 3.2 below shows this plot for the data matrix given in Appendix C. The data in
Appendix C was collected through the use of the YAHOO algorithm and was composed of data
collected for the search term ‘adheres’. The data plots in figure 3.2 shows that there does not
exists a simple linear correlation between the dependent variable – page index and any of the
independent variables. The relationship plots are a clear indicator that much like the case of the
BING search engine that there is a complex dynamic that affects system homeostasis.
[3.3]
[3.4]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
56
The mapping of the system attributes and the collection file fields is displayed in Table
3.3. In each of the two data mining algorithms they all output to a tab delimited file structure
with the fields identified in Table 3.3 and mapped to the system variables as listed in the sample
table.
The next step in the process is to evaluate the formula posed for each of the search engine
providers that would allow for the determination of goodness of fit given the data collected. Each
of the search engines had data collected through the algorithm provided in Appendix A of this
body of work that provided the data input to the R statistical tool.
Given that the data retrieval process was not uniform as each of the search engine
providers tailored the output to prevent bodies of work such as this from disclosing proprietary
information each of the data mining algorithms were executed multiple times for the same search
terms in order to obtain a statistical average of search indexing position. In the case of Bing the
algorithm was executed 4 times over the 100 randomly chosen query terms yielding a total of
400 data files; the base files that the algorithm generated may be found in the GitHub repository
under the node src/BING/data/Source.
In the case of the Yahoo data extract the algorithm was executed a total of 4 times for
each of the 100 random keywords that were identified earlier. This process created a total of 400
input files that may be found in the GitHub repository under the directory
src/YAHOO/data/Source.
In each of these two data extract processes the data files were generated in a tab delimited
format that allowed for the parsing of the data sets in a consistent manner. Please note that while
the .DATA file extension was used for the data extraction files this does not designate these files
as being proprietary to any system, application or programming language it was merely done in
this matter as the chosen convention.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
57
File Header (Attribute) System Attribute
Index S – Search Engine Index
Url U – URL Index
Description D – Meta Tag Description Index
Div
P – Page Copy Index
H1
H2
H3
H4
H5
H6
P
Span
Inbound_Links I – Inbound Links
Keywords K – Meta Tag Keywords Index
Outbound_Links O – Outbound Links
Title T – Title Index
Table 3.3 – Attribute Mapping
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
58
Fig. 3.1 – Pair Relationship Plots in R – Bing Data
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
59
Fig. 3.2 – Pair Relationship Plots in R – Yahoo Data
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
60
3.11 Bing Formula
The sample data provided in Appendix B shows the column headers on the first line of
the data sample file which are tab delimited and having rows which are newline feed terminated.
Given the space allocations each line bleeds through to the second line in the sample data
example. The maximum, minimum, and average values are given below for each of the base
system attributes as tabulated through the input file _complete_r.dat. The algorithm that was
used to create Table 3.4 may be found in Appendix A under the title pySummary.py.
The data contained in the file _complete_r.dat was fed into R through the use of the read
table function such as the sample given below. The place holder ‘[File Including Path]’ is the
Attribute Minimum Maximum Average
Description 0.0 1.3 0.0
DIV 0.0 100.0 0.18378059758120993
H1 0.0 10.0 0.20716051797863566
H2 0.0 40.0 0.21060367446109785
H3 0.0 14.0 0.04897304115383433
H4 0.0 7.0 0.012779480243865575
H5 0.0 3.545454545454545 0.0022305186012005532
H6 0.0 2.8137651821862346 0.0013185283482377767
Inbound Links 0.0 582.0 14.889566203572441
Keywords 0.0 2.0 0.04248639370365007
Outbound Links 0.0 1643.0 35.47462432662319
Paragraph 0.0 66.0171397982293 0.23883382748042553
URL 0.0 1.3571428571428572 0.12196129911513241
Span 0.0 93.01220720322118 1.1248293028668193
Title 0.0 2.723809523809524 0.16058082343991514
Table 3.4 – Bing Attribute Statistics
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
61
resource location of the data to provide as input. The place holder ‘[Field Separator]’ is the
escape character sequence that delimits each field such as the tab character or ‘\t’. The attribute
‘header’ designates the header fields in the file where a value of ‘TRUE’ implies existence.
lm.data ← read.table("[File Including Path]", sep="[Field Seperator]", header=TRUE)
The model or equation to fit by way of a regression is specified using the notation given
immediately below – equation 3.6. The sample data file provided in Appendix B has the system
attributes in the header which are defined in Table 3.3. Equation 3.6 and the variable mapping
pairs by way of Table 3.3 result in the generalized model shown in equation 3.7.
lm.fit ← lm(y ~ x)
S = D * (Div+H1+H2+H3+H4+H5+H6+P+Span)*I*O*K*U*T
The model derivation that results given equation 3.7 and the format required by R results
in the model defined in equation 3.8
lm.fit ← lm(1/index^15 ~ inbound_links / (root * outbound_links * title * description *
keywords * (div + h1 + h2 + h3 + h4 + h5 + h6 + p + span)))
Equation 3.8 is equation 3.7 with two components modified; the first change being the
expression of the dependent variable and the second being the creation of the ratio. The exponent
of the dependent variable is the quantity of independent variables in the equation set. The ratio
portion of the equation was arrived at by way of the R regression utility function. The use of R
and the model as input lead to the identification of the significant variable pairs and their correct
position within the overall equation. It was this iterative process that lead to the identification of
the optimal formula that was identified in equation 3.8. By definition of the Google Rank
paradigm the position or rank of a page is the probability of the end point being selected by an
end user during a search. The probability of choosing one in a series of options becomes x/n or
stated differently it is ratio of selection options over the data set. It was under this premise that
the dependent variable was first investigated in the form of 1/index. Subsequent mathematical
[3.8]
[3.5]
[3.6]
[3.7]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
62
iterations of the independent variable over a series of options was investigated such as the series
of 1/index^n, where n was set to an integer between 1 and the total number of attributes
available, i.e. the left hand rows as in Table 3.3. It was under this primitive discourse that the
optimum was discovered and was found to be 15 for the BING formula.
Equation 3.8 once fed into R allows for the statistical factors to be provided by the
statistical tool through the use of the summary function; which was performed on the file
_complete_r.dat. This file contained a total of 7,055 entries, which once fed into R and had the
summary function applied to the model as defined in equation 3.8 resulted in an Adjusted R-
Squared value of 0.31. This entails that the model fed into R proved to be 31% effective. The
generalized formula that results from this analysis given the R summary data is provided in
equation 3.9.
1/index^15 = 0.001141*I / ( -0.002473*U * -0.000009302*O * -0.001788*T * -0.02655*D * -0.03175*K * ( -0.0005829*DIV - 0.003683*H1+ 0.003760*H2 -0.0008250*H3 + 0.003665*H4 + 7.854*H5 + 0.004228*H6 – 0.001312*P + 0.00004138*Span ) ) – 0.002291
The formula derived by way of the R statistical tool gave a Multiple R-Squared value of
0.3523 and a Residual Standard Error of 0.0721. The analysis herein shows that a complex
dynamic is a work that could only be approximated with the system variables in an all-
encompassing model. While the derived model was 31% effective it does show that more work is
needed to be able to predict search results. It further needs to be pointed out that the p value is
much less than 0.05 (2.2x10^-16), which implies that the analysis herein has a very low
probability of being arrived at by chance and further validate the results herein.
An investigation also took place where only variable were used into a second model
where only the attributes who experienced a significant code at the three star R derived
designation were used. In this case the model resulted in essentially the same prediction factor,
but slightly higher of 31.05%. While this refinement does yield a prediction factor of 0.0005
points better it does remove base system constraints from the model, thus retaining the original
premise of equation 3.9. The removal of some of the base system variables from equation 3.9
[3.9]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
63
would simply entail that an overfitting was taking place here and moving the argument from the
general to the specific case.
Taking equation 3.8 and performing a logistic regression on the model through R allows
for an alternative analysis to be performed. The model fed into the R statistical package is given
in equation 3.10 below.
lm.fit ← glm(1/index^15 ~ inbound_links / (root * outbound_links * title * description * keywords * (div + h1 + h2 + h3 + h4 + h5 + h6 + p + span)), family=”binomial”)
The data summary from the logistic regression may be obtain from the R statistical
package through the use of the summary function, which does provide an alternative hypothesis
to equation 3.9. This alternative hypothesis is given in the form of equation 3.11 and stated
below.
log(p/(1-p)) = 1/index^15 = 1.198e12*inbound_links / (-2.88e12*root * (-3.866e10) * outbound_links * (-5.963e12) * title * (-1.017e14) * description * (-5.387e14) * keywords * ( (-8.851e12) * div + (-9.563e13) * h1 + (3.953e12) * h2 + (-6.215e13) * h3 + (8.406e13) * h4 + (2.060e18) * h5 + (-2.888e13) * h6 + (-2.637e13) * p + (1.522e13) * span -4.069e14
Utilization of the ROC function in R allows for the derivation of a ROC plot that may be
performed through the use of the plot function in R. This plot function is shown in Fig. 3.3
below. Calculating the area under the curve or the probability of accuracy yields a total of
0.4242. This analysis shows a dramatic improvement over the linear regression that was
performed earlier and now allows for the argument to be further enforced for a better predictive
model for the Bing search engine; an improvement of 11.42. The full analysis performed herein
may be found in the GitHub repository under the directory /BING/R and in the file ‘commands -
chapter 3 - glm.txt’.
[3.10]
[3.11]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
64
Fig. 3.3 – Receiver Operating Characteristic (ROC) Curve – Bing Data
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
65
3.12 Yahoo Formula
The sample data provided in Appendix C represents data captured for the YAHOO search
engine and it shows the column headers on the first line of the data sample file which are tab
delimited and having rows which are newline feed terminated. Given the space allocations each
line bleeds through to the second line in the sample data as was the case for the Bing data file.
The maximum, minimum, and average values are given below for each of the base system
attributes as tabulated through the input file _complete_r.dat. The algorithm that was used to
create Table 3.5 may be found in Appendix A under the title pySummary.py; this was a common
algorithm that was derived for both Bing and Yahoo data processing.
Attribute Minimum Maximum Average
Description 0.0 1.3 0.0
DIV 0.0 100.0 0.16042360354688315
H1 0.0 10.0 0.24207932924893227
H2 0.0 43.3287784679089 0.33377193794607646
H3 0.0 14.0 0.059488461850749914
H4 0.0 7.0 0.016739919888699308
H5 0.0 3.545454545454545 0.004076729165437885
H6 0.0 2.8137651821862346 0.002874918873375806
Inbound Links 0.0 582.0 11.479391944836232
Keywords 0.0 1.6666666666666667 0.05287994966319452
Outbound Links 0.0 1389.0 36.668077103902206
Paragraph 0.0 124.5787019648197 0.2639037744783405
URL 0.0 1.3571428571428572 0.13570601600935772
Span 0.0 93.01220720322117 1.408042685627342
Title 0.0 2.723809523809524 0.17548509294130696
Table 3.5 – Yahoo Attribute Statistics
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
66
The data contained in the file _complete_r.dat was fed into the statistical software
package R through the use of the read table function such as was given in equation 3.5 and done
previously for BING. The generalized model that was used to derive system value was as was
shown in equation 3.7.
As was the case of the BING modeling exercise, but through the data input of the Yahoo
data file that may be found in the file _complete_r.dat of the code repository under the directory
src/YAHOO/data and for which the sample data was provided in Appendix C the process was
once again repeated. Data was imported into the R statistical tool and the regression model was
evaluated using the same equation as depicted in equation 3.7. An iterative process was followed
to determine the exponent of the index factor for which an optimum value of 1 was determined.
This resulted in a model for which an Adjusted R-Squared value of 0.1615 was derived or stated
differently a model accuracy of 16.15% was achieved. The generalized formula that results from
the R analysis is provided in the form of equation 3.12 given below.
1/index = -0.05977*K* (-0.01003)*U * (-0.001494)*T * (-0.00000477)*I * (-0.000007963)*O * (-0.04570)*D * ( -0.01385*DIV + 0.002041*H1 + 0.05928*H2 + 0.0003611*H3 + 0.08388*H4 + 0.04422*H5 + 1.951*H6 – 0.01690*P + 0.003334*SPAN ) + 0.04643
The formula parameters derived by way of the R statistical tool gave a Multiple R-
Squared value of 0.2353 and a Residual Standard Error of 0.09038. The p value is 2.2x10^-16
which is significantly below 0.05 which entails that the probability of having arrived at the result
by chance is miniscule, thus validating the results found.
As was the case for BING, the Yahoo formula of equation 3.12 was investigated by only
looking at base system attributes that contained significance R codes at the three star level of
which there was one H2. Using this simple element in equation 3.10 and removing the excess
attributes resulted in an Adjusted R-Squared value of 0.01364. This prediction factor leads to the
conclusion that the original premise as given in equation 3.12 must stand as the general solution
to the problem frame as of now for the linear regression.
[3.12]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
67
Taking equation 3.12 and performing a logistic regression on the model through R allows
for an alternative analysis to be performed. The model fed into the R statistical package is given
in equation 3.13 below.
lm.fit ←
glm(1/index~keywords*root*title*inbound_links*outbound_links*description*(div+h1+h2+h3+h4+h5+h6+p+span), family="binomial")
The data summary from the logistic regression may be obtain from the R statistical
package through the use of the summary function, which does provide an alternative hypothesis
to equation 3.12. This alternative hypothesis is given in the form of equation 3.14 and stated
below.
log( p / ( 1 – p ) ) = 1/index = (1.867e15) * keywords * (2.025e15) * root * (1.880e15) *
title * (-2.487e12) * inbound_links * (-1.671e12) * outbound_links * (2.877e15) * description * ( (-1.786e14) * div + (7.990e14) *h1 + (3.239e14) * h2 + (1.692e14) * h3 + (-2.830e15) * h4 + (1.001e16) * h5 + (3.773e16) * h6 + (4.123e13) * p + (2.270e12) * span – 1.272e15
Utilization of the ROC function in R allows for the derivation of a ROC plot that may be
performed through the use of the plot function in R. This plot function is shown in Fig. 3.4
below. Calculating the area under the curve or the probability of accuracy yields a total of
0.5333. This analysis shows a dramatic improvement over the linear regression that was
performed earlier and now allows for the argument to be further enforced for a better predictive
model for the Yahoo search engine; an improvement of 0.3718 points. The full analysis
performed herein may be found in the GitHub repository under the directory /YAHOO/R and in
the file ‘glm - commands - Chapter 3.txt’.
As was the case of the BING formula the YAHOO formula shows that there is a strong
dynamic that is not addressed with the attribute formulas defined in equations 3.12 and 3.14 and
as such it furthers the argument that a deeper discussion needs to be had. Section 4 of this body
of work takes a deeper look at one of the components of the equations defined in 3.9, 3.11, 3.12,
and 3.14; the link equity component of the equation under various regression constraints.
[3.13]
[3.14]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
68
Fig. 3.4 – Receiver Operating Characteristic (ROC) Curve – Yahoo Data
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
69
3.15 Data Collection Challenges
The original premise of this discourse laid the groundwork for the derivation of system
formulas for three search engines – GOOGLE, BING, and YAHOO. What was found during the
data collection proved to limit the ability to collect the data for the GOOGLE search engine. The
GOOGLE search engine does a tremendous job of not allowing data to be collected from its
website. The search engine provider tracks IP addresses to query submittals and if this reaches
some company defined threshold then they simply block the data request. In an effort to
overcome this hurdle proxy software was used to try to mask the data request; software packages
that were used may be found at the following URLs:
• www.eliteproxyswitcher.com
• www.steganos.net
Even through the use of proxy software there was a limitation that was reached with the
data crawling effort; this is the reason why in the GitHub repository you will find modules for
GOOGLE. While the code was written and while part of the data collection was achieved it was
simply not possible to extract the needed data from the GOOGLE search engine for the purposes
of this investigative study.
The second hurdle that was encountered in the data collection effort was that some of the
APIs that were being used by the search providers were rendered void during the writing of this
body of work which meant that the original algorithms that had been written to do the data
extract had to be retrofitted to make the data calls directly to the search providers and forgo an
easier API implementation. The BING search API was deprecated December 15, 2016 and the
YAHOO search API was deprecated March 31, 2016.
The third challenge that was encountered with the data collection effort was the results
that were obtained from the search engine providers. Some of the search engine providers such
as GOOGLE will tailor the XML content to the specific browser type that they are providing the
data to and when the algorithm being used is specifically looking for markup definitions this will
cause the complete process of data collection to come to a screeching halt.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
70
3.14 The Systems Paradigm
This body of work brings forth the argument that a systems perspective is a necessity
when modeling a given domain for which if the alternative is sought it becomes more difficult to
find a direct correlation. If this paradigm involved some aspects that was investigated in the
research literature then the paradigm would have been skewed and not for the better. Previous
research literature on the subject matter involved incorporating query logs for example to try to
understand index value, but this argument fails to understand the boundary constraint of the
problem - the page. Query logs are manifestations of user preferences and not based on the
fundamental attributes of the base entity of study – the page. This study parted from the systems
perspective by placing the boundary around the page and not for example around a proxy
variable to true worth as in the case of the BIG Mac index. If worth can be measured then this
measurement must start and be restricted to the system boundary constraint. It is by only
understanding the fundamental attributes of the entity under study that one may begin to
understanding how a delta in behavior may be explained that makes discovery possible and the
fundamental reason why this investigative study was a systems study.
3.16 Summary
Even with all the challenges that were encountered and the inconsistent results that were
achieved between the YAOO search engine and the BING search engine I must state that this
experience was positive because the search results and the paradigm brought forth has moved the
argument forward and for the first time there has been a body of work that has created a
predictive formula to the search engine paradigm.
In section four of this body of work the current state of affairs is examined from the
perspective of backlink quality and an optimization to the current paradigm where the quality
aspect of the paradigm is investigated further. Section 4 builds on the work of the current body of
work in an effort to refine the formulas proposed and help to move the argument even further
down the line and make the case for the systems perspective for modelling.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
71
3.14 Bibliography
Ageev, M., Guo, Q., Lagun, D., Agichtein, E. (2011). Find It If You Can: A Game for Modeling
Different Types of Web Search Success Using Interaction Data. Proceedings of the 34th
International ACM SIGIR Conference on Research and Development In Information Retrieval.
ACM
Berry, M., Browne, M. (2005). Understanding Search Engines: Mathematical Modeling and
Text Retrieval. Siam
Carterette, B., Jones, R. (2008). Evaluating Search Engines by Modeling the Relationship
Between Relevance and Clicks. Advances in Neural Information Processing Systems
Dahiwale, P., Raghuwanshi, M., Malik, L. (2014). PDD Crawler: A Focused Web Crawler Using
Link and Content Analysis for Relevance Prediction. SEAS-2014, Dubai, UAE, International
Conference.
Google. Search Engine Optimization Starter Guide. Retrieved July 15, 2015, from
http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-
engine-optimization-starter-guide.pdf
Hassan, A. (2012). A Semi-Supervised Approach to Modeling Web Search Satisfaction.
Proceedings of the 35th International ACM SIGIR Conference on Research and Development in
Information Retrieval. ACM
Henzinger, M. (2007) Combinatorial Algorithms for Web Search Engines – Three Success
Stories. SODA '07 Proceedings off the Eighteenth Annual ACM -SIAM Symposium on Discrete
Algorithms. 1022-1026
Huang, P., He, X., Gao, J., Deng, L., Acero, A., Heck, L. (2013). Learning Deep Structured
Semantic Models for Web Search Using Clickthrough Data. Proceedings of the 22nd ACM
International Conference on Information & Knowledge Management. ACM
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
72
Jansen, B., Spink, A. (2006). How Are We Searching the World Wide Web? A Comparison of
Nine Search Engine Transaction Logs. Information Processing and Management. Vol 40, 248-
263
Jerkovic, J. (2010). SEO Warrior. Sebastopol, CA. O’Reilly Media Inc.
King, A. (2008). Website Optimization. Sebastopol, CA. O’Reilly Media Inc.
Mukhopadhyay, D., Biswas, P., Kim, Y. (2006). A Syntactic Classification Based Web Page
Ranking Algorithm. Retrieved May 2015 from
http://arxiv.org/ftp/arxiv/papers/1102/1102.0694.pdf
Pal, A., Tomar, D., Shrivastava, S. (2009). Effective Focused Crawling Based on Content and
Link Structure Analysis. International Journal of Computer Science and Information Security
(IJCSIS). Vol. 2, No. 1
Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E. (2012). Modeling and
Predicting Behavioral Dynamics on the Web. Proceedings of the 21st International Conference
on World Wide Web. ACM
Sedigh, A., Roudaki, M. (2003). Identification of the Dynamics of the Google's Ranking
Algorithm. 13th IFAC Symposium on System Identification
Smyth, B., Balfe, E., Freyne, J., Briggs, P., Coyle, M., Boydell, O. (2004). Exploiting Query
Repetition and Regularity in an Adaptive Community-Based Web Search Engine. User
Modelling and User-Adaptive Interaction. Vol 14, 383-423
Spirin, N., Han, J. Survey on Web Spam Detection: Principles and Algorithms. Retrieved May
2015 from http://www.kdd.org/sites/default/files/issues/13-2-2011-12/V13-02-08-Spirin.pdf
Sun, H., Wei, Y. (2005). A Note On The PageRank Algorithm. Elsevier. Vol. 179, Issue 2, 799-
806
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
73
Wang, H., Zhai, C., Liang, F., Dong, A., Chang, Y. (2014). User Modeling in Search Logs Via a
Nonparametric Bayesian Approach. Proceedings of the 7th ACM International Conference on
Web Search and Data Mining. ACM.
Yue, Z., Han, S., He, D. (2014). Modeling Search Processes Using Hidden States in
Collaborative Exploratory Web Search. Proceedings of the 17th ACM Conference on Computer
Supported Cooperative Work & Social Computing. ACM
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
74
CHAPTER IV
A PAGE INDEXING OPTIMIZATION PROPOSITIONS
This study will examine the enhancement to the model proposed in section three of this
body of work. In this investigative study the link structure from source to sink will be
investigated to derive a weighted factor to the system models for both Bing and Yahoo. This
section of the body of work seeks to enhance and build upon the previous body of knowledge to
create an enhancement to the current framework. The investigation of network structure of the
link equity from source to sink will be evaluated here to determine if an optimization may exist
in the system models derived previous, one which will allow for a better solution to arise.
While current practice by the search engine providers is to weigh the source from some
authority domain such as Wikipedia or an academic contributor as more heavily than a link from
some other domain as has been claimed by King (2008) the question does arise here as to
whether this fact can be leveraged by the existence paradigm to allow for a refinement in the
models created to date. This investigative study will seek to address the issue with current
practice and set forth a search model that takes into account the system attributes as the focal
point of the discussion. The derivation of the new approximation model is a further step in the
discussion as the argument here is on the enhancement or refinement of one of the system
attributes used in the previous models – link equity. This link equity component is studied under
the perspective of its underlying components to derive a new metric that leads to a new dynamic
where the link contribution of each source is weighed across all contributors to change the
dynamic from a simple link equity contribution to a quality metric. It is this quality metric that
once derived will be used to change the models already defined for Bing and Yahoo. This
refinement will be evaluated as was done previously through the statistical software package R
and allow for a direct comparison between past results and the new paradigm, which will allow
the discussion to come to a termination point as to whether or not value can be added by
examining link structure in page indexing.
4.1 Introduction
Page indexing by the search engine providers contains a large system component in the
overall model this being the network of link structures from source to sink. Each link between a
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
75
benefactor and beneficiary become a contributor factor in the overall algorithm in page ranking.
Jerkovic (2010) makes the argument that the link structure does play an integral part in the
search index algorithm in determining overall worth. This link structure however up to now has
been largely ignored and only considered on a sliding scale in terms of quantity. The question in
this investigative study and a question that up to now has failed to be taken into account is the
degree to which this measure of link structures needs to be quantified and not simply be some
simple linear contributor in an overall equation.
The general thesis of this paper being that while network link structures should be a
contributing factor this contributing factor needs to be assessed at a deeper level than is currently
being done. Also, this link structure once exposed to a systems perspective and evaluated for its
individual worth in a larger constraint creates a different paradigm from current practice. This
new paradigm is void of the simple constraint now seen and places a new metric in place; an
enhanced metric – a true weight to the link structure dynamic.
This paper will define a model to assess worth of link components based upon the system
attributes contained in the linking document. From the summation of these sources an overall
weighed component may be derived to create an enhanced mathematical model to predict page
indexing by way of link equity value as derived from system attributes. This body of work relies
on the work to date and will take each of the end nodes identified through the data mining
algorithm and then expose the linking documents. These linking documents will be investigated
for their worth under each element that has been identified as being relevant and identified in
Table 3.3. The same data mining algorithm will be used to extract index values for each of the
already identified page attributes having relevance from the literature review. The algorithm that
was used to evaluate the index elements may be found in Appendix A. Once each of the link
contributors is evaluated for worth given the indexing elements they will be used to create an
overall qualifier to gauge link worth. While this argument does parallel the argument posed
previously what is being accomplished through this body of work is the creation of an enhanced
model a model that is a refinement to the previous proposal and a model given the literature
review that falls into greater focus to at least one indexing paradigm – the Google search engine
and the PageRank algorithm. The HITS algorithm as developed by Jon Kleinberg has the
premise that sinks are pointed to by good sources an argument that feeds directly into the context
of this body of work. The contributors to link quality need to be assessed by merit and not simply
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
76
allowed to affect the paradigm simply through an existence. To argue against this dynamic is to
open the door for link farms and the potential skewing of system homeostasis. Another dynamic
that was broached previously was the PageRank algorithm as developed by Brin and Page the
founders of Google. PageRank can be defined as the summation of the sites that point to some
sink divided by all of the outbound links from the sink. The link structure plays an important role
in the classification of the big data landscape that is the World Wide Web and to develop a
model that is void of the quality metric representative of the link structure strips the underlying
argument of its worth. In this section of the body of work an effort is made to enhance the
previous work to account for the link quality in the overall model and thus hope to improve the
argument posed – at least this is the aspiration of the work herein.
4.2 Big Data
Big data is a term that is used to classify a large amount of bytes, the term is credited to
Roger Magoulas from O’Reilly media according to Ulraru, Puican, Apostu, and Velicanu (2012).
Mougalas defined the term to encompass that data which is too complex to manage by normal
data management techniques. The definition provided herein is time specific by definition as the
tools available to manage data change with time. Given a 1 GB data set that would have been
completely unmanageable under a 16MB RAM environment from the early 1990’s, but today
databases of a hundred fold of the base line given are a common place in technology departments
and by no means can be classified as ‘BIG DATA’ under current times; at least with regards to
the volume metric. While the definition provided by Magoulas is a classifier that allows for the
qualification over some domain a definition must be fixed and constant over time such as π. If
definitions are not constant it entails the learning of new jargon continuously and it would be
counterproductive to the academic effort. While the classification of Mougalas is a good step
forward what is needed by the domain is a time independent classification over the domain, what
is missing is the symbolic PI. Big data is characterized by four distinguishing attributes Volume,
Velocity, Variety, and Veracity according to Ulratu et al. (2012). In the attribute pairs the
components are defined as follows.
• Volume – The quantity of data
• Velocity – Time in which data can be processed
• Variety – Diversity of the data encompassed
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
77
• Veracity – The degree of trust in the data In the definition of the attributes of big data there are two components that are variable with time
– volume and velocity and it is with these two attributes that the definition for big data becomes
malleable over time. Advances in CPU power and computational volume via extended RAM
entails that the quantity of data and the processing of the said data is a moving target. The
attribute of veracity is a purely subjective and consequently by definition not a component where
an academic investigation may unfold upon. It is my contention that what makes big data ‘BIG’
is variety. The determination of a computational formula, which after all is what is always sought
in decision science is the specific determination of those attributes of some system that may be
combined under some constraint such that they embody the dependent variable of a formula.
This brings the discussion to modes of variety; data could be presented to an end user for
analysis in a variety of modes such as structured, semi structured or randomized – chaotic in
appearance.
Ularu et al. (2012) point to the statistic that every day 2.5 quintillion bytes of data are
created. This statistic leads to the realization that 90% of the data surrounding us has been
created in the past two years according to Ularu et al. (2012)! A discussion point that has been
lacking up to now and a contention point that was never identified in the research literature is
why is it that there is no discussion surrounding algorithms for big data model derivations?
Given, F = ma as law this law holds true for one data point {m=8 Kg, a = 25 m/s2} or a series of
data points [{m=1 Kg, a = 1 m/s2}, {m=6 Kg, a = 2 m/s2}, {m=9 Kg, a = 65 m/s2}]. The
argument for the truth in the data is being lost and replaced with the notion of the big data
paradigm for which it must be easier to derive if there is just more data – of course. Taylor,
Schroeder, Meyer (2014) bring this reality to the forefront when they quote Professor David
Hendry as stating during an interview in 2013:
“…whether the dataset’s big or small doesn’t actually matter in establishing
change, but if it’s big and the system is complex the only way to establish change is to model that
complexity”
Provost and Fawcett (2013) make note of a specific problem when dealing with large data sets
and this being the problem of over fitting the data to the problem frame. The problem with
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
78
overfitting the author’s argue is that while the truism holds for the data set the derived solution
does not fit the general problem frame. Gandomi and Haider (2015) identify this problem of
overfitting under a different context and call the problem Spurious Correlation. Gandomi and
Haider define spurious correlation as having uncorrelated variables being falsely found to be
correlated due to the massive size of the data set. So when looking at big data it appears that once
the jargon and the novelty of the problem domain is stripped away what is left is a very similar
problem – the correct determination of those system attributes that affect change or put
differently, there must be defined some function that is dependent on a series of attributes or
components that when combined model behavior.
f → U(x1, x2, … ,xn)
Something all scientist hold near and dear is taking a new problem and using existing
theory to explain the phenomena. Once the jargon and the noise around the fundamental
definition of big data is stripped away what is left is variety! The volume and velocity
component of big data is resource centric such that given time the obstacle is removed. It could
even be argued that outside of special cases this obstacle does not exist as today an individual
could take hardware purchased at a local electronics store and link together a network after
which point a solution such as HADOOP (https://hadoop.apache.org/) could be installed on the
machines to leverage a distributed file structure for computational purposes. The big data
component of veracity is purely subjective by definition and as such has no place in its
consideration when building a predictive model. What this ultimately leaves the system with is
one specific attribute that makes data big data – variety.
The variety component of data is complex and a fundamental reason for which model
derivation is a challenge. Under this banner of variety however there exists three distinct buckets
of categorization – structured data, semi structured data, and randomized data. An example of
structured data would be data contained within a database table. A database table contains a
specific schema that defines its context. A series of columns contain a container type (string,
integer, etc.), constraints such as string length must be at most 12 characters long, and referential
integrity such as column ‘Id’ of the employee table is a foreign key to the payroll table for
column ‘EmployeeId’. Semi structured data is what this paper deals with and the content that is
[4.1]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
79
presented to the web browser for interpretation and viewing by an audience. Semi structured data
represents data that contains a flexible boundary constraint such as HTML. The HTML standard
is maintained by the W3C (http://www.w3.org/MarkUp/) and represent a series of containers for
displaying content on a web page. The domain is semi structured because tags contain attributes
of which may be defined by an encompassed tool such as AngularJS (https://angularjs.org/) or by
the individual user for that matter. The HTML domain is also considered semi structured because
tag position is completely arbitrary outside of some tags such as the html tag <HTML> for
example. For completion the definition of randomized data needs to be addressed and very
simply stated this is data that has no apparent pattern to it such as an encrypted password for
example.
Search engine indexing deals with big data due to the fundamental constant attribute of
big data – variety. It could very well be argued that page content does also change over time and
hence does the attribute variety not fall under the same constraint as volume and velocity? While
time may yield a difference to the data landscape it does not entail this as a constant or a given;
variety in the data pool simply means a delta there in between data points. Variety denotes a
difference and as such the model must deal with this delta to predict a truism independent of time
thus making variety the true constant of big data unlike any of the other attributes used to
categorize big data. This brings the discussion to a lapse in the dialog over big data. While the
variety attribute is specific it is so on a macro level; at the micro level variety in data has a
connotation with regards to the differing end points and their amalgamation into a paradigm to
predict truth. This detail is addressed in the next section of the body of work where this
underlying truth in big data is defined.
4.3 System Variables
A system variable simply put are some metric that may be used to gauge a performance
indicator of some model under investigation. Big data systems become difficult to decipher
because the data stream points to a multitude of conclusion points each of which if not
investigated under a systems perspective only helps to cloud the picture and forces the researcher
to only view the shadow of the model at hand - maybe. A far worse scenario and a problem that
current society is plagued with is the creating of indexes when no such system attributes support
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
80
the said system. Take for example the case of the much heralded publication the economist; the
economist researchers have come up with a metric to measure inflation between isolated entities
– countries and this metric is called the ‘BIG MAC’ index. The index rational is as follows; if
you have a product that is the same in a series of countries then the monetary inflation in the
countries may be gauged by comparing the price of the Big Mac in each of the countries.
Problem number one with the rational is that the product is not the same in all countries. In India
the patty is not beef so does the price differential represent a difference in the supply chain and
maybe even legislation factors in the system or is it simply inflation as is the current argument?
While the component inflation (unit of measure dollars) may be similar in magnitude with the
price of Big Macs it may not necessary be tied conclusively to inflation and provable
mathematically, which of course can be the only judge to scientific discourse.
What is missing from the discourse is a fundamental artifact when analyzing big data and
systems; the rules of governance over the factors that dominate the discussion or formula if you
will. In order to make a prediction of the performance of some metric over some domain then
this metric must come and be measurable from the system itself. The engineering domain takes
this artifact as a rule that is never broken take for example the case of calculating stress on a
beam, calculating fluid flow or the turbulence a plane exhibits in flight. But wait a minute some
of these variables such as force are not simply taken as first order measurement directly from the
system, but are rather amalgamations such as velocity, which is a unit of distance over time. This
leads the discussion to an all too familiar point – the discussion has come full circle and it has
been taken to the initial conditions – system variable constraints. It has become too easy to make
an argument away from the fundamental metrics to the Big Mac Index if you will and sway the
argument astray. This body of work aims at steering the argument back on course and relying on
the direct measurable attributes of the system to derive truth – void of the external factors that
simply only represent proxy variables to the datum.
4.4 Literature Review
The work of Jerkovic (2010) – SEO Warrior defines a series of attributes that determine
page worth by the major search engine providers. In total Jerkovic defines nine attributes of a
page that help define its own worth or help in defining the worth of linked pages. The nine
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
81
attributes that Jerkovic (2010) defines as aiding in the total page ranking algorithm by the major
search engines are as follows:
1. Title Tag 2. Page Copy 3. Document URL 4. Meta Tag 5. Keyword Proximity 6. Search Term Prominence 7. Anchor Text 8. Domain Registration Length 9. Quality & Quantity of Referral Links
Page content is defined through a variant of the Extensible Markup Language (XML)
called HTML or Hypertext Markup Language. The HTML definition has a series of attributes
that are defined as the protocol for content placement; the first of which is the title tag
<tittle></title>. The title tag is indexed by the search engines and displayed on search engine
result pages for the reader to view. The goal of the whole exercise being that the user will select
relevant links by reading the content displayed. The search ‘San Ramon CA’ was performed on
Google.com November 17, 2015 at 8:42 PM and yielded a series of results the first of which
(non-advertisement entry) was as follows:
Welcome to the City of San Ramon
www.ci.san-ramon.ca.us/ San Ramon Public Engagement Join Instagram Public Engagement Follow us Subscribe to Email Notification Public Engagement. 2226 Camino Ramon San Ramon, CA ...
The result displayed above displays an entry on the first line ‘Welcome to the City of San
Ramon’. Doing an inspection of the contents of the link www.ci.san-ramon.ca.us results in the
identification of the following title tag:
<title>Welcome to the City of San Ramon</title> The title tag was used by the search engine provider to help the user determine relevant content.
It should be pointed out to the reader that when the title tag does not provide value then the
search engine provider Google.com replaces the entry on the first line with its own derived
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
82
relevant content. Bing and Yahoo display a different entry for the user; with one subtle
difference in the first four lines of output by the providers, which is displayed in red below.
San Ramon, California - Official Site
www.ci.san-ramon.ca.us Official site CONNECT WITH US. 2226 Camino Ramon San Ramon, CA 94583 Phone: (925) 973-2500 Fax: (925) 866-1436 Monday to Friday except Holidays 8:30 am - 5:00 pm
The first line in the above entry is not the content of the title tag, but is rather derived from a
different source. Could this source be the content URL or some other page attribute that signifies
the root source of the content? Inspection of the second result in the data set shows that the
second entry is the Wikipedia entry for all three search engine providers – Bing, Google, and
Yahoo. In this case the title tag does display on the search results page for all three search engine
providers. In summary then we can state that in some instances Google will replace the content
of the first entry in the results page with its own text, text for which keyword prominence is more
poignant. In the case of Bing or Yahoo the search providers display additional text on the first
line to help the user narrow down their search; this text is not derived from the page copy, but is
rather determined from some other source as was verified by inspection of the content.
Page copy is the second attribute that Jerkovic (2010) identifies as having value to the
search engine providers. Page copy is displayed by way of the header tags <hX></hX> or a
paragraph tag <p></p> and as of more recently division tags <div></div> or span tags
</span></span>. In the case of header tags the ‘X’ component displayed above may be a value
between one and six – inclusive; the higher the value of ‘X’ the smaller the content displayed.
Jerkovic (2010) and King (2008) both identify page copy as having significance to the search
engines and even go as far as identifying the subject of which as having a large contribution to
the overall indexing efforts of the search engine providers. Inspection of the above examples
from the search engine providers show that all three providers put in bold the query term and/or
the query words. A query term is a series of characters separated by a space and a query word is
an individual word in a query term. In the example given all three search engine providers make
a concentrated effort to highlight the relevant keywords searched upon by the user in the results
page – a nudge in the right direction if you will.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
83
The document URL is the third component that Jerkovic identifies as adding value to the
search index performed by the search engine providers. In the search examples given above each
of the search engine providers displayed the document URL for the reader. Each of the search
engine providers also went as far as to highlight the query terms found in the URL to show some
degree of relevance for the reader. The highlighting effort given by the search engine providers
points to a disclosure of the paradigm and makes the argument for both King (2008) and Jerkovic
(2010) more apparent, keywords in the URL play an important role in determining worth by the
search engine providers.
The forth attribute that Jerkovic (2010) points to adding value to the search engine
providers is the Meta tag. Meta tags represent pseudo content of the displayed text, yet hold
value to the indexing process. Meta tags are found in the head section of a document and provide
information for the search engines; Meta tags have the following definition:
<Meta name=”” content=””> The name attribute of the tag may have three values ‘description’, ‘keywords’ or ‘author’. A
description value identifies the content to contain a short description for the document and
should be less than 250 words as suggested by King (2008). The keywords value identifies the
value for the keywords to be associated with the document; King (2008) advises this to be no
more than 20 words in length.
The fifth component that Jerkovic (2010) identifies as having value to the search engines
is keyword proximity. Keyword proximity refers to the physical distance between query words.
Jerkovic (2010) makes the argument that placing content on a document so that query terms
align yields better results in a search engine optimization effort. If an optimization is desired on
the query ‘Mountain House California’ then the following page copy would be treated
differently.
<p>Mountain House is located in Northeastern California</p>
And <p>Mountain House, California is located in the Northeastern part of the state </p>
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
84
The second page copy would be superior to the first given the distance between the city name
and the state name; a delta of 28 – 2 = 26 characters in favor for the second page copy.
The sixth component that Jerkovic (2010) identifies as having value to the search engine
providers is the prominence of the keywords on the physical page. Keyword prominence refers
to the physical location of the query words / term with regards to the top of the document. The
general idea being here according to Jerkovic that the higher the placement on the document the
more important the query term.
The seventh component that Jerkovic (2010) identifies as having significance to the
search engines is keywords in anchor text. The anchor tag is a tag that allows for the linking of
content between documents. In the examples given above the reader will notice that the query
terms identified in the URL were placed with emphasis in the search results. King makes the
argument that shorter URLs are preferred to longer URLs by search engine providers – once
again an implicit indication of the algorithm. Could it be possible that the search engine
providers utilize an index structure with keywords in some phrase and thus create an algorithm
on this attribute to assess value?
The eighth attribute that Jerkovic (2010) identifies as having value to the search engine
providers is the length of time that the domain has been in existence. The argument by Jerkovic
being that the age of content has a bearing on the worth of the indexing process. In the
optimization guide by Google, the search engine provider does not bring this attribute forth as
having value to the indexing process, but never the less a point that may be worth investigating
further.
The final and ninth component that Jerkovic (2010) identifies as having value to the
search engines is the quality and quantity of inbound links. King makes the argument that each
web page has a voting share and the linking between source and sink is essentially a voting share
between the source and sink. The PageRank algorithm by Google utilizes the link structure as a
weight to measure value. Sun and Wei (2005) define PageRank as the importance of a page on
the internet. It is this final component of the system attribute mapping structure that needs to be
investigated further and the focus of this paper. While Jerkovic makes the argument that it is the
quantity and quality of the inbound links that help to determine worth; what is not considered is
the systems perspective of the source. Sure the search engine providers deem inbound links from
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
85
.org or .edu domains as having a special elevated contribution factor it does fail however to
address the complex dynamic of link structures. For example linking for the sake of linking can
add value as identified by King (2008), but should it? The argument of this paper is that link
structures need to be evaluated from a systems perspective and not viewed as a binary measure –
even in the case of the special domains.
Information retrieval is the process of extracting organized data from some source. While
data represents unorganized text, information represents actionable, organized text that may be
disseminated and acted upon. In the realm of information retrieval the concepts of recall and
precision are of paramount interest. Recall according to Lee (2000) refers to the percentage of a
set of documents that are deemed to be relevant given some repository. Precision on the other
hand refers to the percentage of documents from the retrieved data sets that are deemed to be
relevant. Mehlitz et al (2007) take the definition a step further and provide an equation for the
two dependent variables identified previously, and this being as follows.
P = Gr / r [4.2] And R = Gr/g [4.3] Where P = Precision R = Recall Gr = Relevant Documents Among Returned Documents r = Documents Retrieved g = Relevant Documents in a Collection Set Mehlitz et al go a step further from the book definition of relevance and precision and define an
effectiveness measurement for the information retrieval system. This effectiveness measurement
is defined as follows.
Let E = 1 – [(β2 + 1) * P * R] / [β2 * P + R] [4.4] Where E = Degree of Effectiveness β= Relative Importance between Precision and Recall Given a document retrieval system such as a search engine then the total effectiveness of the
search paradigm can be explicitly evaluated to a tangible quantifiable metric. The optimization
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
86
model proposed in this paper specifically addresses the precision component of the equation.
What is not being contended is the recall; the set of documents retrieved is fixed and definable at
some specific point in time -‘t’. The next section of the discourse addresses the optimization
mechanism proposed and the analysis for its evaluation.
4.5 Theoretical Formulation
While the page attributes defined by researchers are finite they do represent a complex
dynamic and one for which has been in constant flux since its inception. The search engine
paradigm is a much sought after paradigm as its full disclosure signifies a windfall for online
retailers. The collective of the attributes identified through the research literature create a system
definition that may be modelled as follows.
Let A = Anchor Text
C = Page Copy D = Domain Registration Length I = Search Index M = Meta Tags P = Keyword Proximity Q = Quality & Quantity of Referrals S = Search Term Prominence T = Title Tag
U = Document URL Where I = f(A, C, D, M, P, Q, S, T, U) [4.5]
The optimization mechanism proposed addresses the ‘Q’ component of the relationship given
above. King makes the argument that while referrals from pages with a higher equity rank are
more desirable; the link structures from low ranking link equity sources contributes to overall
rank. This fact completely exposes the search algorithm to a bias towards link farms or search
engine optimizers that understand how to create link spam. This scenario and environment that
currently exists must be adjusted to in order to keep the system homeostasis and remove bias.
In section three of this body of work an approximation model was created for the Bing
and Yahoo search engines. This paradigm took into account the referrers to the sink as a volume
metric, but the model did not address the quality of the referrers. It is this quality metric that this
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
87
addressed here. In an effort to define this quality metric the same system variables that were
investigated earlier will be indexed across the group of nodes that point to some sink. It is this
collective under study that will be scrutinized to create a weighted quality metric component that
will be incorporated into the same equations derived previously for both the Bing and Yahoo
search engines to create a possible refinement of the model. What is being proposed here is to
actually define the β component of equation 4.4 – quantify the relative importance between
precision and recall. To calculate the source index or β the linking pages will be evaluated to a
corresponding index that will be defined as follows.
Let c = Copy Index = [∑ Query Terms Found in Copy] / [∑ Words in Copy]
d = Description Index = [∑ Query Terms in Description] / [∑ Words in Description] k = Keyword Index = [∑ Query Terms in Keywords] / [∑ Words in Keyword]
t = Title Index = [∑ Query Terms Title] / [∑ Words in Title] u = URL Index = [∑ Query Terms Found URL] / [∑ Words in URL] W = Weighed Index Where Wi = ∑ ci + ∑ di + ∑ ki + ∑ ti + ∑ ui [4.6] The evaluation of equation 4.6 for each of the linking pages to the sink would then provide a
total index ‘W’; the average will be taken to derive the mean index for some sink that will be
defined as follows.
Let n = Number of Sources SST = Source to Sink Total Where SST = [ ∑ c + ∑ d + ∑ k + ∑ t + ∑ u ] / n [4.7] In the previous section of this dissertation an approximation formula was defined for each
of the search engine providers. The proposal made here is an adjustment to the general formula
defined in equation 3.2 of the previous section and reproduced below.
Let: S = Search Engine Index
Bn = Slope of Component ‘n’ D = Meta Tag Description Index → Keywords / Total Words in Description I = Inbound Links → External Links to Page
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
88
K = Meta Tag Keywords Index → Keywords / Total Words in Keywords O = Outbound Links → Links to External Pages P = Page Copy Index → Keywords in Copy / Total Words in Copy T = Title Index → Keywords In Title / Total Words In Title U = URL Index → Keywords in URL / Total Words in URL
Then: S = B1D * B2I * B3K * B4O * B5P * B6T * B7U + µ [3.2]
The evaluation of the approximation formulas derived in the previous section were void
of a quality metric with regards to inbound links (I). Inbound links were deemed to be relevant if
they existed, but they were not scrutinized for value. This value metric is of great importance as
it filters out a bias such as may be found with links from link farms or link spam. The utilization
of a quality metric as applied to equation 3.2 which leads to a new paradigm as given below.
S = B1D * B2 I * B8SST * B3K * B4O * B5P * B6T * B7U + µ [4.8]
Equation 4.8 has a simple, but fundamental change and this being the quality evaluation of the
link equity from a series of links or sources. It is this paradigm that is investigated further in this
section to seek to find an optimization to the previous formula defined (3.2) and derive an
optimized approximation to the search algorithms of the two search engine providers.
4.6 Underpinnings
The algorithms depicted in Appendix A were used to create the data input files that were
later fed into the statistical modeling utility R (https://www.r-project.org/). The algorithms
utilized were created using the Python programming language (https://www.python.org/) version
3.4.3. The version of R that was used was 3.3.1. The complete source code along with the data
files may be obtained through a GitHub repository at the following URL
https://github.com/guillermorodriguez/Dissertation.
As was the case in the previous section the Natural Language Toolkit was used for the
Python programming language to search for the lexical context for a given query term. The query
terms percolated from the previous section and were simply carried over for the given searches
that had been performed previously. The GitHub repository contains the needed information to
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
89
install the Natural Language Toolkit for Python and these instructions may be found in the
Python file named wordnet_install.py.
4.7 Bing Formula
The Bing data file _historical_complete_r.dat contains the quantified quality attribute that
is summarized in Table 4.1 and given below. The quality metric table shows the mean, maximum
and minimum values for the quality attribute that was used in equation 4.9. The data was
extracted through the use of the algorithm pySummary.py as given in Appendix A.
The data needed to perform the calculations was fed into the R statistical package through
the use of the table read function as defined in equation 3.5. The data file that was used may be
found in the director src/BING/data/ _historical_complete_r.dat. The directory noted may be
found in the GitHub repository for downloading an extract of the top five lines are included in
Appendix D. The Bing formula derived in section three is given below.
1/index^15 = 0.001141*I / ( -0.002473*U * -0.000009302*O * -0.001788*T * -0.02655*D * -0.03175*K * ( -0.0005829*DIV - 0.003683*H1+ 0.003760*H2 -0.0008250*H3 + 0.003665*H4 + 7.854*H5 + 0.004228*H6 – 0.001312*P + 0.00004138*Span ) ) – 0.002291
The equation given above will be utilized in this body of work as the basis for extension
to incorporate the quality metric identified as showing significance for further study. Using the
_historical_complete_r.dat file as input into the R modeling software and providing as input the
modeling equation 4.9 provides further insight into the true system homeostasis for the Bing
search engine.
1/index^15~inbound_links*quality/(root*outbound_links*title*description*keywords*(div+h1+h2+h3+h4+h5+h6+p+span)
Minimum Maximum Average
0.0 9146.0 73.767
[3.9]
[4.9]
Table 4.1 – Bing Quality Metric
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
90
Execution of the summary function in R over the model definition of equation 4.9 leads to
summary statistics of an Adjusted R-Squared value of 0.102, a Multiple R-Squared value of
0.1962, a Residual Standard Error of 0.06385, and a p value of 2.2*10^-16. The results are
significant as the system has conveyed a shift in homeostasis to the negative. The conclusion
here is that the Bing search engine algorithm given the modeling paradigm used probably does
not incorporate the quality metric of link referrers! While the true nature of the complete
modeling paradigm incorporated by the Bing search engine is a mystery, what can be claimed
given the research presented here is that the modeling paradigm given this quality metric has
shifted the model towards chaos and away from order given the addition of the quality metric
into the overall equation. The best model available for the Bing search engine to date is void of a
quality metric and centralized around page attributes only along with quantity of page referrers.
As an alternative analysis a logistic regression was performed through the R statistical
software package. The regression model was define as given below in equation 4.19.
lm.fit ← glm(1/index^15~inbound_links*quality/(root*outbound_links*title*description*keywords*(div+h1+h2+h3+h4+h5+h6+p+span)), family="binomial")
Taking equation 4.9 and submitting this into the R statistical package yields a regression
formula as defined in equation 4.10 and given below.
log(p/(1-p)) = 1/index^15 = (1.816e12) * inbound_links * (-2.183e11) * quality / (
(8.299e10) * root * (-6.009e7) * outbound_links * (-5.643e10) * title * (-8.159e11) * description * (4.966e12) * keywords * ( (1.036e11) * div + (-5.951e11) * h1 + (-8.875e11) * h2 + (6.512e9) * h3 + (2.781e12) * h4 + (2.281e14) * h5 + (1.132e12) * h6 + (-7.327e9) * p + (-1.695e10) * span – 4.052e15
Utilization of the ROC function in R allows for the derivation of a ROC plot that may be
performed through the use of the plot function in R. This plot function is shown in Fig. 4.1
below. Calculating the area under the curve or the probability of accuracy yields a total of 0.375.
This analysis shows a degradation over the previous analysis which was void of the quality
component of the model. The degradation in the model accuracy amounts to 0.0492 or stated
[4.9]
[4.10]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
91
differently approximately a 5% difference. In the case of Bing the approximation model and the
optimization model are one in the same. As far as the model paradigm here is concerned, Bing
does not account for link quality in their model derivation. The full analysis performed herein
may be found in the GitHub repository under the directory /BING/R and in the file ‘commands -
Chapter 4 - glm.txt’.
4.8 Yahoo Formula
The Yahoo data file _historical_complete_r.dat contains the quantified quality attribute
that is summarized in Table 4.2 and given below. The quality metric table shows the mean,
maximum and minimum values for the quality attribute that was used in equation 4.11. The data
was extracted through the use of the algorithm pySummary.py as given in Appendix A.
The data for needed to perform the calculations was fed into the R statistical package
through the use of the table read function as defined in equation 3.5. The data file that was used
may be found in the director src/YAHOO/data/ _historical_complete_r.dat. The directory noted
may be found in the GitHub repository for downloading an extract of the top five lines are
included in Appendix E. The Yahoo formula derived in section three is given below.
1/index = -0.05977*K* (-0.01003)*U * (-0.001494)*T * (-0.00000477)*I * (-0.000007963)*O * (-0.04570)*D * ( -0.01385*DIV + 0.002041*H1 + 0.05928*H2 + 0.0003611*H3 + 0.08388*H4 + 0.04422*H5 + 1.951*H6 – 0.01690*P + 0.003334*SPAN ) + 0.04643
Given the quality metric that is under investigation here the equation given above needs
to be modified in order to account for the quality attribute of the model. The generalized
regression model that was used in R is given in equation 4.11 below.
Minimum Maximum Average
0.0 3165.5 71.832
[3.10]
Table 4.2 – Yahoo Quality Metric
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
92
Fig. 4.1 – Receiver Operating Characteristic (ROC) Curve – With Quality Component –Bing Data
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
93
1/index~keywords*root*title*(inbound_links*quality*outbound_links)*description*(div
+h1+h2+h3+h4+h5+h6+p+span)
Execution of the summary function in R over the model definition given in the form of equation
4.11 leads to summary statistics of a Residual Standard Error of 0.09163, a Multiple R-Squared
value of 0.5702, an Adjusted R-Squared value of 0.2798, and a p value of 2.2*10^-16. The
findings here are significant as it shows that in the case of the Yahoo search engine the quality
metric does help to improve the prediction of the search indexes. The derived R model for the
Yahoo classification index is given in the form of equation 4.12 below. As you may validate
through visual inspection of the new derived formula it is the previous formula as given in
equation 3.12 with one addition component the quality metric shown as variable ‘Q’ in equation
4.12.
1/index = -5.39*K * 0.004452*U * 0.07541*T * 0.00003367*I * (-0.000009344)*Q * 0.00003294*O * 0.1243*D * (0.02064*DIV + 0.03605*H1 + 0.08839*H2 + (-0.08222)*H3 + (-7551)*H4 + 10130*H5 + 62230*H6 –0.05492*P + (-0.006268)*SPAN) + 0.04244
Equation 4.12 from above has provided a better indexing approximation or an
optimization if you will to the original modeling paradigm as provided in section three – for the
linear regression. The argument may even be made here that one of the reasons why the indexing
approximation as derived in section three was poor was that it was missing the quality
component.
As an alternative analysis as was done previously the logistic regression was performed
for the Yahoo data using the R statistical package. The regression model was define as given
below in equation 4.13.
lm.fit ← glm( 1/index ~ keywords*root*title*(inbound_links*quality*outbound_links)*description*(div+h1+h2+h3+h4+h5+h6+p+span), family="binomial")
[4.11]
[4.12]
[4.13]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
94
The R statistical package provides a summary function that allows for the derivation of
the component values. This tabulation by the statistical package R allows for the derivation of
equation 4.14 and it is given below.
1/index = (-3.762e25) * K * (-7.973e14) * U * (6.461e15) * T * (2.028e12) * I * (4.150e12) * Q * (9.992e12) * O * (-1.662e16) * D * (((1.892e15) * DIV + (2.101e15) * H1 + (-4.845e14) * H2 + (-3.738e15) * H3 + (-6.486e19) * H4 + (-8.662e19) * H5 + (-2.983e20) * H6 + (7.302e14) * P + (4.086e14) * SPAN) – 2.641e15
Utilization of the ROC function in R allows for the derivation of a ROC plot that may be
performed through the use of the plot function in R. This plot function is shown in Fig. 4.2
below. Calculating the area under the curve or the probability of accuracy yields a total of
0.7778. This analysis shows a refinement over the previous analysis which was void of the
quality component of the model. The refinement in the model amounts to an increase for the
better to a total of 0.3536 points. This level of accuracy is significant as it not only puts the
argument over the 50% threshold, but it also makes for a strong argument for the modeling
paradigm that was adopted in this body of work. The full analysis performed herein may be
found in the GitHub repository under the directory /YAHOO/R and in the file ‘glm - commands -
Chapter 4.txt’.
4.9 Summary
The creation of an approximation to some datum is complex given the nature of the
environment of current times. There exists a plethora of data around all of us that is semi
structured and difficult to leverage and to complicate matters further the tools available to solve
these current problems are evolving. One of the questions that I believe I was able to answer in
this body of work is that while the landscape may be complex and that while the paradigm may
appear to be at a far distance horizon and fuzzy at best from a current perspective it is the
systems perspective that can shed light on the problem frame and at least help us understand the
world a little better. At least let us wrap our hands around the problem frame to begin to
understand the dynamic at work and in some measure help chip away at the stone for others to
put the face to the bust.
[4.14]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
95
Fig. 4.2 – Receiver Operating Characteristic (ROC) Curve – With Quality Component –Yahoo Data
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
96
4.10 Bibliography
Acceleration. (n.d.). Retrieved March 16, 2016, from http://www.merriam-
webster.com/dictionary/acceleration
Agarwal, R., Dhar, V. (2014). Big Data, Data Science, and Analytics: The Opportunity and
Challenge for IS Research. Institute of Operations Research and the Management Science. Vol.
25, No. 3, 443-448
Ageev, M., Guo, Q., Lagun, D., Agichtein, E. (2011). Find It If You Can: A Game for Modeling
Different Types of Web Search Success Using Interaction Data. Proceedings of the 34th
International ACM SIGIR Conference on Research and Development In Information Retrieval.
ACM
Gandomi, A., Haider, M. (2015). Beyond the Hype: Big Data Concepts, Methods, and Analytics.
International Journal of Information Management. Vol. 35, 137-144
Gwizdka, J., Chignell, M. (1999). Towards Information Retrieval Measures for Evaluation of
Web Search Engines. Retrieved May 2015 from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.3212&rep=rep1&type=pdf
Hersch, W., Elliot, D., Hickam, D., Wolf, S., Molnar, A., Leichtenstien, C. (1995). Towards New
Measures of Information Retrieval Evaluation. SIGIR ’95 Proceedings of the 18th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval.
164-170
Imafouo, A., Tannier, X. (2005). Retrieval Status Values in Information Retrieval Evaluation.
12th International Conference – String Processing and Information Retrieval
Jacobs, A. (2009). The Pathologies of Big Data. Communications of the ACM. Vol. 52, No. 9
Jerkovic, J. (2010). SEO Warrior. O’Reilly Media Inc.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
97
Kandefer, M., Shapiro, S. (2008). An F-Measure for Context Based Information Retrieval.
Retrieved October 2015 from
http://commonsensereasoning.org/2009/papers/commonsense2009paper13.pdf
King, A. (2008). Website Optimization. Sebastopol, CA. O’Reilly Media Inc.
Laurila, J., Perez, D., Aad, I., Blom, J., Bornet, O., Do, T., Dousse, O., Eberle, J., Miettinen, M.
(2012). The Mobile Data Challenge: Big Data for Mobile Computing Research. Pervasive
Computing, Newcastle.
Lavalle, S., Lesser, E., Shockley, R., Hopkins, M., Kruschwitz, N. (2011). Big Data, Analytics
and the Path from Insights to Value. MIT Sloan Review. Vol. 52, No. 2
Lazer, D., Kennedy, R., King, G., Vespignani, A. (2014). The Parable of Google Flu: Traps in
Big Data Analysis. Science. Vol. 343
Losee, R. (2000). When Information Retrieval Measures Agree About the Relative Quality of
Document Rankings. Journal of the American Society for Information Science. Vol. 51, 834-840
Mehlitz, M., Kunegis, J., Bauckhage, C., Albayrak, S. (2007). A New Evaluation Measure for
Information Retrieval Systems. IEEE International Conference on Systems, Man and
Cybermetics.
Provost, F., Fawcett, T. (2013). Data Science and Its Relationship to Big Data and Data-Driven
Decision Making. Mary Ann Liebert, Inc. Vol. 1, No. 1
Snijders, C., Matzat, U., Reips, U. (2012). “Big Data”: Big Gaps of Knowledge in the Field of
Internet Science. International Journal of Internet Science. Vol. 7, 1-5
Sun, H., Wei, Y. (2005). A Note On The PageRank Algorithm. Elsevier. Vol. 179, Issue 2, 799-
806
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
98
Taylor, L., Schroeder, R., Meyer, E. (2014). Emerging Practices And Perspectives on Big Data
Analysis in Economics: Bigger and Better or More of the Same. Big Data and Society. July-
December 2014, 1-10
Ularu, E., Puican, F., Apostu, A., Velicanu, M. (2012). Perspectives on Big Data and Big Data
Analytics. Database Systems Journal. Vol. III, No. 4
Variable. (n.d.). Retrieved March 16, 2016, from http://www.merriam-
webster.com/dictionary/variable
Zhou, B., Yao, Y. (2010). Evaluating Information Retrieval System Performance Based on User
Preference. Journal of Intelligent Information Systems. Vol. 34, Issue 3, 227-248
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
99
CHAPTER V
CONCLUSION
The research questions at the onset of this body of work were defined as follows.
1. What are the system attributes for each of the search engine providers studied – Bing and Yahoo?
2. Can the system attributes be combined into a regression model to predict search results?
3. Can the big data paradigm be investigated from a systems perspective to help define system homeostasis?
4. What is the optimized classification formula that may be derived using systems theory?
Each of the research questions was answered in this collective body of work. It was discovered
that while each of the two major search engines had similarities in the derived approximations
formulas they did differ. The identified system attributes in the analysis proved to be as defined
in table 5.1 and given below.
The second research question that was answered was whether these base attributes could
be combined into a predictive model and once again the answer was yes to an accuracy of
77.78% for Yahoo and to an accuracy of 42.42% for Bing. The third question that was answered
in this body of work was whether the big data paradigm could be studied from a systems
perspective to help determine homeostasis. This question was answered positively because of the
System Attributes
URL
Description
Copy
Inbound Links
Keywords
Outbound Links
Title
Table 5.1 – System Attributes Summary
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
100
modeling methodology that was followed. First the system was set to a boundary constraint, i.e.
the page and then the page elements were derived to create a finite set of attributes over the
problem frame. So while the World Wide Web may contain a plethora of nodes these nodes have
order since they are indexed by the providers and as such this order must part from some base
properties. It was this assumption that was followed and for which positive results were obtained.
The fourth research question that was posed was to determine the optimum derivable
classification formula that may be determined. In the case of Bing this formula was determined
as equation 3.11 and in the case of Yahoo this formula was determined to be equation 4.14.
This body of work has been a first step in the process of understanding search engine
formulas from the systems perspective and as such has laid a fundamental stepping stone in what
may come. One of the attributes that was not investigated in this body of work was the location
element. If the current search paradigm is geographic centric as was hinted at in the research
literature then this entails that another measurement can be made here; the physical distance
between query node and sink node. This element or notion of distance may very well be
incorporated directly into the derived formulas of 3.11 and 4.14. For that matter, additional
attributes that are identified as having merit that may arise in the future could very well be
incorporated directly into the formulas given and thus create refinements to the formulas given
here.
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
101
BIBLIOGRAPHY
Acceleration. (n.d.). Retrieved March 16, 2016, from http://www.merriam-
webster.com/dictionary/acceleration
Agarwal, R., Dhar, V. (2014). Big Data, Data Science, and Analytics: The Opportunity and
Challenge for IS Research. Institute of Operations Research and the Management Science. Vol.
25, No. 3, 443-448
Ageev, M., Guo, Q., Lagun, D., Agichtein, E. (2011). Find It If You Can: A Game for Modeling
Different Types of Web Search Success Using Interaction Data. Proceedings of the 34th
International ACM SIGIR Conference on Research and Development In Information Retrieval.
ACM
Al-Maolegi, M., Arkok, B. (2014). An Improved Apriori Algorithm for Association Rules.
International Journal on Natural Language Computing (IJNLC). Vol. 3, No. 1
Alla, H. (2008). A Novel Efficient Classification Algorithm for Search Engines. 8th WSEAS
International Conference on Applied Informatics and Communications. 20-22
Bar-Yosser, Z., Gurevich, M. (2008). Mining Search Engine Query Logs via Suggestion
Sampling. ACM. Vol. 1, Issue 1, 54-65
Barbay, J., Kenyon, C. (2003). Deterministic Algorithm for the t-Threshold Set Problem.
Retrieved May 2015 from http://users.dcc.uchile.cl/~jbarbay/Publications/2003-ISAAC-
DeterministicAlgorithmForTheTThresholdProblem-BarbayKenyon.pdf
Beeferman, D., Berger, A. (2000). Agglomerative Clustering of a Search Engine Query Log.
KDD '00 Proceedings of the sixth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. 407-416
Beel, J., Gipp, B. (2009). Google Scholar’s Ranking Algorithm: The Impact of Article’s Age
(An Empirical Study). Information Technology: New Generations, 2009, ITNG ’09. Sixth
International Conference. 160-164
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
102
Berry, M., Browne, M. (2005). Understanding Search Engines: Mathematical Modeling and
Text Retrieval. Siam
Bijral, S., Mukhopadhyay, D. Efficient Fuzzy Search Engine with B-Tree Search Mechanism.
Retrieved May 2015 from http://arxiv.org/abs/1411.6773
Carterette, B., Jones, R. (2008). Evaluating Search Engines by Modeling the Relationship
Between Relevance and Clicks. Advances in Neural Information Processing Systems
Chang, J., Chiou, S. (2009). An EM Algorithm for Context-Based Searching and Disambiguation
with Application to Synonym Term Alignment. 23rd Pacific Asia Conference on Language,
Information and Computing. 630-637.
Choudhary, L., Burdak, B. Role of Ranking Algorithms for Information Retrieval. Retrieved
May 2015 from http://arxiv.org/abs/1208.1926
Chuklin, A., Rijke, M. (2014). The Anatomy of Relevance. Retrieved May 2015 from
http://arxiv.org/abs/1501.06412
Cohen, A., Vitanyi, P. Web Similarity. Retrieved May 2015 from
http://arxiv.org/abs/1502.05957
Dahiwale, P., Raghuwanshi, M., Malik, L. (2014). PDD Crawler: A Focused Web Crawler Using
Link and Content Analysis for Relevance Prediction. SEAS-2014, Dubai, UAE, International
Conference.
Erdani, Y. (2012). Developing Backward Chaining Algorithm of Inference Engine in Ternary
Grid Expert System. International Journal of Advanced Computer Science and Applications
(IJACSA). Vol. 3, No. 9
Frees, A., Gamble, J., Rudinger, K., Bach, E., Friesen, M., Joynt, R., Coppersmith, S. Power
Law Scaling for the Adiabatic Algorithm for Search Engine Ranking. Retrieved May 2015 from
http://arxiv.org/abs/1211.2248
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
103
Gandomi, A., Haider, M. (2015). Beyond the Hype: Big Data Concepts, Methods, and Analytics.
International Journal of Information Management. Vol. 35, 137-144
Garnerone, S., Zanardi, P., Lidar, D. Adiabatic Quantum Algorithm for Search Engine Ranking.
Retrieve May 2015 from http://arxiv.org/abs/1109.6546
Gwizdka, J., Chignell, M. (1999). Towards Information Retrieval Measures for Evaluation of
Web Search Engines. Retrieved May 2015 from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.3212&rep=rep1&type=pdf
Google. Search Engine Optimization Starter Guide. Retrieved July 15, 2015, from
http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-
engine-optimization-starter-guide.pdf
Hassan, A. (2012). A Semi-Supervised Approach to Modeling Web Search Satisfaction.
Proceedings of the 35th International ACM SIGIR Conference on Research and Development in
Information Retrieval. ACM
Henzinger, M. (2007) Combinatorial Algorithms for Web Search Engines – Three Success
Stories. SODA '07 Proceedings off the Eighteenth Annual ACM -SIAM Symposium on Discrete
Algorithms. 1022-1026
Hersch, W., Elliot, D., Hickam, D., Wolf, S., Molnar, A., Leichtenstien, C. (1995). Towards New
Measures of Information Retrieval Evaluation. SIGIR ’95 Proceedings of the 18th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval.
164-170
Heuer, J., Dupke, S. Towards a Spatial Search Engine Using Geotags. Retrieved May 2015 from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.463.8323&rep=rep1&type=pdf
Huang, P., He, X., Gao, J., Deng, L., Acero, A., Heck, L. (2013). Learning Deep Structured
Semantic Models for Web Search Using Clickthrough Data. Proceedings of the 22nd ACM
International Conference on Information & Knowledge Management. ACM
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
104
Imafouo, A., Tannier, X. (2005). Retrieval Status Values in Information Retrieval Evaluation.
12th International Conference – String Processing and Information Retrieval
Ishii, H., Tempo, R. (2010). Distributed Randomized Algorithms for the PageRank Computation.
Automatic Control, IEEE Transactions. Vol. 55, Issue 9, 1987-2002
Jacobs, A. (2009). The Pathologies of Big Data. Communications of the ACM. Vol. 52, No. 9
Jansen, B., Spink, A. (2006). How Are We Searching the World Wide Web? A Comparison of
Nine Search Engine Transaction Logs. Information Processing and Management. Vol 40, 248-
263
Jerkovic, J. (2010). SEO Warrior. Sebastopol, CA. O’Reilly Media Inc.
Jones, S. (2002). Encyclopedia of New Media: An Essential Reference to Communication and
Technology. SAGE Publications, Inc.
Kandefer, M., Shapiro, S. (2008). An F-Measure for Context Based Information Retrieval.
Retrieved October 2015 from
http://commonsensereasoning.org/2009/papers/commonsense2009paper13.pdf
King, A. (2008). Website Optimization. Sebastopol, CA. O’Reilly Media Inc.
Koorangi, M., Zamanifar, K. (2007). A Distributed Agent Based Web Search Using Genetic
Algorithm. International Journal of Computer Science and Network Security. Vol. 7, No. 1
Kumar, R., Saini, S. (2011). A Study on SEO Monitoring System Based on Corporate Website
Development. International Journal on Computer Science, Engineering and Information
Technology (IJCSEIT). Vol. 1, No. 2
Lardin-Schweitzer, Y., Collet, P., Lutton, E., Prost, T. (2003). Introducing Lateral Thinking in
Search Engines with Interactive Evolutionary Algorithms. SAC '03 Proceedings of the 2003
ACM Symposium on Applied Computing, 214-219
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
105
Laurila, J., Perez, D., Aad, I., Blom, J., Bornet, O., Do, T., Dousse, O., Eberle, J., Miettinen, M.
(2012). The Mobile Data Challenge: Big Data for Mobile Computing Research. Pervasive
Computing, Newcastle.
Lavalle, S., Lesser, E., Shockley, R., Hopkins, M., Kruschwitz, N. (2011). Big Data, Analytics
and the Path from Insights to Value. MIT Sloan Review. Vol. 52, No. 2
Lazer, D., Kennedy, R., King, G., Vespignani, A. (2014). The Parable of Google Flu: Traps in
Big Data Analysis. Science. Vol. 343
Li, C., Hong, M., Cogill, R., Garcia, A. An Adaptive Online Ad Auction Scoring Algorithm for
Revenue Maximization. Retrieved May 2015 from http://arxiv.org/abs/1207.4701
Li, H., Wang, Y, Zhang, D., Zhang, M., Chang, E. PFP: Parallel FP-Growth for Query
Recommendation. Retrieved May 2015 from
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/34668.pdf
Losee, R. (2000). When Information Retrieval Measures Agree About the Relative Quality of
Document Rankings. Journal of the American Society for Information Science. Vol. 51, 834-840
Mehlitz, M., Kunegis, J., Bauckhage, C., Albayrak, S. (2007). A New Evaluation Measure for
Information Retrieval Systems. IEEE International Conference on Systems, Man and
Cybernetics.
Mukhopadhyay, D., Biswas, P., Kim, Y. (2006). A Syntactic Classification Based Web Page
Ranking Algorithm. Retrieved May 2015 from
http://arxiv.org/ftp/arxiv/papers/1102/1102.0694.pdf
PageRank. Retrieved May 25, 2015 from http://en.wikipedia.org/wiki/PageRank
Pal, A., Tomar, D., Shrivastava, S. (2009). Effective Focused Crawling Based on Content and
Link Structure Analysis. International Journal of Computer Science and Information Security
(IJCSIS). Vol. 2, No. 1
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
106
Provost, F., Fawcett, T. (2013). Data Science and Its Relationship to Big Data and Data-Driven
Decision Making. Mary Ann Liebert, Inc. Vol. 1, No. 1
Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E. (2012). Modeling and
Predicting Behavioral Dynamics on the Web. Proceedings of the 21st International Conference
on World Wide Web. ACM
Rani, M., Parashar, A., Chaturvedi, J., Malviya, A., Search Space Engine Optimize Search Using
FCC_STF Algorithm in Fuzzy Co-Clustering. Retrieved May 2015 from
http://arxiv.org/abs/1407.6952
Rojas, M. A Semantic Association Page Rank Algorithm for Web Search Engines. Retrieved
May 2015 from http://arxiv.org/abs/1211.6159
Sedigh, A., Roudaki, M. (2003). Identification of the Dynamics of the Google's Ranking
Algorithm. 13th IFAC Symposium on System Identification
Sheng, C., Zhang, N., Tao, Y., Jin, X. (2012). Optimal Algorithms for Crawling a Hidden
Database in the Web. Proceedings of the VLDB Endowment. Vol. 5, No. 11
Smyth, B., Balfe, E., Freyne, J., Briggs, P., Coyle, M., Boydell, O. (2004). Exploiting Query
Repetition and Regularity in an Adaptive Community-Based Web Search Engine. User
Modelling and User-Adaptive Interaction. Vol 14, 383-423
Snijders, C., Matzat, U., Reips, U. (2012). “Big Data”: Big Gaps of Knowledge in the Field of
Internet Science. International Journal of Internet Science. Vol. 7, 1-5
Spamdexing. Retrieved May 25, 2015 from http://en.wikipedia.org/wiki/Spamdexing
Spirin, N., Han, J. Survey on Web Spam Detection: Principles and Algorithms. Retrieved May
2015 from http://www.kdd.org/sites/default/files/issues/13-2-2011-12/V13-02-08-Spirin.pdf
Sun, H., Wei, Y. (2005). A Note On The PageRank Algorithm. Elsevier. Vol. 179, Issue 2, 799-
806
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
107
Suri, P., Taneja, H. (2012). An Integrated Ranking Algorithm for Efficient Information
Computing In Social Networks. International Journal on Web Service Computing (IJWSC). Vol.
3, No. 1
Taylor, L., Schroeder, R., Meyer, E. (2014). Emerging Practices And Perspectives on Big Data
Analysis in Economics: Bigger and Better or More of the Same. Big Data and Society. July-
December 2014, 1-10
Turney, P. (2008). The Latent Relation Mapping Engine: Algorithm and Experiments. Journal of
Artificial Intelligence Research. Vol. 33, 615-655
U.S. Department of Commerce. E-Stats. May 22, 2014. Web. April 9, 2015
Ularu, E., Puican, F., Apostu, A., Velicanu, M. (2012). Perspectives on Big Data and Big Data
Analytics. Database Systems Journal. Vol. III, No. 4
Variable. (n.d.). Retrieved March 16, 2016, from http://www.merriam-
webster.com/dictionary/variable
Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., Scheffer, T. (2005).
Classifying Search Engine Queries Using the Web as Background Knowledge. ACM SIGKDD
Explorations Newsletter. Vol. 7, Issue 2, 117-122.
W3C. 4.01. World Wide Web Consortium. December 24, 1999. Web. April 9, 2015
Wang, F., Du, Y., Dong, Q. (2008). A Search Quality Evaluation Based On Objective-Subjective
Method. Journal of Convergence Information Technology. Vol. 3, No. 2, 50-56
Wang, H., Zhai, C., Liang, F., Dong, A., Chang, Y. (2014). User Modeling in Search Logs Via a
Nonparametric Bayesian Approach. Proceedings of the 7th ACM International Conference on
Web Search and Data Mining. ACM.
Wei, Z., Zhao, P., Zhang, L. (2014). Design and Implementation of Image Search Algorithm.
American Journal of Software Engineering and Applications. Vol. 3, No. 6, 90-94
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
108
White, C. (2007). Sergey Brin and Larry Page: the founders of Google. The Rosen Publishing
Group, Inc. New York, NY.
Xu, S., Zhu, Y., Jiang, H., Lau, F. A User-Oriented Webpage Ranking Algorithm Based On user
Attention Time. Retrieved May 2015 from https://www.aaai.org/Papers/AAAI/2008/AAAI08-
199.pdf
Younes, A., Rowe, J., Miller, J. (2008). A Hybrid Quantum Search Engine: A Fast Quantum
Algorithm for Multiple Matches. Retrieved from http://arxiv.org/abs/quant-ph/0311171
Yue, Z., Han, S., He, D. (2014). Modeling Search Processes Using Hidden States in
Collaborative Exploratory Web Search. Proceedings of the 17th ACM Conference on Computer
Supported Cooperative Work & Social Computing. ACM
Zhou, B., Yao, Y. (2010). Evaluating Information Retrieval System Performance Based on User
Preference. Journal of Intelligent Information Systems. Vol. 34, Issue 3, 227-248
Zhou, L., Personalized Web Search. Retrieved May 2015 from http://arxiv.org/abs/1502.01057
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
109
APPENDIX A
The data mining algorithm was written in the Python programming language
https://www.python.org . Python is a structured programming language with a vast array of
libraries that may be utilized and facilitates software development. The programming language
was chosen for its ability to parse HTML through the HTMLParser library and the libraries
available to query the search engines such as Google, Bing, and Yahoo. Below are listed the
Python programs created to query the search engines and index each of the system attributes
identified in chapter three of this dissertation. The programs are displayed by file name along
with any addition non Python files used such as XML configuration files.
config.py
from xml.dom import minidom import os class config: bing_settings = { } # 'url': '', 'externallinks': ''} yahoo_settings = { } # 'url': '', 'externallinks': ''} google_settings = { } # 'url': '', 'externallinks': ''} def __init__(self): print('\n==================================================================') print( '%s Initialized' % self.__class__.__name__ ) print('==================================================================') self.file = os.path.dirname(os.path.abspath(__file__)) + '\\' + 'config.xml' self.xmldoc = minidom.parse(self.file) def bing(self): for child in self.xmldoc.getElementsByTagName("bing")[0].childNodes: if child.nodeType == child.ELEMENT_NODE: self.bing_settings[child.nodeName] = child.firstChild.nodeValue def google(self): for child in self.xmldoc.getElementsByTagName("google")[0].childNodes: if child.nodeType == child.ELEMENT_NODE: self.google_settings[child.nodeName] = child.firstChild.nodeValue
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
110
def yahoo(self): for child in self.xmldoc.getElementsByTagName("yahoo")[0].childNodes: if child.nodeType == child.ELEMENT_NODE: self.yahoo_settings[child.nodeName] = child.firstChild.nodeValue
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
111
config.xml
<?xml version="1.0" encoding="windows-1252"?> <root> <!-- Bing Account Settings // --> <bing> <url>http://www.bing.com/search?q=</url> <externallinks>http://bing.com/search?q=link: [URL] -site:[BASE_URL]</externallinks> </bing> <!-- Google Account Settings // --> <google> <url>http://www.google.com/search?q=</url> <externallinks>http://www.google.com/search?q=link:[URL]+-site:[BASE_URL]&num=1000</externallinks> </google> <!-- Yahoo Account Settings // --> <yahoo> <url>http://search.yahoo.com/search?p=</url> <externallinks>http://search.yahoo.com/search?p=link: [URL] -site:[BASE_URL]</externallinks> </yahoo> </root>
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
112
extract.py
import urllib.request from htmlHelper import * import sys import random import os class Extract: _proxies = [] _agents = [] def __init__(self): print('\n==================================================================') print( '%s Initialized' % self.__class__.__name__ ) print('==================================================================') _proxies = [] _source = os.path.dirname(os.path.realpath(__file__)) + '\\proxies.txt' with open(_source, 'r') as _file: for _line in _file: _proxies.append(_line.strip()) # Retrieve series of links from search engine query def extract_links(self, url, engine, use_proxy = False): _html = HTMLhelper() _html.search_engine(engine, 'URLS') try: if len( self._agents ) == 0: _source = os.path.dirname(os.path.abspath(__file__)) + '\\' + 'agents.txt' _file = open(_source, 'r') _agent = _file.readline().strip() while _agent: self._agents.append(_agent) _agent = _file.readline().strip() _file.close() _request = None url = url.replace(' ', '%20') if use_proxy:
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
113
if len(self._proxies) == 0: _source = os.path.dirname(os.path.realpath(__file__)) + '\\proxies.txt' with open(_source, 'r') as _file: for _line in _file: self._proxies.append(_line.strip()) _prox = random.choice(self._proxies) print("Using Proxy: %s" % _prox) proxies = {'http': 'http://'+_prox} opener = urllib.request.FancyURLopener(proxies) with opener.open(url) as f: print(f.read().decode('utf-8')) else: _request = urllib.request.Request(url) if engine.upper() == 'YAHOO' or engine.upper() == 'GOOGLE': _request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36') else: _request.add_header('User-Agent', random.choice(self._agents)) _response = urllib.request.urlopen(_request) _html.feed( _response.read().decode('utf-8') ) except ValueError as v: print("Error: %s" % v) exit() return _html # Extract indexes from end point URL given keywords array def extract_indexes(self, url, keywords): _html = HTMLhelper() try: _request = urllib.request.Request(url.replace(' ', '%20')) _request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36') _response = urllib.request.urlopen(_request, timeout=180) _html.search_indexes(url, 'KEYS', keywords) _html.feed(_response.read().decode('utf-8')) except: print(sys.exc_info()[1]) return {} return _html.indexes # Extract embedded links from search engine def external_links(self, url, engine):
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
114
_html = HTMLhelper() try: url = url.replace(' ', '%20') print("Back Links: %s" % url) _request = urllib.request.Request(url) _request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36') _response = urllib.request.urlopen(_request, timeout=180) _html.get_backlinks(url, engine) _html.feed(_response.read().decode('utf-8')) except: print(sys.exc_info()[1]) return {} return _html
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
115
htmlHelper.py
class HTMLhelper(HTMLParser): """ type = BING | GOOGLE | YAHOO operation = URLS | KEYS | INDEX URLS = Links on search engine KEYS = Primary indexing attributes on result link for each individual link extracted from main query to search engine INDEX = Backlinks to main URL """ indexes = { 'description': 0, 'div': 0, 'outbound_links': 0, 'h1': 0, 'h2': 0, 'h3': 0, 'h4': 0, 'h5': 0, 'h6': 0, 'inbound_links': 0, 'keywords': 0, 'p': 0, 'span': 0, 'title': 0, 'root': 0 } links = [] operation = '' next = '' root_url = '' tag = '' type = '' url = '' words = {} backlinks = [] need_proxy = False # Extract links from search engine query. Base search to retrieve series of URLs that match a given query to search engine. # operation = URLS def search_engine(self, type, operation): self.links = [] self.operation = operation self.next = '' self.type = type self.need_proxy = False self.previous_tag = '' # Retrieve individual indexes from keywords for a specific end point, i.e. URL # operation = KEYS def search_indexes(self, url, operation, words): self.indexes = { 'description': 0, 'div': 0, 'outbound_links': 0, 'h1': 0, 'h2': 0, 'h3': 0, 'h4': 0, 'h5': 0, 'h6': 0, 'inbound_links': 0, 'keywords': 0, 'p': 0, 'span': 0, 'title': 0, 'root': 0 } self.operation = operation self.url = url self.root_url = self.url.lower().replace('https', '').replace('http', '').replace(':', '').replace('//', '').replace('www.', '')
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
116
self.words = words self.need_proxy = False for _word in self.words: if len(self.root_url) > 0: self.indexes['root'] += ( self.root_url.lower().count(_word.lower()) * len(_word) ) / len(self.root_url) # Get backlinks to main URL found from the invocation of the search_engine(). def get_backlinks(self, url, type): self.url = url self.next = '' self.type = type self.backlinks = [] self.operation = 'INDEX' self.need_proxy = False def handle_starttag(self, tag, attrs): self.tag = tag self.attrs = attrs if self.operation.upper() == 'URLS': if self.tag == 'a' and self.type.upper() == 'GOOGLE': if len(attrs) == 2: _hasMouseDown = False _hasHref = False _link = '' _iterate = True for items in attrs: for key in items: if _iterate: if 'href' in key and items[1].strip() != '#' and ('google' not in items[1].strip() and items[1][0] != '/'): _hasHref = True _link = items[1] elif 'onmousedown' in key and 'return' in items[1] and 'rwt(this' in items[1]: _hasMouseDown = True elif ('class' == key.strip() and 'fl' == items[1].strip()) or ('data-' in key.strip()): _hasMouseDown = False _hasHref = False _iterate = False if _hasHref and _hasMouseDown: self.links.append({'url': _link, 'count': len(self.links) + 1 }) else: _hasClass = False _hasHref = False _hasId = False
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
117
_link = '' for items in attrs: for key in items: if 'class' in key and 'pn' in items[1]: _hasClass = True elif 'href' in key: _hasHref = True _link = items[1] elif 'id' in key and 'pnnext' in items[1]: _hasId = True if _hasClass and _hasHref and _hasId and _link != '': self.next = 'https://www.google.com'+_link elif self.tag == 'a' and self.type.upper() == 'BING': _hasHref = False _hasH = False _hasTitle = False _hasClass = False _link = '' for items in attrs: for key in items: if 'href' == key and ('bing.com' not in items[1] ) and ('go.microsoft.com' not in items[1] ) and items[1][0] != '/' and 'javascript:' not in items[1] and '#' != items[1][0]: _hasHref = True _link = items[1] elif 'h' == key and 'Ads' not in items[1].strip(): _hasH = True elif 'title' == key and 'next page' == items[1].strip().lower(): _hasTitle = True elif 'class' == key and 'sb_pagn' == items[1].strip().lower(): _hasClass = True elif 'href' == key and '/search?q=' == items[1][:10].strip().lower() and '&first=' in items[1] and 'FORM=PORE' in items[1]: _hasHref = True _link = 'http://www.bing.com' + items[1] if _hasHref and _hasH and (not _hasTitle) and (not _hasClass): _exists = False for _element in self.links: if 'url' in _element and ( _element.get('url').strip().lower().replace('https', '').replace('http','') == _link.strip().lower().replace('https', '').replace('http','') or _link.strip().lower().replace('https', '').replace('http','') in _element.get('url').strip().lower().replace('https', '').replace('http','') or _element.get('url').strip().lower().replace('https', '').replace('http','') in _link.strip().lower().replace('https', '').replace('http','')): _exists = True break
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
118
if not _exists: self.links.append({'url': _link, 'count': len(self.links)+1}) elif _hasHref and _hasH and _hasTitle and _hasClass: self.next = _link elif self.tag == 'a' and self.type.upper() == 'YAHOO': _hasClass = False _hasLink = False _hasTarget = False _hasData = False _hasReferrerPolicy = False _link = '' for items in attrs: for key in items: if 'class' == key: _hasClass = True elif 'href' == key and 'javascript' not in items[1].strip().lower() and '#' not in items[1].strip().lower() and 'search' not in items[1].strip().lower() and 'yahoo.com' not in items[1].strip().lower(): _hasLink = True _link = items[1] elif 'referrerpolicy' == key and 'origin' == items[1].strip().lower(): _hasReferrerPolicy = True elif 'target' == key: _hasTarget = True elif 'data' in key and 'beacon' in items[1].strip().lower(): _hasData = True if _hasClass and _hasLink and _hasTarget and _hasData and _hasReferrerPolicy: self.links.append({'url': _link, 'count': len(self.links) + 1 }) elif self.operation.upper() == 'KEYS': # Meta Tag Indexes - description and keywords if self.tag.lower().strip() == 'meta': _attrs = dict(attrs) if 'name' in _attrs and 'content' in _attrs and len(_attrs) == 2: # Meta Tags for _word in self.words: if _attrs['name'].lower().strip() == 'description' and len(_attrs['content'].strip()) > 0: self.indexes['description'] += ( _attrs['content'].lower().count(_word.lower()) * len(_word) ) / len(_attrs['content'].strip()) elif _attrs['name'].lower().strip() == 'keywords' and len(_attrs['content'].strip()) > 0: self.indexes['keywords'] += (_attrs['content'].lower().count(_word.lower()) * len(_word) ) / len(_attrs['content'].strip()) elif self.tag.lower().strip() == 'a': # Outbound Links
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
119
_nofollow = False _validLink = False for items in attrs: for key in items: _baseURL = self.root_url if self.root_url.find('/') != -1: _baseURL = self.root_url[:self.root_url.find('/')] if 'href' == key and _baseURL not in items[1] and '#' != items[1][:1] and '/' != items[1][:1] and 'javascript' not in items[1]: _validLink = True elif 'rel' == key and items[1].strip().lower() == '': _nofollow = True if not _nofollow and _validLink: self.indexes['outbound_links'] += 1 elif self.operation.upper() == 'INDEX': if self.tag == 'a' and self.type.upper() == 'BING': _hasHref = False _hasH = False _hasTitle = False _hasClass = False _link = '' for items in attrs: for key in items: if 'href' == key.lower() and ('bing.com' not in items[1] ) and ('go.microsoft.com' not in items[1] ) and items[1][0] != '/' and 'javascript:' not in items[1] and '#' != items[1][0]: _hasHref = True _link = items[1] elif 'h' == key.lower() and 'ID=SERP' in items[1].strip(): _hasH = True elif 'class' == key.lower() and 'sb_pagn' == items[1].strip().lower(): _hasClass = True elif 'title' == key.lower() and 'next page' == items[1].strip().lower(): _hasTitle = True elif 'href' == key.lower() and '/search?q=' in items[1].strip().lower() and ( 'FORM=' not in items[1] ) and ('bing.com' not in items[1]) : _hasHref = True self.next = items[1] if _hasHref and _hasH: _exists = False if ( not _hasTitle ) and ( not _hasClass ): for _element in self.backlinks: if ( _element.strip().lower().replace('https', '').replace('http','') == _link.strip().lower().replace('https', '').replace('http','') or _link.strip().lower().replace('https', '').replace('http','') in _element.strip().lower().replace('https', '').replace('http','') or _element.strip().lower().replace('https', '').replace('http','') in _link.strip().lower().replace('https', '').replace('http','')):
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
120
_exists = True break if not _exists: self.backlinks.append(_link) elif self.tag == 'a' and self.type.upper() == 'GOOGLE': # Extract Links for Backlink Analysis if len(attrs) == 2: _hasMouseDown = False _hasHref = False _link = '' _iterate = True for items in attrs: for key in items: if _iterate: if 'href' in key and items[1].strip() != '#' and ('google' not in items[1].strip() and items[1][0] != '/'): _hasHref = True _link = items[1] elif 'onmousedown' in key and 'return' in items[1] and 'rwt(this' in items[1]: _hasMouseDown = True elif ('class' == key.strip() and 'fl' == items[1].strip()) or ('data-' in key.strip()): _hasMouseDown = False _hasHref = False _iterate = False if _hasHref and _hasMouseDown: self.backlinks.append(_link) else: _hasClass = False _hasHref = False _hasId = False _link = '' for items in attrs: for key in items: if 'class' in key and 'pn' in items[1]: _hasClass = True elif 'href' in key: _hasHref = True _link = items[1] elif 'id' in key and 'pnnext' in items[1]: _hasId = True if _hasClass and _hasHref and _hasId and _link != '': self.next = 'https://www.google.com'+_link elif self.tag == 'a' and self.type.upper() == 'YAHOO': _hasClass = False
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
121
_hasLink = False _hasTarget = False _hasData = False _hasReferrerPolicy = False _link = '' for items in attrs: for key in items: if 'class' == key: _hasClass = True elif 'href' == key and 'javascript' not in items[1].strip().lower() and '#' not in items[1].strip().lower() and 'search' not in items[1].strip().lower() and 'yahoo.com' not in items[1].strip().lower(): _hasLink = True _link = items[1] elif 'referrerpolicy' == key and 'origin' == items[1].strip().lower(): _hasReferrerPolicy = True elif 'target' == key: _hasTarget = True elif 'data' in key and 'beacon' in items[1].strip().lower(): _hasData = True if _hasClass and _hasLink and _hasTarget and _hasData and _hasReferrerPolicy: self.backlinks.append(_link) def handle_endtag(self, tag): self.tag = '' def handle_data(self, data): # Search Engine Prevents Response if self.type.upper() == 'BING': pass elif self.type.upper() == 'GOOGLE': pass elif self.type.upper() == 'YAHOO': if 'error 999' in data: print("Proxy Required") self.need_proxy = True if self.operation.upper() == 'URLS': # Extract Search Engine URLS if self.tag == 'd:url' and self.type.upper() == 'BING': pass elif self.tag == 'a' and self.type.upper() == 'GOOGLE': pass elif self.tag == 'a' and self.type.upper() == 'YAHOO': if data == 'Next': _hasLink = False
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
122
for items in self.attrs: for key in items: if key == 'href': self.next = items[1] elif self.operation.upper() == 'KEYS': # Keywords in Text Tag Elements - div, h1, h2, h3, h4, h5, h6, p, span, title for _key in self.indexes: if _key == self.tag: for _word in self.words: if len(data.strip()) > 0: self.indexes[_key] += ( data.lower().count(_word.lower()) *len(_word) ) / len(data.strip()) elif self.operation.upper() == 'INDEX': # Backlinks - Next Link on Page if self.type.upper() == 'BING': # Found in Anchor Tag pass elif self.type.upper() == 'GOOGLE': pass elif self.type.upper() == 'YAHOO': if data == 'Next': _hasLink = False for items in self.attrs: for key in items: if key == 'href': self.next = items[1]
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
123
index.py
print ('Started ....') # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" # -------------------------------------------------------------------------- parser = argparse.ArgumentParser(prog='index.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parser.add_argument('-operation', help='Operation [index | consolidate]') parse = parser.parse_args() if parse.engine and parse.operation: if parse.operation.lower() == 'index': _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _type = '.data' _destination = '_indexed.dat' if os.path.exists(_path+_destination): os.remove(_path+_destination) with open(_path+_destination, 'w') as _target: _target.write('index\tkey\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') for _file in os.listdir(_path): if _type in _file: _name_pairs = _file.split('_') _keywords = [] for _synonyms in wordnet.synsets(_name_pairs[0]): for _s in _synonyms.lemmas(): _s = _s.name().replace('_', ' ') if _s not in _keywords: _keywords.append(_s) if len(_keywords) == 0: _keywords.append(_name_pairs[0]) with open(_path+_file) as _from: for _line in _from: _data = _line.split('\t') if _data[0] != 'index': print("Indexing: %s" % _data[1]) _extract = Extract()
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
124
_indexes = sorted(_extract.extract_indexes(_data[1], _keywords ).items()) if _indexes: with open(_path+_destination, 'a') as _target: _target.write(_data[0]+'\t'+_name_pairs[0]+'\t'+_data[1]) for _index in _indexes: _target.write('\t'+str(_index[1])) _target.write('\n') elif parse.operation.lower() == 'consolidate': # Determines the average index for each node found print("Consolidating....") _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _source = '_indexed.dat' _destination = '_clean.dat' # Get Distinct URLs _sites = [] with open(_path+_source, 'r') as _from: for _line in _from: _data = _line.split('\t') if _data[0] != 'index': if len(_sites) == 0: _sites.append({'index': _data[0], 'key': _data[1] , 'url': _data[2], 'indexes': _data[3:]}) _exists = False for _site in _sites: if _data[1] == _site['key'] and _site['url'] == _data[2]: _exists = True break if not _exists: _sites.append({'index': _data[0], 'key': _data[1] , 'url': _data[2], 'indexes': _data[3:]}) print("Added: %s" % _data[2]) # Get Average Index _cleaned = [] for _site in _sites: _compositeIndex = float(_site['index']) _count = 1.0 with open(_path+_source, 'r') as _from: for _line in _from: _data = _line.split('\t') if _data[0] != 'index' and _data[1] == _site['key'] and _data[2] == _site['url']: _count+=1 _compositeIndex+=float(_data[0])
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
125
_cleaned.append({'index': _compositeIndex/_count, 'key': _site['key'] , 'url': _site['url'], 'indexes': _site['indexes']}) if os.path.exists(_path+_destination): os.remove(_path+_destination) with open(_path+_destination, 'w') as _target: _target.write('index\tkey\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') for _clean in _cleaned: with open(_path+_destination, 'a') as _file: _file.write(str(_clean['index']) + '\t' + _clean['key'] + '\t' + _clean['url']) for _entry in _clean['indexes']: _file.write('\t'+ str(_entry).strip()) _file.write('\n') else: parser.print_help() print ('Ended ....')
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
126
merge.py
print ('Started ....') # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" # -------------------------------------------------------------------------- parser = argparse.ArgumentParser(prog='merge.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parse = parser.parse_args() if parse.engine: path = os.getcwd()+'\\'+parse.engine+'\\data' type = '.data' destination = path + '\\_compiled.dat' if os.path.exists(destination): os.remove(destination) for _file in os.listdir(path): if type in _file: with open(path+'\\'+_file) as _source: if not os.path.exists(destination): with open(destination, 'w') as _target: for _line in _source: _target.write(_line) else: with open(destination, 'a') as _target: for _line in _source: if 'index' in _line: continue else: _target.write(_line) else: parser.print_help() print ('Ended ....')
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
127
pyBacklinks.py
import os import time import argparse from pyBing import * from pyGoogle import * from pyYahoo import * print ('Started ....') # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" _OPERATION = 'APPEND' # -------------------------------------------------------------------------- parser = argparse.ArgumentParser(prog='pyBacklinks.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parse = parser.parse_args() _out = os.getcwd()+'\\'+parse.engine.upper()+'\\data\\_complete.dat' if _OPERATION != 'APPEND': if os.path.exists(_out): os.remove(_out) _historical = os.getcwd()+'\\'+parse.engine.upper()+'\\data\\_historical.dat' if _OPERATION != 'APPEND': if os.path.exists(_historical): os.remove(_historical) with open(_historical, 'w') as _file: _file.write('key\tsink\tsource\n') if parse.engine: source = os.getcwd()+'\\'+parse.engine.upper()+'\\data\\_clean.dat' with open(source) as _source: for _line in _source: if 'index' in _line and 'url' in _line and 'description' in _line and 'div' in _line and _OPERATION != 'APPEND': _file = open( _out, "w" ) _file.write(_line) _file.close() elif _line[0:1] != "#": _data = _line.strip().split('\t') _repository = {} if parse.engine.upper() == 'BING':
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
128
_bing = pyBing() _repository = _bing.getBackLinks(_data[2]) elif parse.engine.upper() == 'GOOGLE': _google = pyGoogle() _repository = _google.getBackLinks(_data[2]) while _repository == 'ERROR': _repository = _google.getBackLinks(_data[2]) elif parse.engine.upper() == 'YAHOO': _yahoo = pyYahoo() _repository = _yahoo.getBackLinks(_data[2]) if len(_repository) > 0 and len(_repository[0]) > 0: _data[11] = len(_repository) else: _data[11] = 0 _file = open(_out, 'a') _entries = '' for _entry in _data: if len(_entries) > 0: _entries += '\t' _entries += str(_entry) _file.write(_entries+'\n') _file.close() # Log Total Links to File try: for _sink in _repository: with open(_historical, 'a') as _log: _log.write(_data[1] + '\t' + _data[2] + '\t' + _sink + '\n') except: for _sink in _repository: with open(_historical, 'a') as _log: _log.write(_data[1] + '\t' + _data[2] + '\t[Unicode Error Write]\n') time.sleep(5) else: parser.print_help() print ('Ended ....')
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
129
pyBing.py
import sys from config import * import urllib.request as request from xml.dom import minidom from htmlHelper import * from extract import * import time import os import math from nltk.corpus import wordnet class pyBing(Extract): def __init__(self): print('\n==================================================================') print( '%s Initialized' % self.__class__.__name__ ) print('==================================================================') def getBackLinks(self, url): _repository = {} _page = 1 _iteration = "&first=[ITERATION_STEP]&FORM=PORE" _MAX = 100 try: print("Page %i" % _page) _config = config() _config.bing() _source = url.replace('https', '').replace('http', '').replace(':', '').replace('//', '').replace('www.', '') if _source.find('/') != -1: _source = _source[:_source.find('/')] _url = _config.bing_settings['externallinks'].replace('[URL]', url).replace('[BASE_URL]', _source ) _base = _url _html = self.external_links(_url, 'BING') _repository = _html.backlinks
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
130
print("Page %i Contains %i Links" % (_page, len(_html.backlinks))) while _html.next != '' and len(_html.backlinks) > 0 and len(_html.backlinks[0]) > 0: # Will go maximum 100 pages deep in search ... 1000 links approximately if _MAX - _page == 0 or len(_html.backlinks) == 0: break time.sleep(3) _html = self.external_links(_base + _iteration.replace("[ITERATION_STEP]", str( len(_repository) +1 )), 'BING') _page += 1 print("Page %i Contains %i Links" % (_page, len(_html.backlinks))) _new = False for _link in _html.backlinks: if _link not in _repository: _repository.append(_link) _new = True # Stop if no new links were found if not _new: break print("Total Links = %i" % len(_repository)) except request.URLError as e: print("Error: %s" % e.reason ) except ValueError as v: print("Non urllib Error: %s" % v) return _repository # Extract Link Attributes for a Given Search def getLinks(self, query): # Obtain Configuration Settings _config = config() _config.bing() try: # Keywords _keywords = [] for _synonyms in wordnet.synsets(query): for _s in _synonyms.lemmas(): _s = _s.name().replace('_', ' ') if _s not in _keywords: _keywords.append(_s) if len(_keywords) == 0: _keywords.append(query)
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
131
# Set Repository Structure _name = os.getcwd()+'\\BING\\'+query+'_'+time.strftime("%Y%m%d%H%M%S")+".data" _file = open( _name, "w" ) _file.write('index\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') _file.close() # Direct Call _html = self.extract_links(_config.bing_settings['url']+query, 'BING') _indexValue = 0 _page = 1 _maxPages = 6 while _html.next != '' and _page < _maxPages: for _entry in _html.links: _indexValue += 1 if '.pdf' not in _entry: print("Searching: %s" % _entry.get('url')) _indexes = sorted(self.extract_indexes(_entry.get('url'), _keywords ).items()) if _indexes: _file = open(_name, 'a') _file.write(str(_indexValue) + '\t' + _entry.get('url') ) for _index in _indexes: _file.write('\t'+str(_index[1])) _file.write('\n') _file.close() if _html.next != '': print("Next: %s" % _html.next) print("Page: %i" % _page) _html = self.extract_links(_html.next, 'BING') print("Pausing for 3 seconds") time.sleep(3) _page += 1 except request.URLError as e: print("Error: %s" % e.reason ) except ValueError as v: print("Non urllib Error: %s" % v)
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
132
pyGoogle.py
import sys from config import * from rauth import OAuth2Service import urllib.request from xml.dom import minidom from htmlHelper import * from extract import * import time import os from nltk.corpus import wordnet class pyGoogle(Extract): """ """ _proxy = False def __init__(self): print('\n==================================================================') print( '%s Initialized' % self.__class__.__name__ ) print('==================================================================') def getBackLinks(self, url): _repository = [] try: _config = config() _config.google() _source = url.replace('https', '').replace('http', '').replace(':', '').replace('//', '').replace('www.', '') if _source.find('/') != -1: _source = _source[:_source.find('/')] _url = _config.google_settings['externallinks'].replace('[URL]', url).replace('[BASE_URL]', _source ) _html = self.external_links(_url, 'GOOGLE')
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
133
try: for _link in _html.backlinks: if _link not in _repository: _repository.append(_link) print("Total Links = %i" % len(_repository)) except: print("Error Extracting Google Data.") _repository = 'ERROR' except urllib.request.URLError as e: print("Error: %s" % e.reason ) except ValueError as v: print("Non urllib Error: %s" % v) return _repository # Extract Link Attributes for a Given Search def getLinks(self, query, use_proxy = False): # Obtain API Settings _config = config() _config.google() self._proxy = use_proxy try: # Keywords _keywords = [] for _synonyms in wordnet.synsets(query): for _s in _synonyms.lemmas(): _s = _s.name().replace('_', ' ') if _s not in _keywords: _keywords.append(_s) if len(_keywords) == 0: _keywords.append(query) # Set Repository Structure _name = os.getcwd()+'\\GOOGLE\\'+query+'_'+time.strftime("%Y%m%d%H%M%S")+".data" _file = open( _name, "w" ) _file.write('index\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') _file.close() # Direct Call _html = self.extract_links(_config.google_settings['url']+query, 'GOOGLE', use_proxy) _indexValue = 0
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
134
_page = 1 _maxPages = 11 while _html.next != '' and _page < _maxPages: for _entry in _html.links: _indexValue += 1 if '.pdf' not in _entry: print("Searching: %s" % _entry['url']) _indexes = sorted(self.extract_indexes(_entry['url'], _keywords).items()) if _indexes: _file = open(_name, 'a') _file.write(str(_indexValue) + '\t' + _entry['url'] ) for _index in _indexes: _file.write('\t'+str(_index[1])) _file.write('\n') _file.close() if _html.next != '': print("Next: %s" % _html.next) _html = self.extract_links(_html.next, 'GOOGLE') print("Pausing for 3 seconds") time.sleep(3) _page += 1 except urllib.request.URLError as e: print("URL Error: %s" % e.reason ) self._proxy = True self.getLinks(query, True) except ValueError as v: print("Value Error: %s" % v)
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
135
pyOptimization.py
import os import argparse from xml.dom import minidom from htmlHelper import * from extract import * from nltk.corpus import wordnet print ('Started ....') # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" # -------------------------------------------------------------------------- parser = argparse.ArgumentParser(prog='pyOptimization.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parser.add_argument('-operation', help='Optimization Type [INDEX | FILTER | QUALITY]') parse = parser.parse_args() nodes = [] if parse.engine and parse.operation: if parse.operation and parse.operation.upper() == 'INDEX': _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _source = '_historical.dat' _destination = '_historical_indexed.dat' if os.path.exists(_path+_destination): os.remove(_path+_destination) with open(_path+_source, 'r') as _process: for _line in _process: if 'key' in _line and 'sink' in _line and 'source' in _line: with open(_path+_destination, 'w') as _file: _file.write('sink\tquality\n') else: _data = _line.strip().split('\t') if ( len(_data) == 3 ) and ( len(_data[2].strip()) > 0 ) and ('Unicode Error' not in _data[2].strip()): try: print("Searching: %s" % _data[2]) _keywords = [] for _synonyms in wordnet.synsets(_data[0]): for _s in _synonyms.lemmas():
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
136
_s = _s.name().replace('_', ' ') if _s not in _keywords: _keywords.append(_s) if len(_keywords) == 0: _keywords.append(_data[0]) print(_keywords) _indexes = sorted(Extract().extract_indexes(_data[2], _keywords ).items()) if _indexes: with open(_path+_destination, 'a') as _file: _file.write(_data[0]+'\t'+_data[1]+'\t'+_data[2]) for _index in _indexes: _file.write('\t'+str(_index[1])) _file.write('\n') except: print("Error Processing URL") elif parse.operation and parse.operation.upper() == 'FILTER': _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _source = '_historical_indexed.dat' _destination = '_historical_filtered.dat' if os.path.exists(_path+_destination): os.remove(_path+_destination) with open(_path+_destination, 'w') as _file: _file.write('key\tsink\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') # Distinct URLs _summary = [] with open(_path+_source, 'r') as _process: for _line in _process: if ('key' not in _line) and ('sink' not in _line) and ('source' not in _line): _data = _line.strip().split('\t') _found = False for _entry in _summary: if (_data[1].strip().lower() == _entry.strip().lower()): _found = True break if not _found: print("ADDING: %s" % _data[1]) _summary.append( _data[1]) for _individual in _summary:
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
137
_totals = [] _found = 0 with open(_path+_source, 'r') as _process: for _line in _process: if ('key' not in _line) and ('sink' not in _line) and ('source' not in _line): _data = _line.strip().split('\t') if _individual.strip().lower() == _data[1].strip().lower(): _index = 3 while _index < len(_data[3:]): if _found == 0: if _data[_index].isnumeric(): try: _totals.append(float(_data[_index])) except: _totals.append(float(0)) else: _totals.append(float(0)) else: if _data[_index].isnumeric(): try: _totals[_index-3] += float(_data[_index]) except: pass else: _totals[_index-3] += 0 _index+=1 _found+=1 else: if _found > 0: break _total = 0 print(_totals) for _entry in _totals: _total += _entry _total = _total / _found with open(_path+_destination, 'a') as _file: print("Totals %s = %s" % (_individual, str(_total))) _file.write(_individual+'\t'+str(_total)+'\n') elif parse.operation and parse.operation.upper() == 'QUALITY': _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _filtered = '_historical_filtered.dat' _indexed = '_complete.dat' _destination = '_historical_complete.dat' if os.path.exists(_path+_destination):
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
138
os.remove(_path+_destination) with open(_path+_destination, 'w') as _file: _file.write('index\tkey\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\tquality\n') with open(_path+_filtered, 'r') as _filtered_file: for _filtered_line in _filtered_file: if ('sink' not in _filtered_line) and ('quality' not in _filtered_line): _filtered_data = _filtered_line.strip().split('\t') print("..%s" % _filtered_data[0]) with open(_path+_indexed, 'r') as _indexed_file: for _indexed_line in _indexed_file: if ('index' not in _indexed_line) and ('key' not in _indexed_line) and ('url' not in _indexed_line): _indexed_data = _indexed_line.strip().split('\t') if _filtered_data[0].strip().lower() == _indexed_data[2].strip().lower(): print(">>>>>FOUND: %s" % _filtered_data[0]) with open(_path+_destination, 'a') as _file: for _element in _indexed_data: _file.write(str(_element)+'\t') _file.write(str(_filtered_data[1])) _file.write('\n') break else: parser.print_help() print ('Ended ....')
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
139
pySearch.py
import sys import argparse from pyBing import * from pyGoogle import * from pyYahoo import * import os import time import random print("Starting Search ......") parser = argparse.ArgumentParser(prog='pySearch.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parse = parser.parse_args() # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" _source = os.getcwd()+'\\words_chosen.txt' _use_proxy = False _words = [] # Ensure Default Search Engine Directory Exists if not os.path.exists(os.getcwd()+'\\'+parse.engine.upper()): os.makedirs(os.getcwd()+'\\'+parse.engine.upper()) # -------------------------------------------------------------------------- if parse.engine: if _MODE == "DEBUG": print("Engine: %s" % parse.engine) with open(_source, 'r') as _file: for _line in _file: _term = _line.strip() if _term[0:1] != "#": print("Searching: %s" % _term) if parse.engine.upper() == 'BING': _bing = pyBing() _bing.getLinks(_term) elif parse.engine.upper() == 'GOOGLE': _google = pyGoogle() _google.getLinks(_term, _use_proxy)
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
140
time.sleep(10) elif parse.engine.upper() == 'YAHOO': _yahoo = pyYahoo() _yahoo.getLinks(_term) else: parser.print_help() print("Ending Search ......")
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
141
pySummary.py
import sys import argparse import os print("Starting Summary Calculations ......") parser = argparse.ArgumentParser(prog='pySummary.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parser.add_argument('-file', help='Input File Name') parse = parser.parse_args() # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" if parse.engine and parse.file: _source = os.getcwd()+"\\"+parse.engine.upper()+"\\data\\"+parse.file if _MODE == "DEBUG": print("Engine: %s" % parse.engine) print("File: %s" % _source) _max = {'description': 0.0, 'div': 0.0, 'h1': 0.0, 'h2': 0.00, 'h3': 0.0, 'h4': 0.0, 'h5': 0.0, 'h6': 0.0, 'inbound_links': 0.0, 'keywords': 0.0, 'outbound_links': 0.0, 'p': 0.0, 'root': 0.0, 'span': 0.0, 'title': 0.0, 'quality': 0.00} _min = {'description': 1000.0, 'div': 1000.0, 'h1': 1000.0, 'h2': 1000.00, 'h3': 1000.0, 'h4': 1000.0, 'h5': 1000.0, 'h6': 1000.0, 'inbound_links': 1000.0, 'keywords': 1000.00, 'outbound_links': 1000.0, 'p': 1000.0, 'root': 1000.0, 'span': 1000.0, 'title': 1000.0, 'quality': 1000.00} _totals = {'description': 0.0, 'div': 0.0, 'h1': 0.0, 'h2': 0.00, 'h3': 0.0, 'h4': 0.0, 'h5': 0.0, 'h6': 0.0, 'inbound_links': 0.0, 'keywords': 0.0, 'outbound_links': 0.0, 'p': 0.0, 'root': 0.0, 'span': 0.0, 'title': 0.0, 'quality': 0.00} _lines = 0 _show_quality = False if os.path.exists(_source): with open(_source, 'r') as _file: for _line in _file: _data = _line.strip().split('\t') if len(_data) == 19 : _show_quality = True if _data[0].upper() != "INDEX": _lines += 1 # Totals
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
142
_totals['description'] = float(_data[3]) _totals['div'] += float(_data[4]) _totals['h1'] += float(_data[5]) _totals['h2'] += float(_data[6]) _totals['h3'] += float(_data[7]) _totals['h4'] += float(_data[8]) _totals['h5'] += float(_data[9]) _totals['h6'] += float(_data[10]) _totals['inbound_links'] += float(_data[11]) _totals['keywords'] += float(_data[12]) _totals['outbound_links'] += float(_data[13]) _totals['p'] += float(_data[14]) _totals['root'] += float(_data[15]) _totals['span'] += float(_data[16]) _totals['title'] += float(_data[17]) if len(_data) == 19: _totals['quality'] += float(_data[18]) # Maximum if float(_data[3]) > _max['description']: _max['description'] = float(_data[3]) if float(_data[4]) > _max['div']: _max['div'] = float(_data[4]) if float(_data[5]) > _max['h1']: _max['h1'] = float(_data[5]) if float(_data[6]) > _max['h2']: _max['h2'] = float(_data[6]) if float(_data[7]) > _max['h3']: _max['h3'] = float(_data[7]) if float(_data[8]) > _max['h4']: _max['h4'] = float(_data[8]) if float(_data[9]) > _max['h5']: _max['h5'] = float(_data[9]) if float(_data[10]) > _max['h6']: _max['h6'] = float(_data[10]) if float(_data[11]) > _max['inbound_links']: _max['inbound_links'] = float(_data[11]) if float(_data[12]) > _max['keywords']: _max['keywords'] = float(_data[12]) if float(_data[13]) > _max['outbound_links']: _max['outbound_links'] = float(_data[13]) if float(_data[14]) > _max['p']: _max['p'] = float(_data[14]) if float(_data[15]) > _max['root']: _max['root'] = float(_data[15]) if float(_data[16]) > _max['span']: _max['span'] = float(_data[16]) if float(_data[17]) > _max['title']: _max['title'] = float(_data[17]) if ( len(_data) == 19 ) and ( float(_data[18]) > _max['quality'] ): _max['quality'] = float(_data[18]) # Minimum if float(_data[3]) < _min['description']: _min['description'] = float(_data[3]) if float(_data[4]) < _min['div']: _min['div'] = float(_data[4]) if float(_data[5]) < _min['h1']: _min['h1'] = float(_data[5]) if float(_data[6]) < _min['h2']: _min['h2'] = float(_data[6]) if float(_data[7]) < _min['h3']: _min['h3'] = float(_data[7]) if float(_data[8]) < _min['h4']: _min['h4'] = float(_data[8]) if float(_data[9]) < _min['h5']: _min['h5'] = float(_data[9])
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
143
if float(_data[10]) < _min['h6']: _min['h6'] = float(_data[10]) if float(_data[11]) < _min['inbound_links']: _min['inbound_links'] = float(_data[11]) if float(_data[12]) < _min['keywords']: _min['keywords'] = float(_data[12]) if float(_data[13]) < _min['outbound_links']: _min['outbound_links'] = float(_data[13]) if float(_data[14]) < _min['p']: _min['p'] = float(_data[14]) if float(_data[15]) < _min['root']: _min['root'] = float(_data[15]) if float(_data[16]) < _min['span']: _min['span'] = float(_data[16]) if float(_data[17]) < _min['title']: _min['title'] = float(_data[17]) if ( len(_data) == 19 ) and ( float( _data[18] ) < _min['quality'] ): _min['quality'] = float(_data[18]) else: print("Invalid Input File Specified") if( not _show_quality ): _max.pop('quality', None) _min.pop('quality', None) _totals.pop('quality', None) print("MAXIMUM ---------------------------") print(_max) print("MINIMUM ---------------------------") print(_min) print("AVERAGE ---------------------------") for key, value in _totals.items(): _totals[key] = float(value)/_lines print(_totals) else: parser.print_help() print("Ending Summary Calculations ......")
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
144
pyYahoo.py
import sys
from config import *
from rauth import OAuth2Service
import urllib.request
from xml.dom import minidom
from htmlHelper import *
from extract import *
import time
import os
from nltk.corpus import wordnet
class pyYahoo(Extract):
""" """
def __init__(self):
print('\n===============================================================
===')
print( '%s Initialized' % self.__class__.__name__ )
print('================================================================
==')
def getBackLinks(self, url):
_repository = {}
_page = 1
_iteration = "&b=[ITERATION_STEP]&pz=10&bct=0&xargs=0"
_MAX = 100
try:
print("Page %i" % _page)
_config = config()
_config.yahoo()
_source = url.replace('https', '').replace('http', '').replace(':', '').replace('//',
'').replace('www.', '')
if _source.find('/') != -1:
_source = _source[:_source.find('/')]
_url = _config.yahoo_settings['externallinks'].replace('[URL]',
url).replace('[BASE_URL]', _source )
_base = _url
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
145
_html = self.external_links(_url, 'YAHOO')
_repository = _html.backlinks
print("Page %i Contains %i Links" % (_page, len(_html.backlinks)))
while _html.next != '':
# Will go maximum 100 pages deep in search ... 1000 links approximately
if _MAX - _page == 0 and len(_html.backlinks) > 0:
break
time.sleep(3)
_html = self.external_links(_base + _iteration.replace("[ITERATION_STEP]", str(
len(_repository)+1 )), 'YAHOO')
_page += 1
print("Page %i Contains %i Links" % (_page, len(_html.backlinks)))
_new = False
for _link in _html.backlinks:
if _link not in _repository:
_repository.append(_link)
_new = True
# Stop if no new links were found
if not _new:
break
print("Total Links = %i" % len(_repository))
except request.URLError as e:
print("Error: %s" % e.reason )
except ValueError as v:
print("Non urllib Error: %s" % v)
return _repository
# Extract Link Attributes for a Given Search
def getLinks(self, query):
# Obtain API Settings
_config = config()
_config.yahoo()
try:
# Keywords
_keywords = []
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
146
for _synonyms in wordnet.synsets(query):
for _s in _synonyms.lemmas():
_s = _s.name().replace('_', ' ')
if _s not in _keywords:
_keywords.append(_s)
if len(_keywords) == 0:
_keywords.append(query)
# Set Repository Structure
_name =
os.getcwd()+'\\YAHOO\\'+query+'_'+time.strftime("%Y%m%d%H%M%S")+".data"
_file = open( _name, "w" )
_file.write('index\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbou
nd_links\tp\troot\tspan\ttitle\n')
_file.close()
# Direct Call
_html = self.extract_links(_config.yahoo_settings['url'] + query, 'YAHOO')
_indexValue = 0
_page = 1
_maxPages = 6
while _html.next != '' and _page < _maxPages:
for _entry in _html.links:
_indexValue += 1
if '.pdf' not in _entry:
print("Searching: %s" % _entry['url'])
_indexes = sorted(self.extract_indexes(_entry['url'], _keywords).items())
if _indexes:
_file = open(_name, 'a')
_file.write(str(_indexValue) + '\t' + _entry['url'] )
for _index in _indexes:
_file.write('\t'+str(_index[1]))
_file.write('\n')
_file.close()
if _html.next != '':
print("Next: %s" % _html.next)
_html = self.extract_links(_html.next, 'YAHOO')
print("Pausing for 3 seconds")
time.sleep(3)
_page += 1
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
147
except urllib.request.URLError as e:
print("Error: %s" % e.reason )
except ValueError as v:
print("Non urllib Error: %s" % v)
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
148
wordnet_install.py
# First execute the command line # pip install -U nltk import nltk # nltk.download() from nltk.corpus import brown brown.words()
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
149
words_set.py
import sys import os import random print("Started ....") _source = os.getcwd()+'\\words.txt' _words = [] _archive = os.getcwd()+'\\words_chosen.txt' if not os.path.isfile(_archive): # Create Query History File _queries = open(_archive, 'w') _queries.close() # Read Dictionary with open(_source, 'r') as _file: for _line in _file: if len(_line.strip()) > 1: _words.append(_line.strip()) _file.close() # Select 100 Random Words for iteration in range(100): _term = random.choice(_words) with open(_archive, 'a') as _file: _file.write(_term) _file.write('\n') _file.close() print("Completed ....")
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
150
APPENDIX B
Sample data for the Bing search engine algorithm and for which the search term was the ‘adheres’. The sample data may be
found in the solution files under the directory ‘src/BING/data/Source’ and the file adheres_20170124002946.data.
index url description div h1 h2 h3 h4 h5 h6 inbound_links keywords outbound_links
p root span title
1 http://www.oxforddictionaries.com/us/ 0.0 0 0.0 0.0 0 0.0 0 0 0 0.0 14
0.0 0.0 0.0 0.0
2 http://www.thefreedictionary.com/adheres 0.8458781362007167 3.0263273893007403 6.0 48.0 0 0 0
0 0 0.6033519553072626 171 0.0 1.2413793103448276 84.79654875000406 1.3333333333333335
3 http://legal-dictionary.thefreedictionary.com/adheres 0.7868852459016394 0.6792452830188679 24.0 0 0
0 0 0 0 0.6033519553072626 266 0.0 0.782608695652174 68.5714285714285
2.0571428571428574
4 http://medical-dictionary.thefreedictionary.com/adheres 1.0778443113772456 0.6545454545454545 6.0 0 0
0 0 0 0 0.6033519553072626 173 0.0 0.75 71.99999999999994 1.3584905660377358
5 http://www.thesaurus.com/browse/adheres 0.2903225806451613 2.7714285714285714 6.0 0.0 0 0 0
0 0 2.4406779661016946 14 0.7982062780269057 1.2857142857142856 23.454978354978348 1.44
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
151
APPENDIX C
Sample data for the Yahoo search engine algorithm and for which the search term was the ‘adheres’. The sample data may be
found in the solution files under the directory ‘src/YAHOO/data/Source’ and the file adheres_20170126024148.data.
index url description div h1 h2 h3 h4 h5 h6 inbound_links keywords outbound_links
p root span title
1 http://www.thefreedictionary.com/adheres 0.8458781362007167 3.0263273893007403 6.0 48.0 0 0 0
0 0 0.6033519553072626 171 0.0 1.2413793103448276 84.79654875000406 1.3333333333333335
2 http://www.thesaurus.com/browse/adheres 0.2903225806451613 2.7714285714285714 6.0 0.0 0 0 0
0 0 2.4406779661016946 14 0.7982062780269057 1.2857142857142856 23.454978354978348 1.44
3 https://en.wiktionary.org/wiki/adheres 0 0.0 5.142857142857142 0.0 0.0 0 0 0 0
0 20 0 1.2 0.0 1.8
4 http://www.macmillandictionary.com/dictionary/british/adhere-to 0.6585365853658537 0.0 0 0 0 0
0 0 0 0.5373134328358209 3 0.0 0.6923076923076924 22.800000000000008 0.5070422535211268
5 http://legal-dictionary.thefreedictionary.com/adhere 0.8044692737430167 0.6923076923076924 24.0 0 0 0
0 0 0 0.6136363636363636 270 0.0 0.7999999999999999 75.42857142857143 2.181818181818182
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
152
APPENDIX D
Sample data for the Bing search engine algorithm utilized in section four of this dissertation. The sample data may be found in
the solution files under the directory ‘src/BING/data’ and the file _historical_complete_r.dat.
index key url description div h1 h2 h3 h4 h5 h6 inbound_links keywords
outbound_links p root span title quality
1.0 adheres http://www.oxforddictionaries.com/us/ 0.0 0 0.0 0.0 0 0.0 0 0 498
0.0 14 0.0 0.0 0.0 0.0 82.54609929078015
10.333333333333334 adheres http://www.adherishealth.com/ 0 0.0 0.13636363636363635 0 0
0.0 0.0 0 5 0 7 0.05294453973699256 0.0 0.022641509433962263 0.0 26.8
10.0 adheres https://en.m.wiktionary.org/wiki/adheres 0 0.0 0.8571428571428571 0.0 0 0 0
0 1 0 2 0 0.1875 0.0 0.3 20.0
11.0 adheres https://en.m.wikipedia.org/wiki/Adhesion 0 0.21739130434782608 0.0 0.0 0 0
0 0 1 0 5 0.18143798379105275 0.0 0.0 0.0 7.0
12.0 adheres https://en.wikipedia.org/wiki/AdhesionSurface_energy 0 0.0 0.0 0.0 0.0 0 0
0 9 0 56 0.18143798379105275 0.0 0.0 0.0 43.22222222222222
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
153
APPENDIX E
Sample data for the Yahoo search engine algorithm utilized in section four of this dissertation. The sample data may be found
in the solution files under the directory ‘src/YAHOO/data’ and the file _historical_complete_r.dat.
index key url description div h1 h2 h3 h4 h5 h6 inbound_links keywords
outbound_links p root span title quality
9.571428571428571 adheres http://www.adherishealth.com/ 0 0.0 0.13636363636363635 0 0
0.0 0.0 0 5 0 7 0.05294453973699256 0.0 0.022641509433962263 0.0 26.8
12.666666666666666 adheres https://en.wikipedia.org/wiki/Adhesion 0 0.0 0.0 0.0 0.0 0 0
0 5 0 56 0.18143798379105275 0.0 0.0 0.0 53.2
14.6 adheres https://phys.org/chemistry-news/ 0 0.0 0.0 0 0 0 0.0 0 1 0
28 0.07361838648826734 0.0 0.0 0.0 101.0
19.25 adheres http://atherys.com/ 0 0 0 0.0 0 0 0 0 62 0 2 0.0
0.0 0 0.0 40.69387755102041
22.0 adheres https://en.wikipedia.org/wiki/Anders_Behring_Breivik 0 0.0 0.0 0.0 0.0 0 0
0 50 0 446 0.08955223880597014 0.0 0.0 0.0 194.10810810810
Texas Tech University, Guillermo Antonio Rodriguez, May 2017
154