162
Indexing Approximations and Optimizations in Search Systems by Guillermo Antonio Rodriguez, B.Sc., M.S. Dissertation In Systems and Engineering Management Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Approved Dr. Mario Beruvides, Ph.D., P.E. Chair of Committee Dr. Susan Mengel, Ph.D. Dr. Patrick Patterson, Ph.D., P.E., C.P.E. Dr. Jennifer Cross, Ph.D. Dr. Mark Sheridan, Ph.D. Dean of the Graduate School May 2017

Copyright 2017, Guillermo Antonio Rodriguez

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Copyright 2017, Guillermo Antonio Rodriguez

Indexing Approximations and Optimizations in Search Systems

by

Guillermo Antonio Rodriguez, B.Sc., M.S.

Dissertation

In

Systems and Engineering Management

Submitted to the Graduate Faculty

of Texas Tech University in

Partial Fulfillment of

the Requirements for

the Degree of

DOCTOR OF PHILOSOPHY

Approved

Dr. Mario Beruvides, Ph.D., P.E.

Chair of Committee

Dr. Susan Mengel, Ph.D.

Dr. Patrick Patterson, Ph.D., P.E., C.P.E.

Dr. Jennifer Cross, Ph.D.

Dr. Mark Sheridan, Ph.D.

Dean of the Graduate School

May 2017

Page 2: Copyright 2017, Guillermo Antonio Rodriguez

Copyright 2017, Guillermo Antonio Rodriguez

Page 3: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

TABLE of CONTENTS

CHAPTER I...................................................................................................................................................... 1

INTRODUCTION ......................................................................................................................................... 1

1.1 Problem Statement ......................................................................................................................... 2

1.2 Research Question .......................................................................................................................... 4

1.2.1 Search Index Regression Approximations .................................................................................... 5

1.2.3 Index Matrix Optimizations .......................................................................................................... 6

1.3 General Hypothesis ......................................................................................................................... 7

1.3.1 System Attribute Formulas in Search Engine Index Approximations .......................................... 8

1.3.2 Indexing Optimization .................................................................................................................. 8

1.4 Assumption ..................................................................................................................................... 9

1.5 Research Benefit ............................................................................................................................. 9

1.6 Research Outputs and Outcomes ................................................................................................. 10

1.7 Research Outline ........................................................................................................................... 11

CHAPTER II .................................................................................................................................................. 13

LITERATURE REVIEW ............................................................................................................................... 13

2.1 Introduction .................................................................................................................................. 13

2.2 Content Clustering ........................................................................................................................ 14

2.2.1 Proximity Clustering ................................................................................................................... 16

2.2.2 Query Logs ................................................................................................................................. 17

2.2.3 Page Attributes .......................................................................................................................... 21

2.3 Conclusion ..................................................................................................................................... 29

CHAPTER III ................................................................................................................................................. 31

SYSTEM ATTRIBUTE FORUMULAS IN SEARCH ENGINE INDEX APPROXIMATIONS ................................. 31

3.1 Introduction .................................................................................................................................. 31

3.2 Literature Review .......................................................................................................................... 32

3.3 Page Attributes ............................................................................................................................. 32

3.3.1 Title ............................................................................................................................................ 33

3.3.2 Copy ........................................................................................................................................... 34

3.3.3 URL ............................................................................................................................................. 35

3.3.4 Meta Tags ................................................................................................................................... 36

3.3.5 Keyword Proximity ..................................................................................................................... 37

3.3.6 Keyword Prominence ................................................................................................................. 37

Page 4: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

3.3.7 Anchor Text ................................................................................................................................ 38

3.3.8 Domain Age ................................................................................................................................ 39

3.3.9 Back Links ................................................................................................................................... 39

3.3.10 Click Through Counter ............................................................................................................. 40

3.3.11 Lexical Context ......................................................................................................................... 45

3.3.12 Attribute Summary .................................................................................................................. 45

3.4 Premise Introspection ................................................................................................................... 46

3.5 Google ........................................................................................................................................... 48

3.5.1 Title Tag ...................................................................................................................................... 48

3.5.2 Meta Tags ................................................................................................................................... 49

3.5.3 URL ............................................................................................................................................. 49

3.5.4 Anchor Text ................................................................................................................................ 50

3.5.5 Image Alternate Tags ................................................................................................................. 50

3.5.6 Header Tags ............................................................................................................................... 50

3.5.7 Google Attribute Summary ........................................................................................................ 50

3.6 Yahoo ............................................................................................................................................ 51

3.7 Bing ............................................................................................................................................... 51

3.8 Algorithm ...................................................................................................................................... 52

3.9 Search Engine Approximation Model ........................................................................................... 52

3.10 Analysis Underpinnings ............................................................................................................... 54

3.11 Bing Formula ............................................................................................................................... 60

3.12 Yahoo Formula ............................................................................................................................ 65

3.15 Data Collection Challenges ......................................................................................................... 69

3.14 The Systems Paradigm ................................................................................................................ 70

3.16 Summary ..................................................................................................................................... 70

3.14 Bibliography ................................................................................................................................ 71

CHAPTER IV ................................................................................................................................................. 74

A PAGE INDEXING OPTIMIZATION PROPOSITIONS ................................................................................. 74

4.1 Introduction .................................................................................................................................. 74

4.2 Big Data ......................................................................................................................................... 76

4.3 System Variables ........................................................................................................................... 79

4.4 Literature Review .......................................................................................................................... 80

4.5 Theoretical Formulation................................................................................................................ 86

Page 5: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

4.6 Underpinnings ............................................................................................................................... 88

4.7 Bing Formula ................................................................................................................................. 89

4.8 Yahoo Formula .............................................................................................................................. 91

4.9 Summary ....................................................................................................................................... 94

4.10 Bibliography ................................................................................................................................ 96

CHAPTER V .................................................................................................................................................. 99

CONCLUSION ........................................................................................................................................... 99

BIBLIOGRAPHY .......................................................................................................................................... 101

APPENDIX A ............................................................................................................................................... 109

APPENDIX B ............................................................................................................................................... 150

APPENDIX C ............................................................................................................................................... 151

APPENDIX D ............................................................................................................................................... 152

APPENDIX E ............................................................................................................................................... 153

Page 6: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

DEFINITIONS

API Application Program Interface

CSS Cascading Style Sheet

EM Expectation Maximization

HTML Hypertext Markup Language

SEO Search Engine Optimization

SOAP Service Oriented Architecture Pattern

Page 7: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

TABLES

2.1 Data Clustering Algorithm 18

2.2 Beeferman & Berger Click Through Results 19

2.3 Input Parameters 21

3.1 Attribute Summary 46

3.2 Google Attribute Summary 51

3.3 Attribute Mapping 57

3.4 Bing Attribute Statistics 60

3.5 Yahoo Attribute Statistics 65

4.1 Bing Quality Metric 89

4.2 Yahoo Quality Metric 91

5.1 System Attributes Summary 99

Page 8: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

FIGURES

2.1 Venn Diagram – Content Clustering 16

2.2 Link Sample 19

3.1 Pair Relationship Plots in R – Bing Data 58

3.2 Pair Relationship Plots in R – Yahoo Data 59

3.3 Receiver Operating Characteristic (ROC) Curve – Bing Data 64

3.4 – Receiver Operating Characteristic (ROC) Curve – Yahoo Data 68

4.1 Receiver Operating Characteristic (ROC) Curve – With Quality Component –Bing Data

92

4.2 Receiver Operating Characteristic (ROC) Curve – With Quality Component –Yahoo Data

95

Page 9: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

1

CHAPTER I

INTRODUCTION

The notion of the Internet is credited to a J.C.R. Licklider of MIT who had the original

idea of using networking concepts to link a group of memos together [Jones (2002)]. The

concepts of Licklider were later extend and evolved to a project by Lawrence G. Roberts that

manifested itself into a concept known as ARPANET. It was through this idea of networking

concepts that the concept of the Internet came to be and has become such an indispensable tool

for so many. This linking of documents across a disbursed network structure lead to further

developments such as email and the cloud all concepts for which the idea of linking like data was

paramount. This idea of bridging the gap between disbursed object sets is what is of interest in

this paper and whose core concepts date back to the early 1960s and credited to a gentleman who

simply wanted to link a series of memos together.

In order to facilitate the bridging of like documents across a network a type of markup

language was needed; a format if you will for the exchange of information. This language is

called Hypertext Markup Language or HTML as it is most widely known. HTML is a subset of

the Extensible Markup Language (XML). XML format is composed of start and end tags that are

of the form:

<TAG>Some Content</TAG>

It should also be noted that in XML the end tag is not required as the start tag may have a

forward slash prior to the greater than symbol, thus designating a tag with only one node – a start

node and end node all in one. The 'TAG' portion refers to a specific designation object in HTML,

but could be standard text in the more generic form of XML. Documents are linked on the

Internet through the use of a specific type of tag – an anchor tag. The anchor tag has a specific

format it adheres to and which is defined by the World Wide Web Consortium (W3C). The W3C

(1999) has specified the anchor tag to have the following definition:

<a name=”A” href=”B” hreflang=”C” type=”D” rel=”E” rev=”F” charset=”G”></a>

Page 10: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

2

Where A = Name of the current anchor tag B = Web resource location C = Base language D = Content type hint E = Relationship from current document F = Reverse link description G = Character encoding

Documents are linked via the Internet by way of the href (hyperlink reference) attribute of the

anchor tag and designate one document or object to be relevant to another document or object.

The search engine movement was advanced by two individuals, Larry Page and Sergey

Brin. Larry Page and Sergey Brin met at Stanford University and collaborated on a project – a

search engine called BackRub. Larry and Sergey placed their search engine on Stanford

University servers and ultimately took their project to the world as Google.com [White (2007)].

While the concepts of networking lead to the manifestation of the monster that today is

Google.com, it only became so through the direct application of placing order to chaos. Today

there are other search engine providers such as Microsoft with Bing and Yahoo with their own

search engine named after themselves each of which is a perceived free service, but in reality

they are viable businesses making pennies per click on advertisement space that is sold on the

real estate that has become the search engine results. While the bulk of the motivation is on the

corporate side for the giants such as Google and Microsoft there are also secondary markets that

have risen out of the forest that is the Internet. Electronic commerce in the United States

accounted for $1.8 trillion dollars according to the U.S. Department of Commerce (2014). The

ability for customers to find a specific business online has become a necessity and the argument

may even be made that it could mean the survival of a business venture in current times. For a

business knowing how to structure the dynamics of a website are paramount in order to be

ranked highly by the search engines.

1.1 Problem Statement

While the major search engine Google (http://www.google.com) provides a general

outline for search engine optimization some of the others such as Bing (http://www.bing.com),

and Yahoo (http://www.yahoo.com) do not provide such guidance. Even though Google does

Page 11: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

3

provide a general outline of what constitutes good content they do not disclose their specific

algorithm or the weight of importance of the attributes they have explicitly stated warrant interest

in their search indexing. Evidence of this may be validated through the secrecy of the PageRank

algorithm and its details or lack thereof rather.

The process of indexing content quality is also an evolving process that is by no means

stagnant and thus represents a moving target to the website owner. Take for example one quality

attribute utilized by the search engines to gauge search significance – the keyword index. The

keyword index represents the ratio of keywords found divided by the total amount of words

found in the document; or stated differently:

KI = [Keywords Sought] / [Total Document Words]

The total document words is the physical count of all the words the document contains void of

all HTML markup. In the early days of the internet there were sites that flooded their content

with specific keywords so that the Keyword Index value would be high, but hid the excess

keywords by formatting the foreground text color with the same color as the background color of

the page, thus making the content invisible to the viewer of the page, but completely visible to

the indexing algorithm of the search engines (Spamdexing). Once the search engine providers

discovered the illusion they began to penalize the website owners by changing the rules of the

game and once again changing the algorithm to account for the transgression.

The art of optimizing a website for high rankings on the search engines is referred to SEO

or Search Engine Optimization and represents big business. Moving an e-commerce website

from the bowls of obscurity to the top of the search engine results, i.e. the first ten slots in a

given search would entail a windfall for any business. The search engines have even come to

understand that changes made to a website cannot be percolated through the search engine filter

immediately as it would signal to the astute listener that a change was positive or negative and as

such companies like Google have explicitly stated that it can take up to six months (King: 2008)

for website changes to fully take effect in their indexing. By pushing out the feedback

mechanism by the search engine the company is trying to conceal the value added proposition of

the argument by the affected party.

Page 12: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

4

The problem faced by any individual hoping to have positive results on the major search

engines is how can the document or documents be structured in such a way so as to facilitate

positive results for a given search? While the current state of affairs of search engine

optimization weighs more positively on the side of art than science can this paradigm be changed

to systematically follow key index factors while designing a web page to ensure a positive search

result? Out of all of the attributes that are deemed to be relevant in designing content for the web

which of these are actually relevant and which of these are secondary or irrelevant? It is this final

question that brings forth the true problem statement of this discourse and the underlying

justification for its study. The problem sought to answer in this endeavor is whether there can be

generated a model given a series of system attributes that may be derived directly from a

document or system that can yield an approximation to the search engine index value?

1.2 Research Question

In order to build the regression model there must exists a series of system attributes that

drive search results, such that the search attributes are system attributes of the document or

environment including external linking; the following is explicitly what is sought.

Let:

I = Search Index A = Document / Environment Attributes

Seek {X | X → ∆ I}

The research questions that must be answered are as follows.

1. What are the system attributes for each of the search engine providers studied – Bing and Yahoo?

2. Can the system attributes be combined into a regression model to predict search results?

3. Can the big data paradigm be investigated from a systems perspective to help define system homeostasis?

4. What is the optimized classification formula that may be derived using systems theory?

Page 13: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

5

1.2.1 Search Index Regression Approximations

A solution to the primary research question will begin to be framed by determining the

system attributes of the page. What is sought to be determined from the research literature is

what constitutes worth of search. Once the basic premise is answered then there needs to be

created a mechanism by which data may be tied to the model. Data will be obtained through a

search algorithm that will be created for this sole purpose.

The data mining algorithm will be written in the Python programming language given its

rich syntax and its access to libraries that facilitate the searching of XML content. Python

contains a library called HTMLParser (https://docs.python.org/2/library/htmlparser.html) that

facilitates the manipulating of XML content and thus will facilitate the parsing of sought criteria.

The algorithm developed will contain modules for each search engine type and its sole purpose

will be to take the derived system attributes and quantify their value on some search criteria. The

results of executing the data mining algorithm on the search engine data feed will be a key value

pair mapping yielding the inputs to the regression model for further analysis and the

approximation of some index value that may be compared directly to the search engine search

results as a comparative metric to asses merit of the results derived.

Given the lexical context of language and the utilization of replacement words in search

results by the search providers with lexically equivalent word(s) it is paramount that the

algorithm account for this dynamic. In an effort to mimic this pattern of classification a

repository will be used to aid in this effort. WordNet (https://wordnet.princeton.edu/) is an online

utility that allows for the derivation of lexically equivalent word(s) or phrase(s). WordNet

currently exposes this content by way of a utility called the ‘Natural Language Toolkit’ for the

Python programming language; the library may be found at the following URL:

http://www.nltk.org/. This library will allow for the query term to be mapped to an array of

lexically equivalent word(s) and thus classify accordingly, which puts the algorithm in line with

the pattern utilized by the search providers.

The previous progressive input from the underlying body of work will be utilized in this

next section of the paper to build the predictive model to approximate the index for each of the

Page 14: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

6

search engines – Bing and Yahoo. The system attributes derived for each of the search engines

and their corresponding weight will be used as follows in the regression model:

Given

S = Search Engine Index

Xi = Ith System Attribute of Model

Where

S = ∏ � + ∆��

Or In a Regression Context S = b1X1 * ….. * bnXn + u

Where S = Approximated Search Engine Index Xn = System Attribute 'n' bn = Slope of Component Xn u = Regression Correction

The regression model that will be built will approximate the actual search engine model and will

be validated by how closely it predicts the actual results found. This approximation model will

be two fold given each of the search engine providers and each will ultimately yield an index

value that will approximate actual results for content crawled.

1.2.3 Index Matrix Optimizations

It is the contention that given an approximation matrix for some system there exists an

enhancement to this matrix that is based upon the system view in link structures. One of the

major components that is identified by the search engine providers as having merit when

indexing is the back links that are tied to some page. What is not discussed by the search engine

providers is the merit of the back link. System homeostasis must have a bounding constraint on

the back link worth. If back links are open to a simple Boolean interpretation then it is possible

for any individual to create an online directory and skew the results to some desired domain.

This is something that in my professional career I have directly experienced and have seen this

simple trick or rouse used to invoice thousands of dollars for the service.

In the second part of this body of work the big data system that is the World Wide Web

content will be studied from a systems perspective where link structure will be investigated in

Page 15: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

7

depth and a solution to the problem frame will be produced where the feedback mechanism on

link structure will be investigated and indexed separately for each search result entry. It is the

contention here that through the use of systems theory an optimized solution to the search

indexing paradigm exists, thus creating a general framework that may be used going forward to

build search systems.

1.3 General Hypothesis

The current search indexing paradigm by the major search engines is based on key

attributes. These key attributes are combined into a composite model and evaluated to derive an

index value signifying value or worth given some search performed. This research will build an

approximation to this search engine index by following a systematic process of identifying the

system components, and then through the use of an algorithm derive metrics for search results

from the major search engine providers – Bing and Yahoo. The determined metrics will be

utilized to derive a regression model for each of the major search engine providers. This

paradigm will allow the building of a deterministic model for search engine optimization for

each of the major search engine providers described previously. This model will allow an

individual to follow a deterministic model as a guideline for building a website that maybe

ranked highly by the search engine providers without the need to resort to general rules of thumb

or fall victim to unwarranted speculation.

It is hypothesized that there exists a series of attributes surrounding each site that may be

used to measure value given some search. Secondly it is hypothesized that given these attributes

they may be combined into a predictive regression model to predict current system behavior of

the major search engines. It is the third hypothesis of this paper that the optimum system state

may be best predicted through the qualifying and then quantifying of the link structure for

individual components that provide the link equity. The analysis will be done for each individual

linking element by analyzing referring page content for value, thus removing the simple dynamic

of simple summation of total links as the underlying metric.

Page 16: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

8

1.3.1 System Attribute Formulas in Search Engine Index Approximations

Each of the major search engine providers relies on a series of attributes to determine

worth or likelihood of match given some search. The first phase of this research in this

dissertation will identify each of the system attributes that are deemed to be significant to each of

the major search engine providers – Bing and Yahoo. The identification of the system attributes

that contribute to overall worth of search will allow for the derivation of some collective that will

be used as the inputs to the regression model that is to be built to evaluate each of these attributes

specifically to the search engine provider and while an exact derivation of the model used by

each of the major search engine providers is not feasible what is feasible is the identification of a

given set of variables that do affect search engine worth and while this set is a subset of the total

it will yield an approximation of the super set.

The first phase of this two phase dissertation will derive an algorithm in the Python

programming language that will take as input the search results of some given search engine –

Bing or Yahoo. The algorithm will then take the search results and asses the value of each of the

attributes derived in this phase. If for example a system variable is deemed to be 'Key Words

Found in URL' then the algorithm will parse the URL and determine the quantity of key words

found in the URL and assign a value to this parameter or index. The collective of all these

parameters as determined in the first phase of this dissertation will then be used as the input to

determine parameter value of the regression model that will be built and based upon the

identified system attributes.

The set of variables determined along with their corresponding value will be combined

into a regression model for each of the search engine providers to derive an approximation model

for each of the major search engine providers – Bing and Yahoo. It is the contention of this phase

that a regression model may be built using system attributes and their given weights into a

paradigm to predict search result worth.

1.3.2 Indexing Optimization

The second research phase of this dissertation will look at the problem domain as an

optimization challenge to the model derived in the first section of this body of work. To this

Page 17: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

9

endeavor a paradigm will be built to enhance the framework developed previously to apply an

indexing construct to the link structure and consequently create a value definition for the link

structure elements. This diverges from the simple construct of simply deeming linking content as

present or not and helps to drive the argument forward where indexing is applied to each element

of the paradigm to help create a mathematical proposition whenever possible.

1.4 Assumption

The assumptions that are taken in this body of work are that there exists some set of

system attributes that collectively describe some system state, i.e. some index value that defines

worth given some search. It is also the assumption of this body of work that an algorithm may be

used to derive system attribute component values that may later be used as inputs into a

regression model that can correctly predict behavior or a search index approximation. It is also

the assumption of this paper that the systems perspective may be extended to the software

development domain with tangible benefits. An additional assumption of this research is that the

big data paradigm of volume, veracity, variety, and velocity may be simplified using system

attributes.

1.5 Research Benefit

The research benefits of this body of work are diverse and may be stated as follows:

1. The derivation of a systematic model for designing search engine friendly content 2. A predictive model to assess search engine index state 3. A comparison model to assess the modeling behavior of each of the two major search

engine providers 4. Given the regression model paradigm - a system may be created to systematically

create searchable content 5. The proof that a systems perspective may be extended to the software development

domain with tangible benefits 6. The extension of big data theory to embody decision science by way of systems

theory 7. The explanation of an indexing system which is sensitive to initial conditions and

shows to be topologically transitive and thus shows a parallel to a chaotic system

The derivation of a systematic model will allow for the derivation of some algorithm that may

later be used in open source development frameworks that utilize a search component, thus

Page 18: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

10

yielding a concrete benefit to the open source community. The study of a big data system – the

World Wide Web and the deciphering of a predictive model has a tremendous impact on

research in decision science going forward. The current perspective of a non-normalized system

is to classify the domain under investigation as chaotic, but what if chaos is the result of not

properly identifying system attributes thus yielding the misalignment of the utilized attributes

into an amalgamation to show a non-predictive path? The discovery of truth is no simple task

given the quantity of data available to some problem frames and the multitude of truths that may

be derived from said data points, i.e. an HTML document. The major contribution of this body of

work comes from the second phase of the research in its underlying premise which if proven true

will be significant as it will show that there is a direct correlation between systems theory and an

optimization proposition. From the argument proposed herein then what may be stated if the

underlying premise is proven true is that there are complex systems that fail classification or are

classified sub-optimally due to their predictive model being in an inconsistent state with the

system boundary or constraints and as such it sets forth the possibility for the precedence of the

investigative study.

1.6 Research Outputs and Outcomes

The research outputs and outcomes that are to come from this investigation as they relate

to the systems engineering domain are significant as it will provide a concrete case of how

general systems theory maybe applied to a computer science domain to yield a predictive

systems model. From an engineering management domain the research focus will demonstrate

that scientific principles when combined with system theory may yield the output that each

manager seeks to address in an engineering domain - the prediction of system behavior.

In an age were software runs enterprises the building of predictive models given some

vast amount of data is the current focus of domains such as analytics and “Big Data”. What is

void in that whole discussion however is the systems or engineering management perspective of

analyzing the data set. This research hopes to bring to light that while a vast amount of data

maybe organized and put on a spreadsheet with pretty graphs this amounts to nothing more than

noise. The power or strength of the predictive modeler is not in the utilization of technology that

is in vogue, but is rather the systematic thinking that can only be brought forth with a systems

Page 19: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

11

understanding, an engineering acumen and of course an understanding and application of

computer science to do the heavy lifting. It is this combination of cross disciplinary fields of

study that adds true value to the modeling paradigm and to be specific it is this combination of

computer science and systems / engineering management that makes modeling take on a

completely new perspective. Modeling intelligently through first identifying the relevant

components through a systems theory perspective and then using computer science to do the

heavy lifting maybe the way to solve those problems that plague society such as cancer or some

other evil. While this paper does not seek to address such matters it is a small step towards a

more profound argument – there is much that can be done if work is done in a smarter fashion

and as all engineering students are taught the best design is always the simplest design. Einstein

discovered the theory of relativity by riding a street car and noticing the flow of pedestrians

across the landscape. Insight is often lost when knee deep in data; the science domain given the

current problems faced may be lacking because the collective of the scientists have not been able

to view the landscape and see the pattern in the noise of pedestrians or data points or cells in a

body or well the discussion here could go into infinitum.

The research focus in the first phase of the research is on creating a predictive model for

search indexing and while this proposition has a true benefit to the research community the

largest benefit of this body of work lay in the second paper if the underlying premise is proven

factual of course. System modeling must be bound to a boundary constraint that extends past

numerical algebraic operation and whether the discourse is about physical systems or meta-

physical systems if that boundary constraint is present then it changes the dialog profoundly. The

secondary outcome of this research is the bridging of big data with decision science and the

illustration that the discovery within big data systems still lay at the doorstep of decision science

by correctly classifying system attributes.

1.7 Research Outline

This research endeavor will follow the two paper dissertation outline. The first paper will

create some body of knowledge that will be utilized by the second paper, thus allowing for an

incremental benefit in the body of work.

Page 20: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

12

The first paper of the body of work will utilize the Python programming language to

build a data mining algorithm for each of the major search engines – Bing and Yahoo. This data

mining algorithm will be used to mine the content output of a given search on each of the search

engines and for each result derived calculate the system attribute scalar values to be used in the

regression model to follow. The system attributes will be derived from outstanding literature on

the subject matter. This paper will measure each of the components found to build a record of

value on each index attribute. The first paper will then progress to build the regression model

from the system attributes identified.

The second paper will take the inputs of the first paper and build an enhanced modelling

paradigm. The second paper will take the big data problem domain that is the World Wide Web

and study the problem within a decision science perspective by identifying those attributes that

tie back to the system under study. The specific attribute that will be investigated here is the back

link equity component. In this part of the study the back link equity will be evaluated from the

perspective of value or quality and remove the simple existence argument. By studying the

quality aspect of link referrers it is the contention in this specific body of work that a better

regression model may be created; a regression model that better models the current system

behavior of the search engine providers.

Page 21: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

13

CHAPTER II

LITERATURE REVIEW

The reason to study the search optimization domain include such factors such as

providing a better result set for search engine users i.e. facilitating research by the identification

of relevant content. Another research benefit could include the development of a new indexing

algorithm i.e. an improvement over current search trees. While these noted factors place the

burden of inquiry on the provider of the service for example a Google or Bing there is also

another party that would just as equally be interested in understanding the underpinnings of a

search algorithm – a consumer of the search. Consumers of search data would include e-

commerce site owners or for that matter any party selling content online. A high search engine

ranking can translate to a direct monetary benefit for any party selling content online. Yet

another benefit of this research is the identification of the system attributes that define a good

search engine. These underpinnings could then be used to build an open source search engine for

all to use such as Universities and nonprofits. By identifying the framework for a good search

engine such as Bing or Google there could then be the evolution of the framework to a tangible

benefit for all to use.

The discussion that follows highlights current theory and the current understanding over

the search indexing domain. The literature review will place an added focus on the identification

of the factors that would yield a positive attribute in a predictive modeling algorithm or stated

differently what is sought to be identified is the series of attributes that determine search

relevance in a query.

2.1 Introduction

The available literature on the subject matter of search engine indexing essentially falls

into two distinct camps; the group of research that deals with content clustering in an effort to

facilitate searching and the second group which deals with the identification of attributes pairs

'A' that define page relevance. In all of the literature identified there was not one single article

that was able to produce a predictive model for web page indexing. Some of the methodologies

that have been utilized to cluster content include Meta data, syllogism pairing, and query logs.

Meta data clustering would include the data returned for a search on local restaurants given a

Page 22: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

14

user's GPS coordinates retrieved from their mobile device where the search was performed.

Query logs represent existing searches made on the device. If a search previously made by the

user on some device is stored locally by the search tool, the browser could retrieve the archived

information as the primary option to display for the interested party. There is already evidence of

this performance enhancement in existing search tools that may be validated by the search hints

provided at the initialization of a query in some browser such as Google Chrome. The searching

over some domain in some browsers caches the search query. Each subsequent search query that

matches this original query results in the query hints displaying this entry first.

The search engine algorithm used by the search engine providers such as Google or

Yahoo is a much coveted formula as such a model would allow the web site owner to

automatically generate a windfall of profit. The algorithm utilized by the search providers is a

closely guarded secret that while may not be completely predictable it could be approximated

through the use of the identification of the attributes that define behavior.

Modern search engine providers utilize a combination of content clustering and page

attributes to determine the best possible search result for the user. While the search algorithms

have evolved over the years and will undoubtedly evolve from this point forward for example in

the current mobile space there can be a focal analysis of the state of the system at some point in

time. This paper is just that endeavor, but should in no regard be taken as a blue print for analysis

after the ink is dry as the future of dedicated prediction is unknown especially given the state of

mobile devices and their utility in the real world that is the user experience. By being able to

identify the system attributes that contribute to search relevance however it could be argued that

some evolution of the current systems in place would rely on the current state of the system.

Parents after all do look somewhat like their children – at times.

2.2 Content Clustering

Content clustering is the utilization of extra data surrounding the search undertaken. This

extra data is important as it extends the probability of providing a positive search experience for

the end user. Take for example a user during a lunch hour looking for a Taiwanese restaurant to

visit. This could entail a search proposition performed on a hand held device such as a mobile

phone. This hypothetical user would be interested in seeing results for restaurants in the

Page 23: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

15

surrounding vicinity and probably even more interested in seeing those restaurants that are

ranked highest on some rating scale by other diners. It would be absolutely futile to display a

search result matrix showing restaurants in a different city than the hungry patron and equally as

un-enticing to show those restaurants with poor reviews. Prior to the mobile revolution taking

place searches had no interest in considering the location of the user, but now given the

mechanism providing the inquiry a new dynamic has emerged. Another example of content

clustering surrounding extra data could be geotags on images. Images on a restaurant website for

some entre having geo tags alongside the verbiage could deem some body of content as being

more relevant than others, i.e. the hungry patron again and the proximity of the dish to them.

The second type of content clustering algorithm identified is the query log. Query logs

are typically found on the server and identify queries that have been performed by users. Query

history may also be found on the client side by way of cookies placed on the device by the

website performing the search. If you currently use a Chrome browser and point the browser to

Google.com then proceed to perform some sort of search such as 'Tennis Classes in Lubbock

Texas' you will find that the search quick hints, the drop down window that displays shows

content similar to what you type. After you press the search button you will be provided with a

series of results. The resulting query has been added by the provider to your browser history. If

you close your browser and return to Google.com once again and begin to type the same search

term as before you will find that the first entry in the list will narrow down to your original

search prior to completing the typing effort. Google has stored your search and placed it first in

the list to facilitate your search effort.

The third type of mechanism that may be relied upon to determine worth is the specific

attributes that may be attached to a document. These system attributes may be directly retrieved

from the pages searched and indexed according to some formula that may be defined by the

search engine provider internally. The work of Jerkovic specifically alluded to this paradigm

having the most significance constraint on the search boundary. To reinforce this idea one only

need to look at the SEO guide by Google to understand that page or system attributes play a

significant role in the determination of page worth. While the resolution of the problem frame is

not as simple as plugging in the page attributes into a formula and evaluating through some sort

of regression modelling. The general consensus does have to be that search engine results are

Page 24: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

16

based upon system attributes to a large extent. The only question that needs to be answered is

how do the system attributes interact with the global boundary constraints to define behavior?

This last question is what the remainder of this dissertation will seek to answer through a

modeling paradigm.

2.2.1 Proximity Clustering

Proximity clustering refers to the grouping of content that is deemed to be correlated

across search boundaries. In proximity clustering the principle idea is to return the data set where

one search result resulted in data like documents of another search result. Take for example a

user searching for 'Texas Tech University Academic Programs' – a potential student to the

university. This user might also be interested in searches relating to 'Apartments'. Search results

that contain information about 'Texas Tech University' and about 'Apartments near Texas Tech

University' would represent documents that are approximate solutions to the query. Proximity

clustering takes those elements that share some common attributes and groups them together as a

search result.

Landrin-Schweitzer, Collet, Lutton, & Prost (2003) introduce the notion of lateral thinking in the

search domain. The author's formulate the hypothesis that rather than using a blind thesauri

expansion to retrieve like documents simply because the words are similar or share a similar

meaning the search engine should retrieve a data set based upon a query processing phase. The

work of Lardin-Schweitzer et al was applied to a medical database where an enhanced document

retrieval algorithm was needed to specifically overcome the hurdles with disbursed databases and

different fields. While the problem frame is different as is the case that is encountered with the

Search A Search B

Fig. 2.1 - Venn Diagram – Content Clustering

Page 25: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

17

major search engines it does highlight the necessity of data clustering outside of a simple

language construct.

The work of Chang and Chiou (2009) bring to light another dynamic and this being

context. The author's make the argument that search texts such as ‘Big Apple’ and ‘New York’

while different in language also represent similar search interests. The intelligent search engine

must be able to decipher the context of the language being used. The author’s found that an

Expectation Maximization (EM) algorithm outperforms a simple text base algorithm from 2% -

48%. A search to Google on ‘the big apple’ returns the Wikipedia page as the first result for New

York City. Google as a search engine is taking into account the context of the search terms and

not simply indexing content on a bit by bit basis.

The work of Heuer and Dupke (2007) introduce the idea of a spatial context to web site

content. If content can be tagged to a spatial context then the search paradigm will take on a new

context, such as the example given prior of searching for a Taiwanese restaurant. While the work

of Heuer and Dupke prove that currently only simple content may be indexed it does show the

potential for an enhanced search in the near future. The search paradigm with geotags could be

extended to images and not just text in the future.

2.2.2 Query Logs

Query logs typically reside on the server where the search is performed and queries are

used to retrieve data sets for user interest. Each query performed will return a specific record

sequence from which the search provider can use the actual user click events to gauge

effectiveness. Bar-Yossef and Gurevich (2008) report the deterministic metric “ImpressionRank”

as the normalized amount of impressions a page receives from user queries in a certain time

frame. The impressions created by the user base are tracked by the search engine provider to

gauge utility. If you perform a search on Google.com for any type of interest point you will

notice towards the bottom part of your screen that once you click the link the request for content

first goes to Google for tracking then it is transferred to the destination site for processing of the

content. The purpose of this work effort by the search provider is to create a scoring function (S)

Bar-Yossef and Gurevich (2009) show that:

Page 26: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

18

S → [0, ∞)

Where there exists a target distribution π on S such that a target function may be defined as f : S → ℝ, having an integral relative to π as:

Int π (f) = Σ f(x) π(x)

During each click event that is performed Google.com is actually enhancing its scoring function

to derive utility for the given search. King (2008) also reports that the Google toolbar will track

each link event – an effort to also add to the precision of its scoring function. Just as is the case

for Google you will also find that link content that is clicked on in a search result window for

Yahoo does go through a processing function. A link that is clicked on in the Yahoo result page

first goes to the search engine provider for processing prior to being routed to the intended target

URL. The search providers are tracking user preference to the search results in an effort to

incorporate a human element in the paradigm; after all who better to validate the search results

provided than a user of the system themselves.

The work of Beeferman and Berger (200) also demonstrate the effectiveness that may be

achieved through the use of query logs. The author's define an algorithm to cluster like data pairs

and is given below.

Input Click through data C in the form of (query, URL) pairs

Output Bipartite graph G

Step 1 Collect a set of Q of unique queries from C

Step 2 Collect a set U of unique URLs from C

Step 3 For each of the n unique queries create a “white” vertex in G

Step 4 For each of the m unique URLs create a “black” vertex in G

Step 5 If query q appeared with URL u then place an edge in G between the corresponding white and black vertices

Table 2.1 - Data Clustering Algorithm

2.1

2.2

Page 27: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

19

The result of the Beeferman & Berger algorithm is the creation of a graph where the relationship

between like nodes is represented with a physical link. The end result of the Beeferman & Berger

algorithm is the grouping of like data sets thus enhancing the user experience incrementally

through the collection of the query generated data pairs. The results of the Beeferman & Berger

(2000) paper are displayed below. The results were retrieved by way of the Lycos search engine

and there were a total of 500,000 records that were analyzed (click through records). The largest

point of interest in the findings of Beeferman & Berger are in the data clustering results. The

'URL Sibling Pairs' represent those queries which occurred in a click through record and

contained the same URL. These records represent the common neighbors in a search or a

clustered record set on “like queries”.

Click Through Records 500,000

Unique Queries 243,595

Unique URLs 361,906

Query Sibling Pairs 1,895,111

Query Edge Density 6.38x10-5

URL Sibling Pairs 476,505

URL Edge Density 7.27x10-6

Fig. 2.2 - Link Sample

URL 1

URL 2

Query 1

Query 2

Table 2.2 - Beeferman & Berger Click Through Results

Page 28: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

20

The Beeferman & Berger (2000) results show that given some set of data this data may

be used in turn as the input to another system, a repository that may be analyzed to score the

quality of the initial data set and consequently enhance the user experience.

Query logs demonstrate an important aspect of the user experience and this is the

utilization of user interaction with search engine results. Each query along with the

accompanying link selections represents the desired value or ranking that should take place by

the search engine and hence shows its value. Another element that may be used with the query

logs is the user attention span once a link has been followed.

The work of Xu, Zhu, Jian, and Lau address the allocation of time spent on page to the

search engine equation as a component. While the idea when combined with query logs does

represent an interesting dynamic it does have a major drawback, this being that the amount of

time a page is left open does not directly represent page interest as a page may be loaded just

prior to an individual heading off to lunch or home for the night. The other major drawback with

attention span as a system attribute is that you would need to have data on each page on the web

in order to assess value! Google has acknowledge the use of over 200 attributes that are used in

order to assess value according to Kumar, Saini (2011), could attention span be one of the

parameters in the Goolge matrix? The answer here would have to be no at least from the systems

perspective because by definition attention time is tied to the user and not the page.

An interesting argument is made by Zhou (2015) for search results is the argument for

personalization. Search engine results could be returned based upon the profile of the user to

enhance the user experience. Given a technology professional for example then the search results

sought by this individual would tend to lean towards the technical domain as opposed to

otherwise. This notion of search customization would greatly enhance the user experience as it

would eliminate extreme points. This is the current trend in technology where now Python is the

name of a programming language and not a type of snake then it would stand to reason that

personalization of search queries would be a benefit for the user. The issue with the argument

made by Zhou however is that it does not account for two or more individuals using a device

such as a family tablet or the extra needed body of work that would be needed to configure the

framework – a burden on the user; a far from optimal scenario.

Page 29: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

21

2.2.3 Page Attributes

By far the greatest amount of literature that was discovered and the topic of two O'Reilly

books was related to search engine optimization through page attributes. Page attributes are the

primary indicator that Google and the other major search engines report to track and index as the

primary indicator of content value.

There have been a great many researchers that have tried to understand the search engine

results by way of page attributes with varying degrees of success. The work of Sedigh &

Roudaki (2003) actually incorporated a least squares approach to try to model the dynamic

behavior of the Google search engine. The author’s utilized a series of attributes to model the

behavior, which included the following

1 PageRank

2 Keyword Frequency

3 Keyword Density (web page title)

4 Keyword Density (web page text)

5 Keyword Density (linked text)

6 Keyword Density (ALT tags)

7 Keyword Prominence (relative to top of document)

The author’s then proceeded to incorporate a binary attribute mapping to the defined indexes. If

the attribute was found to exist then the author’s placed a 1 in the parameter of the least squares

equation and if the index value did not exist then the author’s incorporated a 0 value for the

parameter in their model. What the author’s discovered in their research was that their intended

model was not able to predict page rank, but what was possible was the approximation of a

perceived pattern in the ranking algorithm. From a systems perspective what has been

determined here is that while the authors were not able completely describe the system

homeostasis through the identification of all the system attributes, what was possible was the

identification of the shadow or a fuzzy image of the canvas sought. It should also be mentioned

that while the author’s incorporated a least squares approach to solving the problem domain the

Table 2.3 - Input Parameters

Page 30: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

22

deterministic input values into the equation were binary. A least squares formula with binary

input parameters may miss its intended purpose if the input parameters are not more

sophisticated such as index values. To this point, the authority on the matter Google actually

alludes to such a fact in their SEO documentation.

One of the factors that was not brought up by Sedigh & Roudaki (2003) was the link

composition of web pages. Hezinger (2007) utilizes the link structure of the web to define a web

graph (V,E) where if there exists a hyperlink between two nodes ‘u’ and ‘v’ in the complete

space ‘V’ then this results in a directed edge in E. This composition of nodes and edges results in

a data structure referred to as an inverted index and it is the mechanism that is used to answer

user queries (at least in part). According to King (2008) search engine optimization may be

classified in two domains – on-site and off-site domain. King goes further in his explanation of

ranking highly and not only states that it is the number of links that matter, but also the quality of

links and the relationship of the link. Links for the sake of links matter less than do links that

map the link content to the link description or URL of the link.

The inverted graph that is created through link mapping by the search engines does

appear to weigh heavily in the PageRank algorithm, but does so with a large caveat and this

being the quality of the link structure. To currently view the number of links to a specific

primary domain the following may be entered in a Google search box (Jerkovic 2010):

link:[URL]

Where [URL] = Root Domain The above query to Google will return all the external links to a given domain.

Jerkovic (2010) identifies a series of attributes on a given website that dominate its index-

ability by the search engines. Jerkovic identifies the title tag <title> of a document as holding

value with the search engines. Given how search engines such as Google track user response

given some sort of search this only furthers the claim of Jerkovic of the attributes importance.

The title tag is displayed in the search engine results; the more attractive a title appears to a user

the more likely that they are to click on the link and given how user clicks are directly

proportional to throughput traffic as may be found in the query logs the greater the attributes

2.3

2.4

Page 31: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

23

importance takes on. As an exercise the user may perform a search on the Yahoo search engine

and then place the mouse cursor over the link provided – any will do from the result set. What

will be discovered is that the link route is a Yahoo URL such as:

http://r.search.yahoo.com/_ylt=AwrBT.Unkl9VoVYAmAhXNyoA;_ylu=X3oDMTByOHZyb21

tBGNvbG8DYmYxBHBvcwMxBHZ0aWQDBHNlYwNzcg--

/RV=2/RE=1432355495/RO=10/RU=http%3a%2f%2fwww.yellowpages.com%2fdallas-

tx%2fbook-stores/RK=0/RS=bTrZnTedC7end7WvokjyBWgYwag-

The URL is a tracking mechanism that is being used by the search provider. Clicking of the link

will also take the user to the search engine provider, i.e. the link provided – further confirmation

of the tracking mechanism being employed by the business. The search engine providers are

relying on the community as a whole to tell it what is relevant!

The page copy is the second domain that Jerkovic identifies as holding significance to the

search engine providers. Copy is text outside of the HTML markup that is directly visible by the

viewing audience. The significance of this attribute is grossly apparent by viewing search results.

Take for example any search found in the results list such as that given by Bing; a substring of

the copy that contains the significant words or phrases searched in the results output is given.

The search engines will actually tell you the items found that are of significance as the

significant items found and displayed in the results output are in bold! In the very early stages of

the web and online selling this point became an exploitable point for the vendors. Online sellers

were taking page copy and duplicating it across the background of the page so that it would not

be visible to the user, but would be completely visible to the search engine data crawler. The

result of which was abnormally good ranking for pages that exploited this loop hole. This

loophole does not currently exist and pages that utilize such a mechanism are penalized by the

search engine providers.

The document URL is the third domain over which Jerkovic identifies as having

significance over the search engines in determining value on a given search. The document URL

is displayed on the search results for the search engines for Yahoo, Google, and Bing. The results

also show the link text to show the keyword(s) found from the search; these are also displayed in

bold as well for two of the search engine providers – Google and Bing. Keywords in the URL

designate an entire page to be relevant to the search if the keywords are found in the URL.

Page 32: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

24

Apache – the web server product (http://httpd.apache.org/) even allows for the rewriting of URLs

to meet the needs of users wanting to optimize their website links and may be done through a

specific file known as the .htaccess file. The .htaccess file allows for URL mapping to allow

URLs to display relevant content; these mappings are performed through rules that are written

directly into the .htaccess file. The .htaccess file allows for rich URLs such as the following:

http://www.amazon.com/To-Kill-Mockingbird-Harper-Lee/dp/0446310786

The search performed to retrieve the above link (the first in line by the way) was the following:

‘to kill a mockingbird purchase’ on Google.com (2015/05/22 at 4:06 PM Central Time).

The fourth page attribute that Jerkovic identifies as being significant to the search engines

is the meta tag <META>. The meta tag may take on different signatures such as the following:

<meta name=”description” content=”Texas Tech University Dissertation”> <meta name=”keywords” content=”SEO, Search Engine Optimization”> <meta name=”author” content=”Guillermo Rodriguez”> The name attribute in the meta tag may take up five distinct values (application-name, author,

description, generator, and keywords) as far as the search engines are concerned there are only

two that are significant – description and keywords. The meta tag with the keywords attribute

defines all the keywords that should be acknowledged by the search engine with the given page.

King (2008) advises on using a limit of less than 20 keywords in the meta tag for keywords as

applying more would be deemed keyword stuffing. The meta tag with the description attribute

defines a short description for the document that may also be used by the crawlers. Jerkovic

states that the search engine may choose to ignore the meta description tag and opt to incorporate

the description for the site found on Dmoz.org. The description of Dmoz.org follows

(http://www.dmoz.org/docs/en/about.html):

“DMOZ is the largest, most comprehensive human-edited directory of the Web”

Jerkovic highlights the lengths to which some of the search engine providers will traverse in

order to validate their content to try to provide an honest set of results for their users. For

description meta tags King (2008) advises on using at most 250 words to maximize productivity.

Page 33: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

25

The fifth element on a document that Jerkovic identifies as having significance is the

header tags <HX> where ‘X’ represents a numeric value starting at 1 and going to 6. The smaller

the value of ‘X’ the larger the text that is displayed for the user. The main tenant here is that if a

given word or phrase is found in the header text then the larger the text is displayed for the user

then the more important it should be considered when ranking the keyword or phrase. It should

be pointed out that given an HTML element such as a header tag it is possible to change the view

of the element using cascading style sheets (CSS). Using CSS it is possible to make a <H1> tag

display exactly like a <H6> tag. An example follows using CSS notation.

H1{ font-size: 12px; } H6{ font-size: 12px; }

In both cases the font used for the <H1> and <H6> will be 12 pixels. While the formatting of the

document could be viewed by the search engine as proposition Y it could very easy be viewed as

proposition X.

The sixth element that was identified by Jerkovic as having significance to the search

engines is keyword proximity. Keyword proximity refers to the physical distance (in words or

bytes) that one keyword is separated by another. Two separate documents having the following

copy:

Texas Tech University is located in Lubbock Texas And The University of Texas at Austin is a tech hub for scientists While both lines contain the words ‘Texas’ and ‘Tech’, if a user performed a search on ‘Texas

Tech’ it would be the first copy that will take precedence. Viewing the copy a little differently

will highlight the argument:

Texas Tech University is located in Lubbock And The University of Texas at Austin is a tech hub for scientists In the first copy the separation was 0 between keywords. In the second copy the distance

between keywords was 4 words. Keyword proximity relates directly to a language construct that

2.5 2.6

Page 34: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

26

is incorporated in the search engines by way of indexing relevant content and while this could be

measured in words or bites or ASCII characters what does matter is the standardization of the

metric for a consistent ranking.

The seventh element that is deemed to be of importance by Jerkovic is keyword

prominence. Keyword prominence refers to the position of a keyword with regards to the

physical top of the document. The argument here being that content towards the top of the page

is of more importance and consequently the page should be of more importance to the searcher.

It should be noted that much like the header text size the physical location of copy may be

changed through the use of a scripting client side language such as JavaScript. With JavaScript

the physical location of copy could be moved on loading of the document to where ever the

programmer desired.

The eighth element that Jerkovic identifies as affecting search engine results is the

keywords found in anchor text. This optimization technique is relevant as the pages indexed by

the search engines will display this link for users to see in the search results and will inevitably

result in more clicks by a wider audience if the text is relevant to what is being searched.

According to Jerkovic (2010) Google bombing refers to a technique of employing deceptive text

in the anchor text to fool the search engine into believing that the destination URL holds

reference to some specific content when in actuality it would not. Jerkovic states that while

Google has changed its algorithm to account for the deceptive practice some sites are still

succeeding in increasing their page rank by employing the technique.

The ninth component that Jerkovic identifies as having significance to the search engines

is the length of time that a domain has been registered. Jerkovic states that multiyear registration

periods are looked at more favorably than single year registrations. A multiyear registration

along with the length of the domain name registration period alludes to a business having been in

business for the period since the domain was registered and a multiyear registration period

alludes to a business owner expecting to be in business for a period longer than one year. If a

restaurant owner only signed a one year lease for a unit than the property owner would probably

be more skeptical of the business owner’s confidence in his business. The comparable is

definitely there. The analysis and validation of the point made by Jerkovic is definitely pending.

Page 35: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

27

The tenth and final component that Jerkovic identifies has having significance in the

search engine ranking algorithm is the quantity and quality of referrers to the site. Sun and Wei

(2005) define the PageRank algorithm as:

“… link structure-based algorithm, which gives a rank of importance of all the pages

crawled in the Internet by the Google’s web crawler.”

Jerkovic (2010) states that back links from pages that have a PageRank value of at least 4 yield

the best results. PageRank as defined by the PageRank wiki as follows

(�) = ∑��(�)

�(�)�∈��

The PageRank for an individual page is stated as being the sum of the PageRank of each linking

page divided by the outbound links from the same site. It should be pointed out that this

algorithm is not the only mechanism that is used by the popular search engine provider Google.

The PageRank algorithm as defined entails that the more links that are bound from the site linked

the lesser value of the PageRank contribution by the contributing site. In turn it also entails that

the more the quantity of links to the site from external parties the better the PageRank by the site.

The anchor tag contains an attribute called ‘ref’ when this attribute is set to the value of

‘nofollow’ it indicates to the search engines that the accompanying link should not be attributed

to the linked domain - Jerkovic (2010). The reason why this option is enabled is to prevent

applications from going out through the internet and creating links back to their site for the sole

purpose of creating link equity. Just another example of the mechanisms that are being used by

the search engine providers to enforce the rules of fair play. King (2010) also makes the

argument that adding external links from your site dilutes the PageRank component of the site,

i.e. it increases the denominator of the equation L(v).

Dahiwale, Raghuwanshi & Malik (2014) used page attributes to predict relevance of

content. They used content found in the head tag <HEAD>, the title tag <TITLE>, the body tag

<BODY>, the meta tag <META>, and the URL to gauge page content worthiness. Their

algorithm downloaded page content from some URL and then parsed the defined tags to

determine the existence of some queried attribute based upon a mathematical expression. The

formula followed by the author’s is given below.

2.7

Page 36: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

28

t = (Nb*B) + (Nt*T) + (Nm*M) + (Nh*H) + (Nu*U) Where Nb = Number of occurrences of search string in body tag <BODY> Nt = Number of occurrences of search string in title tag <TITLE> Nm = Number of occurrences of search string in met tag <META> Nh = Number of occurrences of search string in head tag <HEAD> Nu = Number of occurrences of search string in URL The author’s further assumed that the weight of content was subjective and as follows. M = 5 U = 4 T = 3 H = 2 B = 1 The criteria was also established that documents containing a total value of ‘t’ > 3 were deemed

to relevant and documents containing a value <= 3 for ‘t’ were deemed to be irrelevant. The

conclusion derived by the author’s was that their algorithm proved to be between 20 and 70%

effective. From the work of Jerkovic and King it stands to reason that a more formidable

algorithm should have been sought without the underlying assumptions for adequacy or

relevance.

Pal, Tomar, and Shrivastava (2009) studied content based upon link structure and found

positive results. While the work of the author’s contained a significant speculative component in

terms of a weight table it does show a positive correlation between content structure and search

results. This parallel is consistent with the publication of the large search engine providers such

as Google.

Another body of work that reinforces this concept is that of Mukhopadhyay, Biswas, and

Kim (2006). The author’s studied ranking from the perspective of a weighted attribute

correlation paradigm. A significant component of the Mukhopadhyay et al algorithm was the

concept of ‘Authority’ as a weight. Jerkovic (2010) brings this point to light when he notes that

the search for ‘Hilltop Algorithm’ brings up the Wikipedia page first on Google even though the

content of the page is minimal at best. The degree of authority of a site does play a significant

role in the search engine results and a valid component of the Mukhopadhyay et al model. The

unfortunate result of the Mukhopadhyay findings however were not positive and allude to a

2.8

Page 37: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

29

complex model at hand at least one more sophisticated than the one created by Mukhopadyay et

al.

The work of Beel and Gipp (2009) looked at the age of an article in a search engine

result. The author’s looked at one particular search domain – Google Scholar. The authors found

that there did not exist any correlation between the age of an article and its ranking. Old content

may or may not be relevant. This point alludes to the possible cancellation of an attribute

component in a possible formula for search engine ranking – it should be void of age of site.

Spirin and Han (2014) point to the fact that a link farm affects PageRank in a positive

context. Each link from a farm directly contributes to PageRank positively. The argument of

Spirin and Han bring to light the complex dynamic of ranking pages. Pages or sites holding a

large influence factor such as Wikipedia hold more clout than sites with significant inbound

links, but for a second category of sites, i.e. dot coms the inbound link flux is paramount and

directly proportional to inbound links. A preliminary function of value can be defined that is

composed of ‘Authority’ and ‘Inbound Links’ and it is listed below.

Let i = Inbound Links A = Authority Level µ = Authority Factor PR = PageRank Where µ = { X | X ε [1, 0, 0.5] } And A = { Y | Y ε [1, 0]} Such That: PR = ∑ i + µA

2.3 Conclusion

The indexing of content by the search engine providers does not adhere to a simple

paradigm it is rather composed of input from users, the environment, and the individual sites

themselves. The identification of the indexing model by the search engine providers is not

stagnant it has evolved over time and will continue to evolve for example take the relative new

medium of searching through mobile devices. Will it be possible to search someday from a

mobile device by taking a picture and searching based upon an image rather than text; probably.

Page 38: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

30

Research literature has tried to pin down the exact paradigm that is used by the search

engine providers without success so is it reasonable to even try? Some time ago I interviewed for

a position with a software firm that has the purpose for its sole existence to help online retailers

sell content. During the interview process they revealed that they were currently performing the

search indexing effort for JCPenny. They even boasted that due to their effort the company was

ranked first on the search engine Google.com for a series of product offerings. Was the formula

determined in this case? The answer to the question may not be as perplexing as the research

community may think it may just require thinking along different lines. As a side note I should

also add that a couple of years later I was watching the news and it was about Google. The

search engine provider could apparently not understand how it was possible that JCPenny was

ranked first for ‘Women’s Shoes’ and a few other items. It was the classic case of the hunter

being the hunted. When the search engines index content they enter the domain through a

specific protocol; the protocol is open and visible. Could the protocol be intercepted by a third

party library to tailor the content specifically for an audience and then tailor the output to

maximize index values?

The work of Gwizdka and Chignell found that on the search engine HotBot

(www.hotbot.com) the words in the title were deemed to be more important than the words

found in the body of the document. The paradigm while conclusive in the findings here does

point to the complex dynamic between the search engine providers and the reality that each

provider does have their own ranking methodology and reason for their formula. There will

exists a specific formula for search indexing that will be proprietary to each vendor that at best

will only be able to be approximated and not absolutely defined given the input parameters that

exist and the ever changing philosophy surrounding the classification of web content exclusive

of course of the evolution of technology – another hurdle in the paradox.

Page 39: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

31

CHAPTER III

SYSTEM ATTRIBUTE FORUMULAS IN SEARCH ENGINE INDEX

APPROXIMATIONS

This study will examine a series of attribute pairs for web pages that are defined by the

search engine providers Google, Microsoft, and Yahoo or have been identified through a formal

literature review to contribute to the search engine results. The chapter will begin with a

literature review of existing vendor specific literature and third party literature to identify all

those system attributes that directly contribute to search engine ranking. The chapter will then

proceed by creating an algorithm in the Python programming language to derive the metrics for

each of these system attributes and yield a series of metrics that can be used to quantify the

parameters pairs identified as having relevance by the search engine providers or researchers.

3.1 Introduction

Google provides documentation as to what attributes are utilized when crawling a web

site to determine worth or value in their specific ranking algorithm. Bing and Yahoo do not

provide a formal specification to aid a search engine optimization effort. In this investigative

study the online documentation provided by Google will be explored to identify all those page

and non-page specific attributes that contribute to a positive page rank. The study will also rely

on third party literature to help identify any attribute pairs that may not be disclosed by the major

search engine providers or provider in this case Google. In the case of Yahoo and Bing only third

party research will be utilized in deriving a predictive model for search engine results as these

two entities do not provide a formal specification. The Google documentation will be analyzed

for the simple case of completeness as this study will focus on the indexing approximation for

the search engines of Bing and Yahoo only.

While the attribute pairs available on each page represent a finite set the specific

combination that are utilized by each of the vendors is unique and specific to their business. The

subjectivity of relevance in the search domain is derived by each of the businesses and the

algorithm to determine value is also an evolving fixture. With this constant system flux being a

constant in the paradigm it is thus a necessity to state that this study represents a snapshot as each

of the search providers are free to change and alter their search relevance scheme and disclosure

to the public.

Page 40: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

32

3.2 Literature Review

The literature review of this section will focus on the journal readings that focus on

attribute mappings to search engine optimization. A systems perspective entails the mapping of

attribute components to a domain for which the said components help predict a behavior pattern

in the system. Given the excess literature available on the possible attributes that may or may not

truly affect system homeostasis this paper will begin to frame the problem by identifying those

attributes of the system that are undeniable; the page attributes.

3.3 Page Attributes

By far the greatest amount of literature that was discovered and the topic of two O'Reilly

books was related to search engine optimization through page attributes. Page attributes are the

primary indicator that the major search engines report to track and index as the primary

component of content value.

There are have been a great many researchers that have tried to understand the search

engine results by way of page attribute with varying degrees of success. The work of Sedigh &

Roudaki (2003) utilized a series of page attributes to determine index priority. The page

attributes included such components such as:

• Keyword Frequency

• Title Keyword Density

• Text Keyword Density

• ALT Tags Keyword Density

• Keyword Prominence These components were then combined into a least squares regression formula were the

existence of a component was termed a 1 or 0 otherwise. The conclusion derived by Sedigh &

Roudaki (2003) proved to be consistent with page rank, but was not able to predict page rank.

From the systems perspective Sedigh & Roudaki were able to approximate system homeostasis,

but failed to model it accurately. The work of King (2008) points to a possible explanation of

where Sedigh & Roudaki may have the composition formula defined in general, but lacking in

detail. King identifies a series of attributes that are relevant to search indexing and does so to the

tune of a much greater set of parameters than were identified by Sedigh & Roudaki. Also, King

Page 41: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

33

points to a possible extension to the regression formula to evaluate index value and this being

that the parameter inputs are not simple binary components, but rather indexes such as some

integer divided by some other integer. If you take for example the second entry given above –

Title Keyword Density; this in the King (2008) model would be defined such as:

Title Keyword Density = ∑ Keywords In Title / ∑ Words in Title This subtle difference makes for a vast difference in the regression coefficients and the main

argument of this body of work – A regression formula may be derived from the system attributes

of a page and its corresponding link structure and this formula is tied directly to index values

derived directly from the system attributes.

One of the factors that was not brought up by Sedigh & Roudaki (2003) was the link

composition of web pages; a fundamental argument that is brought forth by King (2008). King

makes the argument that search engine indexing is composed of two facets – on page factors and

off page factors. King goes as far as to make the argument that links from certain domains such

as those coming from Wikipedia possess more value to the search engines than on page content!

This page link structure may be used to define the structure of the web according to Hezinger

(2007). Hezinger defines the web link structure as a graph composed of elements V and E, where

V identifies a vertex and E identifies an edge. When two nodes ‘a’ and ‘b’ contain a link between

them then this results in a data structure that is identified as being an inverted index. This

inverted index is one of the mechanisms that is evaluated when a user performs a search on a

given search engine. While the link structure does play a critical role in assessing value it does

contain a large caveat and this being that value is bound by quality of referring vertex node as

identified by King (2008).

3.3.1 Title

Jerkovic (2010) in his book SEO Warrior provides the most comprehensive collection of

site attributes that determine page rank. Jerkovic identifies nine components on a page that are

used to determine page worth. The first component that Jerkovic identifies as having value to the

search engines is the title tag <Title></Title>. The title tag is displayed in the web browser and

the contents of which may be found on search engine result pages. The search engine Google

Page 42: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

34

actually makes a concerted effort to track user response to their page results. On September 17,

2005 the term ‘Toronto Ontario’ was performed as a search on Google. The results of the search

provided a list of results and the link that was provided to click on was not to the website, but

was rather to the search engine provider for tracking purposes. The user is only routed to the

targeted designation only after the search engine provider accepts the user response. The

complete URL that was provided to track user activity is given below.

http://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&a

mp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0CBwQFjAAahUKEwjZip60iv_HAhWFFZIK

HZYTB-8&amp;url=http%3A%2F%2Fwww.toronto.ca%2F&amp;usg=AFQjCNEevtKPgE--

qnlQOiwX5wKT1K6HcA&amp;bvm=bv.102829193,d.aWw

The above URL was prepended with http://www.google.com as the page results are listed in

relative paths. The URL is used as the tracking mechanism by which Google assigns user worth

to their internal equation. The reader should note that this differs from Bing as they provide the

absolute path to the destination URL and as such must be left to conclude that they are not

tracking user response and place their indexing emphasis in other attributes.

3.3.2 Copy

Page copy is the second component that Jerkovic identifies as having value to the search

engine providers. Page copy in the context of web pages refers to the text embedded within text

tags. Text tags are defined as the paragraph tag <p></p> and the header tags <hx></hx>, where

the ‘x’ component in the header tag <hx></hx> is an integer value between 1 and 6. The integer

value of the header tag designates importance; the smaller the integer value the larger the text

size is displayed. Copy text is important in search queries as the search engines will display

query text within the documents found in the search results. A search for the previous query

‘Toronto Ontario’ displays the search result in the first position with the following text for the

reader:

Parking Ticket Lookup. Review the status of your City of Toronto parking tickets anytime, anywhere, from your computer or mobile device. As may be verified by the reader the word ‘Toronto’ is highlighted for the reader; a query term

found in the copy text. Inspection of the page content that may be done by right clicking on a

Page 43: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

35

web page and selecting ‘View Source’ in Internet Explorer version 11, yields the complete

HTML content that was used to generate the page layout. The result of this endeavor results in

locating the following:

<p>Read about the latest services, innovations and accomplishments at the City of

Toronto.</p> Inspection of the page attributes will also lead the reader to find keywords searched within div

tags of the document. Div tags are displayed as <div></div> and they are used as containers for

content; do the search engines index these? According to Jerkovic, page copy is restricted to the

paragraph and header tags, none the less it is an interesting test case to investigate in a possible

regression formula to the approximation equation. One possible explanation for ignoring the div

tag in parsing content could be to prevent online resellers from keyword spamming. Keyword

spamming is the physical placement of keyword terms in page content that is hidden from the

viewer, but visible to the search engines. In the past web site owners were placing keywords in

the background and hiding the text from viewers by making the foreground color of the text the

same as the background color; invisible to the viewer, but visible to the search engines.

3.3.3 URL

The third component that Jerkovic (2010) identifies as having significance to the search

engines is the document URL. The document URL refers to the physical location of the

document in the Internet. The document URL along with the query terms found within the

document URL are displayed in the query results for the three search engines Bing, Google, and

Yahoo. In the case of Google the found search term is displayed in bold. The case of the

previously used search query ‘Toronto Ontario’ the first two results for Google displayed the

following URLs in the query results:

1. www.toronto.ca 2. https://en.wikipedia.org/wiki/Toronto

An interesting point to note in the above is that the first result is the website for the city of

Toronto and the second result is the Wikipedia page for the city of Toronto. It should also be

pointed out that the search results for the first two entries were the exact same for Bing and

Page 44: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

36

Yahoo. In the result above you will also note that one of the query terms ‘Toronto’ is also

displayed in bold as Google is directly informing the reader of the direct correlation between the

search query and the located URL. Bing also displayed the search terms in bold for the viewer to

physically see on the results page. Yahoo was the only search engine that did not place emphasis

on the URL displayed by highlighting or placing in bold the search term within the URL; does

this represent a distinct difference between the search engines – again another point to take note

of when evaluating the regression model.

3.3.4 Meta Tags

The Meta tag is the fourth component that Jerkovic (2010) identifies as having

significance to the search engine providers. Meta tags physically reside within the header section

of a web document. The reader should note that the phrase web document was used here as there

are a vast amount of technologies that may be used to render web content and while the

mechanism changes the web browsers will always read a hybrid of XML or HTML content. Web

pages may have varying extensions such as:

• asp - Classic Active Server Pages

• .aspx - .NET Server Pages

• .php – PHP documents

• .js – Java Server Side Scripts Meta tags are atomic and contain varying attribute key values. The signature of a Meta tag is as

follows:

<meta name=”” content=””> The key attribute ‘name’ may contain one of three values ‘description’, ‘keywords’ or ‘author’.

The description designation identifies the Meta tag containing a short description of content that

resides within the page content. The keywords designation identifies a Meta tag containing the

keywords that are pertinent to the document. King advises on using a limit of less than 20

keywords in the Meta tag for keywords as applying more results in the search engines deeming

this act to be keyword stuffing and may result in possible blacklisting of the entire page. The

author designation identifies the Meta tag as holding the value of the author of the current

Page 45: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

37

document. Jerkovic states that the description Meta tag may be completely ignored by the search

engines and instead they may opt to use the description found on Dmoz.org. Dmoz.org is a web

directory for site content. This example highlights the extent to which search engine providers

are going through in order to validate web content. King does state however that description

Meta tags should be used and that the length of the description should be limited to 250 words at

most.

3.3.5 Keyword Proximity

The fifth component that Jerkovic (2010) identifies as having significance to the search

engine providers is keyword proximity. Keyword proximity refers to the physical distance

between keywords found in page copy. Physical distance may be measured in bytes or words for

all practical purposes. A common practice in optimizing web content is the placement of

keywords in page copy to coincide with search queries. Take for example the query ‘Toronto

Ontario’, a search engine optimization technique for keyword proximity would be to create page

content as follows:

<h1>The largest city in Ontario is Tornto. Ontario is a province located in eastern Canada.</h1> The bringing of the search words together as they appear in the query submitted results in a

direct match to the search query; an optimization over a similar page copy such as.

<h1>Ontario is a province located in eastern Canada. The largest city in Ontario is Toronto</h1> This is a direct technique that I have applied to web site content in order to optimize search

engine visibility and have seen direct results in the positive for the effort.

3.3.6 Keyword Prominence

The sixth component that Jerkovic (2010) identifies as having significance to the search

engine providers is keyword prominence. Keyword prominence refers to the physical location of

keywords with respect to the top of the document and the type of header tag used to display

content. King states that header tags that display larger text are perceived by the search engines

Page 46: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

38

as having more important text, i.e. the text is more prominent on the page. It should be noted

however that this is a search optimization technique that has been exploited by the online

resellers to enhance their presence on the search engines. Through the use of cascading style

sheets (CSS) it is possible to format the physical appearance of page content with regards to

positioning or size. Two examples follow were the physical location or the physical size of

content is changed using CSS.

H1{ font-size: 7px; } .showBottom{ position: absolute; left: 0px; top: 300px; }

In the header tag definition the font size is changed to be smaller than the default; font size for all

header one tags is set to seven pixels. In the second definition for the class ‘showBottom’ the

physical location is changed to be absolute - zero pixels from the left, and three hundred pixels

from the top. An online reseller that is optimizing a page for iPhones for example could display a

series of iPhones in the top of the document and then format the encompassing tag to physically

display the content at the very bottom of the page!

3.3.7 Anchor Text

The seventh component that Jerkovic (2010) identifies as having significance to the

search engines is keywords found in anchor text. Anchor tags are the physical mechanism by

which web pages are linked to one another. Anchor tags are the basis for the creation of the web

by Lawrence G. Roberts; the physical linking of disbursed content across a network. Anchor text

is important in the search paradigm as the search engines will display the search term in the

URL. For the search performed previously ‘Toronto Ontario’, Bing provided the following links

in the search results:

• www.mapquest.com/maps?city=Toronto&country=ca

• www.torontosun.com/news/ontario

Page 47: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

39

The links identified as being significant show the search terms directly in the URLs. When the

crawlers parse page content any anchor text that is found that would display the above link

references would then be perceived as being relevant to the words found within the text string.

While the above example is derived directly from the search page of Bing the reader should note

that the links do not point to the root URL such as www.mapquest.com, but rather point to an

internal page within the domain. This type of internal link path would only be found by directly

linking from the main page or some subpage within the main domain or through the including of

the link in a site map.

3.3.8 Domain Age

The eighth component that Jerkovic (2010) identifies as having significance to the search

engines is the duration for which the domain has been in existence. Domains that are newer are

viewed as having less significance as domains that have been in existence for a longer period of

time. Jerkovic also makes the argument that multiyear registrations for domains are viewed as

being more favorably by the search engines as opposed to single year registrations.

3.3.9 Back Links

The ninth and final component that Jerkovic (2010) deems to be significant to the search

engine providers is the quality and quantity of back links to the page. The reader should note that

the literature up to now has only identified value of back links to coincide with type of domain

and not some perceived quality metric from the linking page as is already derived for the

individual pages by all the search engine providers. This will be the focus of the next

investigation that will be undertaken. Back links to the page are related to the infamous metric

called PageRank and the basis for the popular ranking algorithm of Google. Sun and Wei (2005)

define the PageRank algorithm as:

“… link structure-based algorithm, which gives a rank of importance of all the pages crawled in the Internet by the Google’s web crawler.” An interesting part of the definition given by Sun and Wei is ‘importance’; for all intensive

purposes this metric may be viewed as the popularity or the sum of back links from one site to

another. Each site essentially has a voting share in what it deems to be pertinent content on the

Page 48: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

40

web; this content pertinence is voted on by each site by its linking to some destination site.

Jerkovic makes the argument that back links from pages that have a PageRank value of at least 4

yield the best results. The result of this assertion by Jerkovic (2010) is that linking to some site

from just any site is fruitless or less than optimal unless the source site has a positive PageRank.

PageRank as defined by the PageRank wiki as follows

(�) = ∑��(�)

�(�)�∈��

The PageRank for an individual page is stated as being the sum of the PageRank of each linking

page divided by the outbound links from the same site. The end result of the paradigm is that

each contributing link adds to a sites presence on the search engines, but does so on an

incremental scale that is directly proportional to the PageRank of the contributing site. A search

engine optimization technique could then be to go through sites that allow an individual to place

comments and then proceed to create links back to some desired site. The HTML markup

contains an attribute to the anchor tag called ‘ref’ that when set to ‘nofollow’ informs the

crawlers to not consider the target URL in the PageRank algorithm; once again an example of the

evolving search indexing paradigm and the cat and mouse game that is played between the

search engine providers and the would be high indexed aspirers. King (2008) brings to light an

interesting aspect of the PageRank algorithm and this being that internal link will dilute the

PageRank algorithm, i.e. it elevates the denominator L(v) of the above equation.

3.3.10 Click Through Counter

There has been much investigative work performed on search engine query logs and the

study of the correlation between query term(s) and links clicked from a resultant set presented.

Two of the major search engines currently track clickthrough information as it provides the

search engine service a direct correlation between query and user input and all of this for free for

the search engine providers. The major search engines that track user input are Google and

Yahoo. A simply query to each of the tracking search engines results in a DOM structure as

follows for each (extraneous content removed).

Yahoo:

[3.1]

Page 49: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

41

<a title="..." href="http://r.search.yahoo.com/_ylt=AwrSbnhhJsJWag4Af1RXNyoA;_ylu=X3oDMTByYnR1Zmd1BGNvbG8DZ3ExBHBvcwMyBHZ0aWQDBHNlYwNzcg--/RV=2/RE=1455593186/RO=10/RU=http%3a%2f%2fwww.siliconvalleyrealestateteam.com%2f/RK=0/RS=SG2kugSazxVgi23QkEhL2RQIi_s-" target="_blank"> Silicon Valley Real Estate</a> Google:

<a href="http://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwi8tOnIv_rKAhUQwGMKHX2IB-UQFgjXATAA&amp;url=http%3A%2F%2Fwww.tripadvisor.com%2FVacation_Packages-g28930-Florida-Vacations.html&amp;usg=AFQjCNE9oPZpHjWLTlmd-4vUWmO-yLijwg&amp;sig2=JLXlbtAql51vHolN72kblA&amp;bvm=bv.114195076,d.cGc">The Best Florida Vacation Packages 2016 - TripAdvisor</a> In each of the examples the direct link is not routed to the destination, but is first routed to the

search provider. Selection of the link does take you to the desired destination location, but only

after first being routed to the search provider. In the case of Bing, the resultant DOM structure

will generate an output similar to the following for the anchor tag structure (once again

extraneous content removed).

Bing:

<a href="https://nodejs.org/">Node.js - Official Site</a> The tracking of DOM level events for the purpose of clickthroughs is not part of the Bing

paradigm, different from that of the Google and Yahoo search engines. In the current model

developed it is not possible to include the clickthrough rate as part of the overall equation as this

is not a direct measurement that may be derived from the DOM structure or the link structure for

each node in the model, i.e. each of the pages. Carterette and Jones (2008) make note of the issue

with using clickthrough data as a distinguishing characteristic of value as they point to

clickthrough data being skewed. Mathematically speaking the only way to circumvent the

skewness of the data points would be to either collect all possible click events on the node or to

offset the skewness by some compensating factor. In the former case it would not be possible to

have one single search engine to process all clickthrough events on the web. In the latter case this

bias would simply entail an error factor in the equation and move the paradigm to an unusable

state rather than to refinement.

Page 50: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

42

In the case of search results the findings presented to an audience after the query is sent

to the search provider are specific to the DOM elements of the node, not taking into account link

structure at least for now. The title tag is presented as well as is content of the page and the URL

of the page. If each of these result entry domains contain search specific content and they are

highlighted for the user to view. The higher the degree of content that is present in each domain

presented to the viewer the more relevant the content is presented to be deemed by the search

and consequently to the user. Given this premise the mathematical context is next presented over

each of the domains in the search view.

Title Bold Content α Amount of Query Terms Found in Text URL Bold Content α Amount of Query Terms Found in URL Verbiage Bold Content α Amount of Query Terms Found in Verbiage Text The degree of bold content in the results output of the Title tag, the URL, and the page verbiage

is directly proportional to the text matching in the particular domain. In this investigative study

the argument is made that given the fallibility with clickthrough rate mapping to search terms it

should be excluded from the paradigm. Jansen and Spink (2006) make the argument for

clickthrough data being relevant, but a counter argument to their premise is that a user identifier

or IP address to be specific, cannot be used to judge the behavior of one specific user. Take for

example the case of the query term ‘China’ from two devices at one household, which to the

outside world could be presented as the same IP address. If the query is performed by one

individual where the intent was for ‘Trips to China’ and for a second individual searching for

dinner ware, the context is lost. The preference between search term and selection is skewed in

this case. Smyth et al. (2004) make notion of the paradigm that exists between user query and

language used; the author’s make note that it is paramount for the search engine to understand

the historical context between query and preferences. What Smyth et al. point to is the evolution

of the web and search indexing where derived search content needs to be specific to the user

and/or profile of the actor in the use case model. Smyth et al. point to this evolutionary process

materializing in the form of a Google search service (labs.google.com/personalized). The

argument for personalization is further enhanced by Smyth et al. when they point to a statistical

data set where 15% of the Excite queries where duplicate; so while specific users recycle past

Page 51: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

43

queries in the general case the relationship between query and search maps to a probability in the

teens!

The main hurdle that exists in using clickthrough data to derive a conclusion for worth is

that the widget is not identical. Each user has their own biases and impediments to performing

the task at hand. Smyth et al. point to the fact that 90% of selections given a data set for some

query is conducted over the top five results; not even the first page! If a metric is significant over

a data set then this metric needs to be encompassed over the complete data pool not just the top

five results in a typical result set of R (results per page) * P (pages of results), where P is 10 for

Google (10 results per page), 5 for Yahoo (12 results per page), and 5 for Bing (8 results per

page), or stated differently the amount of pages that are available for selection in the result set.

Given the best possible outcome in this scenario, i.e. P = 5 and R = 8 this would entail that the

query log analysis or the clickthrough rate would apply to 5/(8*5) = 4/40 = 12.5% of the search

results presented in 90% of the cases! Please note that the figure calculated previously would

represent a best case scenario for the clickthrough data and its relevance to search or an optimum

case. In the worst case scenario clickthrough data would apply to 5/(10*10) = 5/100 = 5% and

this would be the case for Google the majority share owner of search on the World Wide Web! A

deterministic model cannot be relied upon that has a total scope of 5% of truth to its underlying

argument. It should also be pointed out that the denominator in the equation is derived from the

total pages available for viewing on the first query, this value actually has a higher upper limit as

selection of the last link on the results grid provides for further results for all the search engines;

the 5% estimate from above should be a lot lower if not taking an optimistic point of view over

the data set.

Another issue that exists with the use of clickthrough data in analyzing preference is that

it opens up the model to fraud as is alluded by Smyth et al. (2004). In the current age of data

mining it becomes a very simple exercise to create a tool that would generate a query to the

search engine provider and then select some known entity from the resultant set, i.e. the shoe

store link of the unscrupulous website owner. This paradigm is similar to the paradigm that

existed once on the World Wide Web where website owners were placing keywords into the

background of the page, thus fooling the index engines into believing that the page contained

more relevant content than a competitor. While the argument was made in the research literature

Page 52: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

44

for the tracking of clickthrough data what was not addressed given the context from above is the

justification for the tracking of the data. If the clickthrough data is not used to apply to search

rank what else could it possibly be used for? Well, what comes to mind immediately is keywords

and pay per click relationships. From an academic perspective this cannot be addressed here, but

it is enough to say that another possibility does exist outside of simple search engine indexing of

pages.

Huang et al. (2013) part from the premise that clicked documents directly represent a sink

that has relevance to the query. There is a fallibility in this argument in that it does not account

for errors in judgement for example the user selected the wrong link. A counter argument for this

argument however could be that this represents a small proportion of the population; but then

please define small. Wang et al. (2014) like Huang and his colleagues makes a similar premise in

their case for using clickthroughs as a basis for measuring worth by proxy through search logs

collected from the Yahoo news search engine. The researchers used data from May 2011 to July

2011 and drew a comparison of intent based upon the context of a user group. The author’s did

arrive at a favorable model to predict link favorability given a query under a context, which

much like the work Smyth el al. (2004) point to the evolution of the World Wide Web and its

indexing. What is not addressed in the work of Wang et al. is the issue that cannot be addressed -

the utilization of third party software to tilt the scales in favor of one sink as opposed to another;

once again what is seen is that measurement of a system context needs to be by way of direct

measurement of the system attributes as is defined in Table 3.1 of this body of work. It is the

only solution put forth that treats each widget as being the same and void of a gross speculative

factor, i.e. the human element.

While many have argued for the justification of clickthrough data to be incorporated into

an underlying model to predict sink relevance the justification for such a component cannot be

justified from a systems perspective as such an element would entail that the behavior of a

system could be measured external to the specific attributes of the system at hand, but by

definition this type of attribute represents a proxy to the system at hand. While a proxy element

may mirror an actual system variable it does not entail that it is a fundamental component of the

system. Does a clickthrough rate represent a system attribute as some have argued or does it

represent the manifestation of the combination of link placement combined with highlighted text

Page 53: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

45

in the search results which is a direct result of a query matching the verbiage (or lexical context)

of the text found by the search engine. Well the indexing parameter that can be derived by way

of the Title tag and the index derived by way of the URL combined with the index derived by

way of the message content is directly measurable from the sink at hand and does not require the

incorporation of dissimilar widgets in the equation; a fundamentally more sound premise from

which to part from would be the argument here and the reason for which it is void in the

regression formal derived. Also worthy of mention is the fact that this metric ‘clickthrough’ data

is not available for measurement given the current state of the system, i.e. current page rank to

clickthrough data and the underlying reason for its dismissal from this body of work.

3.3.11 Lexical Context

One of the attributes that is being considered as part of the paradigm is the lexical

mapping of search content. Take for example the query of ‘Tshirt’ which could yield ‘T-Shirt’ or

‘T Shirt’. When each of the search engines under investigation was queried for the term ‘Tshirt’

the result set did bring up entries that differed in their textual representation, but where the

lexical context was the same. The regression formula utilized in this endeavor will incorporate

the lexical context of a search through the use of WordNet which may be found at

https://wordnet.princeton.edu. There exists an API that maybe used with the Python

programming language called the ‘Natural Language Toolkit’ that may be found at

http://www.nltk.org. The algorithm that may be found in the appendix incorporates the Python

library of the Natural Language Toolkit to breakdown the search term into its lexical synonyms

from which point the desired indexing parameters needed can be determined. The ability to

incorporate the lexical context in the search endeavor and the indexing work effort will allow for

the modelling of the indexing to fit in line with the physical world and the process that is

followed by the search engine providers.

3.3.12 Attribute Summary

Table 3.1 represents the list of attribute pairs that have been defined through the literature

review to have relevance with search indexing and for which no proxy variable is utilized to

gauge value. This is the direct list of attributes that will be drawn upon to build a predictive

model to gauge an approximation to the search index of each of the major search engine

Page 54: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

46

providers – Bing and Yahoo. In section 3.5 the literature, available by way of Google, will be

analyzed to derive those specific page attributes that Google has specifically laid bare to hold

value to its indexing algorithm and is done so here for the sake of completeness. While the list

proves to be a subset of the list below it does highlight those indicators from below that are valid

explicitly by way of Google.

Index Attribute

1 Title Tag

2 Copy Text

3 URL

4 Meta Tags

5 Keyword Proximity

6 Keyword Prominence

7 Anchor Text

8 Domain Age

9 Back Links

3.4 Premise Introspection

Studies such as the one being undertaken have been performed in the past, but all those

found did not incorporate all of the page attributes identified in this paper and they also made

assumptions about the paradigm that lead to inconsistent results. Dahiwale, Raghuwanshi &

Malik (2014) used a subset of all those attributes identified in this section to gauge search index

relevance. The Dahiwale et al (2014) utilized the tags header, title, body, meta along with the

URL to build a regression model to gauge search index correlation. The formula derived by

Dahiwale et al (2014) is given below

t = (Nb*B) + (Nt*T) + (Nm*M) + (Nh*H) + (Nu*U)

Where Nb = Number of occurrences of search string in body tag <BODY> Nt = Number of occurrences of search string in title tag <TITLE> Nm = Number of occurrences of search string in met tag <META>

Table 3.1 – Attribute Summary

Page 55: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

47

Nh = Number of occurrences of search string in head tag <HEAD> Nu = Number of occurrences of search string in URL The author’s also placed the system constraint on their model that follows: M = 5 U = 4 T = 3 H = 2 B = 1 The author’s further surmised that page content where the value of t > 3 was deemed significant

and content were the value of t <= 3 was deemed to be irrelevant. The conclusion surmised by

the author’s was that their algorithm proved to be between 20% and 70% accurate. From the

work of King and Jerkovic it stands to reason that the underlying assumptions may have proved

to have been the demise of their thesis along with the lack of the additional components that have

been identified in this section of the literature.

Pal, Tomar, and Shrivastava (2009) studied search engine results based upon link

structures and found positive results. A point of contention with the research effort of the

contributors must be acknowledged and it is a speculative component in terms of a weight table.

The research does show however that a positive correlation between content structure and search

results can be derived. This research finding by Pal et al (2009) does highlight a parallel between

their body of work and the large search engine providers such as Google.

Another body of work that mirrors the work of Pal et al (2009) is that of Mukhopadhyay,

Biswas, and Kim (2006). The author’s studied ranking from the perspective of a weighted

attribute correlation paradigm. A significant component of the Mukhopadhyay et al (2006)

algorithm was the concept of ‘Authority’ as a weight or a load baring component on the system.

The work of Mukhopadhyay (2006) highlights the findings of Jerkovic (2010) and specifically

the argument that Jerkovic makes with regards to source of information having a bearing on

rank. Jerkovic states that searching for the term ‘Hilltop Algorithm’ brings up the Wikipdeia

page first on Google even though the content of the page, i.e. the header and page tags inner text

are miniscule. The source of information will need to be addressed in any regression model that

is created to approximate the actual findings of the search engines unless the data dictates

differently. To the point of Mukhopadhyay et al (2006) there is another argument that is made by

Page 56: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

48

Spirin and Han (2014) that highlights the complex dynamic of search engine results. Spirin and

Han point to the fact that a link farm affects the PageRank in a positive way; so while

Mukhopadhyay et al (2006) make the argument of source being a factor Spirin and Han make the

argument that contributing links also affect rank irrelevant of source. This divergence in

paradigm between Spirin and Han and Mukhopadhyay et al (2006) highlights the complexity of

the search engine paradigm – using a network nomenclature; sinks are affected by the source and

a weighted paradigm. This paradigm may be stated as follows:

Let

i = Inbound Links µ = Authority Factor PR = PageRank Where µ = { X | X ε [Real Number] } Such That: PR = ∑ i * µ

The above represents a partial definition of what needs to be created – a regression model

encompassing all of the constraints identified in this section of the paper.

3.5 Google

Google provides its user base with a search engine guide to utilize when formatting page

content; this guide may be found at the link below.

http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-

engine-optimization-starter-guide.pdf

The Google guide provides a general reference to follow when optimization page content

for the web. While the guide identifies some of the attributes that are deemed significant by

Google it does not disclose the complete algorithm. The specific page attributes that Google

deems to be relevant in their evaluation of page content follows next.

3.5.1 Title Tag

The first component that Google identifies as being significant to their crawler is the title

tag <title></title>. The title tag is displayed for users by Google in the search results and words

Page 57: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

49

entered in the query that are found in the title of the document are highlighted for the reader.

Google also states that each page created should have a unique title tag; does this allude to a

possible glimpse into their algorithm? Could it be that Google has placed the title of the page in

some sort of search tree that is traversed and that each duplicate node is flagged?

3.5.2 Meta Tags

The second component that Google identifies as having significance to their algorithm is

the description meta tag. The description meta tag contains the key value pair

name=”description”. Google also makes note that the description text may be used by Google in

the search engine results. Content that is displayed for the user on the search engine results page

is significant because the bold text shown is more likely to catch the eye of a normal reader and

consequently be clicked by the reader.

3.5.3 URL

The third component that is identified by Google as having significance to their search

engine is the URL. Keywords in the URL will be displayed for the user on the search results

page and be highlighted – again another eye catcher for the reader. Google also makes note that

small URLs are preferred over large ones and that a consistent directory structure should be used

to display content. Take for example the case of an online retailer selling shoes then a directory

structure and consequently a URL structure such as that given below would be preferred.

/mens/shoes/nike/1890.html /womens/shoes/Adidas/1890.html The above is the preferred method by Google to structure content as opposed to some URL

structure such as any of the following:

/mens_shoes_nike_1890.html http://www.somesite.com?id=18902569775 Does the above point to an indicator of what Google is doing internally to evaluate worth? A

string of text separated by some specific character can be split into an array at which point the

specific keywords associated with the URL become clear. To compute an index from this

Page 58: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

50

perspective becomes easier to perform than performing the same operation over the first or

second immediate example from above.

3.5.4 Anchor Text

The fourth component that Google identifies as having value to its search engine is

navigation text or anchor text if you will. Google states that anchor text should be simple text

and as short as possible. Google has directly stated in this case that length does matter. Does

length matter because it entails a smaller byte array to store internally to their system or does it

matter because an index is calculated for each anchor text component?

3.5.5 Image Alternate Tags

The fifth component that Google identifies as having value to their search engine is the

alternate attribute of the image tag. The ‘alt’ component identifies a text string to associate with

each image, Meta text of the image if you will. For the visually impaired the text component of

the image tag has significant meaning and a clear indicator of how Google is tailoring content to

their audience.

3.5.6 Header Tags

The sixth element that Google identifies as having value to their search engine is the

header tags <hx></hx>, where ‘x’ is an integer and a value between one and six. This assertion

by Google is consistent with King and Jerkovic and a clear indicator that header tags need to be

encompassed in the regression model to build.

While the Google specification only alludes to six components driving search results the

work of the researchers discussed in this section clearly indicate a more complex model at work.

The regression model to build must incorporate the six components defined by Google, but it

must also encompass a larger framework and is the focus of the discussion to follow.

3.5.7 Google Attribute Summary

Table 3.2 list those variables that have been identified by way of the Google

documentation to provide worth to the search indexing paradigm of the search provider. Two

Page 59: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

51

distinct differences from the attribute pairs identified in Table 3.1 are the ‘Image Alternate Tags’

and the ‘Header Tags’. The header tags have been classified in the generic model defined as

‘Copy Text’ as the current styling and formatting paradigm using Cascading Style Sheets makes

presentation flexible and what may be classified as a header one tag may actually be paragraph

content. Also, to void of the generic formula, the ‘Image Alternate Tags’ is used since these

point to images and not linked to web pages directly, i.e. a sink that has a map able attribute

array and as such the reason why they were left off of the generic model.

Index Attribute

1 Title Tag

2 Meta Tags

3 URL

4 Anchor Text

5 Image Alternate Tags

6 Header Tags

3.6 Yahoo

Yahoo does not provide a search engine guide to optimize page content. While there may

exist some similarity between search engine categorization by the provider with some datum

such as the Google search engine there does not exist any documentation that is provided by

Yahoo to allude to the page attributes the search engine provider finds relevant. For Yahoo this

paper will assume that page indexing follows a similar pattern to that of Google and as such this

paper will map those same page attributes to the search engine provider.

3.7 Bing

Bing much like Yahoo does not provide a general guide to optimize search engine page

content. Bing does provide its user base with an online tool to help determine which keywords

are relevant for users in an effort to help the user create page copy that would be of interest to its

user base. Implicitly this implies that page content is deemed significant by the search engine

Table 3.2 – Google Attribute Summary

Page 60: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

52

provider, but explicitly there is no key indicator as to what methodology may be used to

maximize search engine presence.

3.8 Algorithm

An algorithm has been defined to extract content from the search engine providers and

categorize the content for model fitting. The algorithm derived and available for viewing follows

the following steps:

1. Retrieve search engine content from a query term by search engine provider. A

repository of available English words was used for the purpose of creating a query term where the query term was selected at random from a list of available English words. A partial list of English words may be downloaded here: http://www-

01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt. The document contains a total of 109,583 words. The algorithm selects a word in sequence from a file called words_chosen.txt that may be found in the GitHub repository https://github.com/guillermorodriguez/Dissertation. The words_chosen.txt file was created by parsing through the words.txt as was downloaded from the first link given above and choosing a word at random that was written to the file words_chosen.txt. The process of choosing a random word was repeated a total of 100 times in order to create 100 distinct words to query upon. The specific algorithm that was created in Python to extract the 100 distinct words may be found within the file named words_set.py. You will find the files referenced here in the GitHub repository given above and also in Appendix ‘A’ of this body of work.

2. Crawl each URL retrieved and determine if the system attributes contain the query term. The lexical dictionary WordNet is used to index lexical content. The lexical content is mapped with the use of the Natural Toolkit Library. For each attribute determined the specific indexing value over each system attribute is tabulated.

3. Collectively maps attributes to the mathematical model that are written to file for processing through the R utility later.

The algorithm is evaluated for a series of terms to create a database of queries to results. The

results are then be used to create an approximation model for each of the major search engines

Bing and Yahoo. The regression model that is used by the algorithm is explored in the next

section of this body of work.

3.9 Search Engine Approximation Model

Each of the search engine providers Bing and Yahoo have a paradigm that is adhered to

and used to rank content. While it was only Google that provided a definitive guide to their

Page 61: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

53

search paradigm it is from past experience that can be stated that the algorithms are similar in the

criteria that is utilized to rank content; otherwise a search engine optimization effort would be

vastly different for each search engine – something which is not found in the practical domain.

What follows is the best approximation that may be determined from the research literature as to

what the ranking algorithm may be for each of the search engine providers in the form of a

formula.

Bing is one of the search engine providers that does not disclose the intricacies of their

paradigm in this case as is the case for the Yahoo search engine what will be utilized for the base

formula will be derived from the research literature. The base formula for these two search

engine providers will be identical and will be refined by way of the data obtained from each of

the two search engine providers respectively. The data to be used as inputs into the model

formula defined below will come directly by way of the data mining algorithm defined in

Appendix A of this document.

Let: S = Search Engine Index

Bn = Slope of Component ‘n’ D = Meta Tag Description Index → Keywords / Total Words in Description I = Inbound Links → External Links to Page K = Meta Tag Keywords Index → Keywords / Total Words in Keywords O = Outbound Links → Links to External Pages P = Page Copy Index → Keywords in Copy / Total Words in Copy T = Title Index → Keywords In Title / Total Words In Title U = URL Index → Keywords in URL / Total Words in URL

Given the above parameter definitions then the formula for each of the search engine

providers Bing and Yahoo may now be defined as follows:

S = B1D * B2I * B3K * B4O * B5P * B6T * B7U + µ

Missing from the above equation and identified in the literature as being relevant is the user click

through index. This value cannot be measured by the system attributes and hence been left out of

the equation.

[3.2]

Page 62: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

54

In the model formula the parameter index ‘I’ may be measured for Yahoo by way of the

advanced search parameters and the issuing of a query such as ‘link: [URL] -site:[BASE_URL];

where URL is the end point being indexed and the BASE_URL is the root URL. The list of

available advanced query parameters for Yahoo may be found here

http://www.wikihow.com/Count-Inbound-Links-to-a-Website-With-Yahoo%21. In the case of

Bing the inbound link index may also be tabulated through the issuing of the query term ‘link:

[URL] -site:[BASE_URL]’ just as in the case for Yahoo.

3.10 Analysis Underpinnings

The algorithms depicted in Appendix A were used to create the data input files that were

later fed into the statistical modeling utility R (https://www.r-project.org/). The algorithms

utilized were created using the Python programming language (https://www.python.org/) version

3.4.3. The version of R that was used was 3.3.1. The complete source code along with the data

files may be obtained through a GitHub repository at the following URL

https://github.com/guillermorodriguez/Dissertation.

Given equation 3.2 from above there is an identified attribute of the formula termed

‘Page Copy Index’ that represents the keywords in the page copy divided by the length of the

copy text. Page copy is found in HTML documents in variants of tag definitions such as the

header tags or paragraph tags for example. This composition element termed ‘Page Copy Index’

may further be defined as follows.

Let:

P = Page Copy Index → Keywords in Copy / Total Words in Copy DI = Division Tag Copy Index → Keywords in Tag / Total Words in Tag H1 = Header One Index → Keywords in H1 Tag / Total Words in H1 Tag H2 = Header Two Index → Keywords in H2 Tag / Total Words in H2 Tag H3 = Header Three Index → Keywords in H3 Tag / Total Words in H3 Tag H4 = Header Four Index → Keywords in H4 Tag / Total Words in H4 Tag H5 = Header Five Index → Keywords in H5 Tag / Total Words in H5 Tag H6 = Header Six Index → Keywords in H6 Tag / Total Words in H6 Tag PA = Paragraph Index → Keywords in Paragraph Tag / Total Words in Paragraph SP = Span Index → Keywords in Span Tag / Total Words in Span Tag

Where: P = DI + H1 + H2 + H3 + H4 + H5 + H6 + PA + SP Such That:

Page 63: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

55

B5P = B51DI + B52H1 + B53H2 + B54H3 + B55H4 + B56H5 + B57H6 + B58PA + B59SP

The page copy content is essentially the amalgamation of each of the individual copy tags

that are composed of the header tags, division tag, paragraph tag, and span tag text. Once

equation 3.3 is substituted into equation 3.2 the following formula is derived.

S = B1D * B2I * B3K * B4O * (B51DI + B52H1 + B53H2 + B54H3 + B55H4 + B56H5 + B57H6 + B58PA + B59SP) * B6T * B7U + µ

Equation 3.4 given above is a composite formula that combines the system attributes to

assess value. The equation given is a departure from the standard linear regression formula such

as that given by Dahiwale et al (2014) and for which the composition may be supported by

plotting the linear correlation between the dependent and independent variable pairs. This

attribute may be investigated through R with the use of the pairs function. The pairs function of

R displays a grid plot of independent to dependent variable for a given data matrix. The data

matrix in this case was loaded into R through the use of the table read function.

Figure 3.1 below shows this plot for the data matrix given in Appendix B. The data in

Appendix B was collected through the use of the BING algorithm and was composed of data

collected for the search term ‘adheres’. The data plots in figure 3.1 shows that there does not

exists a simple linear correlation between the dependent variable – page index and any of the

independent variables. The relationship plots are a clear indicator that system homeostasis is

complex and not bound to singularity of relationship which further supports the argument posed

in the form of equation 3.4.

Figure 3.2 below shows this plot for the data matrix given in Appendix C. The data in

Appendix C was collected through the use of the YAHOO algorithm and was composed of data

collected for the search term ‘adheres’. The data plots in figure 3.2 shows that there does not

exists a simple linear correlation between the dependent variable – page index and any of the

independent variables. The relationship plots are a clear indicator that much like the case of the

BING search engine that there is a complex dynamic that affects system homeostasis.

[3.3]

[3.4]

Page 64: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

56

The mapping of the system attributes and the collection file fields is displayed in Table

3.3. In each of the two data mining algorithms they all output to a tab delimited file structure

with the fields identified in Table 3.3 and mapped to the system variables as listed in the sample

table.

The next step in the process is to evaluate the formula posed for each of the search engine

providers that would allow for the determination of goodness of fit given the data collected. Each

of the search engines had data collected through the algorithm provided in Appendix A of this

body of work that provided the data input to the R statistical tool.

Given that the data retrieval process was not uniform as each of the search engine

providers tailored the output to prevent bodies of work such as this from disclosing proprietary

information each of the data mining algorithms were executed multiple times for the same search

terms in order to obtain a statistical average of search indexing position. In the case of Bing the

algorithm was executed 4 times over the 100 randomly chosen query terms yielding a total of

400 data files; the base files that the algorithm generated may be found in the GitHub repository

under the node src/BING/data/Source.

In the case of the Yahoo data extract the algorithm was executed a total of 4 times for

each of the 100 random keywords that were identified earlier. This process created a total of 400

input files that may be found in the GitHub repository under the directory

src/YAHOO/data/Source.

In each of these two data extract processes the data files were generated in a tab delimited

format that allowed for the parsing of the data sets in a consistent manner. Please note that while

the .DATA file extension was used for the data extraction files this does not designate these files

as being proprietary to any system, application or programming language it was merely done in

this matter as the chosen convention.

Page 65: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

57

File Header (Attribute) System Attribute

Index S – Search Engine Index

Url U – URL Index

Description D – Meta Tag Description Index

Div

P – Page Copy Index

H1

H2

H3

H4

H5

H6

P

Span

Inbound_Links I – Inbound Links

Keywords K – Meta Tag Keywords Index

Outbound_Links O – Outbound Links

Title T – Title Index

Table 3.3 – Attribute Mapping

Page 66: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

58

Fig. 3.1 – Pair Relationship Plots in R – Bing Data

Page 67: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

59

Fig. 3.2 – Pair Relationship Plots in R – Yahoo Data

Page 68: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

60

3.11 Bing Formula

The sample data provided in Appendix B shows the column headers on the first line of

the data sample file which are tab delimited and having rows which are newline feed terminated.

Given the space allocations each line bleeds through to the second line in the sample data

example. The maximum, minimum, and average values are given below for each of the base

system attributes as tabulated through the input file _complete_r.dat. The algorithm that was

used to create Table 3.4 may be found in Appendix A under the title pySummary.py.

The data contained in the file _complete_r.dat was fed into R through the use of the read

table function such as the sample given below. The place holder ‘[File Including Path]’ is the

Attribute Minimum Maximum Average

Description 0.0 1.3 0.0

DIV 0.0 100.0 0.18378059758120993

H1 0.0 10.0 0.20716051797863566

H2 0.0 40.0 0.21060367446109785

H3 0.0 14.0 0.04897304115383433

H4 0.0 7.0 0.012779480243865575

H5 0.0 3.545454545454545 0.0022305186012005532

H6 0.0 2.8137651821862346 0.0013185283482377767

Inbound Links 0.0 582.0 14.889566203572441

Keywords 0.0 2.0 0.04248639370365007

Outbound Links 0.0 1643.0 35.47462432662319

Paragraph 0.0 66.0171397982293 0.23883382748042553

URL 0.0 1.3571428571428572 0.12196129911513241

Span 0.0 93.01220720322118 1.1248293028668193

Title 0.0 2.723809523809524 0.16058082343991514

Table 3.4 – Bing Attribute Statistics

Page 69: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

61

resource location of the data to provide as input. The place holder ‘[Field Separator]’ is the

escape character sequence that delimits each field such as the tab character or ‘\t’. The attribute

‘header’ designates the header fields in the file where a value of ‘TRUE’ implies existence.

lm.data ← read.table("[File Including Path]", sep="[Field Seperator]", header=TRUE)

The model or equation to fit by way of a regression is specified using the notation given

immediately below – equation 3.6. The sample data file provided in Appendix B has the system

attributes in the header which are defined in Table 3.3. Equation 3.6 and the variable mapping

pairs by way of Table 3.3 result in the generalized model shown in equation 3.7.

lm.fit ← lm(y ~ x)

S = D * (Div+H1+H2+H3+H4+H5+H6+P+Span)*I*O*K*U*T

The model derivation that results given equation 3.7 and the format required by R results

in the model defined in equation 3.8

lm.fit ← lm(1/index^15 ~ inbound_links / (root * outbound_links * title * description *

keywords * (div + h1 + h2 + h3 + h4 + h5 + h6 + p + span)))

Equation 3.8 is equation 3.7 with two components modified; the first change being the

expression of the dependent variable and the second being the creation of the ratio. The exponent

of the dependent variable is the quantity of independent variables in the equation set. The ratio

portion of the equation was arrived at by way of the R regression utility function. The use of R

and the model as input lead to the identification of the significant variable pairs and their correct

position within the overall equation. It was this iterative process that lead to the identification of

the optimal formula that was identified in equation 3.8. By definition of the Google Rank

paradigm the position or rank of a page is the probability of the end point being selected by an

end user during a search. The probability of choosing one in a series of options becomes x/n or

stated differently it is ratio of selection options over the data set. It was under this premise that

the dependent variable was first investigated in the form of 1/index. Subsequent mathematical

[3.8]

[3.5]

[3.6]

[3.7]

Page 70: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

62

iterations of the independent variable over a series of options was investigated such as the series

of 1/index^n, where n was set to an integer between 1 and the total number of attributes

available, i.e. the left hand rows as in Table 3.3. It was under this primitive discourse that the

optimum was discovered and was found to be 15 for the BING formula.

Equation 3.8 once fed into R allows for the statistical factors to be provided by the

statistical tool through the use of the summary function; which was performed on the file

_complete_r.dat. This file contained a total of 7,055 entries, which once fed into R and had the

summary function applied to the model as defined in equation 3.8 resulted in an Adjusted R-

Squared value of 0.31. This entails that the model fed into R proved to be 31% effective. The

generalized formula that results from this analysis given the R summary data is provided in

equation 3.9.

1/index^15 = 0.001141*I / ( -0.002473*U * -0.000009302*O * -0.001788*T * -0.02655*D * -0.03175*K * ( -0.0005829*DIV - 0.003683*H1+ 0.003760*H2 -0.0008250*H3 + 0.003665*H4 + 7.854*H5 + 0.004228*H6 – 0.001312*P + 0.00004138*Span ) ) – 0.002291

The formula derived by way of the R statistical tool gave a Multiple R-Squared value of

0.3523 and a Residual Standard Error of 0.0721. The analysis herein shows that a complex

dynamic is a work that could only be approximated with the system variables in an all-

encompassing model. While the derived model was 31% effective it does show that more work is

needed to be able to predict search results. It further needs to be pointed out that the p value is

much less than 0.05 (2.2x10^-16), which implies that the analysis herein has a very low

probability of being arrived at by chance and further validate the results herein.

An investigation also took place where only variable were used into a second model

where only the attributes who experienced a significant code at the three star R derived

designation were used. In this case the model resulted in essentially the same prediction factor,

but slightly higher of 31.05%. While this refinement does yield a prediction factor of 0.0005

points better it does remove base system constraints from the model, thus retaining the original

premise of equation 3.9. The removal of some of the base system variables from equation 3.9

[3.9]

Page 71: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

63

would simply entail that an overfitting was taking place here and moving the argument from the

general to the specific case.

Taking equation 3.8 and performing a logistic regression on the model through R allows

for an alternative analysis to be performed. The model fed into the R statistical package is given

in equation 3.10 below.

lm.fit ← glm(1/index^15 ~ inbound_links / (root * outbound_links * title * description * keywords * (div + h1 + h2 + h3 + h4 + h5 + h6 + p + span)), family=”binomial”)

The data summary from the logistic regression may be obtain from the R statistical

package through the use of the summary function, which does provide an alternative hypothesis

to equation 3.9. This alternative hypothesis is given in the form of equation 3.11 and stated

below.

log(p/(1-p)) = 1/index^15 = 1.198e12*inbound_links / (-2.88e12*root * (-3.866e10) * outbound_links * (-5.963e12) * title * (-1.017e14) * description * (-5.387e14) * keywords * ( (-8.851e12) * div + (-9.563e13) * h1 + (3.953e12) * h2 + (-6.215e13) * h3 + (8.406e13) * h4 + (2.060e18) * h5 + (-2.888e13) * h6 + (-2.637e13) * p + (1.522e13) * span -4.069e14

Utilization of the ROC function in R allows for the derivation of a ROC plot that may be

performed through the use of the plot function in R. This plot function is shown in Fig. 3.3

below. Calculating the area under the curve or the probability of accuracy yields a total of

0.4242. This analysis shows a dramatic improvement over the linear regression that was

performed earlier and now allows for the argument to be further enforced for a better predictive

model for the Bing search engine; an improvement of 11.42. The full analysis performed herein

may be found in the GitHub repository under the directory /BING/R and in the file ‘commands -

chapter 3 - glm.txt’.

[3.10]

[3.11]

Page 72: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

64

Fig. 3.3 – Receiver Operating Characteristic (ROC) Curve – Bing Data

Page 73: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

65

3.12 Yahoo Formula

The sample data provided in Appendix C represents data captured for the YAHOO search

engine and it shows the column headers on the first line of the data sample file which are tab

delimited and having rows which are newline feed terminated. Given the space allocations each

line bleeds through to the second line in the sample data as was the case for the Bing data file.

The maximum, minimum, and average values are given below for each of the base system

attributes as tabulated through the input file _complete_r.dat. The algorithm that was used to

create Table 3.5 may be found in Appendix A under the title pySummary.py; this was a common

algorithm that was derived for both Bing and Yahoo data processing.

Attribute Minimum Maximum Average

Description 0.0 1.3 0.0

DIV 0.0 100.0 0.16042360354688315

H1 0.0 10.0 0.24207932924893227

H2 0.0 43.3287784679089 0.33377193794607646

H3 0.0 14.0 0.059488461850749914

H4 0.0 7.0 0.016739919888699308

H5 0.0 3.545454545454545 0.004076729165437885

H6 0.0 2.8137651821862346 0.002874918873375806

Inbound Links 0.0 582.0 11.479391944836232

Keywords 0.0 1.6666666666666667 0.05287994966319452

Outbound Links 0.0 1389.0 36.668077103902206

Paragraph 0.0 124.5787019648197 0.2639037744783405

URL 0.0 1.3571428571428572 0.13570601600935772

Span 0.0 93.01220720322117 1.408042685627342

Title 0.0 2.723809523809524 0.17548509294130696

Table 3.5 – Yahoo Attribute Statistics

Page 74: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

66

The data contained in the file _complete_r.dat was fed into the statistical software

package R through the use of the read table function such as was given in equation 3.5 and done

previously for BING. The generalized model that was used to derive system value was as was

shown in equation 3.7.

As was the case of the BING modeling exercise, but through the data input of the Yahoo

data file that may be found in the file _complete_r.dat of the code repository under the directory

src/YAHOO/data and for which the sample data was provided in Appendix C the process was

once again repeated. Data was imported into the R statistical tool and the regression model was

evaluated using the same equation as depicted in equation 3.7. An iterative process was followed

to determine the exponent of the index factor for which an optimum value of 1 was determined.

This resulted in a model for which an Adjusted R-Squared value of 0.1615 was derived or stated

differently a model accuracy of 16.15% was achieved. The generalized formula that results from

the R analysis is provided in the form of equation 3.12 given below.

1/index = -0.05977*K* (-0.01003)*U * (-0.001494)*T * (-0.00000477)*I * (-0.000007963)*O * (-0.04570)*D * ( -0.01385*DIV + 0.002041*H1 + 0.05928*H2 + 0.0003611*H3 + 0.08388*H4 + 0.04422*H5 + 1.951*H6 – 0.01690*P + 0.003334*SPAN ) + 0.04643

The formula parameters derived by way of the R statistical tool gave a Multiple R-

Squared value of 0.2353 and a Residual Standard Error of 0.09038. The p value is 2.2x10^-16

which is significantly below 0.05 which entails that the probability of having arrived at the result

by chance is miniscule, thus validating the results found.

As was the case for BING, the Yahoo formula of equation 3.12 was investigated by only

looking at base system attributes that contained significance R codes at the three star level of

which there was one H2. Using this simple element in equation 3.10 and removing the excess

attributes resulted in an Adjusted R-Squared value of 0.01364. This prediction factor leads to the

conclusion that the original premise as given in equation 3.12 must stand as the general solution

to the problem frame as of now for the linear regression.

[3.12]

Page 75: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

67

Taking equation 3.12 and performing a logistic regression on the model through R allows

for an alternative analysis to be performed. The model fed into the R statistical package is given

in equation 3.13 below.

lm.fit ←

glm(1/index~keywords*root*title*inbound_links*outbound_links*description*(div+h1+h2+h3+h4+h5+h6+p+span), family="binomial")

The data summary from the logistic regression may be obtain from the R statistical

package through the use of the summary function, which does provide an alternative hypothesis

to equation 3.12. This alternative hypothesis is given in the form of equation 3.14 and stated

below.

log( p / ( 1 – p ) ) = 1/index = (1.867e15) * keywords * (2.025e15) * root * (1.880e15) *

title * (-2.487e12) * inbound_links * (-1.671e12) * outbound_links * (2.877e15) * description * ( (-1.786e14) * div + (7.990e14) *h1 + (3.239e14) * h2 + (1.692e14) * h3 + (-2.830e15) * h4 + (1.001e16) * h5 + (3.773e16) * h6 + (4.123e13) * p + (2.270e12) * span – 1.272e15

Utilization of the ROC function in R allows for the derivation of a ROC plot that may be

performed through the use of the plot function in R. This plot function is shown in Fig. 3.4

below. Calculating the area under the curve or the probability of accuracy yields a total of

0.5333. This analysis shows a dramatic improvement over the linear regression that was

performed earlier and now allows for the argument to be further enforced for a better predictive

model for the Yahoo search engine; an improvement of 0.3718 points. The full analysis

performed herein may be found in the GitHub repository under the directory /YAHOO/R and in

the file ‘glm - commands - Chapter 3.txt’.

As was the case of the BING formula the YAHOO formula shows that there is a strong

dynamic that is not addressed with the attribute formulas defined in equations 3.12 and 3.14 and

as such it furthers the argument that a deeper discussion needs to be had. Section 4 of this body

of work takes a deeper look at one of the components of the equations defined in 3.9, 3.11, 3.12,

and 3.14; the link equity component of the equation under various regression constraints.

[3.13]

[3.14]

Page 76: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

68

Fig. 3.4 – Receiver Operating Characteristic (ROC) Curve – Yahoo Data

Page 77: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

69

3.15 Data Collection Challenges

The original premise of this discourse laid the groundwork for the derivation of system

formulas for three search engines – GOOGLE, BING, and YAHOO. What was found during the

data collection proved to limit the ability to collect the data for the GOOGLE search engine. The

GOOGLE search engine does a tremendous job of not allowing data to be collected from its

website. The search engine provider tracks IP addresses to query submittals and if this reaches

some company defined threshold then they simply block the data request. In an effort to

overcome this hurdle proxy software was used to try to mask the data request; software packages

that were used may be found at the following URLs:

• www.eliteproxyswitcher.com

• www.steganos.net

Even through the use of proxy software there was a limitation that was reached with the

data crawling effort; this is the reason why in the GitHub repository you will find modules for

GOOGLE. While the code was written and while part of the data collection was achieved it was

simply not possible to extract the needed data from the GOOGLE search engine for the purposes

of this investigative study.

The second hurdle that was encountered in the data collection effort was that some of the

APIs that were being used by the search providers were rendered void during the writing of this

body of work which meant that the original algorithms that had been written to do the data

extract had to be retrofitted to make the data calls directly to the search providers and forgo an

easier API implementation. The BING search API was deprecated December 15, 2016 and the

YAHOO search API was deprecated March 31, 2016.

The third challenge that was encountered with the data collection effort was the results

that were obtained from the search engine providers. Some of the search engine providers such

as GOOGLE will tailor the XML content to the specific browser type that they are providing the

data to and when the algorithm being used is specifically looking for markup definitions this will

cause the complete process of data collection to come to a screeching halt.

Page 78: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

70

3.14 The Systems Paradigm

This body of work brings forth the argument that a systems perspective is a necessity

when modeling a given domain for which if the alternative is sought it becomes more difficult to

find a direct correlation. If this paradigm involved some aspects that was investigated in the

research literature then the paradigm would have been skewed and not for the better. Previous

research literature on the subject matter involved incorporating query logs for example to try to

understand index value, but this argument fails to understand the boundary constraint of the

problem - the page. Query logs are manifestations of user preferences and not based on the

fundamental attributes of the base entity of study – the page. This study parted from the systems

perspective by placing the boundary around the page and not for example around a proxy

variable to true worth as in the case of the BIG Mac index. If worth can be measured then this

measurement must start and be restricted to the system boundary constraint. It is by only

understanding the fundamental attributes of the entity under study that one may begin to

understanding how a delta in behavior may be explained that makes discovery possible and the

fundamental reason why this investigative study was a systems study.

3.16 Summary

Even with all the challenges that were encountered and the inconsistent results that were

achieved between the YAOO search engine and the BING search engine I must state that this

experience was positive because the search results and the paradigm brought forth has moved the

argument forward and for the first time there has been a body of work that has created a

predictive formula to the search engine paradigm.

In section four of this body of work the current state of affairs is examined from the

perspective of backlink quality and an optimization to the current paradigm where the quality

aspect of the paradigm is investigated further. Section 4 builds on the work of the current body of

work in an effort to refine the formulas proposed and help to move the argument even further

down the line and make the case for the systems perspective for modelling.

Page 79: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

71

3.14 Bibliography

Ageev, M., Guo, Q., Lagun, D., Agichtein, E. (2011). Find It If You Can: A Game for Modeling

Different Types of Web Search Success Using Interaction Data. Proceedings of the 34th

International ACM SIGIR Conference on Research and Development In Information Retrieval.

ACM

Berry, M., Browne, M. (2005). Understanding Search Engines: Mathematical Modeling and

Text Retrieval. Siam

Carterette, B., Jones, R. (2008). Evaluating Search Engines by Modeling the Relationship

Between Relevance and Clicks. Advances in Neural Information Processing Systems

Dahiwale, P., Raghuwanshi, M., Malik, L. (2014). PDD Crawler: A Focused Web Crawler Using

Link and Content Analysis for Relevance Prediction. SEAS-2014, Dubai, UAE, International

Conference.

Google. Search Engine Optimization Starter Guide. Retrieved July 15, 2015, from

http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-

engine-optimization-starter-guide.pdf

Hassan, A. (2012). A Semi-Supervised Approach to Modeling Web Search Satisfaction.

Proceedings of the 35th International ACM SIGIR Conference on Research and Development in

Information Retrieval. ACM

Henzinger, M. (2007) Combinatorial Algorithms for Web Search Engines – Three Success

Stories. SODA '07 Proceedings off the Eighteenth Annual ACM -SIAM Symposium on Discrete

Algorithms. 1022-1026

Huang, P., He, X., Gao, J., Deng, L., Acero, A., Heck, L. (2013). Learning Deep Structured

Semantic Models for Web Search Using Clickthrough Data. Proceedings of the 22nd ACM

International Conference on Information & Knowledge Management. ACM

Page 80: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

72

Jansen, B., Spink, A. (2006). How Are We Searching the World Wide Web? A Comparison of

Nine Search Engine Transaction Logs. Information Processing and Management. Vol 40, 248-

263

Jerkovic, J. (2010). SEO Warrior. Sebastopol, CA. O’Reilly Media Inc.

King, A. (2008). Website Optimization. Sebastopol, CA. O’Reilly Media Inc.

Mukhopadhyay, D., Biswas, P., Kim, Y. (2006). A Syntactic Classification Based Web Page

Ranking Algorithm. Retrieved May 2015 from

http://arxiv.org/ftp/arxiv/papers/1102/1102.0694.pdf

Pal, A., Tomar, D., Shrivastava, S. (2009). Effective Focused Crawling Based on Content and

Link Structure Analysis. International Journal of Computer Science and Information Security

(IJCSIS). Vol. 2, No. 1

Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E. (2012). Modeling and

Predicting Behavioral Dynamics on the Web. Proceedings of the 21st International Conference

on World Wide Web. ACM

Sedigh, A., Roudaki, M. (2003). Identification of the Dynamics of the Google's Ranking

Algorithm. 13th IFAC Symposium on System Identification

Smyth, B., Balfe, E., Freyne, J., Briggs, P., Coyle, M., Boydell, O. (2004). Exploiting Query

Repetition and Regularity in an Adaptive Community-Based Web Search Engine. User

Modelling and User-Adaptive Interaction. Vol 14, 383-423

Spirin, N., Han, J. Survey on Web Spam Detection: Principles and Algorithms. Retrieved May

2015 from http://www.kdd.org/sites/default/files/issues/13-2-2011-12/V13-02-08-Spirin.pdf

Sun, H., Wei, Y. (2005). A Note On The PageRank Algorithm. Elsevier. Vol. 179, Issue 2, 799-

806

Page 81: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

73

Wang, H., Zhai, C., Liang, F., Dong, A., Chang, Y. (2014). User Modeling in Search Logs Via a

Nonparametric Bayesian Approach. Proceedings of the 7th ACM International Conference on

Web Search and Data Mining. ACM.

Yue, Z., Han, S., He, D. (2014). Modeling Search Processes Using Hidden States in

Collaborative Exploratory Web Search. Proceedings of the 17th ACM Conference on Computer

Supported Cooperative Work & Social Computing. ACM

Page 82: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

74

CHAPTER IV

A PAGE INDEXING OPTIMIZATION PROPOSITIONS

This study will examine the enhancement to the model proposed in section three of this

body of work. In this investigative study the link structure from source to sink will be

investigated to derive a weighted factor to the system models for both Bing and Yahoo. This

section of the body of work seeks to enhance and build upon the previous body of knowledge to

create an enhancement to the current framework. The investigation of network structure of the

link equity from source to sink will be evaluated here to determine if an optimization may exist

in the system models derived previous, one which will allow for a better solution to arise.

While current practice by the search engine providers is to weigh the source from some

authority domain such as Wikipedia or an academic contributor as more heavily than a link from

some other domain as has been claimed by King (2008) the question does arise here as to

whether this fact can be leveraged by the existence paradigm to allow for a refinement in the

models created to date. This investigative study will seek to address the issue with current

practice and set forth a search model that takes into account the system attributes as the focal

point of the discussion. The derivation of the new approximation model is a further step in the

discussion as the argument here is on the enhancement or refinement of one of the system

attributes used in the previous models – link equity. This link equity component is studied under

the perspective of its underlying components to derive a new metric that leads to a new dynamic

where the link contribution of each source is weighed across all contributors to change the

dynamic from a simple link equity contribution to a quality metric. It is this quality metric that

once derived will be used to change the models already defined for Bing and Yahoo. This

refinement will be evaluated as was done previously through the statistical software package R

and allow for a direct comparison between past results and the new paradigm, which will allow

the discussion to come to a termination point as to whether or not value can be added by

examining link structure in page indexing.

4.1 Introduction

Page indexing by the search engine providers contains a large system component in the

overall model this being the network of link structures from source to sink. Each link between a

Page 83: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

75

benefactor and beneficiary become a contributor factor in the overall algorithm in page ranking.

Jerkovic (2010) makes the argument that the link structure does play an integral part in the

search index algorithm in determining overall worth. This link structure however up to now has

been largely ignored and only considered on a sliding scale in terms of quantity. The question in

this investigative study and a question that up to now has failed to be taken into account is the

degree to which this measure of link structures needs to be quantified and not simply be some

simple linear contributor in an overall equation.

The general thesis of this paper being that while network link structures should be a

contributing factor this contributing factor needs to be assessed at a deeper level than is currently

being done. Also, this link structure once exposed to a systems perspective and evaluated for its

individual worth in a larger constraint creates a different paradigm from current practice. This

new paradigm is void of the simple constraint now seen and places a new metric in place; an

enhanced metric – a true weight to the link structure dynamic.

This paper will define a model to assess worth of link components based upon the system

attributes contained in the linking document. From the summation of these sources an overall

weighed component may be derived to create an enhanced mathematical model to predict page

indexing by way of link equity value as derived from system attributes. This body of work relies

on the work to date and will take each of the end nodes identified through the data mining

algorithm and then expose the linking documents. These linking documents will be investigated

for their worth under each element that has been identified as being relevant and identified in

Table 3.3. The same data mining algorithm will be used to extract index values for each of the

already identified page attributes having relevance from the literature review. The algorithm that

was used to evaluate the index elements may be found in Appendix A. Once each of the link

contributors is evaluated for worth given the indexing elements they will be used to create an

overall qualifier to gauge link worth. While this argument does parallel the argument posed

previously what is being accomplished through this body of work is the creation of an enhanced

model a model that is a refinement to the previous proposal and a model given the literature

review that falls into greater focus to at least one indexing paradigm – the Google search engine

and the PageRank algorithm. The HITS algorithm as developed by Jon Kleinberg has the

premise that sinks are pointed to by good sources an argument that feeds directly into the context

of this body of work. The contributors to link quality need to be assessed by merit and not simply

Page 84: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

76

allowed to affect the paradigm simply through an existence. To argue against this dynamic is to

open the door for link farms and the potential skewing of system homeostasis. Another dynamic

that was broached previously was the PageRank algorithm as developed by Brin and Page the

founders of Google. PageRank can be defined as the summation of the sites that point to some

sink divided by all of the outbound links from the sink. The link structure plays an important role

in the classification of the big data landscape that is the World Wide Web and to develop a

model that is void of the quality metric representative of the link structure strips the underlying

argument of its worth. In this section of the body of work an effort is made to enhance the

previous work to account for the link quality in the overall model and thus hope to improve the

argument posed – at least this is the aspiration of the work herein.

4.2 Big Data

Big data is a term that is used to classify a large amount of bytes, the term is credited to

Roger Magoulas from O’Reilly media according to Ulraru, Puican, Apostu, and Velicanu (2012).

Mougalas defined the term to encompass that data which is too complex to manage by normal

data management techniques. The definition provided herein is time specific by definition as the

tools available to manage data change with time. Given a 1 GB data set that would have been

completely unmanageable under a 16MB RAM environment from the early 1990’s, but today

databases of a hundred fold of the base line given are a common place in technology departments

and by no means can be classified as ‘BIG DATA’ under current times; at least with regards to

the volume metric. While the definition provided by Magoulas is a classifier that allows for the

qualification over some domain a definition must be fixed and constant over time such as π. If

definitions are not constant it entails the learning of new jargon continuously and it would be

counterproductive to the academic effort. While the classification of Mougalas is a good step

forward what is needed by the domain is a time independent classification over the domain, what

is missing is the symbolic PI. Big data is characterized by four distinguishing attributes Volume,

Velocity, Variety, and Veracity according to Ulratu et al. (2012). In the attribute pairs the

components are defined as follows.

• Volume – The quantity of data

• Velocity – Time in which data can be processed

• Variety – Diversity of the data encompassed

Page 85: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

77

• Veracity – The degree of trust in the data In the definition of the attributes of big data there are two components that are variable with time

– volume and velocity and it is with these two attributes that the definition for big data becomes

malleable over time. Advances in CPU power and computational volume via extended RAM

entails that the quantity of data and the processing of the said data is a moving target. The

attribute of veracity is a purely subjective and consequently by definition not a component where

an academic investigation may unfold upon. It is my contention that what makes big data ‘BIG’

is variety. The determination of a computational formula, which after all is what is always sought

in decision science is the specific determination of those attributes of some system that may be

combined under some constraint such that they embody the dependent variable of a formula.

This brings the discussion to modes of variety; data could be presented to an end user for

analysis in a variety of modes such as structured, semi structured or randomized – chaotic in

appearance.

Ularu et al. (2012) point to the statistic that every day 2.5 quintillion bytes of data are

created. This statistic leads to the realization that 90% of the data surrounding us has been

created in the past two years according to Ularu et al. (2012)! A discussion point that has been

lacking up to now and a contention point that was never identified in the research literature is

why is it that there is no discussion surrounding algorithms for big data model derivations?

Given, F = ma as law this law holds true for one data point {m=8 Kg, a = 25 m/s2} or a series of

data points [{m=1 Kg, a = 1 m/s2}, {m=6 Kg, a = 2 m/s2}, {m=9 Kg, a = 65 m/s2}]. The

argument for the truth in the data is being lost and replaced with the notion of the big data

paradigm for which it must be easier to derive if there is just more data – of course. Taylor,

Schroeder, Meyer (2014) bring this reality to the forefront when they quote Professor David

Hendry as stating during an interview in 2013:

“…whether the dataset’s big or small doesn’t actually matter in establishing

change, but if it’s big and the system is complex the only way to establish change is to model that

complexity”

Provost and Fawcett (2013) make note of a specific problem when dealing with large data sets

and this being the problem of over fitting the data to the problem frame. The problem with

Page 86: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

78

overfitting the author’s argue is that while the truism holds for the data set the derived solution

does not fit the general problem frame. Gandomi and Haider (2015) identify this problem of

overfitting under a different context and call the problem Spurious Correlation. Gandomi and

Haider define spurious correlation as having uncorrelated variables being falsely found to be

correlated due to the massive size of the data set. So when looking at big data it appears that once

the jargon and the novelty of the problem domain is stripped away what is left is a very similar

problem – the correct determination of those system attributes that affect change or put

differently, there must be defined some function that is dependent on a series of attributes or

components that when combined model behavior.

f → U(x1, x2, … ,xn)

Something all scientist hold near and dear is taking a new problem and using existing

theory to explain the phenomena. Once the jargon and the noise around the fundamental

definition of big data is stripped away what is left is variety! The volume and velocity

component of big data is resource centric such that given time the obstacle is removed. It could

even be argued that outside of special cases this obstacle does not exist as today an individual

could take hardware purchased at a local electronics store and link together a network after

which point a solution such as HADOOP (https://hadoop.apache.org/) could be installed on the

machines to leverage a distributed file structure for computational purposes. The big data

component of veracity is purely subjective by definition and as such has no place in its

consideration when building a predictive model. What this ultimately leaves the system with is

one specific attribute that makes data big data – variety.

The variety component of data is complex and a fundamental reason for which model

derivation is a challenge. Under this banner of variety however there exists three distinct buckets

of categorization – structured data, semi structured data, and randomized data. An example of

structured data would be data contained within a database table. A database table contains a

specific schema that defines its context. A series of columns contain a container type (string,

integer, etc.), constraints such as string length must be at most 12 characters long, and referential

integrity such as column ‘Id’ of the employee table is a foreign key to the payroll table for

column ‘EmployeeId’. Semi structured data is what this paper deals with and the content that is

[4.1]

Page 87: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

79

presented to the web browser for interpretation and viewing by an audience. Semi structured data

represents data that contains a flexible boundary constraint such as HTML. The HTML standard

is maintained by the W3C (http://www.w3.org/MarkUp/) and represent a series of containers for

displaying content on a web page. The domain is semi structured because tags contain attributes

of which may be defined by an encompassed tool such as AngularJS (https://angularjs.org/) or by

the individual user for that matter. The HTML domain is also considered semi structured because

tag position is completely arbitrary outside of some tags such as the html tag <HTML> for

example. For completion the definition of randomized data needs to be addressed and very

simply stated this is data that has no apparent pattern to it such as an encrypted password for

example.

Search engine indexing deals with big data due to the fundamental constant attribute of

big data – variety. It could very well be argued that page content does also change over time and

hence does the attribute variety not fall under the same constraint as volume and velocity? While

time may yield a difference to the data landscape it does not entail this as a constant or a given;

variety in the data pool simply means a delta there in between data points. Variety denotes a

difference and as such the model must deal with this delta to predict a truism independent of time

thus making variety the true constant of big data unlike any of the other attributes used to

categorize big data. This brings the discussion to a lapse in the dialog over big data. While the

variety attribute is specific it is so on a macro level; at the micro level variety in data has a

connotation with regards to the differing end points and their amalgamation into a paradigm to

predict truth. This detail is addressed in the next section of the body of work where this

underlying truth in big data is defined.

4.3 System Variables

A system variable simply put are some metric that may be used to gauge a performance

indicator of some model under investigation. Big data systems become difficult to decipher

because the data stream points to a multitude of conclusion points each of which if not

investigated under a systems perspective only helps to cloud the picture and forces the researcher

to only view the shadow of the model at hand - maybe. A far worse scenario and a problem that

current society is plagued with is the creating of indexes when no such system attributes support

Page 88: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

80

the said system. Take for example the case of the much heralded publication the economist; the

economist researchers have come up with a metric to measure inflation between isolated entities

– countries and this metric is called the ‘BIG MAC’ index. The index rational is as follows; if

you have a product that is the same in a series of countries then the monetary inflation in the

countries may be gauged by comparing the price of the Big Mac in each of the countries.

Problem number one with the rational is that the product is not the same in all countries. In India

the patty is not beef so does the price differential represent a difference in the supply chain and

maybe even legislation factors in the system or is it simply inflation as is the current argument?

While the component inflation (unit of measure dollars) may be similar in magnitude with the

price of Big Macs it may not necessary be tied conclusively to inflation and provable

mathematically, which of course can be the only judge to scientific discourse.

What is missing from the discourse is a fundamental artifact when analyzing big data and

systems; the rules of governance over the factors that dominate the discussion or formula if you

will. In order to make a prediction of the performance of some metric over some domain then

this metric must come and be measurable from the system itself. The engineering domain takes

this artifact as a rule that is never broken take for example the case of calculating stress on a

beam, calculating fluid flow or the turbulence a plane exhibits in flight. But wait a minute some

of these variables such as force are not simply taken as first order measurement directly from the

system, but are rather amalgamations such as velocity, which is a unit of distance over time. This

leads the discussion to an all too familiar point – the discussion has come full circle and it has

been taken to the initial conditions – system variable constraints. It has become too easy to make

an argument away from the fundamental metrics to the Big Mac Index if you will and sway the

argument astray. This body of work aims at steering the argument back on course and relying on

the direct measurable attributes of the system to derive truth – void of the external factors that

simply only represent proxy variables to the datum.

4.4 Literature Review

The work of Jerkovic (2010) – SEO Warrior defines a series of attributes that determine

page worth by the major search engine providers. In total Jerkovic defines nine attributes of a

page that help define its own worth or help in defining the worth of linked pages. The nine

Page 89: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

81

attributes that Jerkovic (2010) defines as aiding in the total page ranking algorithm by the major

search engines are as follows:

1. Title Tag 2. Page Copy 3. Document URL 4. Meta Tag 5. Keyword Proximity 6. Search Term Prominence 7. Anchor Text 8. Domain Registration Length 9. Quality & Quantity of Referral Links

Page content is defined through a variant of the Extensible Markup Language (XML)

called HTML or Hypertext Markup Language. The HTML definition has a series of attributes

that are defined as the protocol for content placement; the first of which is the title tag

<tittle></title>. The title tag is indexed by the search engines and displayed on search engine

result pages for the reader to view. The goal of the whole exercise being that the user will select

relevant links by reading the content displayed. The search ‘San Ramon CA’ was performed on

Google.com November 17, 2015 at 8:42 PM and yielded a series of results the first of which

(non-advertisement entry) was as follows:

Welcome to the City of San Ramon

www.ci.san-ramon.ca.us/ San Ramon Public Engagement Join Instagram Public Engagement Follow us Subscribe to Email Notification Public Engagement. 2226 Camino Ramon San Ramon, CA ...

The result displayed above displays an entry on the first line ‘Welcome to the City of San

Ramon’. Doing an inspection of the contents of the link www.ci.san-ramon.ca.us results in the

identification of the following title tag:

<title>Welcome to the City of San Ramon</title> The title tag was used by the search engine provider to help the user determine relevant content.

It should be pointed out to the reader that when the title tag does not provide value then the

search engine provider Google.com replaces the entry on the first line with its own derived

Page 90: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

82

relevant content. Bing and Yahoo display a different entry for the user; with one subtle

difference in the first four lines of output by the providers, which is displayed in red below.

San Ramon, California - Official Site

www.ci.san-ramon.ca.us Official site CONNECT WITH US. 2226 Camino Ramon San Ramon, CA 94583 Phone: (925) 973-2500 Fax: (925) 866-1436 Monday to Friday except Holidays 8:30 am - 5:00 pm

The first line in the above entry is not the content of the title tag, but is rather derived from a

different source. Could this source be the content URL or some other page attribute that signifies

the root source of the content? Inspection of the second result in the data set shows that the

second entry is the Wikipedia entry for all three search engine providers – Bing, Google, and

Yahoo. In this case the title tag does display on the search results page for all three search engine

providers. In summary then we can state that in some instances Google will replace the content

of the first entry in the results page with its own text, text for which keyword prominence is more

poignant. In the case of Bing or Yahoo the search providers display additional text on the first

line to help the user narrow down their search; this text is not derived from the page copy, but is

rather determined from some other source as was verified by inspection of the content.

Page copy is the second attribute that Jerkovic (2010) identifies as having value to the

search engine providers. Page copy is displayed by way of the header tags <hX></hX> or a

paragraph tag <p></p> and as of more recently division tags <div></div> or span tags

</span></span>. In the case of header tags the ‘X’ component displayed above may be a value

between one and six – inclusive; the higher the value of ‘X’ the smaller the content displayed.

Jerkovic (2010) and King (2008) both identify page copy as having significance to the search

engines and even go as far as identifying the subject of which as having a large contribution to

the overall indexing efforts of the search engine providers. Inspection of the above examples

from the search engine providers show that all three providers put in bold the query term and/or

the query words. A query term is a series of characters separated by a space and a query word is

an individual word in a query term. In the example given all three search engine providers make

a concentrated effort to highlight the relevant keywords searched upon by the user in the results

page – a nudge in the right direction if you will.

Page 91: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

83

The document URL is the third component that Jerkovic identifies as adding value to the

search index performed by the search engine providers. In the search examples given above each

of the search engine providers displayed the document URL for the reader. Each of the search

engine providers also went as far as to highlight the query terms found in the URL to show some

degree of relevance for the reader. The highlighting effort given by the search engine providers

points to a disclosure of the paradigm and makes the argument for both King (2008) and Jerkovic

(2010) more apparent, keywords in the URL play an important role in determining worth by the

search engine providers.

The forth attribute that Jerkovic (2010) points to adding value to the search engine

providers is the Meta tag. Meta tags represent pseudo content of the displayed text, yet hold

value to the indexing process. Meta tags are found in the head section of a document and provide

information for the search engines; Meta tags have the following definition:

<Meta name=”” content=””> The name attribute of the tag may have three values ‘description’, ‘keywords’ or ‘author’. A

description value identifies the content to contain a short description for the document and

should be less than 250 words as suggested by King (2008). The keywords value identifies the

value for the keywords to be associated with the document; King (2008) advises this to be no

more than 20 words in length.

The fifth component that Jerkovic (2010) identifies as having value to the search engines

is keyword proximity. Keyword proximity refers to the physical distance between query words.

Jerkovic (2010) makes the argument that placing content on a document so that query terms

align yields better results in a search engine optimization effort. If an optimization is desired on

the query ‘Mountain House California’ then the following page copy would be treated

differently.

<p>Mountain House is located in Northeastern California</p>

And <p>Mountain House, California is located in the Northeastern part of the state </p>

Page 92: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

84

The second page copy would be superior to the first given the distance between the city name

and the state name; a delta of 28 – 2 = 26 characters in favor for the second page copy.

The sixth component that Jerkovic (2010) identifies as having value to the search engine

providers is the prominence of the keywords on the physical page. Keyword prominence refers

to the physical location of the query words / term with regards to the top of the document. The

general idea being here according to Jerkovic that the higher the placement on the document the

more important the query term.

The seventh component that Jerkovic (2010) identifies as having significance to the

search engines is keywords in anchor text. The anchor tag is a tag that allows for the linking of

content between documents. In the examples given above the reader will notice that the query

terms identified in the URL were placed with emphasis in the search results. King makes the

argument that shorter URLs are preferred to longer URLs by search engine providers – once

again an implicit indication of the algorithm. Could it be possible that the search engine

providers utilize an index structure with keywords in some phrase and thus create an algorithm

on this attribute to assess value?

The eighth attribute that Jerkovic (2010) identifies as having value to the search engine

providers is the length of time that the domain has been in existence. The argument by Jerkovic

being that the age of content has a bearing on the worth of the indexing process. In the

optimization guide by Google, the search engine provider does not bring this attribute forth as

having value to the indexing process, but never the less a point that may be worth investigating

further.

The final and ninth component that Jerkovic (2010) identifies as having value to the

search engines is the quality and quantity of inbound links. King makes the argument that each

web page has a voting share and the linking between source and sink is essentially a voting share

between the source and sink. The PageRank algorithm by Google utilizes the link structure as a

weight to measure value. Sun and Wei (2005) define PageRank as the importance of a page on

the internet. It is this final component of the system attribute mapping structure that needs to be

investigated further and the focus of this paper. While Jerkovic makes the argument that it is the

quantity and quality of the inbound links that help to determine worth; what is not considered is

the systems perspective of the source. Sure the search engine providers deem inbound links from

Page 93: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

85

.org or .edu domains as having a special elevated contribution factor it does fail however to

address the complex dynamic of link structures. For example linking for the sake of linking can

add value as identified by King (2008), but should it? The argument of this paper is that link

structures need to be evaluated from a systems perspective and not viewed as a binary measure –

even in the case of the special domains.

Information retrieval is the process of extracting organized data from some source. While

data represents unorganized text, information represents actionable, organized text that may be

disseminated and acted upon. In the realm of information retrieval the concepts of recall and

precision are of paramount interest. Recall according to Lee (2000) refers to the percentage of a

set of documents that are deemed to be relevant given some repository. Precision on the other

hand refers to the percentage of documents from the retrieved data sets that are deemed to be

relevant. Mehlitz et al (2007) take the definition a step further and provide an equation for the

two dependent variables identified previously, and this being as follows.

P = Gr / r [4.2] And R = Gr/g [4.3] Where P = Precision R = Recall Gr = Relevant Documents Among Returned Documents r = Documents Retrieved g = Relevant Documents in a Collection Set Mehlitz et al go a step further from the book definition of relevance and precision and define an

effectiveness measurement for the information retrieval system. This effectiveness measurement

is defined as follows.

Let E = 1 – [(β2 + 1) * P * R] / [β2 * P + R] [4.4] Where E = Degree of Effectiveness β= Relative Importance between Precision and Recall Given a document retrieval system such as a search engine then the total effectiveness of the

search paradigm can be explicitly evaluated to a tangible quantifiable metric. The optimization

Page 94: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

86

model proposed in this paper specifically addresses the precision component of the equation.

What is not being contended is the recall; the set of documents retrieved is fixed and definable at

some specific point in time -‘t’. The next section of the discourse addresses the optimization

mechanism proposed and the analysis for its evaluation.

4.5 Theoretical Formulation

While the page attributes defined by researchers are finite they do represent a complex

dynamic and one for which has been in constant flux since its inception. The search engine

paradigm is a much sought after paradigm as its full disclosure signifies a windfall for online

retailers. The collective of the attributes identified through the research literature create a system

definition that may be modelled as follows.

Let A = Anchor Text

C = Page Copy D = Domain Registration Length I = Search Index M = Meta Tags P = Keyword Proximity Q = Quality & Quantity of Referrals S = Search Term Prominence T = Title Tag

U = Document URL Where I = f(A, C, D, M, P, Q, S, T, U) [4.5]

The optimization mechanism proposed addresses the ‘Q’ component of the relationship given

above. King makes the argument that while referrals from pages with a higher equity rank are

more desirable; the link structures from low ranking link equity sources contributes to overall

rank. This fact completely exposes the search algorithm to a bias towards link farms or search

engine optimizers that understand how to create link spam. This scenario and environment that

currently exists must be adjusted to in order to keep the system homeostasis and remove bias.

In section three of this body of work an approximation model was created for the Bing

and Yahoo search engines. This paradigm took into account the referrers to the sink as a volume

metric, but the model did not address the quality of the referrers. It is this quality metric that this

Page 95: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

87

addressed here. In an effort to define this quality metric the same system variables that were

investigated earlier will be indexed across the group of nodes that point to some sink. It is this

collective under study that will be scrutinized to create a weighted quality metric component that

will be incorporated into the same equations derived previously for both the Bing and Yahoo

search engines to create a possible refinement of the model. What is being proposed here is to

actually define the β component of equation 4.4 – quantify the relative importance between

precision and recall. To calculate the source index or β the linking pages will be evaluated to a

corresponding index that will be defined as follows.

Let c = Copy Index = [∑ Query Terms Found in Copy] / [∑ Words in Copy]

d = Description Index = [∑ Query Terms in Description] / [∑ Words in Description] k = Keyword Index = [∑ Query Terms in Keywords] / [∑ Words in Keyword]

t = Title Index = [∑ Query Terms Title] / [∑ Words in Title] u = URL Index = [∑ Query Terms Found URL] / [∑ Words in URL] W = Weighed Index Where Wi = ∑ ci + ∑ di + ∑ ki + ∑ ti + ∑ ui [4.6] The evaluation of equation 4.6 for each of the linking pages to the sink would then provide a

total index ‘W’; the average will be taken to derive the mean index for some sink that will be

defined as follows.

Let n = Number of Sources SST = Source to Sink Total Where SST = [ ∑ c + ∑ d + ∑ k + ∑ t + ∑ u ] / n [4.7] In the previous section of this dissertation an approximation formula was defined for each

of the search engine providers. The proposal made here is an adjustment to the general formula

defined in equation 3.2 of the previous section and reproduced below.

Let: S = Search Engine Index

Bn = Slope of Component ‘n’ D = Meta Tag Description Index → Keywords / Total Words in Description I = Inbound Links → External Links to Page

Page 96: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

88

K = Meta Tag Keywords Index → Keywords / Total Words in Keywords O = Outbound Links → Links to External Pages P = Page Copy Index → Keywords in Copy / Total Words in Copy T = Title Index → Keywords In Title / Total Words In Title U = URL Index → Keywords in URL / Total Words in URL

Then: S = B1D * B2I * B3K * B4O * B5P * B6T * B7U + µ [3.2]

The evaluation of the approximation formulas derived in the previous section were void

of a quality metric with regards to inbound links (I). Inbound links were deemed to be relevant if

they existed, but they were not scrutinized for value. This value metric is of great importance as

it filters out a bias such as may be found with links from link farms or link spam. The utilization

of a quality metric as applied to equation 3.2 which leads to a new paradigm as given below.

S = B1D * B2 I * B8SST * B3K * B4O * B5P * B6T * B7U + µ [4.8]

Equation 4.8 has a simple, but fundamental change and this being the quality evaluation of the

link equity from a series of links or sources. It is this paradigm that is investigated further in this

section to seek to find an optimization to the previous formula defined (3.2) and derive an

optimized approximation to the search algorithms of the two search engine providers.

4.6 Underpinnings

The algorithms depicted in Appendix A were used to create the data input files that were

later fed into the statistical modeling utility R (https://www.r-project.org/). The algorithms

utilized were created using the Python programming language (https://www.python.org/) version

3.4.3. The version of R that was used was 3.3.1. The complete source code along with the data

files may be obtained through a GitHub repository at the following URL

https://github.com/guillermorodriguez/Dissertation.

As was the case in the previous section the Natural Language Toolkit was used for the

Python programming language to search for the lexical context for a given query term. The query

terms percolated from the previous section and were simply carried over for the given searches

that had been performed previously. The GitHub repository contains the needed information to

Page 97: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

89

install the Natural Language Toolkit for Python and these instructions may be found in the

Python file named wordnet_install.py.

4.7 Bing Formula

The Bing data file _historical_complete_r.dat contains the quantified quality attribute that

is summarized in Table 4.1 and given below. The quality metric table shows the mean, maximum

and minimum values for the quality attribute that was used in equation 4.9. The data was

extracted through the use of the algorithm pySummary.py as given in Appendix A.

The data needed to perform the calculations was fed into the R statistical package through

the use of the table read function as defined in equation 3.5. The data file that was used may be

found in the director src/BING/data/ _historical_complete_r.dat. The directory noted may be

found in the GitHub repository for downloading an extract of the top five lines are included in

Appendix D. The Bing formula derived in section three is given below.

1/index^15 = 0.001141*I / ( -0.002473*U * -0.000009302*O * -0.001788*T * -0.02655*D * -0.03175*K * ( -0.0005829*DIV - 0.003683*H1+ 0.003760*H2 -0.0008250*H3 + 0.003665*H4 + 7.854*H5 + 0.004228*H6 – 0.001312*P + 0.00004138*Span ) ) – 0.002291

The equation given above will be utilized in this body of work as the basis for extension

to incorporate the quality metric identified as showing significance for further study. Using the

_historical_complete_r.dat file as input into the R modeling software and providing as input the

modeling equation 4.9 provides further insight into the true system homeostasis for the Bing

search engine.

1/index^15~inbound_links*quality/(root*outbound_links*title*description*keywords*(div+h1+h2+h3+h4+h5+h6+p+span)

Minimum Maximum Average

0.0 9146.0 73.767

[3.9]

[4.9]

Table 4.1 – Bing Quality Metric

Page 98: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

90

Execution of the summary function in R over the model definition of equation 4.9 leads to

summary statistics of an Adjusted R-Squared value of 0.102, a Multiple R-Squared value of

0.1962, a Residual Standard Error of 0.06385, and a p value of 2.2*10^-16. The results are

significant as the system has conveyed a shift in homeostasis to the negative. The conclusion

here is that the Bing search engine algorithm given the modeling paradigm used probably does

not incorporate the quality metric of link referrers! While the true nature of the complete

modeling paradigm incorporated by the Bing search engine is a mystery, what can be claimed

given the research presented here is that the modeling paradigm given this quality metric has

shifted the model towards chaos and away from order given the addition of the quality metric

into the overall equation. The best model available for the Bing search engine to date is void of a

quality metric and centralized around page attributes only along with quantity of page referrers.

As an alternative analysis a logistic regression was performed through the R statistical

software package. The regression model was define as given below in equation 4.19.

lm.fit ← glm(1/index^15~inbound_links*quality/(root*outbound_links*title*description*keywords*(div+h1+h2+h3+h4+h5+h6+p+span)), family="binomial")

Taking equation 4.9 and submitting this into the R statistical package yields a regression

formula as defined in equation 4.10 and given below.

log(p/(1-p)) = 1/index^15 = (1.816e12) * inbound_links * (-2.183e11) * quality / (

(8.299e10) * root * (-6.009e7) * outbound_links * (-5.643e10) * title * (-8.159e11) * description * (4.966e12) * keywords * ( (1.036e11) * div + (-5.951e11) * h1 + (-8.875e11) * h2 + (6.512e9) * h3 + (2.781e12) * h4 + (2.281e14) * h5 + (1.132e12) * h6 + (-7.327e9) * p + (-1.695e10) * span – 4.052e15

Utilization of the ROC function in R allows for the derivation of a ROC plot that may be

performed through the use of the plot function in R. This plot function is shown in Fig. 4.1

below. Calculating the area under the curve or the probability of accuracy yields a total of 0.375.

This analysis shows a degradation over the previous analysis which was void of the quality

component of the model. The degradation in the model accuracy amounts to 0.0492 or stated

[4.9]

[4.10]

Page 99: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

91

differently approximately a 5% difference. In the case of Bing the approximation model and the

optimization model are one in the same. As far as the model paradigm here is concerned, Bing

does not account for link quality in their model derivation. The full analysis performed herein

may be found in the GitHub repository under the directory /BING/R and in the file ‘commands -

Chapter 4 - glm.txt’.

4.8 Yahoo Formula

The Yahoo data file _historical_complete_r.dat contains the quantified quality attribute

that is summarized in Table 4.2 and given below. The quality metric table shows the mean,

maximum and minimum values for the quality attribute that was used in equation 4.11. The data

was extracted through the use of the algorithm pySummary.py as given in Appendix A.

The data for needed to perform the calculations was fed into the R statistical package

through the use of the table read function as defined in equation 3.5. The data file that was used

may be found in the director src/YAHOO/data/ _historical_complete_r.dat. The directory noted

may be found in the GitHub repository for downloading an extract of the top five lines are

included in Appendix E. The Yahoo formula derived in section three is given below.

1/index = -0.05977*K* (-0.01003)*U * (-0.001494)*T * (-0.00000477)*I * (-0.000007963)*O * (-0.04570)*D * ( -0.01385*DIV + 0.002041*H1 + 0.05928*H2 + 0.0003611*H3 + 0.08388*H4 + 0.04422*H5 + 1.951*H6 – 0.01690*P + 0.003334*SPAN ) + 0.04643

Given the quality metric that is under investigation here the equation given above needs

to be modified in order to account for the quality attribute of the model. The generalized

regression model that was used in R is given in equation 4.11 below.

Minimum Maximum Average

0.0 3165.5 71.832

[3.10]

Table 4.2 – Yahoo Quality Metric

Page 100: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

92

Fig. 4.1 – Receiver Operating Characteristic (ROC) Curve – With Quality Component –Bing Data

Page 101: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

93

1/index~keywords*root*title*(inbound_links*quality*outbound_links)*description*(div

+h1+h2+h3+h4+h5+h6+p+span)

Execution of the summary function in R over the model definition given in the form of equation

4.11 leads to summary statistics of a Residual Standard Error of 0.09163, a Multiple R-Squared

value of 0.5702, an Adjusted R-Squared value of 0.2798, and a p value of 2.2*10^-16. The

findings here are significant as it shows that in the case of the Yahoo search engine the quality

metric does help to improve the prediction of the search indexes. The derived R model for the

Yahoo classification index is given in the form of equation 4.12 below. As you may validate

through visual inspection of the new derived formula it is the previous formula as given in

equation 3.12 with one addition component the quality metric shown as variable ‘Q’ in equation

4.12.

1/index = -5.39*K * 0.004452*U * 0.07541*T * 0.00003367*I * (-0.000009344)*Q * 0.00003294*O * 0.1243*D * (0.02064*DIV + 0.03605*H1 + 0.08839*H2 + (-0.08222)*H3 + (-7551)*H4 + 10130*H5 + 62230*H6 –0.05492*P + (-0.006268)*SPAN) + 0.04244

Equation 4.12 from above has provided a better indexing approximation or an

optimization if you will to the original modeling paradigm as provided in section three – for the

linear regression. The argument may even be made here that one of the reasons why the indexing

approximation as derived in section three was poor was that it was missing the quality

component.

As an alternative analysis as was done previously the logistic regression was performed

for the Yahoo data using the R statistical package. The regression model was define as given

below in equation 4.13.

lm.fit ← glm( 1/index ~ keywords*root*title*(inbound_links*quality*outbound_links)*description*(div+h1+h2+h3+h4+h5+h6+p+span), family="binomial")

[4.11]

[4.12]

[4.13]

Page 102: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

94

The R statistical package provides a summary function that allows for the derivation of

the component values. This tabulation by the statistical package R allows for the derivation of

equation 4.14 and it is given below.

1/index = (-3.762e25) * K * (-7.973e14) * U * (6.461e15) * T * (2.028e12) * I * (4.150e12) * Q * (9.992e12) * O * (-1.662e16) * D * (((1.892e15) * DIV + (2.101e15) * H1 + (-4.845e14) * H2 + (-3.738e15) * H3 + (-6.486e19) * H4 + (-8.662e19) * H5 + (-2.983e20) * H6 + (7.302e14) * P + (4.086e14) * SPAN) – 2.641e15

Utilization of the ROC function in R allows for the derivation of a ROC plot that may be

performed through the use of the plot function in R. This plot function is shown in Fig. 4.2

below. Calculating the area under the curve or the probability of accuracy yields a total of

0.7778. This analysis shows a refinement over the previous analysis which was void of the

quality component of the model. The refinement in the model amounts to an increase for the

better to a total of 0.3536 points. This level of accuracy is significant as it not only puts the

argument over the 50% threshold, but it also makes for a strong argument for the modeling

paradigm that was adopted in this body of work. The full analysis performed herein may be

found in the GitHub repository under the directory /YAHOO/R and in the file ‘glm - commands -

Chapter 4.txt’.

4.9 Summary

The creation of an approximation to some datum is complex given the nature of the

environment of current times. There exists a plethora of data around all of us that is semi

structured and difficult to leverage and to complicate matters further the tools available to solve

these current problems are evolving. One of the questions that I believe I was able to answer in

this body of work is that while the landscape may be complex and that while the paradigm may

appear to be at a far distance horizon and fuzzy at best from a current perspective it is the

systems perspective that can shed light on the problem frame and at least help us understand the

world a little better. At least let us wrap our hands around the problem frame to begin to

understand the dynamic at work and in some measure help chip away at the stone for others to

put the face to the bust.

[4.14]

Page 103: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

95

Fig. 4.2 – Receiver Operating Characteristic (ROC) Curve – With Quality Component –Yahoo Data

Page 104: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

96

4.10 Bibliography

Acceleration. (n.d.). Retrieved March 16, 2016, from http://www.merriam-

webster.com/dictionary/acceleration

Agarwal, R., Dhar, V. (2014). Big Data, Data Science, and Analytics: The Opportunity and

Challenge for IS Research. Institute of Operations Research and the Management Science. Vol.

25, No. 3, 443-448

Ageev, M., Guo, Q., Lagun, D., Agichtein, E. (2011). Find It If You Can: A Game for Modeling

Different Types of Web Search Success Using Interaction Data. Proceedings of the 34th

International ACM SIGIR Conference on Research and Development In Information Retrieval.

ACM

Gandomi, A., Haider, M. (2015). Beyond the Hype: Big Data Concepts, Methods, and Analytics.

International Journal of Information Management. Vol. 35, 137-144

Gwizdka, J., Chignell, M. (1999). Towards Information Retrieval Measures for Evaluation of

Web Search Engines. Retrieved May 2015 from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.3212&rep=rep1&type=pdf

Hersch, W., Elliot, D., Hickam, D., Wolf, S., Molnar, A., Leichtenstien, C. (1995). Towards New

Measures of Information Retrieval Evaluation. SIGIR ’95 Proceedings of the 18th Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval.

164-170

Imafouo, A., Tannier, X. (2005). Retrieval Status Values in Information Retrieval Evaluation.

12th International Conference – String Processing and Information Retrieval

Jacobs, A. (2009). The Pathologies of Big Data. Communications of the ACM. Vol. 52, No. 9

Jerkovic, J. (2010). SEO Warrior. O’Reilly Media Inc.

Page 105: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

97

Kandefer, M., Shapiro, S. (2008). An F-Measure for Context Based Information Retrieval.

Retrieved October 2015 from

http://commonsensereasoning.org/2009/papers/commonsense2009paper13.pdf

King, A. (2008). Website Optimization. Sebastopol, CA. O’Reilly Media Inc.

Laurila, J., Perez, D., Aad, I., Blom, J., Bornet, O., Do, T., Dousse, O., Eberle, J., Miettinen, M.

(2012). The Mobile Data Challenge: Big Data for Mobile Computing Research. Pervasive

Computing, Newcastle.

Lavalle, S., Lesser, E., Shockley, R., Hopkins, M., Kruschwitz, N. (2011). Big Data, Analytics

and the Path from Insights to Value. MIT Sloan Review. Vol. 52, No. 2

Lazer, D., Kennedy, R., King, G., Vespignani, A. (2014). The Parable of Google Flu: Traps in

Big Data Analysis. Science. Vol. 343

Losee, R. (2000). When Information Retrieval Measures Agree About the Relative Quality of

Document Rankings. Journal of the American Society for Information Science. Vol. 51, 834-840

Mehlitz, M., Kunegis, J., Bauckhage, C., Albayrak, S. (2007). A New Evaluation Measure for

Information Retrieval Systems. IEEE International Conference on Systems, Man and

Cybermetics.

Provost, F., Fawcett, T. (2013). Data Science and Its Relationship to Big Data and Data-Driven

Decision Making. Mary Ann Liebert, Inc. Vol. 1, No. 1

Snijders, C., Matzat, U., Reips, U. (2012). “Big Data”: Big Gaps of Knowledge in the Field of

Internet Science. International Journal of Internet Science. Vol. 7, 1-5

Sun, H., Wei, Y. (2005). A Note On The PageRank Algorithm. Elsevier. Vol. 179, Issue 2, 799-

806

Page 106: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

98

Taylor, L., Schroeder, R., Meyer, E. (2014). Emerging Practices And Perspectives on Big Data

Analysis in Economics: Bigger and Better or More of the Same. Big Data and Society. July-

December 2014, 1-10

Ularu, E., Puican, F., Apostu, A., Velicanu, M. (2012). Perspectives on Big Data and Big Data

Analytics. Database Systems Journal. Vol. III, No. 4

Variable. (n.d.). Retrieved March 16, 2016, from http://www.merriam-

webster.com/dictionary/variable

Zhou, B., Yao, Y. (2010). Evaluating Information Retrieval System Performance Based on User

Preference. Journal of Intelligent Information Systems. Vol. 34, Issue 3, 227-248

Page 107: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

99

CHAPTER V

CONCLUSION

The research questions at the onset of this body of work were defined as follows.

1. What are the system attributes for each of the search engine providers studied – Bing and Yahoo?

2. Can the system attributes be combined into a regression model to predict search results?

3. Can the big data paradigm be investigated from a systems perspective to help define system homeostasis?

4. What is the optimized classification formula that may be derived using systems theory?

Each of the research questions was answered in this collective body of work. It was discovered

that while each of the two major search engines had similarities in the derived approximations

formulas they did differ. The identified system attributes in the analysis proved to be as defined

in table 5.1 and given below.

The second research question that was answered was whether these base attributes could

be combined into a predictive model and once again the answer was yes to an accuracy of

77.78% for Yahoo and to an accuracy of 42.42% for Bing. The third question that was answered

in this body of work was whether the big data paradigm could be studied from a systems

perspective to help determine homeostasis. This question was answered positively because of the

System Attributes

URL

Description

Copy

Inbound Links

Keywords

Outbound Links

Title

Table 5.1 – System Attributes Summary

Page 108: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

100

modeling methodology that was followed. First the system was set to a boundary constraint, i.e.

the page and then the page elements were derived to create a finite set of attributes over the

problem frame. So while the World Wide Web may contain a plethora of nodes these nodes have

order since they are indexed by the providers and as such this order must part from some base

properties. It was this assumption that was followed and for which positive results were obtained.

The fourth research question that was posed was to determine the optimum derivable

classification formula that may be determined. In the case of Bing this formula was determined

as equation 3.11 and in the case of Yahoo this formula was determined to be equation 4.14.

This body of work has been a first step in the process of understanding search engine

formulas from the systems perspective and as such has laid a fundamental stepping stone in what

may come. One of the attributes that was not investigated in this body of work was the location

element. If the current search paradigm is geographic centric as was hinted at in the research

literature then this entails that another measurement can be made here; the physical distance

between query node and sink node. This element or notion of distance may very well be

incorporated directly into the derived formulas of 3.11 and 4.14. For that matter, additional

attributes that are identified as having merit that may arise in the future could very well be

incorporated directly into the formulas given and thus create refinements to the formulas given

here.

Page 109: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

101

BIBLIOGRAPHY

Acceleration. (n.d.). Retrieved March 16, 2016, from http://www.merriam-

webster.com/dictionary/acceleration

Agarwal, R., Dhar, V. (2014). Big Data, Data Science, and Analytics: The Opportunity and

Challenge for IS Research. Institute of Operations Research and the Management Science. Vol.

25, No. 3, 443-448

Ageev, M., Guo, Q., Lagun, D., Agichtein, E. (2011). Find It If You Can: A Game for Modeling

Different Types of Web Search Success Using Interaction Data. Proceedings of the 34th

International ACM SIGIR Conference on Research and Development In Information Retrieval.

ACM

Al-Maolegi, M., Arkok, B. (2014). An Improved Apriori Algorithm for Association Rules.

International Journal on Natural Language Computing (IJNLC). Vol. 3, No. 1

Alla, H. (2008). A Novel Efficient Classification Algorithm for Search Engines. 8th WSEAS

International Conference on Applied Informatics and Communications. 20-22

Bar-Yosser, Z., Gurevich, M. (2008). Mining Search Engine Query Logs via Suggestion

Sampling. ACM. Vol. 1, Issue 1, 54-65

Barbay, J., Kenyon, C. (2003). Deterministic Algorithm for the t-Threshold Set Problem.

Retrieved May 2015 from http://users.dcc.uchile.cl/~jbarbay/Publications/2003-ISAAC-

DeterministicAlgorithmForTheTThresholdProblem-BarbayKenyon.pdf

Beeferman, D., Berger, A. (2000). Agglomerative Clustering of a Search Engine Query Log.

KDD '00 Proceedings of the sixth ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining. 407-416

Beel, J., Gipp, B. (2009). Google Scholar’s Ranking Algorithm: The Impact of Article’s Age

(An Empirical Study). Information Technology: New Generations, 2009, ITNG ’09. Sixth

International Conference. 160-164

Page 110: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

102

Berry, M., Browne, M. (2005). Understanding Search Engines: Mathematical Modeling and

Text Retrieval. Siam

Bijral, S., Mukhopadhyay, D. Efficient Fuzzy Search Engine with B-Tree Search Mechanism.

Retrieved May 2015 from http://arxiv.org/abs/1411.6773

Carterette, B., Jones, R. (2008). Evaluating Search Engines by Modeling the Relationship

Between Relevance and Clicks. Advances in Neural Information Processing Systems

Chang, J., Chiou, S. (2009). An EM Algorithm for Context-Based Searching and Disambiguation

with Application to Synonym Term Alignment. 23rd Pacific Asia Conference on Language,

Information and Computing. 630-637.

Choudhary, L., Burdak, B. Role of Ranking Algorithms for Information Retrieval. Retrieved

May 2015 from http://arxiv.org/abs/1208.1926

Chuklin, A., Rijke, M. (2014). The Anatomy of Relevance. Retrieved May 2015 from

http://arxiv.org/abs/1501.06412

Cohen, A., Vitanyi, P. Web Similarity. Retrieved May 2015 from

http://arxiv.org/abs/1502.05957

Dahiwale, P., Raghuwanshi, M., Malik, L. (2014). PDD Crawler: A Focused Web Crawler Using

Link and Content Analysis for Relevance Prediction. SEAS-2014, Dubai, UAE, International

Conference.

Erdani, Y. (2012). Developing Backward Chaining Algorithm of Inference Engine in Ternary

Grid Expert System. International Journal of Advanced Computer Science and Applications

(IJACSA). Vol. 3, No. 9

Frees, A., Gamble, J., Rudinger, K., Bach, E., Friesen, M., Joynt, R., Coppersmith, S. Power

Law Scaling for the Adiabatic Algorithm for Search Engine Ranking. Retrieved May 2015 from

http://arxiv.org/abs/1211.2248

Page 111: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

103

Gandomi, A., Haider, M. (2015). Beyond the Hype: Big Data Concepts, Methods, and Analytics.

International Journal of Information Management. Vol. 35, 137-144

Garnerone, S., Zanardi, P., Lidar, D. Adiabatic Quantum Algorithm for Search Engine Ranking.

Retrieve May 2015 from http://arxiv.org/abs/1109.6546

Gwizdka, J., Chignell, M. (1999). Towards Information Retrieval Measures for Evaluation of

Web Search Engines. Retrieved May 2015 from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.3212&rep=rep1&type=pdf

Google. Search Engine Optimization Starter Guide. Retrieved July 15, 2015, from

http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-

engine-optimization-starter-guide.pdf

Hassan, A. (2012). A Semi-Supervised Approach to Modeling Web Search Satisfaction.

Proceedings of the 35th International ACM SIGIR Conference on Research and Development in

Information Retrieval. ACM

Henzinger, M. (2007) Combinatorial Algorithms for Web Search Engines – Three Success

Stories. SODA '07 Proceedings off the Eighteenth Annual ACM -SIAM Symposium on Discrete

Algorithms. 1022-1026

Hersch, W., Elliot, D., Hickam, D., Wolf, S., Molnar, A., Leichtenstien, C. (1995). Towards New

Measures of Information Retrieval Evaluation. SIGIR ’95 Proceedings of the 18th Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval.

164-170

Heuer, J., Dupke, S. Towards a Spatial Search Engine Using Geotags. Retrieved May 2015 from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.463.8323&rep=rep1&type=pdf

Huang, P., He, X., Gao, J., Deng, L., Acero, A., Heck, L. (2013). Learning Deep Structured

Semantic Models for Web Search Using Clickthrough Data. Proceedings of the 22nd ACM

International Conference on Information & Knowledge Management. ACM

Page 112: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

104

Imafouo, A., Tannier, X. (2005). Retrieval Status Values in Information Retrieval Evaluation.

12th International Conference – String Processing and Information Retrieval

Ishii, H., Tempo, R. (2010). Distributed Randomized Algorithms for the PageRank Computation.

Automatic Control, IEEE Transactions. Vol. 55, Issue 9, 1987-2002

Jacobs, A. (2009). The Pathologies of Big Data. Communications of the ACM. Vol. 52, No. 9

Jansen, B., Spink, A. (2006). How Are We Searching the World Wide Web? A Comparison of

Nine Search Engine Transaction Logs. Information Processing and Management. Vol 40, 248-

263

Jerkovic, J. (2010). SEO Warrior. Sebastopol, CA. O’Reilly Media Inc.

Jones, S. (2002). Encyclopedia of New Media: An Essential Reference to Communication and

Technology. SAGE Publications, Inc.

Kandefer, M., Shapiro, S. (2008). An F-Measure for Context Based Information Retrieval.

Retrieved October 2015 from

http://commonsensereasoning.org/2009/papers/commonsense2009paper13.pdf

King, A. (2008). Website Optimization. Sebastopol, CA. O’Reilly Media Inc.

Koorangi, M., Zamanifar, K. (2007). A Distributed Agent Based Web Search Using Genetic

Algorithm. International Journal of Computer Science and Network Security. Vol. 7, No. 1

Kumar, R., Saini, S. (2011). A Study on SEO Monitoring System Based on Corporate Website

Development. International Journal on Computer Science, Engineering and Information

Technology (IJCSEIT). Vol. 1, No. 2

Lardin-Schweitzer, Y., Collet, P., Lutton, E., Prost, T. (2003). Introducing Lateral Thinking in

Search Engines with Interactive Evolutionary Algorithms. SAC '03 Proceedings of the 2003

ACM Symposium on Applied Computing, 214-219

Page 113: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

105

Laurila, J., Perez, D., Aad, I., Blom, J., Bornet, O., Do, T., Dousse, O., Eberle, J., Miettinen, M.

(2012). The Mobile Data Challenge: Big Data for Mobile Computing Research. Pervasive

Computing, Newcastle.

Lavalle, S., Lesser, E., Shockley, R., Hopkins, M., Kruschwitz, N. (2011). Big Data, Analytics

and the Path from Insights to Value. MIT Sloan Review. Vol. 52, No. 2

Lazer, D., Kennedy, R., King, G., Vespignani, A. (2014). The Parable of Google Flu: Traps in

Big Data Analysis. Science. Vol. 343

Li, C., Hong, M., Cogill, R., Garcia, A. An Adaptive Online Ad Auction Scoring Algorithm for

Revenue Maximization. Retrieved May 2015 from http://arxiv.org/abs/1207.4701

Li, H., Wang, Y, Zhang, D., Zhang, M., Chang, E. PFP: Parallel FP-Growth for Query

Recommendation. Retrieved May 2015 from

http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/34668.pdf

Losee, R. (2000). When Information Retrieval Measures Agree About the Relative Quality of

Document Rankings. Journal of the American Society for Information Science. Vol. 51, 834-840

Mehlitz, M., Kunegis, J., Bauckhage, C., Albayrak, S. (2007). A New Evaluation Measure for

Information Retrieval Systems. IEEE International Conference on Systems, Man and

Cybernetics.

Mukhopadhyay, D., Biswas, P., Kim, Y. (2006). A Syntactic Classification Based Web Page

Ranking Algorithm. Retrieved May 2015 from

http://arxiv.org/ftp/arxiv/papers/1102/1102.0694.pdf

PageRank. Retrieved May 25, 2015 from http://en.wikipedia.org/wiki/PageRank

Pal, A., Tomar, D., Shrivastava, S. (2009). Effective Focused Crawling Based on Content and

Link Structure Analysis. International Journal of Computer Science and Information Security

(IJCSIS). Vol. 2, No. 1

Page 114: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

106

Provost, F., Fawcett, T. (2013). Data Science and Its Relationship to Big Data and Data-Driven

Decision Making. Mary Ann Liebert, Inc. Vol. 1, No. 1

Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., Horvitz, E. (2012). Modeling and

Predicting Behavioral Dynamics on the Web. Proceedings of the 21st International Conference

on World Wide Web. ACM

Rani, M., Parashar, A., Chaturvedi, J., Malviya, A., Search Space Engine Optimize Search Using

FCC_STF Algorithm in Fuzzy Co-Clustering. Retrieved May 2015 from

http://arxiv.org/abs/1407.6952

Rojas, M. A Semantic Association Page Rank Algorithm for Web Search Engines. Retrieved

May 2015 from http://arxiv.org/abs/1211.6159

Sedigh, A., Roudaki, M. (2003). Identification of the Dynamics of the Google's Ranking

Algorithm. 13th IFAC Symposium on System Identification

Sheng, C., Zhang, N., Tao, Y., Jin, X. (2012). Optimal Algorithms for Crawling a Hidden

Database in the Web. Proceedings of the VLDB Endowment. Vol. 5, No. 11

Smyth, B., Balfe, E., Freyne, J., Briggs, P., Coyle, M., Boydell, O. (2004). Exploiting Query

Repetition and Regularity in an Adaptive Community-Based Web Search Engine. User

Modelling and User-Adaptive Interaction. Vol 14, 383-423

Snijders, C., Matzat, U., Reips, U. (2012). “Big Data”: Big Gaps of Knowledge in the Field of

Internet Science. International Journal of Internet Science. Vol. 7, 1-5

Spamdexing. Retrieved May 25, 2015 from http://en.wikipedia.org/wiki/Spamdexing

Spirin, N., Han, J. Survey on Web Spam Detection: Principles and Algorithms. Retrieved May

2015 from http://www.kdd.org/sites/default/files/issues/13-2-2011-12/V13-02-08-Spirin.pdf

Sun, H., Wei, Y. (2005). A Note On The PageRank Algorithm. Elsevier. Vol. 179, Issue 2, 799-

806

Page 115: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

107

Suri, P., Taneja, H. (2012). An Integrated Ranking Algorithm for Efficient Information

Computing In Social Networks. International Journal on Web Service Computing (IJWSC). Vol.

3, No. 1

Taylor, L., Schroeder, R., Meyer, E. (2014). Emerging Practices And Perspectives on Big Data

Analysis in Economics: Bigger and Better or More of the Same. Big Data and Society. July-

December 2014, 1-10

Turney, P. (2008). The Latent Relation Mapping Engine: Algorithm and Experiments. Journal of

Artificial Intelligence Research. Vol. 33, 615-655

U.S. Department of Commerce. E-Stats. May 22, 2014. Web. April 9, 2015

Ularu, E., Puican, F., Apostu, A., Velicanu, M. (2012). Perspectives on Big Data and Big Data

Analytics. Database Systems Journal. Vol. III, No. 4

Variable. (n.d.). Retrieved March 16, 2016, from http://www.merriam-

webster.com/dictionary/variable

Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., Scheffer, T. (2005).

Classifying Search Engine Queries Using the Web as Background Knowledge. ACM SIGKDD

Explorations Newsletter. Vol. 7, Issue 2, 117-122.

W3C. 4.01. World Wide Web Consortium. December 24, 1999. Web. April 9, 2015

Wang, F., Du, Y., Dong, Q. (2008). A Search Quality Evaluation Based On Objective-Subjective

Method. Journal of Convergence Information Technology. Vol. 3, No. 2, 50-56

Wang, H., Zhai, C., Liang, F., Dong, A., Chang, Y. (2014). User Modeling in Search Logs Via a

Nonparametric Bayesian Approach. Proceedings of the 7th ACM International Conference on

Web Search and Data Mining. ACM.

Wei, Z., Zhao, P., Zhang, L. (2014). Design and Implementation of Image Search Algorithm.

American Journal of Software Engineering and Applications. Vol. 3, No. 6, 90-94

Page 116: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

108

White, C. (2007). Sergey Brin and Larry Page: the founders of Google. The Rosen Publishing

Group, Inc. New York, NY.

Xu, S., Zhu, Y., Jiang, H., Lau, F. A User-Oriented Webpage Ranking Algorithm Based On user

Attention Time. Retrieved May 2015 from https://www.aaai.org/Papers/AAAI/2008/AAAI08-

199.pdf

Younes, A., Rowe, J., Miller, J. (2008). A Hybrid Quantum Search Engine: A Fast Quantum

Algorithm for Multiple Matches. Retrieved from http://arxiv.org/abs/quant-ph/0311171

Yue, Z., Han, S., He, D. (2014). Modeling Search Processes Using Hidden States in

Collaborative Exploratory Web Search. Proceedings of the 17th ACM Conference on Computer

Supported Cooperative Work & Social Computing. ACM

Zhou, B., Yao, Y. (2010). Evaluating Information Retrieval System Performance Based on User

Preference. Journal of Intelligent Information Systems. Vol. 34, Issue 3, 227-248

Zhou, L., Personalized Web Search. Retrieved May 2015 from http://arxiv.org/abs/1502.01057

Page 117: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

109

APPENDIX A

The data mining algorithm was written in the Python programming language

https://www.python.org . Python is a structured programming language with a vast array of

libraries that may be utilized and facilitates software development. The programming language

was chosen for its ability to parse HTML through the HTMLParser library and the libraries

available to query the search engines such as Google, Bing, and Yahoo. Below are listed the

Python programs created to query the search engines and index each of the system attributes

identified in chapter three of this dissertation. The programs are displayed by file name along

with any addition non Python files used such as XML configuration files.

config.py

from xml.dom import minidom import os class config: bing_settings = { } # 'url': '', 'externallinks': ''} yahoo_settings = { } # 'url': '', 'externallinks': ''} google_settings = { } # 'url': '', 'externallinks': ''} def __init__(self): print('\n==================================================================') print( '%s Initialized' % self.__class__.__name__ ) print('==================================================================') self.file = os.path.dirname(os.path.abspath(__file__)) + '\\' + 'config.xml' self.xmldoc = minidom.parse(self.file) def bing(self): for child in self.xmldoc.getElementsByTagName("bing")[0].childNodes: if child.nodeType == child.ELEMENT_NODE: self.bing_settings[child.nodeName] = child.firstChild.nodeValue def google(self): for child in self.xmldoc.getElementsByTagName("google")[0].childNodes: if child.nodeType == child.ELEMENT_NODE: self.google_settings[child.nodeName] = child.firstChild.nodeValue

Page 118: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

110

def yahoo(self): for child in self.xmldoc.getElementsByTagName("yahoo")[0].childNodes: if child.nodeType == child.ELEMENT_NODE: self.yahoo_settings[child.nodeName] = child.firstChild.nodeValue

Page 119: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

111

config.xml

<?xml version="1.0" encoding="windows-1252"?> <root> <!-- Bing Account Settings // --> <bing> <url>http://www.bing.com/search?q=</url> <externallinks>http://bing.com/search?q=link: [URL] -site:[BASE_URL]</externallinks> </bing> <!-- Google Account Settings // --> <google> <url>http://www.google.com/search?q=</url> <externallinks>http://www.google.com/search?q=link:[URL]+-site:[BASE_URL]&amp;num=1000</externallinks> </google> <!-- Yahoo Account Settings // --> <yahoo> <url>http://search.yahoo.com/search?p=</url> <externallinks>http://search.yahoo.com/search?p=link: [URL] -site:[BASE_URL]</externallinks> </yahoo> </root>

Page 120: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

112

extract.py

import urllib.request from htmlHelper import * import sys import random import os class Extract: _proxies = [] _agents = [] def __init__(self): print('\n==================================================================') print( '%s Initialized' % self.__class__.__name__ ) print('==================================================================') _proxies = [] _source = os.path.dirname(os.path.realpath(__file__)) + '\\proxies.txt' with open(_source, 'r') as _file: for _line in _file: _proxies.append(_line.strip()) # Retrieve series of links from search engine query def extract_links(self, url, engine, use_proxy = False): _html = HTMLhelper() _html.search_engine(engine, 'URLS') try: if len( self._agents ) == 0: _source = os.path.dirname(os.path.abspath(__file__)) + '\\' + 'agents.txt' _file = open(_source, 'r') _agent = _file.readline().strip() while _agent: self._agents.append(_agent) _agent = _file.readline().strip() _file.close() _request = None url = url.replace(' ', '%20') if use_proxy:

Page 121: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

113

if len(self._proxies) == 0: _source = os.path.dirname(os.path.realpath(__file__)) + '\\proxies.txt' with open(_source, 'r') as _file: for _line in _file: self._proxies.append(_line.strip()) _prox = random.choice(self._proxies) print("Using Proxy: %s" % _prox) proxies = {'http': 'http://'+_prox} opener = urllib.request.FancyURLopener(proxies) with opener.open(url) as f: print(f.read().decode('utf-8')) else: _request = urllib.request.Request(url) if engine.upper() == 'YAHOO' or engine.upper() == 'GOOGLE': _request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36') else: _request.add_header('User-Agent', random.choice(self._agents)) _response = urllib.request.urlopen(_request) _html.feed( _response.read().decode('utf-8') ) except ValueError as v: print("Error: %s" % v) exit() return _html # Extract indexes from end point URL given keywords array def extract_indexes(self, url, keywords): _html = HTMLhelper() try: _request = urllib.request.Request(url.replace(' ', '%20')) _request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36') _response = urllib.request.urlopen(_request, timeout=180) _html.search_indexes(url, 'KEYS', keywords) _html.feed(_response.read().decode('utf-8')) except: print(sys.exc_info()[1]) return {} return _html.indexes # Extract embedded links from search engine def external_links(self, url, engine):

Page 122: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

114

_html = HTMLhelper() try: url = url.replace(' ', '%20') print("Back Links: %s" % url) _request = urllib.request.Request(url) _request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36') _response = urllib.request.urlopen(_request, timeout=180) _html.get_backlinks(url, engine) _html.feed(_response.read().decode('utf-8')) except: print(sys.exc_info()[1]) return {} return _html

Page 123: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

115

htmlHelper.py

class HTMLhelper(HTMLParser): """ type = BING | GOOGLE | YAHOO operation = URLS | KEYS | INDEX URLS = Links on search engine KEYS = Primary indexing attributes on result link for each individual link extracted from main query to search engine INDEX = Backlinks to main URL """ indexes = { 'description': 0, 'div': 0, 'outbound_links': 0, 'h1': 0, 'h2': 0, 'h3': 0, 'h4': 0, 'h5': 0, 'h6': 0, 'inbound_links': 0, 'keywords': 0, 'p': 0, 'span': 0, 'title': 0, 'root': 0 } links = [] operation = '' next = '' root_url = '' tag = '' type = '' url = '' words = {} backlinks = [] need_proxy = False # Extract links from search engine query. Base search to retrieve series of URLs that match a given query to search engine. # operation = URLS def search_engine(self, type, operation): self.links = [] self.operation = operation self.next = '' self.type = type self.need_proxy = False self.previous_tag = '' # Retrieve individual indexes from keywords for a specific end point, i.e. URL # operation = KEYS def search_indexes(self, url, operation, words): self.indexes = { 'description': 0, 'div': 0, 'outbound_links': 0, 'h1': 0, 'h2': 0, 'h3': 0, 'h4': 0, 'h5': 0, 'h6': 0, 'inbound_links': 0, 'keywords': 0, 'p': 0, 'span': 0, 'title': 0, 'root': 0 } self.operation = operation self.url = url self.root_url = self.url.lower().replace('https', '').replace('http', '').replace(':', '').replace('//', '').replace('www.', '')

Page 124: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

116

self.words = words self.need_proxy = False for _word in self.words: if len(self.root_url) > 0: self.indexes['root'] += ( self.root_url.lower().count(_word.lower()) * len(_word) ) / len(self.root_url) # Get backlinks to main URL found from the invocation of the search_engine(). def get_backlinks(self, url, type): self.url = url self.next = '' self.type = type self.backlinks = [] self.operation = 'INDEX' self.need_proxy = False def handle_starttag(self, tag, attrs): self.tag = tag self.attrs = attrs if self.operation.upper() == 'URLS': if self.tag == 'a' and self.type.upper() == 'GOOGLE': if len(attrs) == 2: _hasMouseDown = False _hasHref = False _link = '' _iterate = True for items in attrs: for key in items: if _iterate: if 'href' in key and items[1].strip() != '#' and ('google' not in items[1].strip() and items[1][0] != '/'): _hasHref = True _link = items[1] elif 'onmousedown' in key and 'return' in items[1] and 'rwt(this' in items[1]: _hasMouseDown = True elif ('class' == key.strip() and 'fl' == items[1].strip()) or ('data-' in key.strip()): _hasMouseDown = False _hasHref = False _iterate = False if _hasHref and _hasMouseDown: self.links.append({'url': _link, 'count': len(self.links) + 1 }) else: _hasClass = False _hasHref = False _hasId = False

Page 125: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

117

_link = '' for items in attrs: for key in items: if 'class' in key and 'pn' in items[1]: _hasClass = True elif 'href' in key: _hasHref = True _link = items[1] elif 'id' in key and 'pnnext' in items[1]: _hasId = True if _hasClass and _hasHref and _hasId and _link != '': self.next = 'https://www.google.com'+_link elif self.tag == 'a' and self.type.upper() == 'BING': _hasHref = False _hasH = False _hasTitle = False _hasClass = False _link = '' for items in attrs: for key in items: if 'href' == key and ('bing.com' not in items[1] ) and ('go.microsoft.com' not in items[1] ) and items[1][0] != '/' and 'javascript:' not in items[1] and '#' != items[1][0]: _hasHref = True _link = items[1] elif 'h' == key and 'Ads' not in items[1].strip(): _hasH = True elif 'title' == key and 'next page' == items[1].strip().lower(): _hasTitle = True elif 'class' == key and 'sb_pagn' == items[1].strip().lower(): _hasClass = True elif 'href' == key and '/search?q=' == items[1][:10].strip().lower() and '&first=' in items[1] and 'FORM=PORE' in items[1]: _hasHref = True _link = 'http://www.bing.com' + items[1] if _hasHref and _hasH and (not _hasTitle) and (not _hasClass): _exists = False for _element in self.links: if 'url' in _element and ( _element.get('url').strip().lower().replace('https', '').replace('http','') == _link.strip().lower().replace('https', '').replace('http','') or _link.strip().lower().replace('https', '').replace('http','') in _element.get('url').strip().lower().replace('https', '').replace('http','') or _element.get('url').strip().lower().replace('https', '').replace('http','') in _link.strip().lower().replace('https', '').replace('http','')): _exists = True break

Page 126: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

118

if not _exists: self.links.append({'url': _link, 'count': len(self.links)+1}) elif _hasHref and _hasH and _hasTitle and _hasClass: self.next = _link elif self.tag == 'a' and self.type.upper() == 'YAHOO': _hasClass = False _hasLink = False _hasTarget = False _hasData = False _hasReferrerPolicy = False _link = '' for items in attrs: for key in items: if 'class' == key: _hasClass = True elif 'href' == key and 'javascript' not in items[1].strip().lower() and '#' not in items[1].strip().lower() and 'search' not in items[1].strip().lower() and 'yahoo.com' not in items[1].strip().lower(): _hasLink = True _link = items[1] elif 'referrerpolicy' == key and 'origin' == items[1].strip().lower(): _hasReferrerPolicy = True elif 'target' == key: _hasTarget = True elif 'data' in key and 'beacon' in items[1].strip().lower(): _hasData = True if _hasClass and _hasLink and _hasTarget and _hasData and _hasReferrerPolicy: self.links.append({'url': _link, 'count': len(self.links) + 1 }) elif self.operation.upper() == 'KEYS': # Meta Tag Indexes - description and keywords if self.tag.lower().strip() == 'meta': _attrs = dict(attrs) if 'name' in _attrs and 'content' in _attrs and len(_attrs) == 2: # Meta Tags for _word in self.words: if _attrs['name'].lower().strip() == 'description' and len(_attrs['content'].strip()) > 0: self.indexes['description'] += ( _attrs['content'].lower().count(_word.lower()) * len(_word) ) / len(_attrs['content'].strip()) elif _attrs['name'].lower().strip() == 'keywords' and len(_attrs['content'].strip()) > 0: self.indexes['keywords'] += (_attrs['content'].lower().count(_word.lower()) * len(_word) ) / len(_attrs['content'].strip()) elif self.tag.lower().strip() == 'a': # Outbound Links

Page 127: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

119

_nofollow = False _validLink = False for items in attrs: for key in items: _baseURL = self.root_url if self.root_url.find('/') != -1: _baseURL = self.root_url[:self.root_url.find('/')] if 'href' == key and _baseURL not in items[1] and '#' != items[1][:1] and '/' != items[1][:1] and 'javascript' not in items[1]: _validLink = True elif 'rel' == key and items[1].strip().lower() == '': _nofollow = True if not _nofollow and _validLink: self.indexes['outbound_links'] += 1 elif self.operation.upper() == 'INDEX': if self.tag == 'a' and self.type.upper() == 'BING': _hasHref = False _hasH = False _hasTitle = False _hasClass = False _link = '' for items in attrs: for key in items: if 'href' == key.lower() and ('bing.com' not in items[1] ) and ('go.microsoft.com' not in items[1] ) and items[1][0] != '/' and 'javascript:' not in items[1] and '#' != items[1][0]: _hasHref = True _link = items[1] elif 'h' == key.lower() and 'ID=SERP' in items[1].strip(): _hasH = True elif 'class' == key.lower() and 'sb_pagn' == items[1].strip().lower(): _hasClass = True elif 'title' == key.lower() and 'next page' == items[1].strip().lower(): _hasTitle = True elif 'href' == key.lower() and '/search?q=' in items[1].strip().lower() and ( 'FORM=' not in items[1] ) and ('bing.com' not in items[1]) : _hasHref = True self.next = items[1] if _hasHref and _hasH: _exists = False if ( not _hasTitle ) and ( not _hasClass ): for _element in self.backlinks: if ( _element.strip().lower().replace('https', '').replace('http','') == _link.strip().lower().replace('https', '').replace('http','') or _link.strip().lower().replace('https', '').replace('http','') in _element.strip().lower().replace('https', '').replace('http','') or _element.strip().lower().replace('https', '').replace('http','') in _link.strip().lower().replace('https', '').replace('http','')):

Page 128: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

120

_exists = True break if not _exists: self.backlinks.append(_link) elif self.tag == 'a' and self.type.upper() == 'GOOGLE': # Extract Links for Backlink Analysis if len(attrs) == 2: _hasMouseDown = False _hasHref = False _link = '' _iterate = True for items in attrs: for key in items: if _iterate: if 'href' in key and items[1].strip() != '#' and ('google' not in items[1].strip() and items[1][0] != '/'): _hasHref = True _link = items[1] elif 'onmousedown' in key and 'return' in items[1] and 'rwt(this' in items[1]: _hasMouseDown = True elif ('class' == key.strip() and 'fl' == items[1].strip()) or ('data-' in key.strip()): _hasMouseDown = False _hasHref = False _iterate = False if _hasHref and _hasMouseDown: self.backlinks.append(_link) else: _hasClass = False _hasHref = False _hasId = False _link = '' for items in attrs: for key in items: if 'class' in key and 'pn' in items[1]: _hasClass = True elif 'href' in key: _hasHref = True _link = items[1] elif 'id' in key and 'pnnext' in items[1]: _hasId = True if _hasClass and _hasHref and _hasId and _link != '': self.next = 'https://www.google.com'+_link elif self.tag == 'a' and self.type.upper() == 'YAHOO': _hasClass = False

Page 129: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

121

_hasLink = False _hasTarget = False _hasData = False _hasReferrerPolicy = False _link = '' for items in attrs: for key in items: if 'class' == key: _hasClass = True elif 'href' == key and 'javascript' not in items[1].strip().lower() and '#' not in items[1].strip().lower() and 'search' not in items[1].strip().lower() and 'yahoo.com' not in items[1].strip().lower(): _hasLink = True _link = items[1] elif 'referrerpolicy' == key and 'origin' == items[1].strip().lower(): _hasReferrerPolicy = True elif 'target' == key: _hasTarget = True elif 'data' in key and 'beacon' in items[1].strip().lower(): _hasData = True if _hasClass and _hasLink and _hasTarget and _hasData and _hasReferrerPolicy: self.backlinks.append(_link) def handle_endtag(self, tag): self.tag = '' def handle_data(self, data): # Search Engine Prevents Response if self.type.upper() == 'BING': pass elif self.type.upper() == 'GOOGLE': pass elif self.type.upper() == 'YAHOO': if 'error 999' in data: print("Proxy Required") self.need_proxy = True if self.operation.upper() == 'URLS': # Extract Search Engine URLS if self.tag == 'd:url' and self.type.upper() == 'BING': pass elif self.tag == 'a' and self.type.upper() == 'GOOGLE': pass elif self.tag == 'a' and self.type.upper() == 'YAHOO': if data == 'Next': _hasLink = False

Page 130: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

122

for items in self.attrs: for key in items: if key == 'href': self.next = items[1] elif self.operation.upper() == 'KEYS': # Keywords in Text Tag Elements - div, h1, h2, h3, h4, h5, h6, p, span, title for _key in self.indexes: if _key == self.tag: for _word in self.words: if len(data.strip()) > 0: self.indexes[_key] += ( data.lower().count(_word.lower()) *len(_word) ) / len(data.strip()) elif self.operation.upper() == 'INDEX': # Backlinks - Next Link on Page if self.type.upper() == 'BING': # Found in Anchor Tag pass elif self.type.upper() == 'GOOGLE': pass elif self.type.upper() == 'YAHOO': if data == 'Next': _hasLink = False for items in self.attrs: for key in items: if key == 'href': self.next = items[1]

Page 131: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

123

index.py

print ('Started ....') # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" # -------------------------------------------------------------------------- parser = argparse.ArgumentParser(prog='index.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parser.add_argument('-operation', help='Operation [index | consolidate]') parse = parser.parse_args() if parse.engine and parse.operation: if parse.operation.lower() == 'index': _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _type = '.data' _destination = '_indexed.dat' if os.path.exists(_path+_destination): os.remove(_path+_destination) with open(_path+_destination, 'w') as _target: _target.write('index\tkey\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') for _file in os.listdir(_path): if _type in _file: _name_pairs = _file.split('_') _keywords = [] for _synonyms in wordnet.synsets(_name_pairs[0]): for _s in _synonyms.lemmas(): _s = _s.name().replace('_', ' ') if _s not in _keywords: _keywords.append(_s) if len(_keywords) == 0: _keywords.append(_name_pairs[0]) with open(_path+_file) as _from: for _line in _from: _data = _line.split('\t') if _data[0] != 'index': print("Indexing: %s" % _data[1]) _extract = Extract()

Page 132: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

124

_indexes = sorted(_extract.extract_indexes(_data[1], _keywords ).items()) if _indexes: with open(_path+_destination, 'a') as _target: _target.write(_data[0]+'\t'+_name_pairs[0]+'\t'+_data[1]) for _index in _indexes: _target.write('\t'+str(_index[1])) _target.write('\n') elif parse.operation.lower() == 'consolidate': # Determines the average index for each node found print("Consolidating....") _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _source = '_indexed.dat' _destination = '_clean.dat' # Get Distinct URLs _sites = [] with open(_path+_source, 'r') as _from: for _line in _from: _data = _line.split('\t') if _data[0] != 'index': if len(_sites) == 0: _sites.append({'index': _data[0], 'key': _data[1] , 'url': _data[2], 'indexes': _data[3:]}) _exists = False for _site in _sites: if _data[1] == _site['key'] and _site['url'] == _data[2]: _exists = True break if not _exists: _sites.append({'index': _data[0], 'key': _data[1] , 'url': _data[2], 'indexes': _data[3:]}) print("Added: %s" % _data[2]) # Get Average Index _cleaned = [] for _site in _sites: _compositeIndex = float(_site['index']) _count = 1.0 with open(_path+_source, 'r') as _from: for _line in _from: _data = _line.split('\t') if _data[0] != 'index' and _data[1] == _site['key'] and _data[2] == _site['url']: _count+=1 _compositeIndex+=float(_data[0])

Page 133: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

125

_cleaned.append({'index': _compositeIndex/_count, 'key': _site['key'] , 'url': _site['url'], 'indexes': _site['indexes']}) if os.path.exists(_path+_destination): os.remove(_path+_destination) with open(_path+_destination, 'w') as _target: _target.write('index\tkey\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') for _clean in _cleaned: with open(_path+_destination, 'a') as _file: _file.write(str(_clean['index']) + '\t' + _clean['key'] + '\t' + _clean['url']) for _entry in _clean['indexes']: _file.write('\t'+ str(_entry).strip()) _file.write('\n') else: parser.print_help() print ('Ended ....')

Page 134: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

126

merge.py

print ('Started ....') # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" # -------------------------------------------------------------------------- parser = argparse.ArgumentParser(prog='merge.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parse = parser.parse_args() if parse.engine: path = os.getcwd()+'\\'+parse.engine+'\\data' type = '.data' destination = path + '\\_compiled.dat' if os.path.exists(destination): os.remove(destination) for _file in os.listdir(path): if type in _file: with open(path+'\\'+_file) as _source: if not os.path.exists(destination): with open(destination, 'w') as _target: for _line in _source: _target.write(_line) else: with open(destination, 'a') as _target: for _line in _source: if 'index' in _line: continue else: _target.write(_line) else: parser.print_help() print ('Ended ....')

Page 135: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

127

pyBacklinks.py

import os import time import argparse from pyBing import * from pyGoogle import * from pyYahoo import * print ('Started ....') # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" _OPERATION = 'APPEND' # -------------------------------------------------------------------------- parser = argparse.ArgumentParser(prog='pyBacklinks.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parse = parser.parse_args() _out = os.getcwd()+'\\'+parse.engine.upper()+'\\data\\_complete.dat' if _OPERATION != 'APPEND': if os.path.exists(_out): os.remove(_out) _historical = os.getcwd()+'\\'+parse.engine.upper()+'\\data\\_historical.dat' if _OPERATION != 'APPEND': if os.path.exists(_historical): os.remove(_historical) with open(_historical, 'w') as _file: _file.write('key\tsink\tsource\n') if parse.engine: source = os.getcwd()+'\\'+parse.engine.upper()+'\\data\\_clean.dat' with open(source) as _source: for _line in _source: if 'index' in _line and 'url' in _line and 'description' in _line and 'div' in _line and _OPERATION != 'APPEND': _file = open( _out, "w" ) _file.write(_line) _file.close() elif _line[0:1] != "#": _data = _line.strip().split('\t') _repository = {} if parse.engine.upper() == 'BING':

Page 136: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

128

_bing = pyBing() _repository = _bing.getBackLinks(_data[2]) elif parse.engine.upper() == 'GOOGLE': _google = pyGoogle() _repository = _google.getBackLinks(_data[2]) while _repository == 'ERROR': _repository = _google.getBackLinks(_data[2]) elif parse.engine.upper() == 'YAHOO': _yahoo = pyYahoo() _repository = _yahoo.getBackLinks(_data[2]) if len(_repository) > 0 and len(_repository[0]) > 0: _data[11] = len(_repository) else: _data[11] = 0 _file = open(_out, 'a') _entries = '' for _entry in _data: if len(_entries) > 0: _entries += '\t' _entries += str(_entry) _file.write(_entries+'\n') _file.close() # Log Total Links to File try: for _sink in _repository: with open(_historical, 'a') as _log: _log.write(_data[1] + '\t' + _data[2] + '\t' + _sink + '\n') except: for _sink in _repository: with open(_historical, 'a') as _log: _log.write(_data[1] + '\t' + _data[2] + '\t[Unicode Error Write]\n') time.sleep(5) else: parser.print_help() print ('Ended ....')

Page 137: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

129

pyBing.py

import sys from config import * import urllib.request as request from xml.dom import minidom from htmlHelper import * from extract import * import time import os import math from nltk.corpus import wordnet class pyBing(Extract): def __init__(self): print('\n==================================================================') print( '%s Initialized' % self.__class__.__name__ ) print('==================================================================') def getBackLinks(self, url): _repository = {} _page = 1 _iteration = "&first=[ITERATION_STEP]&FORM=PORE" _MAX = 100 try: print("Page %i" % _page) _config = config() _config.bing() _source = url.replace('https', '').replace('http', '').replace(':', '').replace('//', '').replace('www.', '') if _source.find('/') != -1: _source = _source[:_source.find('/')] _url = _config.bing_settings['externallinks'].replace('[URL]', url).replace('[BASE_URL]', _source ) _base = _url _html = self.external_links(_url, 'BING') _repository = _html.backlinks

Page 138: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

130

print("Page %i Contains %i Links" % (_page, len(_html.backlinks))) while _html.next != '' and len(_html.backlinks) > 0 and len(_html.backlinks[0]) > 0: # Will go maximum 100 pages deep in search ... 1000 links approximately if _MAX - _page == 0 or len(_html.backlinks) == 0: break time.sleep(3) _html = self.external_links(_base + _iteration.replace("[ITERATION_STEP]", str( len(_repository) +1 )), 'BING') _page += 1 print("Page %i Contains %i Links" % (_page, len(_html.backlinks))) _new = False for _link in _html.backlinks: if _link not in _repository: _repository.append(_link) _new = True # Stop if no new links were found if not _new: break print("Total Links = %i" % len(_repository)) except request.URLError as e: print("Error: %s" % e.reason ) except ValueError as v: print("Non urllib Error: %s" % v) return _repository # Extract Link Attributes for a Given Search def getLinks(self, query): # Obtain Configuration Settings _config = config() _config.bing() try: # Keywords _keywords = [] for _synonyms in wordnet.synsets(query): for _s in _synonyms.lemmas(): _s = _s.name().replace('_', ' ') if _s not in _keywords: _keywords.append(_s) if len(_keywords) == 0: _keywords.append(query)

Page 139: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

131

# Set Repository Structure _name = os.getcwd()+'\\BING\\'+query+'_'+time.strftime("%Y%m%d%H%M%S")+".data" _file = open( _name, "w" ) _file.write('index\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') _file.close() # Direct Call _html = self.extract_links(_config.bing_settings['url']+query, 'BING') _indexValue = 0 _page = 1 _maxPages = 6 while _html.next != '' and _page < _maxPages: for _entry in _html.links: _indexValue += 1 if '.pdf' not in _entry: print("Searching: %s" % _entry.get('url')) _indexes = sorted(self.extract_indexes(_entry.get('url'), _keywords ).items()) if _indexes: _file = open(_name, 'a') _file.write(str(_indexValue) + '\t' + _entry.get('url') ) for _index in _indexes: _file.write('\t'+str(_index[1])) _file.write('\n') _file.close() if _html.next != '': print("Next: %s" % _html.next) print("Page: %i" % _page) _html = self.extract_links(_html.next, 'BING') print("Pausing for 3 seconds") time.sleep(3) _page += 1 except request.URLError as e: print("Error: %s" % e.reason ) except ValueError as v: print("Non urllib Error: %s" % v)

Page 140: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

132

pyGoogle.py

import sys from config import * from rauth import OAuth2Service import urllib.request from xml.dom import minidom from htmlHelper import * from extract import * import time import os from nltk.corpus import wordnet class pyGoogle(Extract): """ """ _proxy = False def __init__(self): print('\n==================================================================') print( '%s Initialized' % self.__class__.__name__ ) print('==================================================================') def getBackLinks(self, url): _repository = [] try: _config = config() _config.google() _source = url.replace('https', '').replace('http', '').replace(':', '').replace('//', '').replace('www.', '') if _source.find('/') != -1: _source = _source[:_source.find('/')] _url = _config.google_settings['externallinks'].replace('[URL]', url).replace('[BASE_URL]', _source ) _html = self.external_links(_url, 'GOOGLE')

Page 141: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

133

try: for _link in _html.backlinks: if _link not in _repository: _repository.append(_link) print("Total Links = %i" % len(_repository)) except: print("Error Extracting Google Data.") _repository = 'ERROR' except urllib.request.URLError as e: print("Error: %s" % e.reason ) except ValueError as v: print("Non urllib Error: %s" % v) return _repository # Extract Link Attributes for a Given Search def getLinks(self, query, use_proxy = False): # Obtain API Settings _config = config() _config.google() self._proxy = use_proxy try: # Keywords _keywords = [] for _synonyms in wordnet.synsets(query): for _s in _synonyms.lemmas(): _s = _s.name().replace('_', ' ') if _s not in _keywords: _keywords.append(_s) if len(_keywords) == 0: _keywords.append(query) # Set Repository Structure _name = os.getcwd()+'\\GOOGLE\\'+query+'_'+time.strftime("%Y%m%d%H%M%S")+".data" _file = open( _name, "w" ) _file.write('index\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') _file.close() # Direct Call _html = self.extract_links(_config.google_settings['url']+query, 'GOOGLE', use_proxy) _indexValue = 0

Page 142: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

134

_page = 1 _maxPages = 11 while _html.next != '' and _page < _maxPages: for _entry in _html.links: _indexValue += 1 if '.pdf' not in _entry: print("Searching: %s" % _entry['url']) _indexes = sorted(self.extract_indexes(_entry['url'], _keywords).items()) if _indexes: _file = open(_name, 'a') _file.write(str(_indexValue) + '\t' + _entry['url'] ) for _index in _indexes: _file.write('\t'+str(_index[1])) _file.write('\n') _file.close() if _html.next != '': print("Next: %s" % _html.next) _html = self.extract_links(_html.next, 'GOOGLE') print("Pausing for 3 seconds") time.sleep(3) _page += 1 except urllib.request.URLError as e: print("URL Error: %s" % e.reason ) self._proxy = True self.getLinks(query, True) except ValueError as v: print("Value Error: %s" % v)

Page 143: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

135

pyOptimization.py

import os import argparse from xml.dom import minidom from htmlHelper import * from extract import * from nltk.corpus import wordnet print ('Started ....') # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" # -------------------------------------------------------------------------- parser = argparse.ArgumentParser(prog='pyOptimization.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parser.add_argument('-operation', help='Optimization Type [INDEX | FILTER | QUALITY]') parse = parser.parse_args() nodes = [] if parse.engine and parse.operation: if parse.operation and parse.operation.upper() == 'INDEX': _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _source = '_historical.dat' _destination = '_historical_indexed.dat' if os.path.exists(_path+_destination): os.remove(_path+_destination) with open(_path+_source, 'r') as _process: for _line in _process: if 'key' in _line and 'sink' in _line and 'source' in _line: with open(_path+_destination, 'w') as _file: _file.write('sink\tquality\n') else: _data = _line.strip().split('\t') if ( len(_data) == 3 ) and ( len(_data[2].strip()) > 0 ) and ('Unicode Error' not in _data[2].strip()): try: print("Searching: %s" % _data[2]) _keywords = [] for _synonyms in wordnet.synsets(_data[0]): for _s in _synonyms.lemmas():

Page 144: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

136

_s = _s.name().replace('_', ' ') if _s not in _keywords: _keywords.append(_s) if len(_keywords) == 0: _keywords.append(_data[0]) print(_keywords) _indexes = sorted(Extract().extract_indexes(_data[2], _keywords ).items()) if _indexes: with open(_path+_destination, 'a') as _file: _file.write(_data[0]+'\t'+_data[1]+'\t'+_data[2]) for _index in _indexes: _file.write('\t'+str(_index[1])) _file.write('\n') except: print("Error Processing URL") elif parse.operation and parse.operation.upper() == 'FILTER': _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _source = '_historical_indexed.dat' _destination = '_historical_filtered.dat' if os.path.exists(_path+_destination): os.remove(_path+_destination) with open(_path+_destination, 'w') as _file: _file.write('key\tsink\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\n') # Distinct URLs _summary = [] with open(_path+_source, 'r') as _process: for _line in _process: if ('key' not in _line) and ('sink' not in _line) and ('source' not in _line): _data = _line.strip().split('\t') _found = False for _entry in _summary: if (_data[1].strip().lower() == _entry.strip().lower()): _found = True break if not _found: print("ADDING: %s" % _data[1]) _summary.append( _data[1]) for _individual in _summary:

Page 145: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

137

_totals = [] _found = 0 with open(_path+_source, 'r') as _process: for _line in _process: if ('key' not in _line) and ('sink' not in _line) and ('source' not in _line): _data = _line.strip().split('\t') if _individual.strip().lower() == _data[1].strip().lower(): _index = 3 while _index < len(_data[3:]): if _found == 0: if _data[_index].isnumeric(): try: _totals.append(float(_data[_index])) except: _totals.append(float(0)) else: _totals.append(float(0)) else: if _data[_index].isnumeric(): try: _totals[_index-3] += float(_data[_index]) except: pass else: _totals[_index-3] += 0 _index+=1 _found+=1 else: if _found > 0: break _total = 0 print(_totals) for _entry in _totals: _total += _entry _total = _total / _found with open(_path+_destination, 'a') as _file: print("Totals %s = %s" % (_individual, str(_total))) _file.write(_individual+'\t'+str(_total)+'\n') elif parse.operation and parse.operation.upper() == 'QUALITY': _path = os.getcwd()+'\\'+parse.engine+'\\data\\' _filtered = '_historical_filtered.dat' _indexed = '_complete.dat' _destination = '_historical_complete.dat' if os.path.exists(_path+_destination):

Page 146: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

138

os.remove(_path+_destination) with open(_path+_destination, 'w') as _file: _file.write('index\tkey\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbound_links\tp\troot\tspan\ttitle\tquality\n') with open(_path+_filtered, 'r') as _filtered_file: for _filtered_line in _filtered_file: if ('sink' not in _filtered_line) and ('quality' not in _filtered_line): _filtered_data = _filtered_line.strip().split('\t') print("..%s" % _filtered_data[0]) with open(_path+_indexed, 'r') as _indexed_file: for _indexed_line in _indexed_file: if ('index' not in _indexed_line) and ('key' not in _indexed_line) and ('url' not in _indexed_line): _indexed_data = _indexed_line.strip().split('\t') if _filtered_data[0].strip().lower() == _indexed_data[2].strip().lower(): print(">>>>>FOUND: %s" % _filtered_data[0]) with open(_path+_destination, 'a') as _file: for _element in _indexed_data: _file.write(str(_element)+'\t') _file.write(str(_filtered_data[1])) _file.write('\n') break else: parser.print_help() print ('Ended ....')

Page 147: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

139

pySearch.py

import sys import argparse from pyBing import * from pyGoogle import * from pyYahoo import * import os import time import random print("Starting Search ......") parser = argparse.ArgumentParser(prog='pySearch.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parse = parser.parse_args() # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" _source = os.getcwd()+'\\words_chosen.txt' _use_proxy = False _words = [] # Ensure Default Search Engine Directory Exists if not os.path.exists(os.getcwd()+'\\'+parse.engine.upper()): os.makedirs(os.getcwd()+'\\'+parse.engine.upper()) # -------------------------------------------------------------------------- if parse.engine: if _MODE == "DEBUG": print("Engine: %s" % parse.engine) with open(_source, 'r') as _file: for _line in _file: _term = _line.strip() if _term[0:1] != "#": print("Searching: %s" % _term) if parse.engine.upper() == 'BING': _bing = pyBing() _bing.getLinks(_term) elif parse.engine.upper() == 'GOOGLE': _google = pyGoogle() _google.getLinks(_term, _use_proxy)

Page 148: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

140

time.sleep(10) elif parse.engine.upper() == 'YAHOO': _yahoo = pyYahoo() _yahoo.getLinks(_term) else: parser.print_help() print("Ending Search ......")

Page 149: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

141

pySummary.py

import sys import argparse import os print("Starting Summary Calculations ......") parser = argparse.ArgumentParser(prog='pySummary.py') parser.add_argument('-engine', help='Search Engine [BING | GOOGLE | YAHOO]') parser.add_argument('-file', help='Input File Name') parse = parser.parse_args() # -------------------------------------------------------------------------- # Global Configuration # -------------------------------------------------------------------------- _MODE = "DEBUG" if parse.engine and parse.file: _source = os.getcwd()+"\\"+parse.engine.upper()+"\\data\\"+parse.file if _MODE == "DEBUG": print("Engine: %s" % parse.engine) print("File: %s" % _source) _max = {'description': 0.0, 'div': 0.0, 'h1': 0.0, 'h2': 0.00, 'h3': 0.0, 'h4': 0.0, 'h5': 0.0, 'h6': 0.0, 'inbound_links': 0.0, 'keywords': 0.0, 'outbound_links': 0.0, 'p': 0.0, 'root': 0.0, 'span': 0.0, 'title': 0.0, 'quality': 0.00} _min = {'description': 1000.0, 'div': 1000.0, 'h1': 1000.0, 'h2': 1000.00, 'h3': 1000.0, 'h4': 1000.0, 'h5': 1000.0, 'h6': 1000.0, 'inbound_links': 1000.0, 'keywords': 1000.00, 'outbound_links': 1000.0, 'p': 1000.0, 'root': 1000.0, 'span': 1000.0, 'title': 1000.0, 'quality': 1000.00} _totals = {'description': 0.0, 'div': 0.0, 'h1': 0.0, 'h2': 0.00, 'h3': 0.0, 'h4': 0.0, 'h5': 0.0, 'h6': 0.0, 'inbound_links': 0.0, 'keywords': 0.0, 'outbound_links': 0.0, 'p': 0.0, 'root': 0.0, 'span': 0.0, 'title': 0.0, 'quality': 0.00} _lines = 0 _show_quality = False if os.path.exists(_source): with open(_source, 'r') as _file: for _line in _file: _data = _line.strip().split('\t') if len(_data) == 19 : _show_quality = True if _data[0].upper() != "INDEX": _lines += 1 # Totals

Page 150: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

142

_totals['description'] = float(_data[3]) _totals['div'] += float(_data[4]) _totals['h1'] += float(_data[5]) _totals['h2'] += float(_data[6]) _totals['h3'] += float(_data[7]) _totals['h4'] += float(_data[8]) _totals['h5'] += float(_data[9]) _totals['h6'] += float(_data[10]) _totals['inbound_links'] += float(_data[11]) _totals['keywords'] += float(_data[12]) _totals['outbound_links'] += float(_data[13]) _totals['p'] += float(_data[14]) _totals['root'] += float(_data[15]) _totals['span'] += float(_data[16]) _totals['title'] += float(_data[17]) if len(_data) == 19: _totals['quality'] += float(_data[18]) # Maximum if float(_data[3]) > _max['description']: _max['description'] = float(_data[3]) if float(_data[4]) > _max['div']: _max['div'] = float(_data[4]) if float(_data[5]) > _max['h1']: _max['h1'] = float(_data[5]) if float(_data[6]) > _max['h2']: _max['h2'] = float(_data[6]) if float(_data[7]) > _max['h3']: _max['h3'] = float(_data[7]) if float(_data[8]) > _max['h4']: _max['h4'] = float(_data[8]) if float(_data[9]) > _max['h5']: _max['h5'] = float(_data[9]) if float(_data[10]) > _max['h6']: _max['h6'] = float(_data[10]) if float(_data[11]) > _max['inbound_links']: _max['inbound_links'] = float(_data[11]) if float(_data[12]) > _max['keywords']: _max['keywords'] = float(_data[12]) if float(_data[13]) > _max['outbound_links']: _max['outbound_links'] = float(_data[13]) if float(_data[14]) > _max['p']: _max['p'] = float(_data[14]) if float(_data[15]) > _max['root']: _max['root'] = float(_data[15]) if float(_data[16]) > _max['span']: _max['span'] = float(_data[16]) if float(_data[17]) > _max['title']: _max['title'] = float(_data[17]) if ( len(_data) == 19 ) and ( float(_data[18]) > _max['quality'] ): _max['quality'] = float(_data[18]) # Minimum if float(_data[3]) < _min['description']: _min['description'] = float(_data[3]) if float(_data[4]) < _min['div']: _min['div'] = float(_data[4]) if float(_data[5]) < _min['h1']: _min['h1'] = float(_data[5]) if float(_data[6]) < _min['h2']: _min['h2'] = float(_data[6]) if float(_data[7]) < _min['h3']: _min['h3'] = float(_data[7]) if float(_data[8]) < _min['h4']: _min['h4'] = float(_data[8]) if float(_data[9]) < _min['h5']: _min['h5'] = float(_data[9])

Page 151: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

143

if float(_data[10]) < _min['h6']: _min['h6'] = float(_data[10]) if float(_data[11]) < _min['inbound_links']: _min['inbound_links'] = float(_data[11]) if float(_data[12]) < _min['keywords']: _min['keywords'] = float(_data[12]) if float(_data[13]) < _min['outbound_links']: _min['outbound_links'] = float(_data[13]) if float(_data[14]) < _min['p']: _min['p'] = float(_data[14]) if float(_data[15]) < _min['root']: _min['root'] = float(_data[15]) if float(_data[16]) < _min['span']: _min['span'] = float(_data[16]) if float(_data[17]) < _min['title']: _min['title'] = float(_data[17]) if ( len(_data) == 19 ) and ( float( _data[18] ) < _min['quality'] ): _min['quality'] = float(_data[18]) else: print("Invalid Input File Specified") if( not _show_quality ): _max.pop('quality', None) _min.pop('quality', None) _totals.pop('quality', None) print("MAXIMUM ---------------------------") print(_max) print("MINIMUM ---------------------------") print(_min) print("AVERAGE ---------------------------") for key, value in _totals.items(): _totals[key] = float(value)/_lines print(_totals) else: parser.print_help() print("Ending Summary Calculations ......")

Page 152: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

144

pyYahoo.py

import sys

from config import *

from rauth import OAuth2Service

import urllib.request

from xml.dom import minidom

from htmlHelper import *

from extract import *

import time

import os

from nltk.corpus import wordnet

class pyYahoo(Extract):

""" """

def __init__(self):

print('\n===============================================================

===')

print( '%s Initialized' % self.__class__.__name__ )

print('================================================================

==')

def getBackLinks(self, url):

_repository = {}

_page = 1

_iteration = "&b=[ITERATION_STEP]&pz=10&bct=0&xargs=0"

_MAX = 100

try:

print("Page %i" % _page)

_config = config()

_config.yahoo()

_source = url.replace('https', '').replace('http', '').replace(':', '').replace('//',

'').replace('www.', '')

if _source.find('/') != -1:

_source = _source[:_source.find('/')]

_url = _config.yahoo_settings['externallinks'].replace('[URL]',

url).replace('[BASE_URL]', _source )

_base = _url

Page 153: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

145

_html = self.external_links(_url, 'YAHOO')

_repository = _html.backlinks

print("Page %i Contains %i Links" % (_page, len(_html.backlinks)))

while _html.next != '':

# Will go maximum 100 pages deep in search ... 1000 links approximately

if _MAX - _page == 0 and len(_html.backlinks) > 0:

break

time.sleep(3)

_html = self.external_links(_base + _iteration.replace("[ITERATION_STEP]", str(

len(_repository)+1 )), 'YAHOO')

_page += 1

print("Page %i Contains %i Links" % (_page, len(_html.backlinks)))

_new = False

for _link in _html.backlinks:

if _link not in _repository:

_repository.append(_link)

_new = True

# Stop if no new links were found

if not _new:

break

print("Total Links = %i" % len(_repository))

except request.URLError as e:

print("Error: %s" % e.reason )

except ValueError as v:

print("Non urllib Error: %s" % v)

return _repository

# Extract Link Attributes for a Given Search

def getLinks(self, query):

# Obtain API Settings

_config = config()

_config.yahoo()

try:

# Keywords

_keywords = []

Page 154: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

146

for _synonyms in wordnet.synsets(query):

for _s in _synonyms.lemmas():

_s = _s.name().replace('_', ' ')

if _s not in _keywords:

_keywords.append(_s)

if len(_keywords) == 0:

_keywords.append(query)

# Set Repository Structure

_name =

os.getcwd()+'\\YAHOO\\'+query+'_'+time.strftime("%Y%m%d%H%M%S")+".data"

_file = open( _name, "w" )

_file.write('index\turl\tdescription\tdiv\th1\th2\th3\th4\th5\th6\tinbound_links\tkeywords\toutbou

nd_links\tp\troot\tspan\ttitle\n')

_file.close()

# Direct Call

_html = self.extract_links(_config.yahoo_settings['url'] + query, 'YAHOO')

_indexValue = 0

_page = 1

_maxPages = 6

while _html.next != '' and _page < _maxPages:

for _entry in _html.links:

_indexValue += 1

if '.pdf' not in _entry:

print("Searching: %s" % _entry['url'])

_indexes = sorted(self.extract_indexes(_entry['url'], _keywords).items())

if _indexes:

_file = open(_name, 'a')

_file.write(str(_indexValue) + '\t' + _entry['url'] )

for _index in _indexes:

_file.write('\t'+str(_index[1]))

_file.write('\n')

_file.close()

if _html.next != '':

print("Next: %s" % _html.next)

_html = self.extract_links(_html.next, 'YAHOO')

print("Pausing for 3 seconds")

time.sleep(3)

_page += 1

Page 155: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

147

except urllib.request.URLError as e:

print("Error: %s" % e.reason )

except ValueError as v:

print("Non urllib Error: %s" % v)

Page 156: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

148

wordnet_install.py

# First execute the command line # pip install -U nltk import nltk # nltk.download() from nltk.corpus import brown brown.words()

Page 157: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

149

words_set.py

import sys import os import random print("Started ....") _source = os.getcwd()+'\\words.txt' _words = [] _archive = os.getcwd()+'\\words_chosen.txt' if not os.path.isfile(_archive): # Create Query History File _queries = open(_archive, 'w') _queries.close() # Read Dictionary with open(_source, 'r') as _file: for _line in _file: if len(_line.strip()) > 1: _words.append(_line.strip()) _file.close() # Select 100 Random Words for iteration in range(100): _term = random.choice(_words) with open(_archive, 'a') as _file: _file.write(_term) _file.write('\n') _file.close() print("Completed ....")

Page 158: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

150

APPENDIX B

Sample data for the Bing search engine algorithm and for which the search term was the ‘adheres’. The sample data may be

found in the solution files under the directory ‘src/BING/data/Source’ and the file adheres_20170124002946.data.

index url description div h1 h2 h3 h4 h5 h6 inbound_links keywords outbound_links

p root span title

1 http://www.oxforddictionaries.com/us/ 0.0 0 0.0 0.0 0 0.0 0 0 0 0.0 14

0.0 0.0 0.0 0.0

2 http://www.thefreedictionary.com/adheres 0.8458781362007167 3.0263273893007403 6.0 48.0 0 0 0

0 0 0.6033519553072626 171 0.0 1.2413793103448276 84.79654875000406 1.3333333333333335

3 http://legal-dictionary.thefreedictionary.com/adheres 0.7868852459016394 0.6792452830188679 24.0 0 0

0 0 0 0 0.6033519553072626 266 0.0 0.782608695652174 68.5714285714285

2.0571428571428574

4 http://medical-dictionary.thefreedictionary.com/adheres 1.0778443113772456 0.6545454545454545 6.0 0 0

0 0 0 0 0.6033519553072626 173 0.0 0.75 71.99999999999994 1.3584905660377358

5 http://www.thesaurus.com/browse/adheres 0.2903225806451613 2.7714285714285714 6.0 0.0 0 0 0

0 0 2.4406779661016946 14 0.7982062780269057 1.2857142857142856 23.454978354978348 1.44

Page 159: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

151

APPENDIX C

Sample data for the Yahoo search engine algorithm and for which the search term was the ‘adheres’. The sample data may be

found in the solution files under the directory ‘src/YAHOO/data/Source’ and the file adheres_20170126024148.data.

index url description div h1 h2 h3 h4 h5 h6 inbound_links keywords outbound_links

p root span title

1 http://www.thefreedictionary.com/adheres 0.8458781362007167 3.0263273893007403 6.0 48.0 0 0 0

0 0 0.6033519553072626 171 0.0 1.2413793103448276 84.79654875000406 1.3333333333333335

2 http://www.thesaurus.com/browse/adheres 0.2903225806451613 2.7714285714285714 6.0 0.0 0 0 0

0 0 2.4406779661016946 14 0.7982062780269057 1.2857142857142856 23.454978354978348 1.44

3 https://en.wiktionary.org/wiki/adheres 0 0.0 5.142857142857142 0.0 0.0 0 0 0 0

0 20 0 1.2 0.0 1.8

4 http://www.macmillandictionary.com/dictionary/british/adhere-to 0.6585365853658537 0.0 0 0 0 0

0 0 0 0.5373134328358209 3 0.0 0.6923076923076924 22.800000000000008 0.5070422535211268

5 http://legal-dictionary.thefreedictionary.com/adhere 0.8044692737430167 0.6923076923076924 24.0 0 0 0

0 0 0 0.6136363636363636 270 0.0 0.7999999999999999 75.42857142857143 2.181818181818182

Page 160: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

152

APPENDIX D

Sample data for the Bing search engine algorithm utilized in section four of this dissertation. The sample data may be found in

the solution files under the directory ‘src/BING/data’ and the file _historical_complete_r.dat.

index key url description div h1 h2 h3 h4 h5 h6 inbound_links keywords

outbound_links p root span title quality

1.0 adheres http://www.oxforddictionaries.com/us/ 0.0 0 0.0 0.0 0 0.0 0 0 498

0.0 14 0.0 0.0 0.0 0.0 82.54609929078015

10.333333333333334 adheres http://www.adherishealth.com/ 0 0.0 0.13636363636363635 0 0

0.0 0.0 0 5 0 7 0.05294453973699256 0.0 0.022641509433962263 0.0 26.8

10.0 adheres https://en.m.wiktionary.org/wiki/adheres 0 0.0 0.8571428571428571 0.0 0 0 0

0 1 0 2 0 0.1875 0.0 0.3 20.0

11.0 adheres https://en.m.wikipedia.org/wiki/Adhesion 0 0.21739130434782608 0.0 0.0 0 0

0 0 1 0 5 0.18143798379105275 0.0 0.0 0.0 7.0

12.0 adheres https://en.wikipedia.org/wiki/AdhesionSurface_energy 0 0.0 0.0 0.0 0.0 0 0

0 9 0 56 0.18143798379105275 0.0 0.0 0.0 43.22222222222222

Page 161: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

153

APPENDIX E

Sample data for the Yahoo search engine algorithm utilized in section four of this dissertation. The sample data may be found

in the solution files under the directory ‘src/YAHOO/data’ and the file _historical_complete_r.dat.

index key url description div h1 h2 h3 h4 h5 h6 inbound_links keywords

outbound_links p root span title quality

9.571428571428571 adheres http://www.adherishealth.com/ 0 0.0 0.13636363636363635 0 0

0.0 0.0 0 5 0 7 0.05294453973699256 0.0 0.022641509433962263 0.0 26.8

12.666666666666666 adheres https://en.wikipedia.org/wiki/Adhesion 0 0.0 0.0 0.0 0.0 0 0

0 5 0 56 0.18143798379105275 0.0 0.0 0.0 53.2

14.6 adheres https://phys.org/chemistry-news/ 0 0.0 0.0 0 0 0 0.0 0 1 0

28 0.07361838648826734 0.0 0.0 0.0 101.0

19.25 adheres http://atherys.com/ 0 0 0 0.0 0 0 0 0 62 0 2 0.0

0.0 0 0.0 40.69387755102041

22.0 adheres https://en.wikipedia.org/wiki/Anders_Behring_Breivik 0 0.0 0.0 0.0 0.0 0 0

0 50 0 446 0.08955223880597014 0.0 0.0 0.0 194.10810810810

Page 162: Copyright 2017, Guillermo Antonio Rodriguez

Texas Tech University, Guillermo Antonio Rodriguez, May 2017

154