Clustering for Innovation

Clustering for Innovation:

Clustering Economic Activities Based on Textual Webpage Content

Master’s Thesis

Communication and Information Sciences

School of Humanities

Specialization Data Science: Business & Governance

Tilburg University

Author: Sjaak van der Zwan

ANR: 915601

SNR: 2002477

June 8th 2018

Supervisor / First reader: dr. E.E. van der Vaart

Second reader prof. E. Postma

mailto:[email protected]

ii

Preface

This thesis is the result of my research performed at the Center for Big Data Statistics at Statistics Netherlands.

Most of research of this thesis has been performed between April 2017 and October 2017.

During my Public Administration bachelor at Leiden University I became interested in quantitative research

methods taught during the program and started to look around for more a quantitative master program to com-

plement my bachelor’s degree. Tilburg University offered a Data Science specialization within their Commu-

nication and Information Sciences program. The Data Science program proved to be challenging yet rewarding,

adding much to my understanding of the possible use and application of data science in both public and private

sectors. The extra-curricular text mining course offered within the program have finally led me to the subject of

this thesis.

I like to thank everybody who has everyone who has supported me in completing this research. In the first place

Piet Daas and Ali Hurriyetoglu for helping me design my research questions, pointing me to relevant sources

and offering possible alternative solutions when running aground. I also like to thank dr. van der Vaart for her

constructive feedback in organizing and clarify this thesis. Allot of thanks to Marga for proofreading, checking

and correcting grammar and spelling. Thanks to my father for financially supporting my decision to continue

my education and most importantly a heartfelt thank you to Laura, Sophia-Amy and Amélie for love and inspi-

ration.

Sjaak van der Zwan

Leiden, June 2018

iii

Abstract

For economic and development purposes municipalities and other local governmental organisations in the Neth-

erlands have shown an interest in knowing which innovative companies can be found in the area that they

govern. Statistics Netherlands would like to provide this kind of information but is unable to do so using the

current Business Categorization System (SBI), since the SBI does not have separate categories for innovative

companies, nor is it able to separate innovative businesses from non-innovative business. Another problem with

SBI is that new innovative companies seem to end up in a remainder category. Researched is whether one or

more clusters containing innovative companies would form if an unsupervised clustering algorithm is applied

using the texts found on the main webpages of a selected number of companies. For this purpose, Statistics

Netherlands has provided a list containing 956,540 Web URLs, which have been identified as URLs belonging

to Dutch companies. Since EU Regulation 1893/2006 forces Statistics Netherlands to follow the structure of

NACE, Statistics Netherlands would not be able to formally change their standard industrial classification in-

dependently from other EU Member states. An alternative categorization system might exist however next to

the current SBI in order to meet the demands of municipalities and other local governmental organisations. For

the collection of Common Crawl web archive is explored but has been found to have insufficient coverage (less

than 30%) of the provided list of URLs. Alternatively, web scraping was used to extract main texts from the

webpages. K-means mini batch is chosen as clustering algorithm, because of its low run time complexity and

compatibility with large data sets. The elbow method is used to find the theoretical most optimum number of

clusters (k), beside k=100 found, with the elbow method, k =1500 was chosen because it equals the total number

of subcategories in the SBI and k= 500 as an arbitrarily chosen number between k=100 and k=1500. After

applying the k-means mini batch algorithm no clear innovative clusters have surfaced. The research also showed

that most clusters that did form are not sufficiently stable to be used for official statistical purposes as discussed

by (Daas, Puts, Buelens, & van den Hurk, 2015). This because clustering is very much affected by initial cen-

troid placement, as explained by MacKay (2003: 288). The elbow method did not lead to the most stable, nor

to the most cohesive clusters and thus not seem to be an effective method for larger data sets. With regard to

the texts belonging to innovative companies (innovative documents), innovative companies tend to have a rel-

atively stronger tendency to become part of larger cluster. From the innovative documents that are used for this

research it cannot be said that the majority fall in a remainder category. While k=500 clustering proved to be

most stable and most similar to the SBI, the k=1500 clustering proved to be most cohesive. The most cohesive

clusters for k=500 and k=1500 fall within the level 2 SBI category (86) Human health activities. For k=100 this

is (74) Industrial design, photography, translation and other consultancy, followed by (86) Human health ac-

tivities. Recommendations to Statistics Netherlands are given to further research the possibilities of supervised

machine learning in order to categorize innovative companies.

Key words: Clustering, Websites, Webpages, Common Crawl, k-means mini batch, SBI, Statistics Netherlands.

iv

Table of Contents

Section 1: Introduction ........................................................................................................................................ 1

1.1. Background and Relevance ...................................................................................................................... 1

1.2. Research Question and Sub-Questions ..................................................................................................... 2

1.3. Thesis Outline ........................................................................................................................................... 3

Section 2: Theoretical Framework ...................................................................................................................... 3

2.1. Statistics Netherlands and Big Data ......................................................................................................... 3

2.2. Standard Industrial Classification (SBI) ................................................................................................... 4

2.3. Innovation ................................................................................................................................................. 5

2.4. Text Mining, Text Clustering ................................................................................................................... 7

Section 3: Experimental Setup .......................................................................................................................... 12

3.1. Data Description ..................................................................................................................................... 12

3.2. Data Collection ....................................................................................................................................... 14

3.2.1. Indirect Collection [Common Crawl] .............................................................................................. 15

3.2.2. Direct Collection [Web scraping] .................................................................................................... 15

3.3. Pre-processing ........................................................................................................................................ 16

3.3.1. Language Detection ......................................................................................................................... 16

3.3.2. Tokenization, Stop Word Removal and Stemming ......................................................................... 17

3.3.3. Document Selection ......................................................................................................................... 18

3.3.4. Feature Selection ............................................................................................................................. 18

3.4. The Algorithm ........................................................................................................................................ 18

3.5. Vector Space Model Setup ..................................................................................................................... 18

3.6. Elbow Method ........................................................................................................................................ 19

3.7. Analysing Clusters.................................................................................................................................. 19

3.7.1. Innovative Clusters .......................................................................................................................... 19

3.7.2. Similarity to SBI and Cohesiveness ................................................................................................ 20

3.7.3. Analysing Stability of Surfacing Clusters ....................................................................................... 21

3.8. Practical Outline ..................................................................................................................................... 22

Section 4: Results .............................................................................................................................................. 23

4.1. Finding Optimum k ................................................................................................................................ 23

4.2. Cluster Analysis...................................................................................................................................... 24

4.2.1. Innovative Cluster Analysis............................................................................................................. 24

4.2.2 Analysing Cohesiveness and Similarity to SBI ................................................................................ 24

4.5. Cluster Stability ...................................................................................................................................... 28

v

Section 5: Discussion and Conclusions ............................................................................................................. 30

5.1. Discussing Results .................................................................................................................................. 30

5.2. Answering Research Questions .............................................................................................................. 30

5.3. Discussing Shortcomings ....................................................................................................................... 31

5.4. Recommendations to Statistics Netherlands ........................................................................................... 32

5.5. Directions for further research ................................................................................................................ 33

Cited Works ....................................................................................................................................................... 34

Appendix I: Common Crawl Results ................................................................................................................ 38

Appendix II: Scraping without Grequest – Language Detection ....................................................................... 39

Appendix III: Innovative Documents per Cluster k = 100 ................................................................................ 40

Appendix IV: Sample Websites Cluster Quality based un URL names and website spot-checks .................... 41

Appendix V: Percentage of Smaller Clusters on different parameters and sizes for k ...................................... 62

Appendix VI: Relative Overrepresentation ....................................................................................................... 63

Appendix VII: Dominant level 4 / 5 SBI Category within Clusters for k = 100 ............................................... 65

Appendix VIII: Most Dominant level 2 SBI Category in Clusters for k = 100 ................................................. 68

Appendix IX: Percentage Innovative URLs per SBI level 4/5 Category .......................................................... 73

Appendix X: Stability of Clusters ..................................................................................................................... 80

Appendix XI: Code Excerpts ............................................................................................................................. 83

Appendix XII: Software, Libraries and Hardware ............................................................................................ 89

Appendix XIII: Explanation SBI Levels ........................................................................................................... 90

1

Section 1: Introduction

This section will give the background and relevance of this research, the research questions and sub-

questions and gives an outline of this thesis.

1.1. Background and Relevance

Innovation plays a vital role in western society and its economic progress (Schumpeter, 1975: 82-85).

Because of this governments should play an active role in stimulating innovation in order for society to

reap the desired welfare effects (Bruce R. Scott 2011: 62). According to a study of the OECD (2016)

one major way in which states promote innovation is by granting preferential tax treatments to support

research and development investments. In 2016 29 of 35 OECD countries and 22 of 28 of non-OECD

countries promoted innovation in such manner. This being the case it makes sense that governments,

that have better knowledge of the innovative companies in their country, are better equipped to support

them in order to strengthen their economy. Categorizing innovative companies, locating them and pro-

cure relevant data and up to date information is thus vital for any well-functioning 21st century society.

It is therefore of public interest that innovative business activities are categorized in such a way that

governments can recognize innovative companies and give them the appropriate support.

The Dutch government and the Dutch municipalities usually rely on the statistics provided by the “Cen-

traal Bureau voor de Statistiek” (hereafter mentioned as Statistics Netherlands) for purposes such as

adjusting current policies or designing new policy. Netherland Statistics is known for its high-quality

statistics. To provide the necessary information concerning the Dutch economy and its companies. Sta-

tistics Netherlands uses the Dutch Standard Industrial Classification (SBI). In doing so Statistics Neth-

erlands hierarchically categorize companies and economic activities into 22 categories and 99 subcate-

gories (Centraal Bureau voor de Statistiek, 2017a). Dutch municipalities have shown an increased in-

terest to be informed about the number and variation of innovative companies in their region in order

to be able to support their development. The current SBI however, does not provide the possibilities to

effectively provide this information. This because innovative companies are not bound to specific cat-

egories or subcategories.

Mr Piet Daas, senior methodologist and data scientist at Statistics Netherlands, heads a major research

project to improve business related statistics. Within this project two problems have been identified

with regard to identifying innovative companies. The first is mentioned above in the previous paragraph

and the second is that the innovative businesses from new business branches tend to end up in the

remainder category, since they cannot be categorized with conventional companies or have a category

of their own in the SBI. Since the current SBI does not provide a way in which Netherland Statistics

could present the requested information concerning innovative businesses, it would add value for Sta-

2

tistics Netherlands to discover what kind of categorization might surface when an unsupervised (clus-

tering) algorithm is used on the textual content of company websites.1 Using a clustering algorithm on

the main text of company websites would ideally result in an identifiable cluster for innovative compa-

nies, alongside conventional categories. Which in turn could be used to identify innovative companies

in e.g. the different Dutch municipalities. The following lists are used in the research of this thesis:

A list of 956.540 web URLs belonging to Dutch companies.

8 lists containing 100 URLs containing the top 100 innovative companies for the years 2009

up until 2016 as compiled by the Dutch Chamber of Commerce (KvK).

1.2. Research Question and Sub-Questions

Based on this suggestion the following research question (RQ) is the central focus of this thesis:

RQ: “In what measure will clusters of innovative companies’ surface when unsupervised machine

learning is used on the textual content of the webpages of Dutch companies?”

Beside the afore mentioned societal relevance for this research, there is an academic relevance which

lies in the question whether abstract characteristics as ‘being innovative’ are reflected in the text on

main page of businesses in such a way that it will lead to separate clusters consisting merely of docu-

ments of companies with such characteristics. This research furthermore adds to the body of knowledge

concerning clustering webpages already available. Beside this it does add to scientific knowledge about

clustering a large body of documents, and thus adding to the scientific knowledge of processing Big

Data.

Texts of webpages can be extracted through a method called web scraping or could be requested from

so-called web archives. The Web Archive Common Crawl is chosen by Statistic Netherlands to be

further explored to whether it could be used to answer the research question. Knowing whether or not

Common Crawl could be used to answer the main research question is important since, extracting data

from a web archive is less time consuming and thus more efficient if compared to web scraping. Ex-

ploring Common Crawl might also help to determine whether Common Crawl could be seen as a valu-

able source for future research regarding Dutch websites. Essential of course is to establish if the Com-

mon Crawl archive sufficiently covers URL list used in this research, because only then it would provide

enough data for this research. Therefore, the first sub-question (SQ1) reads:

1 The statements of Daas concerning categorizing innovative companies within the SBI and the interest in clus-

tering company websites have been derived from a meeting prior to this research.

3

SQ1: “Does the Common Crawl Archive have a sufficient coverage of the Dutch company’s websites

circumscribed by Statistics Netherlands for this research?”

Finally, a second sub-question (SQ2) that is relevant to this research with regard to the SBI used by

Statistics Netherlands is:

SQ2: “Do the clusters that surface show similarities to the SBI and in what measure are these clusters

cohesive and stable?”

This second sub-question (SQ2) is important since the answer will show whether under the chosen ap-

proach, clusters surface that are similar to the SBI and whether the clustering could be easily imple-

mented within the structure already used by Statistics Netherlands.

1.3. Thesis Outline

This thesis follows the following outline. Section 2 describes the relevant literature concerning the re-

search, the k-means algorithm and vector space model. Section 3 outlines the experimental setup with

a description of the data, the methods for data collection, and the methods concerning pre-processing

and processing the data. Section 4 will show the results of the clustering and cluster stability. In section

5 the results will be discussed, research questions will be answered, recommendations to Statistics

Netherlands will be given, the weakness of this research will be discussed and a direction for further

research will be given.

Section 2: Theoretical Framework

This section describes how Statistics Netherlands uses Big Data for official statistics, the pit falls of

using Big Data for official statistics, it gives a definition of innovation as used by Statistics Netherlands

and the Dutch Government, the Standard Industrial Classification (SBI) and it gives an overview of the

relevant literature concerning text mining and clustering.

2.1. Statistics Netherlands and Big Data

With the launch of the Center for Big Data Statistics (CBDS) on September 27th 2016, Statistics Neth-

erlands has created a platform where national and international governments, business, academia and

educational organisations can collaborate in the field of Big Data technology and methods for the cre-

ation of official statistics (Centraal Bureau voor de Statistiek, 2016). On June 23rd 2017 the Statistics

Netherlands presented Urban Data centre/ The Hague, which is a partnership between the municipality

of The Hague and Statistics Netherlands that aims to deepen, broaden and improve their knowledge on

The Hague’s local data in order to be able to improve policy making and decision making (Centraal

4

Bureau voor de Statistiek, 2017a). According to Struijs, Braaksma, & Daas (2014) the increasing vol-

ume, velocity and variety of Data (Big Data) will present both opportunities and challenges for National

Statistical Institutes (NSI) such as Statistics Netherlands. While a major opportunity lies in the increase

in sources from which data can be collected e.g. smart phones, twitter posts, other social media and

click trace, the major challenge is guarantying the quality of statistics derived from Big Data. These

new data sources are not purposely designed for data analysis so they often lack a well-defined target

population and they lack the structure and quality that is found in traditional datasets used for official

statistics. With data so widely available other, commercial parties have entered the information market,

that challenge the traditional role the National Statistics Institutes (NSI’s) and force these Institutes to

prove their thus far unique capabilities and added value in providing high quality statistics (Struijs,

Braaksma, & Daas, 2014). According to Daas, Puts, Buelens, & van den Hurk (2015) it is no easy task

to extract relevant and reliable data from Big Data sources in order to produce official high-quality

statistics. The three main issues when working with Big Data are missing data, volatility and selectivity

of data. Daas, Puts, Buelens, & van den Hurk (2015) are positive about the usefullnes of Big Data for

official statistics in the future, but stress the importance of the knowledge required from the fields of

data mining, high-performance computing and skills from the new emerging discipline of data science.

2.2. Standard Industrial Classification (SBI)

Statistics Netherlands uses the Dutch Standard Industrial Classification (SBI 2008), a hierarchical clas-

sification of economic activities, to classify industrial units according to their core activities (Centraal

Bureau voor de Statistiek, 2017b). The SBI hierarchy knows 5 levels and is based on the NACE (sta-

tistical classification of Economic Activities in the European Community) and on the United Nations

ISIC (International Standard Industrial Classification of All Economic Activities). In the SBI the four

first digits are similar to the first four digits of the NACE and the first two digits are similar to ISIC.

Categories with fifth digits are special Dutch categories (Centraal Bureau voor de Statistiek, 2017c).2

Statistics Netherlands is under European Union (EU) law based on EU Regulation 1893/2006 obliged

to follow the structure of NACE. A ‘regulation’ of the EU is a binding legislative act and it has direct

effect in all EU Member States. This as opposed to the EU ‘directives’ that need to be translated and

implemented in national legislation. It is therefore impossible for Statistics Netherlands to alter the setup

of the SBI. The regulation concerning the establishment of the statistical classification of economic

activities (NACE) gives in the preamble the following motivation for establishing a standard for statis-

tical classification in Member States:

2 See appendix XIII for an explanation of different SBI levels.

5

“(4) In order to function, the internal market requires statistical standards applicable to the collection,

transmission and publication of national and Community statistics so that businesses, financial institu-

tions, governments and all other operators in the internal market can have access to reliable and com-

parable statistical data. To this end, it is vital that the various categories for classifying activities in the

Community be interpreted uniformly in all the Member States. (5) Reliable and comparable statistics

are necessary to enable businesses to assess their competitiveness and are useful to the Community

institutions in preventing distortions of competition.”

While the above-mentioned regulation will prevent Statistics Netherlands to formally change their

standard industrial classification independently from the other member states that are bound by EU

regulation, the research in this thesis remains useful for Statistics Netherlands, to increase their under-

standing of natural clusters that do surface when applying cluster algorithms to the main page text of

websites of innovative companies or to create an alternative categorization which Statistics Netherlands

can use next to the SBI and which can be used for specific requests from municipalities and other

organisations. This knowledge can also be used to improve the NACE in the future.

2.3. Innovation

The Cambridge dictionary describes ‘to innovate’ as “to introduce changes and new ideas”. Similar,

but more specific the Oxford dictionary describes ‘to innovate’ as to: “Make changes in something

established, especially by introducing new methods, ideas, or products.”

Statistics Netherlands differentiate between two kinds of innovation: technical innovation and non-

technical innovation. Technical innovation is when companies renew and improve their products and

processes, while non-technical innovation is when improvements are made within an organization or in

the manner in which a product or service is marketed (Statistics Netherlands 2016: 174-176). According

to Statistics Netherlands (2016: 177) technical innovation can be seen as the classical definition of in-

novation while adding non-technical innovation will lead to a broader definition. Statistics Netherlands

measures innovativeness of companies according to rules of the European Community Innovation Sur-

vey (CIS), which are part of the EU technology and statistics.

In its official publications the Dutch government follows the definition of the Oslo Manual, which de-

fines innovation as:

“[...] the implementation/ commercialization of a product with improved performance characteristics

such as to deliver objectively new or improved services to the consumer. A technological process inno-

vation is the implementation/adoption of new or significantly improved production or delivery methods.

6

It may involve changes in equipment, human resources, working methods or a combination of these.”

(Centraal Planbureau, 2016)

Close related to and often used in combination with innovation is the phrase ‘research and develop-

ment’. The following definition from the Frascati Manual is used to describe research and development

(R&D) by the Dutch Government and Statistics Netherlands:

“Research and experimental development (R&D) comprise creative and systematic work undertaken

in order to increase the stock of knowledge – including knowledge of human kind, culture and society

– and to devise new applications of available knowledge.”

Bessant and Tidd (20011:6) write that most economists agree on the fact that innovation is one of the

main drivers for economic growth and continue by quoting William Baumol in saying that ‘virtually all

of the economic growth that has occurred since the eighteenth century is ultimately attributed to inno-

vation”.

In various reports, studies and indexes the Netherlands is amongst the leading countries as it comes

innovation. The World Economic Forum, in its Global competitiveness report of 2016-2017 ranked The

Netherlands 4th on the Global competitiveness index (World Economic Forum, 2017). In order to stay

competitive, strengthen the labour market, and solve other problems in society the Dutch government

seeks to stimulate various forms of innovation (OECD, 2014). The European Commission in 2016

marked the Netherlands, along with Denmark, Finland, Germany, Sweden and The United Kingdom,

as ‘Innovation Leader’. The group ‘Innovation leader’ is the top group of four performances groups

(Modest Innovators, Moderate innovators, Strong Innovators, Innovation Leaders) that classify Member

states of the EU. While Switzerland does outperform all the European countries with regard to Innova-

tion it is not included as Innovative leader since it is not a member state of the EU. Innovative leaders

are all member states with a relative performance with respect to innovation of more than 20% above

the average of the average performance of EU member states (Hollanders & Es-Sadki, 2017: 79). Be-

tween 2010 and 2016 The Netherlands was also one was the countries with the highest increase in

innovative performance, with an increase of almost 10%, this while Germany, Denmark and Finland

decreased in innovative performance with almost 5% between 2010 and 2016 (Hollanders & Es-Sadki,

2017: 16).

As mentioned in section 1.1 8 lists are used containing the top 100 of most innovative companies in the

Netherlands, compiled by the Dutch Chamber of Commerce (KvK) for the years 2009 up until 2016.

These lists have been created by innovation experts who judged each nominated and ranked companies

in their respective branches, society as a whole, originality, realized potential for growth. These lists

7

are used in this research and form the baseline for innovation in this research. While other companies

could be identified as innovative for consistency the innovative companies are limited to the companies

in these lists.

2.4. Text Mining, Text Clustering

Text mining is part of the field of data mining in which data scientists and other users try to discover

interesting and useful patterns from big quantities of text documents. Text mining is also known as

Intelligent Text Analysis, Knowledge Discovery in Texts (KDT) and Text Data Mining (Sheshasayee

& Thailambal, 2016). With the rise of the internet, including the explosion of social media applications,

text mining has become an increasingly interesting subject for research for both businesses and aca-

demia (Miner, Delen, Elder, Hill, & Nisbet, 2012). Witten, Frank, & Hall (2011: xxi) define data mining

as: “(…) the extraction of implicit, previously unknown, and potentially useful information from data.”

They proceed in saying that different from numeric data, textual data has no hidden implicit information,

since authors explicitly state the information in the text they want to bring across. Textual data in data

science is known as unstructured data and cannot be as easily consumed and used by computers as is

numeric data. Therefore, it has to be transformed or translated into a form which computers can work

with. According to Witten, Frank, & Hall (2011: xxi) Machine Learning provides a technical basis to

extract information from (raw) data, which information in turn can be used for other purposes. Two

main types of Machine Learning can be distinguished: supervised and unsupervised learning. Super-

vised learning schemes use labelled datasets to train models, which models in turn are used to predict

how unseen instance should be labelled (Raschka, 2015: 2- 8). In unsupervised learning problems in-

stances are unlabelled. The most common form of unsupervised learning is clustering or cluster analysis

(Raschka 2015: 312) (Manning, Raghaven, & Schütze, 2009: 348-350).

Text clustering, according Aboulia, Khader, Al-Betar, & Alomari (2017: 24), is one the most efficient

techniques used in the text mining field and its goal is to divide instances into natural groups (clusters)

in such a way that the instances most similar to each other end up in the same cluster while instances

with less similarities end up in different clusters (Yi, Zhang, Zhao, & Wan, 2017: 1). Within the field

of text mining or information retrieval, when referring to the separate designated texts which are pro-

cessed, these texts are referred to as documents. Depending on the project a document can be, a chapter

from a book, a poem, the lyrics of a song, a tweet, a new article or as in the case of this research the

textual content extracted from the main webpage of a company website. At the core of techniques used

for effective document selection and retrieval lies the so-called Cluster Hypothesis formulated by van

Rijsbergen (1989): “Closely associated documents tend to be relevant to the same requests.” or as

Manning, Raghaven, & Schütze (2009) define this hypothesis: “Documents in the same cluster behave

similarly with respect to relevance to information needs”. Manning, Raghaven, & Schütze (2009: 14)

also notice that in its essence this cluster hypothesis is identical to the contiguity hypothesis, which is

8

the basic hypothesis used in vector space models and is defined as: “Documents in the same class form

a contiguous region and regions of different classes do not overlap”. The cluster hypothesis has proven

to be effective and successful in result search clustering (Zeng, He, Chen, Ma, & Ma, 2004), creating

alternative user interfaces and improved information presentation for web-browsing (Cutting, Karger,

Pedersen, & Yukey, 1992), (McKeown, Barzilay, & Evans, 2002). Other examples of document clus-

tering are the work of (Kohonen, et al., 2000) who clustered 6.8 million patents based on similarities in

texts of its abstracts, using a statistical representation of their vocabularies as feature vector. Clustering

algorithms are also used to improve web searches and to find similar search results (Zang, Pang, Xie,

& Wu, 2006). Wang & Koopman (2017) studied the clustering of scientific articles based on semantic

similairity with the use of the algorithms k-means, k-means mini batch and Louvian communion

detection algorithm. Wang & Koopman (2017) found that k-means was more widely applicable e.g.

when citation data was missing, furthermore k-means proved to be higly scalable and produced results

wich are in high agreement with other solutions. No specific studies where found in which clustering

algorithms where applied to webpages in order to find innovative companies. Studies regarding finding

a specific theme or topic do have been conducted. In several of these studies k-means is also proposed

as a more efficienced algorithmn (Inderjit & Dharmendra, 2000), (Steinbach, Karypis, & Kumar, 2000).

Ramage, Heymann, Manning and Garcia-Molina (2008). While many other clustering algorithms exist

and being proposed not all of these algorithms are fit for large scale text clustering Kulis & Jordan

(2012) in their study on new Bayesian algorithms state that:

“(…) despite the success and flexibility of the Bayesian framework, simpler methods such as 𝑘-means

remain the preferred choice of in many large-scale applications. (…) whereas Bayesian models require

sampling algorithms or variation inference techniques which can be difficult to implement and are often

not scalable, k-means is straightforward to implement and works well for a variety of applications.”

K-means uses a centroid-based algorithm, meaning that for each cluster a centre (the mean) is chosen

see figure 1. K-means is a partition algorithm or a flat clustering algorithm, meaning that clusters are

formed independently of each other. The first step in running k-means is choosing a value for k. The

value of k equals the number of centroids and thus the number of clusters the output will have. The fact

that k needs to be determined by the user is viewed as one of the big downsides of the k-mean algo-

rithms. Several techniques e.g. the elbow method and silhouette method have been proposed to deter-

mine the optimal k for the data (Gupta & Srivastava, 2014: 7). After determining k, k-means moves

between two steps while iterating over all data points or instances. The first step, the assignment step,

assigns the instance to its nearest centroid, when the instance is assigned the second step, the update

step, is to recalculate the mean of the cluster. E.g. when k = 4 and N (number of instances) = 11.

9

Figure 1: k-means explained

Another downside to the k-means algorithm mentioned by MacKay (2003: 288) is that a change in the

placement of the initial centroids might lead to different clusters. According to (Bradley, Bennet, &

Demiriz, 2000) k-means setups with n ≥ 10 dimensions and k ≥ 20 clusters will result in a percentage

of clusters which are empty or have very few or only one data point. In order to comply with the greater

requirements of web-based applications Google’s D. Sculley proposed k-means Mini-batch. Sculley

explains that the classic k-means algorithm requires O(kns) computations in which n is the number of

examples and s the max number of non-zero elements in each feature vector, and thus the computations

increase linearly with the increase of documents. The k-means algorithm explained above is tradition-

ally processed and analysed in an ‘offline’ fashion. Meaning that the complete dataset is available and

processed is a whole, especially with big data sets this becomes computational and memory expensive

and even impossible. In order to solve this problem online machine learning is developed. In online

machine learning the data is being streamed and it is not required to load the full data set into the

memory ( (Cho & An, 2014)& An, 2014: 362). K-means mini batch takes random samples of a prede-

termined size (a mini batch) like the traditional k-means algorithm assignments and update steps. The

use of k-means mini-batch thus enables its users to run k-means on the bigger datasets without running

into memory errors. (Yadav & Baria, 2014) have shown that when applying mini batch k-means on the

reuters21578 data set (a text data test set most widely used for text categorization research) is less time

consuming and is over 10% more accurate than the classic k-means algorithm.

Beside k-means and k-means mini-batch other algorithms are used for clustering. Examples are affinity-

based clustering (Frey & Dueck, 2007) and DBSCAN clustering (Ester, Kriegel, Sander, & Xu, 1996).

For both algorithms the calculation and memory costs are relatively high compared to the k-means

10

variants, and thus less suitable for large datasets. Another form of clustering, next to these flat clustering

algorithms, is so called hierarchical clustering. Hierarchical clustering does not need a priori knowledge

or information about the number of clusters and is known in many cases to lead to superior results

compared to flat clustering (Steinbach, Karypis, & Kumar, 2000: 1). The downside to this is that the

time complexity of the clustering algorithm increases with the number of documents and compared to

standard k-means algorithms which have a time complexity which is linear (kns), most hierarchical

clustering algorithms have a time complexity which is at least quadratic to the number of documents

O(kn2s).3 This also makes to hierarchical clustering unsuitable for very large data sets.

Clustering text documents holds one major problem. Namely the fact that compared to other forms of

data, texts have relative high numbers of features, which can be either informative or informative, be-

cause text data is both unstructured and noisy. (Aboualigah, Khader, Al-Betar, & Alomari (2017: 24)

(Sarkar, 2016: 265). The performance of (cluster) algorithms tend to decline with the increase of di-

mensions. In literature, the problems which occur in relation to dimensions is referred to as ‘The Curse

of Dimensionality’, which means that with the increase of each feature the dimensions of the virtual

space in which the instances (documents) are placed increases and thus these instances become sparser

or scattered with respect to each other (Bellman, 1957). This dramatically increases memory usage and

computation time. The vast amounts of (uninformative) features decreases the accuracy of the clustered

text documents. According to Bharti & Singh (2013) effective dimension reduction methods meet five

conditions.

1. The method should identify the relevant features and remove the irrelevant features;

2. The method should remove the redundant features;

3. it should remove features that contain no information (noisy features);

4. The method should preserve useful information in the original feature space;

5. The dimension reduction should not compromise the performance of the algorithm.

The simplest feature selection method is document Frequency-based Selection. Document Frequency-

based Selection removes terms which are appearing with a higher frequency in the corpus then other

terms. As such they hold less meaning to the text. These terms are often referred to as stop words. These

high frequency terms can be removed using a stop-word list, e.g. by removing the 10 percent top fre-

quency terms or removing the top 100 most frequent terms or a combination of these. The lowest fre-

quency terms can be removed as well, since the lowest frequent terms do not add to the similarity or

3 The definition of time complexity given by (Sridhar, 2014: 19) is the following: “Time complexity refers to the measure-

ment of run time of an algorithm in terms of its input size (…)”.

11

distance computations used by clustering algorithms. Sometimes these lower frequent terms are a result

of misspellings or typographical errors. An alternative strategy proposed by Wilbur & Sirotkin (1992)

is to select features by using Term Strength in this method the term strength is compared to the expected

strength of a random term in the corpus with the same frequency. In order to cluster a collection of

unstructured text documents the documents are transformed into vectors in a feature space model, often

referred to as a Bag of Words model (Almeida, Vasconcelos, & Maia, 2009: 47), (Feldman & Sanger,

2007:102). In the Bag of Word approach each text document is first divided into separate words or

tokens (tokenization). Further pre-processing might consist of case folding (making all letters uppercase

or lowercase) and of removing various forms of punctuation (Manning, Raghaven, & Schütze, 2009:

30). All terms (unique tokens) found in the several different documents, together form the corpus of the

task. The number of tokens in the corpus determine the number of features for each document. This can

be visualized in a term frequency matrix (See figure 2). Each document becomes a vector in the vector

space model and each term becomes one dimension of the vector. In order to determine similarity and

difference between documents it is important to apply some sort of weighting scheme to the features of

each document. The most widely used weighting scheme is called TF-IDF, which stands for term fre-

quency – inverse document frequency and is determined with the following formula:

𝑤𝑡, 𝑑 = ൫1 + log 𝑡 𝑓𝑡, 𝑑൯ ∙ 𝑙𝑜𝑔10 𝑁

𝑑 𝑓𝑡

In the formula above w = weight, t = term, d = document and f = frequency. Meaning that for each

appearing term in a document the value or weight of the term increases, however it decreases for each

appearance in the total corpus. The underlying reasoning is that terms which appear in the majority of

the document would unlikely be an effective determinant for that specific document, when a term is

very rare in the whole corpus but thus appear in the specific document it must be a strong determinant

for that document.

Source: modelled after http://brandonrose.org/clustering

Figure 2: Term Frequency Matrix

12

Section 3: Experimental Setup

In this section the setup and steps are described that are used in the experiment and that have led to the

clustering results. This setup also mentions the results of sub steps which were needed in order to pro-

ceed or which form the base for elements of the experimental setup such as direct and indirect data

collection and language detection. For a total overview of the experimental setup see the practical out-

line in Section 3.8.

3.1. Data Description

The main type of data needed for this research is the text found on the main pages of websites of Dutch

companies, which are found on the list of 956.540 URLs. Inspection of the URLs reveals that: 80.11%

of the URLs are found in the .nl domain, 14.23% in the .com domain, 2.7% in the .eu domain, 0.92%

in the .net domain and a total of 2.05% in various other domains such as .org, .biz, .nu, .li and .be (see

chart 1).

Chart 1: Domain Distribution of URLs

In order to determine whether a website can be classified as being innovative, eight top 100 lists for

most innovative companies in the Netherlands are used, which are published by the Dutch Chamber of

Commerce (KvK) for the years 2009 until 2016. The selection of companies and their ranking in the

top 100 have been done by innovation experts who have judged each nominated company according to

the impact they have had on their respective branch, originality, realized potential for growth or society

as a whole. Comparing the innovative URLs with the 956.540 URLs provided by Statistics Netherlands

the following becomes clear (see table 1). From the 703 unique URLs found in top the 100 lists of 2009

till 2016, 460 URLs (65.43%) also appear in the main URL list. This shows that the main list does not

cover all innovative companies which are acknowledged by the Dutch Chamber of Commerce in their

published top 100. Accordingly, 0.05% of the all the companies in the Main URL list have been in the

Chamber of Commerce top 100 of innovative companies between 2009 and 2016. Logically not all

13

innovative companies have made it into the top 100 lists, and thus the main URL list might contain

more innovative companies than the 460 URLs in this paragraph.

Table 1: KvK Innovative URLs found in main list

Next to the above-mentioned data a spreadsheet containing companies URLs and corresponding SBI

numbers as classified by Statistics Netherlands, this makes it possible to link an URL to its specific SBI

category. Next to this a spreadsheet is used that which contains the SBI names of each category and sub

category of the SBI and its corresponding SBI numbers. This is used throughout this research to visu-

alize results and generate tables and charts (see e.g. chart 2).

Thus, there are five sets of data.

1. List of 956.540 URLs of Dutch companies

2. Texts collected from or belonging to the company main webpage.

3. Lists of the KvK top 100 innovative companies

4. A spreadsheet containing the URLs linked to the SBI numbers

5. A spreadsheet containing the SBI numbers linked to the names of categories and sub categories.

In order to evaluate Innovation and the surfaced clusters different elements will be evaluated. With the

use of a dataset containing both URLs and SBI-codes the URLs are located in the SBI. This is before-

hand done for the Innovative URLs to see their original distribution and afterwards for the surfaced

clusters to evaluate to analyse whether there is resemblance between the surfaced clusters and the SBI.

When studying the innovative URLs as used for this research it becomes obvious that the innovative

companies are found in a wide range of categories. In total 40 level 2 categories and 106 level 4/5

14

categories (see chart 2 and appendix IX).4 The chart (2) below and the list in the appendix (IX) make

clear that innovative companies from the top 100 lists are found in multiple categories and that most off

these do not appear in a remainder category.5 While most of the KvK Innovative companies fall into

the categories wholesale trade, architects, engineers, technical design and consultancy, testing and anal-

ysis and financial institutions, Innovative companies are found in many other kinds of categories as

well. This underlines the fact that the SBI is rather ineffective in providing information about innovative

companies.

Chart 2: Percentage Innovative URLs per SBI category

3.2. Data Collection

The lists of 956.540 URLs, the URLs of the top 100 lists and the spreadsheets with URLs linked to the

SBI codes are provided by Statistics Netherlands. The spreadsheet with the SBI codes linked to the


5 Remainder categories have n.e.c.* added, meaning “not elsewhere classified”

15

category names is easily found on the Statistics Netherlands website. The texts of the list need to be

collected or extracted. For this there are two distinct methods, indirect and direct collection. Both meth-

ods will be described.

3.2.1. Indirect Collection [Common Crawl]

The first method which is described hereafter is the indirect method which means collecting the data

from a web archive, in this case the Common Crawl September 2016 archive. Collecting data from a

web archive can be considered as an indirect form of data collection since the data is pre-collected and

stored by the web archive. The raw HTML data and plaintext extracts are also available through Com-

moncrawl.org. Common Crawl provides a dataset on a monthly basis which consist of billions of

webpages stored in so called WARC files. WARC stands for Web Archive and stores raw crawl data

and meta data. Beside WARC files Commoncrawl.org provides WAT files and WET files, which con-

tain specific data from the WARC files. The WAT file only contains the computed metadata and the

WET file contains the flat or plaintext, and thus the actual textual content of the website. Since it is

central to this thesis to apply unsupervised learning on the actual webpage content the plain text ex-

tracted from the WET files will be used in the model (Commoncrawl.org, 2017).

In order to check whether the Common Crawl archive will sufficiently cover the URLs used in this

research 10 random samples were taken of 1000 URLs from the main list of 956.540 URLs. The result

is that on average only 27,98% is covered by the Common Crawl archive of September 2016 and

29.35% in the archive of March 2017. (see appendix I for results per sample and appendix XI for code

excerpts).

While the coverage of Common Crawl seems to be slightly increasing with regard to the URLs set apart

for this research which mainly lie in the .nl domain, it has become clear that indirect collection of

webpages through Common Crawl would not be sufficient for this research. The next sub-section will

therefore elaborate on the direct method of extracting webpages, namely through web scraping.

3.2.2. Direct Collection [Web scraping]

According to (Mitchell, 2015: 5) web scraping is the automatic gathering of information through any

other means than program interaction with an API. Web scraping, which is also known as screen scrap-

ing and web harvesting. This is most commonly done by writing a program which automatically queries

web servers, request data in the form of the HTML code and parses this data in order to extract infor-

mation (Mitchell, 2015: 5).

16

In order to extract texts of the main pages belonging to the URLs, the Python urllib3 library was used

in combination with grequest and the Beautiful Soup library6 (see appendix XI for code excerpts).7

With this program the HTML codes are extracted from the main webpage and the text (information)

extracted from this HTML code. In addition to this the documents are collected in such a way that they

can be used to produce clusters when a cluster algorithm is applied to the documents.

From the 956,540 URLs, 299,759 URLs could not be extracted or no text was found after extraction,

which is a reduction of 31.34%. From 460 Innovative URLs, the text of 70 innovative URLs could not

be extracted and thus resulted in a reduction of 17.95% (see appendix XI for code excerpts). Although

nearly one third of webpages could not be extracted, the direct method is still substantially more suc-

cessful than the indirect method. Thus, the direct method is chosen as the method for generating docu-

ments.8

3.3. Pre-processing

3.3.1. Language Detection

PyPI’s language library, which is a part of Google’s language detection, is used to determine the lan-

guage of the text. This is necessary for further (pre) processing e.g. stop-word removal, tokenisation

and stemming, since all these operations are depending on the language of the text.

The process of language detection is started with a sample of 1,000 URLs, which were manually

checked. From this sample it became evident that most (+80%) of the texts are in Dutch (nl) as expected,

and about 15% are in English (en). A manual check of a sample of the as Afrikaans (af) classified texts

showed that the Afrikaans classified texts in reality were Dutch texts. Apparently, they were misclassi-

fied as Afrikaans. Since Afrikaans and Dutch are alike this misclassification is logical. The texts that

were classified as French or Japanese were correctly classified. From the texts which were classified as

German (de) only 25% was correctly classified, while the remaining 75% were (misclassified) Dutch

texts. The text classified as Catalan were either English (50%) or Dutch (50%). The text which was

classified as being Polish (pl) in reality was a Dutch website which also had some Polish hyperlinks to

Polish Facebook pages. The texts which were classified as Tagalog (tl) (an Austronesian language) in

6 Urllib3 is a HTTP client for python, also see https://pypi.python.org/pypi/urllib3.

7 Total running time was less than 48 hours when running simultaneously on 12 separate Jupyter Notebooks on

an Intel ® Xeon ® CPU E5-2670 0 @2.60 GHZ.

8 The direct collection could be improved when more time and effort is invested in solving the errors which did

occur during the process. It has also been considered and proposed to combine both direct and indirect methods

in order to create larger corpus, this however was rejected to keep the collected data consistent.

https://pypi.python.org/pypi/urllib3

17

reality where English texts about yoga and thus contained some Asian words which lead to the misclas-

sification. The text classified as Chinese (zh-cn) was indeed a Chinese text. While many of the texts

classified as being English where correctly classified as English, there were also texts that were in

reality Dutch texts with a lot of English Terms (see chart 3).

Chart 3: Distribution of Languages

The decision was made to use only texts that were correctly classified as Dutch and texts which were

classified as Afrikaans. This will result in a completely Dutch corpus. The reason for this decision is

that it considerably diminishes the number of dimensions and it also greatly simplifies pre-processing

and clustering processes. The reduction does not have any impact on the validity of the results that are

needed to answer the research question.

3.3.2. Tokenization, Stop Word Removal and Stemming

Before the clustering algorithm is applied to the collected and selected documents it is important to

structure the data in such a way it can be processed by the algorithm. The first step in creating structure

is by cutting the strings in smaller pieces (Tokens or terms). This process is called tokenization and, in

this research, is done with the NLTK word-tokenizer. After Tokenization additional “noise” is removed

by removing stop words. Stop words are words in a language are used often but do not add any meaning

or value from a data science perspective (see appendix XI for code excerpts). Examples of Dutch stop

words are: “aan”, “af”,”al”, “andere”, ”dan”, ”die”, ”dit”, ”doen”, ”een”, ”er”, ”heb”, ”hem”, ”het”,

”met”, ”zei”, ”zo”, ”zou”, and so on. Each language has its unique stop words. For this research the

Dutch stop word library found in PyPi is used in this research.

18

3.3.3. Document Selection

Working with several languages would ultimately lead to several different clustering’s, since e.g. Eng-

lish and Dutch texts would share very few semantic similarities while covering the same topic. Working

with two or more languages with little word similarity would increase the number of features. When

applying Lang-Detect on all documents an even larger list of languages appears, most of them repre-

senting only a small percentage of the documents (see appendix II, table ii) When looking at the lan-

guages with more than 1% of the documents. Dutch, English and Afrikaans are clearly the highest

scoring. From the sample it followed that both documents classified as Dutch and those classified as

Afrikaans where 100% Dutch. Since using more than one language would greatly increase the number

of features and thus dimensions and memory usage, all documents which are not classified as either

Dutch or Afrikaans are omitted. Which left us with 510,755 Dutch documents for further processing.

From these 510,755 documents an additional 66,283 documents were removed because they have less

than 20 terms. Documents with very few terms are unlikely to contain much information about the

company. They only create additional noise to the experiments and do not form a good base for clus-

tering. A sample of the documents with less than 20 terms was reviewed and the majority of these

documents turned out to be error messages or ´page under construction´ messages, which supported the

decision to remove them. The number of 20 terms is in itself arbitrarily chosen. Thus 444,472 docu-

ments remained for further experiments (see appendix XI for code excerpts).

3.3.4. Feature Selection

After scraping the text from the URL, all text is made lower case and punctuation is removed. In this

way no distinction is made between identical words that appear in the beginning of a sentence or in the

middle of the sentence. After tokenization stemming is applied with the use of NLTK SnowballStem-

mer. Stemming reduces dimension by bringing each word back to its stem. E.g. Works, Working,

Worked, Worker, Workers, will all be brought back to the word Work.

3.4. The Algorithm

While many algorithms may exist, which may be superior to k-means on small datasets, when working

with large datasets and limited resources k-means - mini batch seems to be the most logical starting

point, given the time frame of this project and the additional process of extracting texts from the

webpages. The k-means – mini batch algorithm is more extensively explained in sub-section 2.4.

3.5. Vector Space Model Setup

After the previous mentioned pre-processes, the vector space model was constructed with a TF-IDF

Vectorizer, TF-IDF is most commonly used and therefore chosen in this setup. The sklearn TfidfVec-

torizer is used to build the vector space model (see appendix XI for code excerpts). For feature selection

both Uni-grams (one term) and Bi-grams (two-term combinations) are used. Bi-grams may increase

19

effectiveness of the clustering algorithm since documents in which the same word combinations are

found are logically more similar to each other (Furnkranz, 1998). Features are selected by removing the

most infrequent terms (min_df) and most frequent terms (max_df). Several setups where tried (e.g.

min_df = 0.07 en min _df = 0.05, min_df = 0.01 and min_df = 0.007), based on manual exploration

min_df= 0.01 that appeared the have both a low number of very small clusters and clear cohesive clusers

(See appendix V). After min_df = 0.01 (meaning removing terms which appears in less than 1% of

documents) was selected several max_df where tried. Everything below max_df 0.6 (meaning removing

terms which appeared in more than 60% of the documents) gave less obvious clusters and everything

from max_df = 0.7 and higher gave similar clustering which appeared to be more cohesive. Therefore

min_df was set to 0.01 and max_df set to 0.7 was chosen for both having a low number of very small

clusters and a fair amount of clear cohesive clusters. With regard to the uni-grams and bi-grams it must

be stated that a combination of both where used for building the Vector Space Model. The frequency

of reappearing uni-gram features far exceeded those of the bi-grams. The result of cutting of the 1%

lowest frequency features resulted in losing almost all bi-grams with the exception of two.

3.6. Elbow Method

As Bholowalia & Kumar (2014) in their article explain, the Elbow method tries to find the ideal number

of clusters for modelling data. In the elbow method the numbers of clusters are incrementally increased,

while the sum of squared errors is calculated. While the number of clusters is increased the sum of

squared errors will go down until a certain point after which the sum of squared errors continues the be

stable. This point is the so called “elbow” and theoretically this is the most ideal number of cluster,

since each added cluster does not lead to a smaller sum of squared errors.

3.7. Analysing Clusters

3.7.1. Innovative Clusters

In order to analyse to what extend innovative clusters have surfaced, the documents classified as inno-

vative are identified within the surfaced clusters. The relative overrepresentation of innovative docu-

ments9 within a cluster is calculated by subtracting the percentage of innovative documents in the cluster

from the percentage of total number of documents in that cluster.

Relative Overrepresentation = (𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑛𝑜𝑣𝑎𝑡𝑖𝑣𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑛𝑜𝑣𝑎𝑡𝑖𝑣𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠−

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 ) * 100%

9 Within this research innovate documents are documents which are classified in accordance with the list contain-

ing the KvK top 100 of most innovative companies.

20

The relative overrepresentation helps to establish whether innovative documents behave differently than

other documents. A greater (smaller) relative overrepresentation might indicate that a cluster has a

greater (smaller) pull on innovative documents.

3.7.2. Similarity to SBI and Cohesiveness

In order to analyse surfacing clusters, the surfaced clusters are compared to the SBI and innovative

documents will be relocated in the surfaced clusters. The similarity to the SBI and cohesiveness of the

clusters can be analysed by identifying the SBI codes of documents within clusters and calculating the

most dominant category. This is illustrated in figure 3. Figure 3 represents one cluster; the different

coloured circles are documents which can be traced back to different categories in the SBI. In this

research the documents are compared to level 2 and level 4/5 of the SBI.10

Figure 3: Analyzing Clusters

As shown in figure 3, different SBI categories can be present within one cluster. Calculating the most

dominant category will give an insight in the document distribution within a cluster. Comparing the

documents in the cluster of figure 3 to the level 4/5 categories of the SBI will result in a most dominant

level 4/5 category - (Processing of vegetable or fruit (no juice) - which represents 40% of the total

documents in that cluster. Since, level 4/5 SBI categories are the most specific categories in the SBI,

this percentage will indicate how similar the unsupervised clustering is to the designed SBI categorisa-

tion. Logically, clusters with relatively larger dominant categories are more cohesive than clusters with

a relative smaller dominant category. Comparing the documents in clusters to the level 2 SBI categories

will indicate whether documents that are clustered together fall into the same business sector. In figure


21

3 comparing the documents to the level 2 SBI categories will lead to a most dominant level 2 category

– (Manufacture of food) – which represents 60% of the total documents in that cluster. Comparing

clustering to level 2 SBI categories will therefore give a more general indication of the cohesion within

clusters.

3.7.3. Analysing Stability of Surfacing Clusters

As mentioned in the theoretical framework the changing placement of the initial centroids might lead

to different clusters. In order to estimate the stability of surfacing clusters the model will be run five

times for each number of k. This will lead to three collections (k=100, k=500 and k=1500) containing

five sets of clusters. Within each collection, clusters within each set will be compared to the cluster

most similar (containing the highest percentage of identical documents) in the other sets, within that

collection. The clusters in set 1 are compared to sets 2,3,4,5; set 2 will be compared to sets 3,4,5; set 3

is compared to sets 4 and 5; finally set 4 is compared to set 5. For each collection the average of the

similarities is taken. Figure 4 gives an example of one cluster and in what measure it resurfaces in each

set. The example gives an average of 71% and thus indicates on average 71% of the documents in the

cluster resurfaces together in one cluster. The same process is repeated for all clusters and is done for

all three collections. This will result in a collection of stability measures which can be categorized and

reported (see chart 4). The colours and numbers in figure 4 do not represent a value or feature but show

how a cluster might resurface containing different combinations of documents.

Figure 4: Consistency of Surfacing Clusters

22

3.8. Practical Outline

This following figure (5) presents the practical outline and gives an overview of the (pre)-process

leading to the end results and the number of documents in each phase of the research process.

Figure 5: Practical Outline

23

Section 4: Results

This section describes the results of applying k-means algorithm to the pre-processed data. For k=100

the URLs within each cluster are compared to each other in order to check for cluster cohesiveness.

Samples are checked manually for each cluster. This however is not done for k =500 and k= 1500 since

it would be too time consuming for larger number of k. Therefore, this method is not further reported

in this section. (The results and description of this method are found in appendix IV).

In order to analyse the measure in which innovative clusters resurfaces, the relative overrepresentation

of innovative documents within clusters is calculated as is explained in sub-section 3.7.1. The cohe-

siveness and the similarity to the SBI of the clustering is analysed as explained in sub-section 3.7.2.

Appendices VII and VIII show complete tables of the results for level 4/5 and level 2 SBI category

comparison.

4.1. Finding Optimum k

In order to determine the optimum number of clusters used by the k-means mini batch algorithm the

elbow method is applied on the data, this led to graphs 1 and 2. Graph 1 shows that the SSE has a

downward trend when the number of clusters is increased. However, the trend line does not go down

steadily but acts very volatile. This is in accordance with the article of Daas, Puts, Buelens, & van den

Hurk (2015) is one of the three charactics of Big Data. The so-called ‘elbow’ appears around k=100.

This becomes even clearer when the curve is smoothed (when fewer points are shown) (see graph 2).

Graph 1: Results Elbow Method Graph 2: Results Elbow Method "Smoothed"

While k=100 will be the starting position since it theoretically must lead to clustering with the highest

quality, this research also considers two other k’s. k=1500, since 1500 is the total number of subsection

which can be found in SBI and k=500 as an arbitrarily chosen value between the value found by the

elbow method and the number of k corresponding to the total number of categories in the SBI. k=500

is chosen to better comprehend how the surfacing clusters change when scaling up from the number of

24

clusters which should be an optimum according to the elbow method (k= 100) to the number of clusters

comparable to the SBI categories.

4.2. Cluster Analysis

4.2.1. Innovative Cluster Analysis

While in some clusters there has been found a relative overrepresentation of innovative documents, no

clear innovative clusters have surfaced in any of the clustering’s. The relative overrepresentation is

calculated as explained in sub-section 3.7.1. When the relative overrepresentation in clusters is analysed

the following becomes clear:

For k=100 the cluster with the greatest overrepresentation is cluster 83, which has an overrepresentation

of 12.94%. Cluster 83 contains 2.85% of total documents and 15.79% of all innovative documents. The

most dominant level 4/5 category in cluster 83 is 6201 – Writing, producing and publishing software,

representing 19.6% of the documents in the cluster. Another cluster that stands out, in the k=100 clus-

tering, is cluster 12 which has an overrepresentation of 2.71%. The clusters contain 23.61% of the total

documents and 26.3% of all innovative documents. This clearly is the biggest cluster in the k=100

clustering. The most dominant level 4/5 category in cluster 12 is 94997 – Other interest organizations

n.e.c., representing 4% of the documents in the cluster (see appendix VI).

When looking at the innovative documents in the k=500 clustering, innovative documents are more

spread out amongst different clusters. The highest overrepresentation is found in cluster 191, which has

an overrepresentation of 5.39%. The most dominant level 4/5 category in cluster 191 is 6420 - Financial

holding, representing 6% of the documents in the cluster (see appendix VI).

With regards to the innovative documents of k=1500 two clusters stand out: cluster 103 and 309, both

clusters have an above average number of documents especially when compared to the clusters with an

overrepresentation in k=500. Cluster 103 contains 10.3% of total documents and 15.45% of total inno-

vative documents and thus a relative overrepresentation of 5.42%. Cluster 309 contains 17.46% of total

documents and 19.92% of total innovative documents and thus a relative overrepresentation of 2.46%

(see appendix VI).

4.2.2 Analysing Cohesiveness and Similarity to SBI

The cohesiveness and the similarity to the SBI of the clustering is analysed as explained in sub-section

3.7.2. The table (2) below summarizes the similarity and cohesiveness scores (averages) for each num-

ber of k and are further explained thereafter.

25

Similarity and Cohesiveness Scores

k=100 k=500 k=1500

Level 4/5 (Similarity) 23.11% 32.9% 30.2%

Level 2 (Cohesiveness) 35.21% 38.41% 44.97%

Table 2: Similarity and Cohesiveness Scores

When comparing the surfaced clusters of k=100 to the level 4 SBI categories the following table (3)

can be constructed:

Cluster Most similar to SBI for k=100

Cluster level 4/5 most dominant

sub-category

SBI

2008

Percentage documents

with SBI-code repre-

sented by most dominant

category

number of docu-

ments with SBI-

code

89 Photography 74201 83.30% 12 document(s)

88 Beauty treatment, pedi-

cures and manicures,

make-up and image con-

sulting

96022 62.30% 2939 document(s)

54 Advertising agencies 7311 58.30% 12 document(s)

0 Sale and repair of passen-

ger cars and light motor

vehicles (no import of

new cars)

45112 56.80% 3776 document(s)

24 Other service activities

n.e.c.*

9609 56.70% 930 document(s)

17 Restaurants 56101 56.40% 2355 document(s)

85 Landscape service activi-

ties

8130 54.10% 1709 document(s)

Table 3: Most similar clusters to SBI for k=100

The average of percentages represented by the most dominant level 4/5 SBI category for k=100 is

23.11%. When considering the cohesiveness of the clustering, comparing to the level 2 SBI categories

the average of percentages represented by the most dominant category is 35.21%.

26

When scaling up to 500, more clusters appear with higher similarity with an average percentage of the

dominant category documents rising from 23.11% to 32.9% when compared to SBI level 4/5 categories.

Remarkable are the clusters in the sorted table extract below which range till almost 99% similarity (see

table 4). When considering the average level 2 categories the average becomes 38.41%.



sub-category

SBI 2008 Percentage documents with

SBI-code represented by

most dominant category

number of docu-

ments with SBI-

code

216 Hairdressing 96021 98.50% 65 document(s)

217 General dental practices 86231 93.20% 59 document(s)

45 Dispensing chemists 4773 86.60% 149 document(s)


ger cars and light motor ve-

hicles (no import of new

cars)

45112 81.20% 16 document(s)

128 Football 93121 75.00% 60 document(s)

228 Insurance agents 6622 72.70% 165 docu-

ment(s)*

243 Insurance agents 6622 71.20% 66 document(s)*




sulting

96022 70.50% 611 docu-

ment(s)*




sulting

96022 70.30% 912 docu-

ment(s)*

200 Photography 74201 69.40% 631 docu-

ment(s)


* For Some of the clusters (228, 70) which contain a high percentage of documents which fall in the

same SBI category a similar cluster (243,74) has surfaced nearby, having a near similar score, but

have a different number of documents.

27

When scaling up to 1500 clusters the average percentage of the dominant category documents becomes

30.2%, which is 2.7 percent points lower than k=500, but also creates many clusters (over 1000 or 66%)

clusters with less than 10 documents (see table 5). At level 2 SBI the average percentage of the dominant

category is 44.96%, which is surprisingly higher than the level of similarity found in k= 500.



sub-category

SBI

2008

Percentage documents with

SBI-code represented by


number of docu-

ments with SBI-

code*


1364 General dental practices 86231 96.7% 60 document(s)

925 Practices of psychothera-

pists and psychologists

86913 96.6% 58 document(s)



vehicles (no import of new

cars)

45112 95.2% 21 document(s)

819 Other interest organiza-

tions n.e.c.*

94997 95.2% 5617 document(s)

64 Driving schools 8553 94.3% 1051 document(s)




sulting

96022 93.9% 33 document(s)

9 Non-spe-

cialised

stores with

non-food

(no depart-

ment stores)

47192 92.9% 14 document(s)



28

The results from this section (4.2) lead to the conclusion that when applying k-means mini batch with

a higher number for k more cohesive clusters are appearing while more innovative documents fall in

bigger less cohesive clusters. K= 500 is performing best when comparing to the SBI and when looking

at the most cohesive clusters hairdressers score highest in both the k=500 and k=1500 clustering and

dental practices score a second place in both instances. When looking at more general cohesion using

the level 2 SBI categories, K=1500 is most cohesive. Remarkable is the fact that within both the k=500

and k=1500 clustering when comparing to the level 2 SBI categories, in both cases category (86) Human

health activities is the most dominant of two best performing clusters. In three out of four cases repre-

senting 100% of the documents. (see appendix XIII). When looking at the level 2 cohesiveness of k=100

the best performing cluster is cluster 89, in which the most dominant level 2 category is 74 – Industrial

design, photography, translation and other consultancy, representing 92% of the documents in that

cluster. The 2nd and third best performing clusters are respectively 4 and 66, in which the most dominant

level 2 category in both cases is 86 - Human health activities, representing respectively 81% and 80%

of total documents in those clusters (see appendix XIII).

4.5. Cluster Stability

The Chart (4) below shows the distribution of the stability of resurfacing clusters for different numbers

of k. Stability of the clusters is estimated as described in sub-section 3.7.3.

Chart 4: Stability of Resurfacing Clusters

0%

10%

20%

30%

40%

50%

60%

70%

80%

Pe

rce

nta

ge o

f R

esu

rfac

ing

Clu

ste

rs

Precentage of Resurfacing Documents in Cluster

STABILITY OF RESURFACING CLUSTERS

k = 100

k = 500

k =1500

29

For k=100 in 15% of the clusters resurfacing with 90 to 100% identical documents. Meaning that this

part of the clusters will very likely surface each time the algorithm is ran. 65% of all clusters however

show a resurfacing of less than 10% of documents in its clusters meaning that on average only 15% of

the clusters that surface when applying the algorithm, will resurface when it is ran again, with over 90%

of the documents in that cluster. K=100 has an average stability of 25.71% (see appendix X), meaning

that on average 25.71% of all documents are clustered together will be clustered again when the algo-

rithm is ran again. When applying the same test to the results of k=1500 only 11% of the clusters are

for 90 to 100% stable and the average stability is 22.73%. Meaning that the appearing clusters will be

less stable compared to k=100 (see appendix X). When finally, the same test is applied to k=500, 18%

of the clusters are for 90% to 100% stable and the average stability is 32.07% (see appendix X).

An interesting fact with regard to the most stable clusters (90% - 100%) is that companies that ended

up in these very clusters offer a very specific service. E.g. bakeries, floors, yoga, education etc. When

looking at the +90% clusters of the k=500 runs one of the clusters contain 123 different URLs all leading

to the Cool Blue store website (a web shop for electronic devices) e.g. https://www.3dprinterspecial-

ist.nl/, http://www.fonduesetstore.nl, http://www.stofzuigerstore.nl, http://www.kookboekstore.nl,

http://www.autoradiostore.nl, http://www.bestekstore.nl. Also, many hosting sites as Hosting2Go and

CCV Shop are among these +90% clusters, since all URLs which might have belonged to other com-

panies now all point to the same kind of website and thus generated similar documents and are thus

clustered together.

Also remarkable is that while k=100 would theoretically have led to the most stable clustering setup in

reality the arbitrarily chosen k=500 setup ended up to be most stable. This may be an indication that

when working with larger corpora applying the elbow method becomes insufficient (see Gupta & Sri-

vastava (2014: 7), Bholowalia & Kumar (2014) and (Daas, Puts, Buelens, & van den Hurk, 2015).

Further research is needed however in order to prove this. Alternative ways from the ones that are

currently available are required to determine the best number for k when working with larger corpora.

The conclusion we draw from this is that clustering with k-means mini batch under the given conditions

is not a stable method leading to high quality statistics in accordance with the standards of National

Statistics Institutes (NSI’s). Moreover, clustering does not seem to be stable enough nor effective

enough for creating a specific categorization of innovative companies which could be effectively used

by municipalities or other organizations.

https://www.3dprinterspecialist.nl/

https://www.3dprinterspecialist.nl/

http://www.fonduesetstore.nl/

http://www.stofzuigerstore.nl/

http://www.kookboekstore.nl/

http://www.autoradiostore.nl/

http://www.bestekstore.nl/

30

Section 5: Discussion and Conclusions

5.1. Discussing Results

Big data implies working with many dimensions and large numbers of data points leads and it allows

for many options and angles to approach a problem. That means that many choices have to be made

and a lot of optional parameters have to be adjusted. Only a few of these options were used in this

research in order to answer the research question. It has been proven that with the used setup parameters

and pre-processing techniques no distinct clusters have appeared which primarily contain innovative

companies. While some clusters in the k=100, k=500 and k=1500 setups do show a relative overrepre-

sentation of innovative documents, the total number of innovative documents are still distributed over

many clusters. More robust clusters did surface for documents which contain specific descriptions of

the goods and services offered by the corresponding companies, and thus k-means mini-batch clustering

worked fairly well for clustering these particular documents. Clusters which mainly contained innova-

tive documents did not appear however with the pre-processing and algorithm used in this research.

This research also showed that innovative companies tend to end up within various SBI categories and

not only in a remainder category. Beside the fact that the performed experiments have not lead to clear

identifiable innovative clusters, only 10% of the clusters (in k=1500) appeared to be stable (90% -

100%). The initial centroid placement lead to different resurfacing of clusters for the majority of clusters

(MacKay, 2003: 288). This is not nearly sufficient for the standards as required for the official statistical

purposes of NSI’s and thus Statistics Netherlands (Struijs, Braaksma, & Daas, 2014). Unfortunately,

this means that the result of the research is that this method does not transform data in useful information

for municipalities or other public institutions.

5.2. Answering Research Questions

After collecting the results, we can answer the research questions. For the main research question:

RQ: “In what measure will clusters of innovative companies’ surface when unsupervised machine

learning is used on the textual content of the webpages of Dutch companies?”

We can definitely state that no clear cohesive clusters of innovative clusters have surfaced as a result of

using the techniques in this thesis. Although some clusters surfaced in which innovative URLs were

overrepresented, unfortunately no actual innovative clusters have been found.

With regard to the first sub-question (SQ1) question:

SQ1: “Does the Common Crawl Archive have a sufficient coverage of the Dutch company’s websites

circumscribed by Statistics Netherlands for this research?”

31

Common Crawl was found to have insufficient coverage of Dutch companies. As a consequence, for

this research we needed to collect the data with a direct method. It also meant that at the time of the

research Common Crawl did not prove to be a reliable source for statistical research concerning Dutch

websites, which connects with the concern posed by Daas, Puts, Buelens, & van den Hurk (2015), about

missing data in Big Data analytics for official statistics.

With regard to the second sub-question (SQ2):

SQ2: “Do the clusters that surfaces show similarities to the SBI, and in what measure are these clusters

cohesive and stable?”

For companies which offer services or goods which are explicitly circumscribed, e.g. hairdressers and

dentists, the surfaced clusters tend to be very similar (up to 98.5%) to the level 4/5 categories of the

SBI. For the most part the clusters where less similar to the SBI however. The similarity and cohesive-

ness are shown in table 2 in sub-section 4.2.2. While the k=500 clustering is most similar to the SBI,

the k=1500 clustering is most cohesive when comparing to level 2 SBI categories. The most cohesive

clusters for k=500 and k=1500 fall within the level 2 SBI category (86) Human health activities. For

k=100 this is (74) Industrial design, photography, translation and other consultancy, followed by (86)

Human health activities.

For k=100, only 15% of the clusters resurface with 90% to 100% of its documents. For k=500 this was

18%, and for k=1500 11%. For the majority of the clusters less than 10% of the documents resurfaced,

within a that cluster. For k=100 this 65%, for k=500 this 57% and for k=1500 this is 69% (see chart 4).

This proofs that for most documents in a cluster, and thus websites, clustering will, under the chosen

conditions, do not lead to consistent results. Therefore, it can be said the clustering is very much affected

by initial centroid placement, as explained by MacKay (2003: 288).

5.3. Discussing Shortcomings

This research has been about clusters that have naturally surfaced with the use of the k-means mini

batch algorithm. No alternative pre-processing (e.g. POS tagging) has been performed on the corpus.

Another shortcoming of this research is the fact that only a limited number of innovative companies

where available for validating proposes, while amongst the list of URLs there would have been many

more unidentified innovative documents. A more complete set of innovative documents would have led

to clearer results. Nevertheless, the data and scripts from this research could be used when a more

complete list of URLs belonging to innovative companies is made available or is constructed to further

analyse the way the innovative companies are distributed amongst the clusters. While an added list of

32

innovative URLs might lead to more insight in its distribution, it is highly unlikely it would change the

final conclusions of this research regarding the surfacing of innovative clusters.

5.4. Recommendations to Statistics Netherlands

The problem posed by Statistics Netherlands that municipalities and other local public institutions could

not identify innovative companies with the current SBI, could not be solved by applying k-means mini

batch with the chosen setup. It thus has become apparent that the main texts of innovative documents

are not adequately different to be clustered separately from the non-innovative documents. It is unlikely

that further fine-tuning would lead to better results concerning the innovative companies, nor would it

be likely that using different unsupervised algorithms would lead to a better result. While this research

was focussed on the use of unsupervised techniques to distinguish innovative documents from non-

innovative documents. For further research a supervised machine learning algorithm is recommended.

The innovative companies which are already identified could then be used as training/validating sets

and test sets in a supervised machine learning setup. Ikonomakis, Kotsiantis, & Tampakas (2005)

mention various algorithms which are used for the classification of documents. These are Naïve Bayes,

Disision Tree, Closest Neighbor, Support Vector Machines or a combination of these as an ensamble

machine learning setup. When a succesfull method for categorizing innovative companies is found

however, Statistics Netherlands should note that under current EU regulations it cannot be used to

replace the excisting SBI. This because EU regulation No 1893/2006 forces Statistics Netherlands to

follow the structure of the NACE. Statistics Netherlands would thus not be able to formally change the

SBI (which follows the NACE structure) independently from other EU Member states (Centraal Bureau

voor de Statistiek, 2017c). An alternative categorization system might exist however next to the current

SBI in order to meet the demands of municipalities and other organisations.

As shown in sub-section 3.1 most innovative URLs did not end up in a remainder category, while

according to Daas this is one of the problems Statistics Netherlands has with categorizing new

innovative companies. This suggests that the innovative companies chosen for this research do not

represent the full scope of innovative companies that Statistics Netherlands is struggling with. While in

the underlying research it was clear that innovative documents where scattered amongst many different

clusters and thus no clear innovate cluster or set of clusters was formed, it could be of added value for

further research to create a more representative data set of innovative documents, especially if a

supervised machine learning algorithm is applied. The data collected throughout this research as well

as the scripts for collecting and processing the data have been made available to Statistics Netherlands,

and can thus be used for further research. The created textual corpus collected throughout this research

can also be used for further research when applying supervised machine learning techniques.

33

It is further advised not to use Common Crawl as a data source for official statistics. As shown in this

research the coverage of Common Crawl for the list provided by Statistics Netherland has improved

between September 2016 and March 2017 with 2% to a little over 29% coverage, but that is still

insufficient. Likely the coverage will continue to improve, but there is still a long way to go before it

can be used as a valid source for similar statistic research, at least for the list URLs provided by Statistics

Netherlands. In other words Common Crawl is missing to much missing data to fully rely on for

statistical purposes (Daas, Puts, Buelens, & van den Hurk, 2015).

Since the majority of the clusters that did appear proved to be instable, clustering webpage documents

with k-means mini batch does not seem to lead to the stable results needed by Statistics Netherlands. It

is therefore advised not to use k-means mini batch for statistical puproses at least not when used in

similar projects as this research (Daas, Puts, Buelens, & van den Hurk, 2015).

5.5. Directions for further research

In this study it has become apparent that documents which were labelled as “innovative” did not natu-

rally cluster together when applying the k-means mini batch algorithm. While studies about the use of

clustering of webpages do exist, (especially for the use of creating or improving search engines Zang,

H., Pang, B., Xie, K., & Wu, H, 2006, Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., & Ma, J. ,2004), no

specific studies could be found that deal with clustering on a higher abstraction level when working

with large data sets. The proposed elbow in this research did not lead the most optimal number for k as

would follow from Gupta & Srivastava (2014: 7) and Bholowalia & Kumar (2014). The curve found in

section 4.1, shows much volatility, which is in accordance with Daas, Puts, Buelens, & van den Hurk

(2015), who mark volatility is one of the charactaristics of Big Data and thus different methods should

researched for determining the most optimal number for k when working with larger corpora.

34

Cited Works

Abualigah, L. M., Khader, A. T., Al-Betar, M. A., & Alomari, O. A. (2017). Text feature selection with

a robust weight scheme and dynamic dimension reduction to text document clustering. Expert

Systems With Apllications, 24-36.

Almeida, L. G., Vasconcelos , A. T., & Maia, M. A. (2009). A Simple and Fast Term Selection

Procedure for Text Clustering. In N. Nedjah, L. de Macedo Mourelle, J. Kacprzuk, F. M.

França, & A. F. de Souza, Intelligent Text Categorization and Clustering (pp. 47-64). Berlin:

Springer.

Bellman, R. E. (1957). Dynamic programming. Princeton : Princeton University Press.

Bessant, J., & Tidd, J. (20011). Innovation and Entrepeneurship. Chichester: John Wiley & Sons Ltd.

Bholowalia, P., & Kumar, A. (2014). EBK-Means: A Clustering Technique based on Elbow Method

and K-Means in WSN. International Journal of Computer Applications, 17-24.

Bradley, P. S., Bennet, K. P., & Demiriz, A. (2000). Constrained K-Means Clustering. Microsoft

Research Techinical Report (MSR-TR) 2000-65, 1-9.

Centraal Bureau voor de Statistiek. (2016, September 27). CBS start uniek initiatief voor big data-

onderzoek. Retrieved from CBS.nl: https://www.cbs.nl/nl-nl/nieuws/2016/39/cbs-start-uniek-

initiatief-voor-big-data-onderzoek.

Centraal Bureau voor de Statistiek. (2017a, September 8). CBS Urban Data Centre the Hague

Launched. Retrieved from CBS.nl: https://www.cbs.nl/en-gb/corporate/2017/26/cbs-urban-

data-centre-the-hague-launched

Centraal Bureau voor de Statistiek. (2017b, Augustus 20-08-2017). SBI 2008 - Standaard

bedrijfsindeling 2008. Retrieved from CBS.nl: https://www.cbs.nl/nl-nl/onze-

diensten/methoden/classificaties/activiteiten/sbi-2008-standaard-bedrijfsindeling-2008

Centraal Bureau voor de Statistiek. (2017c, Augustus 17). Standard Industrial Classifications (Dutch

SBI 2008, NACE and ISIC). Retrieved from CBS.nl: https://www.cbs.nl/en-gb/our-

services/methods/classifications/activiteiten/standard-industrial-classifications--dutch-sbi-

2008-nace-and-isic--

Centraal Planbureau. (2016). Kansrijk innovatiebeleid. Den Haag: Centraal Planbureau.

Cho, H., & An, M. K. (2014). Co-Custering Algorithm: Batch, Mini-Batch, and Online. International

Journal of Information and Electronics Engineering, 340=346.

Commoncrawl.org. (2017, April 04). CC-mrjob. Retrieved from Common Crawl:

https://github.com/commoncrawl/cc-mrjob

35

Cutting, D. R., Karger, D. R., Pedersen, J. O., & Yukey, J. W. (1992). Scatter/ Gather: a cluster-based

approach to browsing large document collection. SIGIR.

Daas, P. J., Puts, M. J., Buelens, B., & van den Hurk, P. A. (2015). Big Data as a Source for Official

Statistics. Journal of Official Statistics, 249-262.

Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering

Clusters. Proceedings of the Second International Conference on Knowledge Discovery and

Data Mining (KDD-96), 288-231.

Feldman, R., & Sanger, J. (2007). The text Mining Handbook: Advanced Approaches in Analyzing

Unstructured Data. Cambridge: Cambridge University Press.

Frey, B. J., & Dueck, D. (2007). Clustering by Passing Messages. Science, 972-974.

Furnkranz, J. (1998). A Study Using n-gram Features for Text Categorization. Wien: Austrian Research

Institute for Artificial Intelligence.

Gupta, H., & Srivastava, R. (2014). k-means Based Document Clustering with automatic "k" Selection

and Cluster Refinement . International Journal of Computer Science and Mobile Applications,

7-13.

Hollanders, H., & Es-Sadki, N. (2017). European Innovative Scoreboard 2017. Brussels: European

Commision.

Ikonomakis, I., Kotsiantis, S., & Tampakas, V. (2005). Text Classification Using Machine Learning

Techniques. WSEAS TRANSACTIONS on COMPUTERS, 966-974.

Inderjit, S. D., & Dharmendra, S. M. (2000). Concept Decompositions for Large Sparse Text Data using

Clustering. IBM Research Report RJ 10147.

Kohonen, S., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self

organization of a masive document collection. IEEE Transactions, 574-585.

Kulis, B., & Jordan, M. I. (2012). Revisiting k-means: New Algorithms via Bayesian. Proceedings of

the 29th International Conference on Machine Learning (ICML-12) (pp. 513-520). WUSTL

Machine Learning Group.

MacKay, D. J. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge:

Cambridge University Press.

Manning, C. D., Raghaven, P., & Schütze, H. (2009). An introduction to Information Retrieval.

Cambridge: Cambridge University Press.

McKeown, K. R., Barzilay, R., & Evans, D. (2002). Tracking and summarizing news on a daily basis

with Columbia's Newsblasster. HLT.

36

Miner, G., Delen, D., Elder, J., Hill, T., & Nisbet, R. (2012). The Seven Pratice Areas of Text Analysis.

In G. Miner, D. Delen, J. Elder, T. Hill, & R. Nisbet, Practical Text Mining and Statistical

Analysis for Non-Structured Text Data Applications (pp. 29- 41). Amsterdam: Amsterdam.

Mitchell, R. (2015). Web Scraping with Python. Sebastopol: O'Reilly books.

OECD. (2014). OECD Reviews Of Inovative Policy: Netherlands. Paris: OECD.

Raschka, S. (2015). Pyhton Machine Learning. Birmingham: Packt Publishing.

Rijsbergen, C. J. (1989). Information Retrieval. London: Buttersworth.

Sarkar, D. (2016). Text Analysis with Python. Bangalore: Apress.

Schumpeter, J. A. (1975). Capitalism, Socialism and Democracy. New York: Harper.

Scott, B. R. (2011). Capitalism: Its origins and Evolution as a system of Governance. Boston: Springer.

Sheshasayee, A., & Thailambal, G. (2016). A Study on K-means Clustering in Text Mining Using

Python. Internatinal Journal of Computer Systems, 560- 564.

Sridhar, S. (2014). Design and Analysis of Algorithmns. New Delhi: Oxford University Press.

Steinbach, M., Karypis, G., & Kumar, V. (2000). A Comparison of Document Clustering Techniques.

Proceedings of the International KDD Workshop on Text Mining (pp. 1-20). Minneapolis:

Department of Computer Science and Egineering, University of Minnesota.

Struijs, P., Braaksma, B., & Daas, P. J. (2014). Official statistics and Big Data. Big Data & Society, 1-

6.

UN Classifications Registry. (2017, Augustus 18). Retrieved from UNDS:

https://unstats.un.org/unsd/cr/registry/regcs.asp?Cl=27&Lg=1&Co=63

Wang, S., & Koopman, R. (2017). Clustering articles based based on semantic similarity.

Scientometrics, 1017-1031.

Wilbur, J., & Sirotkin, K. (1992). The automatic identification of stopwords. Journal Information

Science, 45 -55 .

Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Prectical Machine Learning Tools and

Techniques. Amsterdam: Morgan Kaufmann.

World Economic Forum. (2017). The Global Competiveness Report 2016-2017. Geneva: World

Economic Forum.

37

Yadav, K., & Baria, J. (2014). Mini-Batch K-means Clustering Using Map-Reduce in Hadoop.

International Journal of Computer Science and Information Technology Research, 336-342.

Zang, H., Pang, B., Xie, K., & Wu, H. (2006). An Efficient Algorithm for Clustering Search Engine

Result. International Conference, Computational Intelligence and Security (pp. 661-671).

Guangzhou: Springer.

Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., & Ma, J. (2004). Learning to cluster web search results.

Special Interest Group on Information Retrieval.

38

Appendix I: Common Crawl Results

Results September 2016 Results March 2017

Sample Percentage of URLs found

in Common Crawl

Sample Percentage of URLs found

in Common Crawl

1 28.0% 1 27.8%

2 28.3% 2 30.3%

3 29.7% 3 30.6%

4 28.2% 4 30.8%

5 28.8% 5 28.9%

6 25.8% 6 29.9%

7 27.1% 7 30.5%

8 28.2% 8 27.8%

9 27.5% 9 27.8%

10 28.2% 10 29.1%

Average 27.98% Average 29.35%

Running Time 3:31:44 Running Time 2:22:45

39

Appendix II: Scraping without Grequest – Language Detection

Table i: Extracting HTML’s Table ii: language de-

tection

Sample Produced

an error

Returned

no text

Successfully

extracted

Time

1 17.40% 19.30% 63.30% 0:43:29

2 18.00% 19.40% 62.60% 0:44:28

3 16.30% 18.60% 65.10% 0:43:19

4 15.60% 20.70% 63.70% 0:29:27

5 15.80% 19.90% 64.30% 0:34:05

6 16.60% 17.70% 65.70% 0:28:39

7 16.30% 19.70% 64.00% 0:40:27

8 16.50% 16.50% 67% 0:40:57

9 18.60% 16.10% 65.30% 0:31:21

10 17.10% 19.70% 63.20% 0:37:08

Total

6:13:20

Mean 16.82% 18.76% 64.42% 0:37:20

Std.

dev.

0.009542 0.015226 0.013456 0.00421

Language URLs

af 1,26%

ca 0,24%

cs 0,01%

cy 0,06%

da 0,28%

de 0,25%

en 15,32%

es 0,07%

et 0,06%

fi 0,02%

fr 0,27%

hr 0,17%

hu 0,01%

id 0,04%

it 0,13%

lt 0,02%

lv 0,01%

nl 81,20%

no 0,14%

pl 0,08%

pt 0,05%

ro 0,08%

sk 0,03%

sl 0,03%

so 0,03%

sq 0,02%

sv 0,06%

tl 0,04%

tr 0,02%

40

Appendix III: Innovative Documents per Cluster k = 100

Cluster Innovative Doc-

uments

Total Docu-

ments

Percentage

Total Docu-

ments

Percentage To-

tal Innovative

Documents

Difference

0 3 6652 1.5% 1.32% -0.18%

1 3 12836 2.89% 1.32% -1.57%

3 1 4323 0.97% 0.44% -0.53%

4 3 8356 1.88% 1.32% -0.56%

7 6 23793 5.35% 2.63% -2.72%

9 1 3551 0.8% 0.44% -0.36%

10 5 3394 0.76% 2.19% 1.43%

12 60 104918 23.61% 26.32% 2.71%

13 1 2745 0.62% 0.44% -0.18%

21 20 57699 12.98% 8.77% -4.21%

26 1 8990 2.02% 0.44% -1.58%

28 2 6496 1.46% 0.88% -0.58%

29 4 14483 3.26% 1.75% -1.51%

31 1 8734 1.97% 0.44% -1.53%

33 3 4286 0.96% 1.32% 0.36%

34 3 5133 1.15% 1.32% 0.17%

36 3 2638 0.59% 1.32% 0.73%

40 16 34871 7.85% 7.02% -0.83%

47 3 1690 0.38% 1.32% 0.94%

48 3 5892 1.33% 1.32% -0.01%

49 10 6031 1.36% 4.39% 3.03%

56 2 4442 1.0% 0.88% -0.12%

61 8 11567 2.6% 3.51% 0.91%

69 2 2029 0.46% 0.88% 0.42%

71 1 4602 1.04% 0.44% -0.6%

77 4 4843 1.09% 1.75% 0.66%

79 1 5816 1.31% 0.44% -0.87%

80 2 16240 3.65% 0.88% -2.77%

83 36 12654 2.85% 15.79% 12.94%

85 1 3342 0.75% 0.44% -0.31%

87 1 6700 1.51% 0.44% -1.07%

90 1 4125 0.93% 0.44% -0.49%

41

Appendix IV: Sample Websites Cluster Quality based un URL names and website spot-

checks

The original URL belonging to the document might reveal much about the kind of website it leads to

e.g. (1) http://www.mobilecarfix.nl, http://www.carwrapservice.nl and http://www.autobedrijfkhan.nl

(cluster 1) are all related to cars; (2) http://www.dekkerfietsen.com, http://www.tielemanfietsen.nl, and

http://www.fietsershoptslimmer.nl (cluster 47) are all about bicycle related; and (3) http://www.mas-

sage4all.nl, http://www.liesbethmassagepraktijk.nl, and http://www.mvmassage.nl cluster ( all offer

some kind massage service. When one of the three examples along with other similar URLs are found

in the sample of the cluster it is clear the cluster has a cohesive theme.

In the three examples above the URLs themselves had reoccurring words e.g. “fiets” which makes them

easy the identify when manually looking through them, another way to approach this is by looking at

the string similarity of the URLs. In order to automatize this the python module difflib.SequenceMatcher

was used in combination with itertools.Combinations on samples of 300 URL second level names, e.g.

‘dekkerfietsen’, per cluster. This results in all possible sets of two URL-names per cluster and their

similarity scores. For each cluster the average of the similarity scores is calculated in order to find how

similar the URL-names in each cluster are, and thus gives a score for each cluster.

This method however is only effective when the URL name explicitly displays the product or service

the company is offering e.g. cars (cluster 1: 27.53%), bikes (cluster 47: 28.53%), floors (cluster 51:

27.04%) or photography (cluster 89: 43.20%). Notice that most clusters that are without a clear simi-

larity, score between 22 and 24 percent. For the clusters in which the product or services are explicitly

mentioned this ranges between 26% to 45%. The instances were a very high percentage is found up to

100%, without exception constituted very small clusters containing only 2 or 3 URLs. These scores

should not be taken into account since these are usually identical documents with near identical URL

names, which should have been clustered with other documents. Other clusters would score low be-

cause the URL name itself does not reveal much about the product or service it delivers. e.g. (1)

http://www.vanleeuwen-advocaat.nl, http://www.ankerenanker.nl and http://www.barentskrans.nl ,

which are all lawyers or law firms. In these cases, some of the websites needed to be visited in order

to determine whether or not there is a measure of coherency within the cluster. Other examples of this

are (2) http://www.operagorinchem.nl, http://www.sari-djaya.nl and http://www.venezia-wijk-

bijduurstede.nl (cluster 5) which happened to be all food delivery services affiliated with Thuisbe-

zorgd.nl; and (3) http://eazyict.nl, http://www.djmissdeedy.com and http://www.hoenuverder.nl (clus-

ter 13) which where all reserved domain names by owned by a company called TransIP. A second way

used to spot cohesiveness in within clusters surfaces is by looking at the at the top terms of the cluster.

The top terms of cluster indicate cohesion within its cluster: “cli cli ebnt ebnt wij zorg onz kantor person

http://www.mobilecarfix.nl/

http://www.carwrapservice.nl/

http://www.autobedrijfkhan.nl/

http://www.dekkerfietsen.com/

http://www.tielemanfietsen.nl/

http://www.fietsershoptslimmer.nl/

http://www.massage4all.nl/

http://www.massage4all.nl/

http://www.liesbethmassagepraktijk.nl/

http://www.mvmassage.nl/

http://www.vanleeuwen-advocaat.nl/

http://www.ankerenanker.nl/

http://www.barentskrans.nl/

http://www.operagorinchem.nl/

http://www.sari-djaya.nl/

http://www.venezia-wijkbijduurstede.nl/

http://www.venezia-wijkbijduurstede.nl/

http://eazyict.nl/

http://www.djmissdeedy.com/

http://www.hoenuverder.nl/

42

werk juridisch begeleid mens mogelijk goed behandel” Note that some of the terms above are not cor-

rectly written in Dutch, since stemming is applied before clustering. An example a non-cohesive cluster

has the following URLs: http://www.hoogtechniek.eu, http://www.kimmenkehorst.nl and

http://www.meubelherstel.nl, and the following top terms: info mail btw tel mail info kvk contact fax

onz den wij telefon all nummer com. These top terms merely contain general information and most

likely is to be found on many Dutch websites. See Table below for results.

Clusters * URLs Topic Score Top Terms percentage

similarity

in URL

0 http://www.mobilecarfix.nl,

http://www.carwrapservice.nl,

http://www.autobedrijfkhan.nl,

http://www.derooyautoschade.nl,

http://www.bpam.nl,

http://www.autoenfiscus.nl,

http://www.brouwerlpg.nl,

http://www.rhcleaningproducts.nl,

http://www.gertbrandsen.nl,

http://www.autodemon-

tagevanderven.nl

Cars and

related to

cars

2 auto wij onderhoud onz servic

merk kunt reparatie all nieuw

car verkop bent terecht goed

27.53%

1

http://www.veba-elektro.nl,

http://www.warmerdam-

lichtwerk.nl,

http://www.htogroep.nl,

http://www.dcmbeheer.nl,

http://www.profectadvies.nl,

http://www.frankenesveld.nl,

http://www.kastenwanden.nl,

http://www.planhus.nl,

http://www.aannemingsbed-

rijfkooistra.nl,

http://www.dirkjankarsten.nl

Housing

and affili-

ated

2 woning bouw verbouw wij re-

novatie nieuwbouw project

huis onderhoud onz werkzam

goed kunt badkamer nieuw

23.93%

2 http://www.vrolijk.nl,

http://www.wishpel-vijver.nl,

http://www.cipela.nl,

http://www.palletwagenshop.nl,

http://www.gevelridder.nl,

http://www.lundbypoppenhuis.nl,

http://www.estherkrop.nl,

div. web-

shops

1 normal prijs special javascript

browser btw functionaliteit

websit browser javascript

functionaliteit incl uitgescha-

keld

23.95%

http://www.hoogtechniek.eu/

http://www.kimmenkehorst.nl/

http://www.meubelherstel.nl/

43

http://www.polarprofilters.nl,

http://www.bedankjes.nl,

http://www.water-

sportartikelen.com

3 http://www.dntw.nl,

http://www.vanleeuwen-advo-

caat.nl, http://www.notarisdos-

sier.nl, http://www.alkcare.nl,

http://www.bekendeparag-

nost.com, http://www.ankerenan-

ker.nl, http://www.dub-

belgenieten.nl,

http://www.keijservandervelden.n

l, http://www.sibb.nl

lawyers

and

notery

2 cli cli ebnt ebnt wij zorg onz

kantor person werk juridisch

begeleid mens mogelijk goed

behandel

23.21%

4 http://www.fysiodenham.nl,

http://www.smelik.huisarts-

plus.nl, http://www.tandartsen-

praktijkburgum.nl, http://www.ka-

relmatla.praktijkinfo.nl,

http://www.fysta.nl,

http://www.fysiotherapiegale-

cop.nl, http://www.petriepedi-

cure.nl, http://www.life-in-en-

ergy.nl, http://www.fysiotherapie-

heuvelland.nl, http://www.geels-

psychotherapie.nl

2 praktijk behandel pati pati

ebnt kunt onz ebnt informatie

wij afsprak medisch websit

welkom zorg therapie .

25.23%

5 http://www.operagorinchem.nl,

http://www.sari-djaya.nl,

http://www.venezia-wijk-

bijduurstede.nl,

http://www.snackhouse-twins.nl,

http://www.eethuis-bon-ap-

petit.nl, http://www.seleraanda-

amstelveen.nl, http://www.pizze-

ria-popeye.nl, http://www.cleopat-

ragrillenschede.nl,

http://www.bellamilanomoor-

drecht.nl

Food de-

livery,

thuisbe-

zorgd.nl

2 beoordel med gemaakt websit

mogelijk bekijk lekker heer-

lijk eten grot keuz warm

super goed vandag

25.00%

44

7 http://www.luukimberg.nl,

http://www.phytofemme.nl,

http://www.chrispellefoto-

grafie.nl, http://www.rijschool-

hennyleenen.nl, http://www.prins-

synergy.nl,

http://www.phdevriesrijsoord.nl,

http://www.cnip.nl,

http://www.joyboelens.nl,

http://www.sionsluis.nl,

http://www.npvbommel-

erwaard.com

No clear

similari-

ties be-

tween

docu-

ments

-2 lev jouw jij jou mens werk je-

zelf goed mak war wer gan an-

der person wet .",

23.13%

8 http://www.mnprojecten.nl,

http://www.friendlydolphin.nl,

http://www.prominent-vast-

goed.nl, http://www.vankan-

dronten.nl,

http://www.kastdesign.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 gerealiseerd garantie project

eig jar jar ervar ervar zzp

goedkop gezin gezond ging

glas goed goed adres

26.69%

9 http://www.bregtjedeboer.nl,

http://misterbassman.nl,

http://www.kinderboekwin-

keldegiraf.nl, http://www.falcon-

air-online.nl, http://www.wil-

lemdegroot.nl,

http://www.anessche.nl,

http://www.margriet4kids.nl,

http://www.pannen-

koekenboerderij.com,

http://www.uitvaartvereniging-

dokkum.nl, http://www.noord-

stee.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 addy var document getele-

mentbyid prefix path getele-

mentbyid document var addy

text spambot prefix var var

path path var var prefix path

prefix

22.93%

10 http://www.oploss-

ingsgerichtdenkenenwerken.com,

http://www.kiwienzo.nl,

http://www.woonmallvil-

laarena.com, http://www.zwem-

bad-info.nl, http://www.monni-

kendam.nl, http://www.relatiebe-

middeling-info.nl, http://www.in-

ternetconnections.nl,

http://www.smale.nl,

No clear

similari-

ties be-

tween

docu-

ments

-2 cookies gebruik cookies ge-

bruik websit onz maakt ge-

bruik gebruikt instell onz web-

sit informatie maakt wij brow-

ser sit klik

22.75%

45

http://www.artinsteel.nl,

http://www.roflexinternational.nl

12 http://www.deballetboetiek.nl,

http://www.bleijmakelaardij.nl,

http://www.movietrader.nl,

http://www.gizo.nl,

http://www.bouwbedrijfmeyer.nl,

http://www.apeldoornsegolfkam-

pioenschappen.nl,

http://www.deleukstelu-

iertaarten.nl,

http://www.tomoveyourbody.nl,

http://www.frendz.nl,

http://www.bedrijvenuitvoor-

burg.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 onz kunt nieuw info wij all jar

product mak contact neder-

land informatie mail mogelijk

via .",

22.97%

13 http://eazyict.nl,

http://www.djmissdeedy.com,

http://www.hoenuverder.nl,

http://www.studyrussian.nl,

http://www.metmeerplezi-

erproductief.nl, http://baba-

nicongo.org, http://www.het-vo-

gelparadijs.nl, http://www.de-

onlinetafelwinkel.com,

http://www.honigfabriek.org,

http://www.freshfashionguy.com

Websites

reserved

by Tran-

sIp

2

onz kunt nieuw info wij all jar

product mak contact neder-

land informatie mail mogelijk

via

22.41%

15 http://www.shemalesexdates.nl,

http://www.betaaldesexdates.nl

Two

URLs =

Same

Destina-

tion

0 zin toegang led gratis krijg

will mak mann functies ac-

count automatisch gebruiker

sprek aanmeld word

77.42%

16 http://www.javatimmerwerken.nl,

http://www.ovensvoordeindus-

trie.nl, http://www.franje.com,

http://www.correctsystems.nl,

http://www.desidesign.nl,

http://www.correct-systems.nl,

http://www.toolsupport.nl,

http://www.beachline.nl,

No clear

similari-

ties be-

tween

docu-

ments

-2 afbeeld www material tech-

nisch bureau advies allen stan

klar verlop rendement glas ex-

pert hog vaandel vaandel pro-

ductie

25.01%

46

http://www.loopbaanineigen-

hand.nl, http://www.eck-

hardtbouw.nl

17 http://www.defuik.nl,

http://www.landgoeddesalen-

tein.nl, http://www.watermolen-

singraven.nl, http://www.mya-

sia.nu, http://www.restaurant-

gustavino.com, http://www.de-

nieuwenhofvoorst.nl,

http://www.bijdeluts.nl,

http://www.restaurantxiexie.nl,

http://www.napoli-maastricht.nl,

http://www.residencerhenen.nl

Restau-

rants

2 restaurant gerecht diner reser-

ver geniet heerlijk gezell wij

onz eten lunch keuk kunt caf

terras

25.73%

19 http://www.diederikstevens.com,

http://www.michielmeijers.nl,

http://www.kraaima-media.nl,

http://www.jaike.nl,

http://www.lifejoy.nl,

http://www.denb-retail.nl,

http://www.janwil-

lemvandegroep.com,

http://www.indepaskamer.nl,

http://www.lisastolk.nl,

http://www.margotcpol.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 mor spannend social media

druk social media tegelijker-

tijd sted event welk onlin

kwam ten zakelijk bezig

24.73%

20 http://www.sprenkeler-pr.nl,

http://www.nldcommunicatie.nl,

http://www.birdycommuni-

catie.nl, http://www.fhcommuni-

catie-advies.nl, http://www.green-

communicatie.nl, http://www.ku-

buscommunicatie.nl,

http://www.proracom.nl,

http://www.herdercommuni-

catie.nl, http://www.breed-

beeld.nl, http://www.com-

passcommunicatie.nl

Commu-

nication

2 communicatie intern adviseur

advies rad ontwikkel onder-

wijs nieuw huisstijl hoger

partner strategisch activiteit

strategie diver

43.41%

21 http://www.ikhebeenverstop-

ping.nl, http://www.dakpannen-

verkoop.nl,

http://www.thephoneshopper.nl,

No clear

similari-

ties be-

tween

-2 wij onz kunt product lever

klant grag contact kwaliteit

23.21%

47

http://www.carreaux.nl,

http://www.reproxchange.nl,

http://www.kaasadministraties.nl,

http://www.mgbrandhout.nl,

http://www.cultwheels.nl,

http://www.hetsterrenhuis.nl,

http://www.w-sec.nl

docu-

ments

servic mogelijk all bedrijf

goed mak

22 http://www.bruidsbeurslim-

burg.nl, http://www.limburgreno-

veert.nl, http://www.velde-

kekids.nl, http://www.wat-

tedoeninlimburg.nl,

http://www.meteolimburg.nl,

http://www.veldekeremunj.nl,

http://www.wpm.nl,

http://www.lim-

burgsvakantiehuis.nl,

http://www.hsdgroep.nl,

http://www.advlimburg.nl

Limburg 2 limburg rendement reinig

energie rest kop hog hoeft wij

onderhoud kwalitatief jaarlijk

kijk mei hoogwaard

25.70%

24 http://www.kwispelstaartje.nl,

http://www.doggyfun.nl,

http://www.canilos.org,

http://www.trimsalondiane.nl,

http://www.hus-walkabout.nl,

http://www.hondenschoon.nl,

http://www.kwispel-tijd.nl,

http://www.heppie-hond.nl,

http://www.taketheleash.nl,

http://www.uniquedog.nl

Dogs 2 hond dier wij onz kunt goed

welkom gedrag all wandel be-

handel natur vind informatie

les

25.38%

26 http://www.mzcwaalwijk.nl,

http://www.beppebaukje.nl,

http://www.schietbaan.com,

http://www.tcroomburg.nl,

http://www.waterland.nl,

http://www.dierenkliniek-

dewaard.nl, http://www.vishen-

kkok.nl, http://www.recreama.nl,

http://www.bijoumoderne.nl,

http://www.spesautobanden.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 uur uur uur vrijdag zaterdag

maandag vrijdag uur dinsdag

wij donderdag openingstijd

zondag onz woensdag geslot

info

24.03%

27 http://www.livemediums.nl,

http://www.datcoaching.nl,

Coaching 2 coach hartelijk beroep eig suc-

cesvoll talent onderwijs

40.51%

48

http://www.edu1coach.nl,

http://www.smart-coach.nl,

http://www.p-coach.nl,

http://www.resultaatgericht-

coachen.nl

voorop var doe maatwerk ver-

ander inzicht waarin duurzam

28 http://www.hoogtechniek.eu,

http://www.kimmenkehorst.nl,

http://www.meubelherstel.nl,

http://www.duvah.nl,

http://www.kneib.com,

http://www.powerkilo.nl,

http://www.sketchuppro.eu,

http://www.eurobalans.nl,

http://www.wijkbouw.nl,

http://www.duinker.eu

No clear

similari-

ties be-

tween

docu-

ments

-2 right all right right reserved re-

served all copyright wij onz

websit kunt designed info con-

tact nieuw welkom

23.26%

29 http://www.twentyfour-shops.nl,

http://www.quicklunchshop.nl,

http://www.dekleineveer-

sepoort.nl,

http://www.wijnboerderijvlaar-

dingen.nl,

http://www.pouww.keurslager.nl,

http://www.grandcafededijk.nl,

http://www.ravanello.nl,

http://www.breshulpmiddelen.nl,

http://www.onsbakhuis.nl,

http://www.lekkerkoken.nu

food 1 onz heerlijk lekker ver wij

product geniet smak kunt wijn

winkel natur eten koffie gezell

24.41%

31 http://www.schaapjeblij.nl,

http://www.gastoudercindy.org,

http://www.bijmijles.nl,

http://www.isg-arcus.nl,

http://www.leskracht.nl,

http://www.sbodedijk.nl,

http://www.leukerik.nl,

http://www.kleurrijkvilt.nl,

http://www.hetzonnetje.nl,

http://www.hlinssen.nl

primary

and day-

care

2 kinder kind schol ouder onder-

wijs leerling ler wij groep ont-

wikkel onz jar spel begeleid

goed

23.87%

33 http://www.evertz.nl,

http://www.vakgaragewolters.nl,

http://www.firimass.nl,

http://www.thetroupe.nl,

No clear

similari-

ties be-

tween

-2 les verder verder les wij nieuw

onz jar goed all project neder-

land juni wer mak werk

22.56%

49

http://www.urotex.nl,

http://www.bedrijfsinterview.nl,

http://www.restauratiecentrum.nu,

http://www.merkwaaardig.nl,

http://www.vrijwilligerssteun-

puntommen.nl, http://www.hzzon-

wering.nl

docu-

ments

34 http://www.swiebertje.org,

http://www.filmuwbedrijf.nl,

http://www.omroepwest.nl,

http://www.zeelandvakantie-

woningen.eu, http://www.schoen-

tauf.de, http://www.haagrecht.nl,

http://www.printvandemaand.nl,

http://www.sarimanis.nl,

http://www.vdakker.nl,

http://www.elsen-uden.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 den hag den hag wij onz rot-

terdam info jar amsterdam

nieuw kunt all nederland werk

utrecht

23.44%

35 http://www.tonertradinggroup.nl,

http://www.theemaatje.nl,

http://www.spijkenissekoi.nl,

http://www.mijngeschenk.lu-

ondo.nl, http://juniorvintag-

eanddesign.eu, http://www.ad-

mir.nl, http://www.mariekelode-

wijk.nl, http://www.shop.tro-

pacafe.nl, http://www.spray-

tancity.nl, http://www.seasons.nu

No clear

similari-

ties be-

tween

docu-

ments

-2 webwinkel adres www shop

support controler geschrev

wellicht beginn beschik ver-

wacht indien jouw product

nem contact

23.13%

36 http://www.amotex.nl,

http://www.vanhooft-transport.nl,

http://www.pkwaterbouw.nl,

http://www.slaats-dierenvoed-

ers.nl, http://www.bonotrans.nl,

http://www.embassyfreight.nl,

http://www.tiniemander-

stransport.nl, http://www.piano-

verhuuramsterdam.nl,

http://www.vanwaveren-

transport.nl, http://www.correu-

ten.nl

mainly

transpor-

tation

2 transport logistiek wij onz ver-

voer international klant eu-

ropa bedrijf nederland servic

jar gespecialiseerd all dienst

25.76%

50

37 http://www.gifts4thegreen.nl,

http://www.surpreza.com,

http://www.sell-to-you.nl,

http://www.huidverzorging-ko-

pen.nl, http://www.tunertape.com,

http://www.realperro.nl,

http://www.goedkoop-eroken.nl,

http://www.houtvanheidi.nl,

http://www.publicsolutions.bied-

meer.nl, http://www.shopat-

work.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 webwinkel www adres con-

troler geschrev wellicht shop

support beginn verwacht in-

dien nem contact beschik nam

wel

23.04%

39 http://www.hetkompas-opende.nl,

http://www.derietvink-breda.nl,

http://www.delinderte.nl,

http://www.cbssamenopweg.nl,

http://www.cbsdewel.nl,

http://www.movendi.nl,

http://www.obsdebolder.nl,

http://www.demorgenster-

kampen.nl, http://www.obshar-

rybannink.nl, http://www.gerar-

duswinkel.nl

Elemen-

tary

schools

2 groep juli oktober september

addy kinder schol leerling on-

derwijs lunch vrij ging woens-

dag eig var

25.62%

40 http://www.jmadvies.nl,

http://www.boxbudgetbeheer.nl,

http://www.brehoff.nl,

http://www.chzorg.nl,

http://www.p-oatwork.nl,

http://www.muus.nl,

http://www.werknemerstevreden-

heid.eu, http://www.tele-

comhuys.nl, http://www.fgbfacili-

tygroup.nl, http://www.sobm.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 organisatie wij project werk

ervar onz ebl management fi-

nanci organisaties advies fi-

nanci ebl ondernem kennis on-

dernemer

22.24%

41 http://www.hofstrasteel.nl,

http://www.gruythuysen.nl,

http://www.dijkinkelektra.nl,

http://www.dijkinkbeveiliging.nl,

http://www.thermos-inst.nl,

http://www.europlant.nl,

http://www.pauluskerkvluchtel-

ingenwerk.nl,

http://www.daldrup.eu,

No clear

similari-

ties be-

tween

docu-

ments

-2 tel fax sneller hoger fax lag

maatwerk tel lever informatie

info goed ging glas goed mo-

gelijk gezond

22.70%

51

http://www.sanitaskliniek.nl,

http://www.hummelelektra.nl

43 http://www.progay.nl,

http://www.walkinngoes.nl,

http://www.telecom-erfgoed.nl,

http://www.gezondheidscentru-

melst.nl, http://www.stichting-

srz.nl, http://www.triviumdiag-

nostiek.nl, http://www.wal-

fridus.nl, http://www.roefelen.nl,

http://www.tdw-advies.nl,

http://www.opleiding-particuli-

eronderzoeker.nl

Mostly

founda-

tions

1 stichting nederland onz activi-

teit jar wij gemeent doel web-

sit project mens nieuw veren

kinder informatie

22.74%

45 http://www.nbz.nl, http://www.e-

beat.biz, http://www.wen-

sinkdancemasters.nl,

http://www.bootbouwschool.nl,

http://www.persoonsbeveiliger.nl,

http://www.rudolfholleman.com,

http://www.spaansetaal.org,

http://www.vu-dekempen.nl,

http://www.akkonderwijs.nl,

http://www.stimulans-fysiothera-

pie.nl

Educa-

tion

2 cursus cursuss opleid work-

shop ler volg kunt wij onz in-

formatie les docent mak trai-

ning werk

23.34%

47 http://www.dekkerfietsen.com,

http://www.tielemanfietsen.nl,

http://www.louwerenburg.nl,

http://hotelbrinkzicht.com,

http://www.fietsershoptslim-

mer.nl, http://www.lem-

mentweewielers.nl,

http://www.telutci.com,

http://www.bito.nl,

http://www.defruitgaard.nl,

http://www.defietsenwinkel.nl

Bikes 2 fiet fiets elektrisch wij onz

kunt nieuw merk winkel ac-

cessoires servic reparatie as-

sortiment all onderdel

28.53%

48 http://www.ikbenarie.nl,

http://www.amigo.nl,

http://www.verstraatengroep.nl,

http://www.elkedagietsleuks.nl,

http://www.pointofsales.nl,

http://www.justflow.nl,

Web-

shops

1 javascript browser functiona-

liteit websit browser javas-

cript functionaliteit uitgescha-

keld javascript lijkt lijkt uitge-

22.76%

52

http://www.shop4networks.nl,

http://www.cadjobs.nl,

http://www.fpcollection.mobi.nl,

http://www.steenstripwinkel.nl

schakeld uitgeschakeld brow-

ser your geactiveerd lijkt web-

sit benut winkelwag

49 http://www.flowfoundation.nl,

http://www.newcase-audiovisu-

als.com, http://www.fbeyeproduc-

tions.nl, http://www.citystarsou-

venirs.com, http://www.tokata.nl,

http://www.winenetwork.nl,

http://www.zichtopjezelf.com,

http://www.buywine.nl,

http://www.davenschot.nl,

http://www.janheinarens.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 the and for you this with are

your not that from can websit

wij hav

22.67%

50 http://www.vanderreest.nl,

http://www.reestmachines.nl

Two

URLs =

Same

Destina-

tion

0 verwijz reparer vernieuwd

machines bent exclusief wij

verhur bent zoek ruim jar ja-

nuari wij onz total jar ervar

juist adres

41.67%

51 http://www.beterisoleren.nu,

http://www.therdex.com,

http://www.houseofpandomo.nl,

http://www.deweerd-

wonenenslapen.nl,

http://www.kjfloorsolutions.com,

http://www.lambooparket.nl,

http://www.domcity.nl,

http://www.natuursteen-tegel-

werken.nl, http://www.brent-

jens.nl, http://www.par-

ketlijm.com

Floors

and

Flooring

2 vloer hout wij onz showrom

legg onderhoud kunt mogelijk

nieuw lever all jar kleur kwa-

liteit

27.04%

52 http://www.silverdaletraining.nl,

http://www.rijschoolgeduld.nl,

http://www.zangstudiodelft.nl,

http://www.simondewit.nl,

http://www.lgmusicschool.nl,

http://www.yukta.nl,

http://www.marinusterpstra.nl,

http://www.sterkeschool.nl,

Schools

and train-

ings

2 less leerling les jar ler docent

gegev wij onz workshop mu-

ziek volg goed mogelijk kun

24.68%

53

http://www.rijschooltalander.nl,

http://www.de-bolster.nl

54 http://www.vormgevenenzo.nl,

http://www.maren74.nl,

http://www.dehaancreative.nl,

http://www.agasi.nl,

http://www.eijsbroek.com,

http://www.schootvormgeving.nl,

http://www.drd-support.nl,

http://www.charlotluiting.nl,

http://www.miesign.nl,

http://www.thesculpfactory.com

Design 2 vormgev grafisch ontwerp

huisstijl rod websit stap be-

drijf complet professionel

snelheid maakt gebruik ge-

zicht rest kenmerk

22.44%

55 http://www.paardenmiddel.nl,

http://www.paardenmiddel.com

Two

URLs =

Same

Destina-

tion

0 sit bedoeld gezond dagelijk in-

formatie leuk krijg consult ge-

bruik stukj wet welkom sit

goed mogelijk dier beant-

woord

100.00%

56 http://www.vandaan-media.com,

http://www.possibilit.webs.com,

http://www.rodanthedecoratie.nl,

http://www.rekenenoprekenen.nl,

http://www.strandhuisje.com,

http://www.horecaoutlet.nl,

http://www.haarstudioalina.nl,

http://www.ctbaa.nl,

http://www.camerafilterstore.nl,

http://www.ggspecialsizes.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 var document function script

http https com getelementbyid

for data twitter new window

wij www

23.04%

58 http://www.in-visible.nl,

http://www.groenveldboeken.nl,

http://www.grimbergenboeken.nl,

http://www.iwema.nl,

http://www.voordeelboeke-

nonline.nl, http://www.agentsaft-

erall.nl, http://www.kolstein.nl,

http://www.agentsafterall.com,

http://www.boekhandel-

vandervelde.nl,

http://www.venstra.nl

books 2 boek activiteit onz onz winkel

winkel aanmeld regelmat foto

hoogt blijv wilt presentatie

onz nieuwsbrief wij grag lang

25.62%

54

61 http://www.cellebroederskapel.nl,

http://www.wingbergermolen.nl,

http://www.bikeparkspaarn-

woude.com, http://www.der-

ooijmakelaars.nl, http://www.be-

zoekerscentrumleudal.nl,

http://www.miataonderdelen.nl,

http://www.bureaucicero.nl,

http://www.nootdorp-

slotenmaker-specialist.nl,

http://www.hetkleineparadijs.nl,

http://www.indonesiatravel.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 les les verder verder wij onz

les mer mer nieuw jar all goed

kunt mak nederland werk

22.92%

62 http://www.littlebuds.nl,

http://www.jmgamesonline.nl,

http://www.herbanaatje.nl,

http://www.dewoonwinkel.eu,

http://www.mijnwebwin-

kel.nl/winkel/debontebazaar.nl,

http://www.rhmpstore.nl,

http://www.vlindertjevrolijk.nl,

http://www.no1els.nl,

http://www.meneerdebock.nl,

http://www.decreatievevlinder.nl

All

closed

webshops

of

mijnweb-

winkel.nl

2 helas webwinkel later mail-

adres prober zeker pro per

maand product domeinnam

functies betaalt inschrijv blog

inspiratie

24.69%

63 http://www.customicebp.com,

http://www.netster.nl

Two

URLs =

Same

Destina-

tion

0 tekst realiser ruim domeinnam

optimaliser afbeeld hosting

controler lat voldoet ruim er-

var kleding vertal email info

aanpass .

33.33%

64 http://www.blasteq.nl,

http://www.blasteq.com,

http://www.blasteq.eu

Three

URLs =

same des-

tination

0 direct kunt biedt les reinig ver-

led all soort schon hiervan be-

hor sector belangrijkst voor-

beeld gebruik mak gecertifi-

ceerd

100.00%

66 http://www.zijnenschijn.nl,

http://www.praktijkasem.nl,

http://www.acupunctuur-scha-

gen.nl, http://www.praktijko-

penblik.nl,

http://www.maulany.nl,

http://www.podothera-

Yoga and

alterna-

tive heal-

ing

2 klacht licham behandel thera-

pie praktijk gezond beweg ba-

lan geest oorzak lev stres kunt

ontspann wer

24.05%

55

pievenray.nl, http://www.heel-

bewust.nl, http://www.hildatop-

per.nl, http://www.sensbewee-

gtje.nl, http://www.prak-

tijkvanrumpt.nl

67 http://www.massage4all.nl,

http://www.ariana-lamberts.nl,

http://www.sedoc.nl,

http://www.spier.nu,

http://www.carlawinkelman.nl,

http://liesbethmassagepraktijk.nl,

http://www.westriknatuurgenees-

wijzen.nl,

http://www.timiselamassage.nl,

http://www.bobodywork.nl,

http://www.mvmassage.nl

Massage 2 massag ontspann licham be-

handel klacht praktijk geest

rust stres jezelf kunt heerlijk

balan aandacht goed

27.50%

68 http://www.bodewes.nu,

http://www.jackbodewes.nl

Two

URLs =

Same

Destina-

tion

0 sport geopend voll augustus

geslot training wer tijd gang

hanter draait maandag vrijdag

gezet hom verbouw

77.78%

69 http://www.sierhekwerkdejong.nl,

http://www.tinnemans-

scheepswerf.nl, http://www.rhino-

bv.nl, http://www.stal-de-elzen.nl,

http://www.vdkengineering.com,

http://www.meijerbv.nl,

http://www.stal-skulenboarch.nl,

http://www.rollen.nl,

http://www.lava3.nl,

http://www.vanhartskampmetaal-

werken.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 stal wij onz product metal le-

ver bedrijf all jar kunt info

mogelijk hout mat project

23.79%

71 http://www.timkorenhofftekst.nl,

http://www.rotorcommunicatie.nl,

http://www.willemnijeboer.com,

http://www.matthijsmeulblok.nl,

http://www.marionvanes.com,

http://www.barbarabreedijk.com,

http://www.lotjevanlieshout.com,

http://www.matchwinner-

Text edi-

tors, writ-

ters,

2 tekst verhal schrijv vertal

communicatie boek goed tal

boodschap mak jouw beeld

woord werk nederland

24.13%

56

shop.com, http://www.mar-

tijngort.nl, http://www.evi-

dentpr.nl

72 http://www.dewijnprins.nl,

http://www.wijnvoordeel.nl,

http://www.westerveldwijnen.nl,

http://www.wijnenmeat.nl,

http://www.taste-trade.eu,

http://www.wijnenvanegbert.nl,

http://www.wijngaard-zon-

nestraal.nl, http://www.vi-

novelzky.nl,

http://www.griekswijnhuis.nl,

http://www.ctwfinewines.com

wines 2 wijn amsterdam drink smak

rod onz del wit juist bestell

mooi kwaliteit zorg biologisch

wij

29.37%

76 http://www.heuvellandhotels.nl,

http://www.hotelvelsen.nl,

http://www.hotel-rido.nl,

http://www.renl.nl,

http://www.zaaninnhotel.nl,

http://www.hotelvanoranje.nl,

http://www.hulsmanvenray.nl,

http://www.invast.nl,

http://www.turkije.nl,

http://www.bellevuegroothoofd.nl

Hotels

and vaca-

tions

2 hotel kamer restaurant geniet

onz heerlijk wij prachtig kunt

geleg vakantie lux centrum

ligt all

29.66%

77 http://www.projectinrichting-

lavoir.nl, http://www.difofoto-

grafie.nl, http://www.den-

kontwerp.nl, http://www.foot-

printchallenge.nl,

http://www.crea-art.nl,

http://www.frozenwebshop.nl,

http://www.gymsportcongres.nl,

http://www.broekenbuuren.nl,

http://www.sfeerenmeer-

events.nl, http://www.meur-

swerkt.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 cre cre ebr ebr wij onz mak

werk klant ontwerp nieuw sam

product war goed ontwikkel

22.75%

78 http://www.bobbledesign.nl,

http://www.bobinterieurbouw.nl

Two

URLs =

Same

0 project diver opgedan meubel

besteld gedur klassiek acces-

soires webwinkel binnenkort

28.57%

57

Destina-

tion

bureau kijkj voorbeeld eis

deur

79 http://www.officecontent.nl,

http://www.fysiotherapie-deriet-

landen.nl, http://www.mh-loop-

baanadvies.nl,

http://www.presentatieo-

pleiding.nl, http://www.knubb.nl,

http://www.kompastraining.nl,

http://www.bontalen.nl,

http://www.marcelfuchs.nl,

http://www.inneraction.nl,

http://www.profact.org

Caoching

and train-

ing

2 training coaching trainer werk

person opleid begeleid train

onz wij mens ontwikkel sport

ler ervar

23.31%

80 http://www.adviseyou.nl,

http://www.newharttings.com,

http://www.dwain.nl,

http://www.koemeester.nl,

http://www.dutchcre8.com,

http://www.eco-communi-

catie.com, http://www.me-

diaflame.nl, http://www.vrijzin-

nigevangelisch.nl,

http://www.semweb.nl,

http://www.ictlimburg.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 websit websites wij onlin do-

meinnam onz ontwerp hosting

klant contact kunt huisstijl

mak informatie nieuw .

22.59%

81 http://www.ag-transporten.com,

http://www.ag-transporten.nl

Two

URLs =

Same

Destina-

tion

0 les verder transport fax verder

les tel fax vervoer tel del sales

specialiteit allround duitsland

vestig uitgevoerd

100.00%

82 http://www.ijmonduitvaart.nl,

http://www.beeldendetherapie-

mcdejager.nl, http://www.pa-visu-

als.nl, http://www.studioacan-

thus.nl, http://www.grapheus.nl,

http://www.reedemannaerts.nl,

http://www.romi-kin-

deropvang.com,

http://www.paulsellers.nl,

http://www.sandalfon.eu,

http://www.liedschrijvers.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 document writ writ document

addy var span prefix path var

addy styl prefix addy var pre-

fix path var var path path pre-

fix .

23.03%

58

83** http://www.crmconnectors.com,

http://www.nl.cremer.com,

http://www.limbra-ict.nl,

http://www.rely.nl,

http://www.leasebits.eu,

http://www.vhi.nl, http://www.ba-

nanajama.net, http://www.peri-

cia.nl, http://www.hipposoft-

ware.com, http://www.grout-

mij.com

Data,

software

and ICT

2 softwar system ict oploss wij

onz klant ontwikkel product

computer dienst beher bedrijf

all mogelijk

22.30%

84 http://www.interiorinput.nl,

http://www.interiorinput.com

Two

URLs =

Same

Destina-

tion

0 gebouw effici ebnt initiatief

effici onz opdrachtgever ebnt

ontwerp opdrachtgever interi-

eur concept onz realiser wijz

ieder ruimt

100.00%

85 http://www.detuinnatuurlijk.nl,

http://www.terstegentuinen.nl,

http://www.groengennep.nl,

http://www.westbeplanting.nl,

http://www.dethuismeester.nl,

http://www.gemmavermeulen.nl,

http://www.belshoftuin.nl,

http://www.fravin-sierbestrat-

ing.nl, http://www.moree-

groen.nl, http://www.raatjestui-

nontwerp.nl

Garden-

ing

2 tuin wij onderhoud ontwerp

groen onz kunt plant wens

mak geniet goed grag particu-

lier wilt .

28.99%

87 http://www.lpg.nl,

http://www.tegelhandelallertz.nl,

http://www.biggelaar.eu,

http://www.weltevredegroep.nl,

http://www.janvanzanten.nl,

http://www.werkvanuithuis.nl,

http://www.berenkind.nl,

http://www.pizzaovenfeestje.nl,

http://www.drukwerkvergelijker.n

et, http://www.tools-and-more.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 per dag stuk per jar btw incl

per dag dag per per wek wij

uur onz uur per jar wek .

23.06%

88 http://www.100procentvoet.nl,

http://www.lisabella.eu,

http://www.beaulissa.nl,

http://www.wellshaped.nl,

http://zhi-shiatsu.nl,

Beauty

and cos-

metics

2 behandel huid salon voet af-

sprak product ontspann kunt

mak wij verzorg natur terecht

onz goed .

25.13%

59

http://www.salonlapromesse.nl,

http://www.abeauty.nl,

http://www.cestcabeauty.nl,

http://www.beautysalonantoi-

nette.nl, http://www.huidcen-

trumlimburg.nl

89 http://www.schnek.nl,

http://www.esthergoldstein.nl,

http://www.leanderfoto-

grafie.com, http://www.cre-

anita.nl, http://www.ellefoto-

grafie.nl, http://www.frbfoto-

grafie.nl, http://www.nl-foto-

grafie.nl, http://www.stonewood-

fotografie.nl, http://www.bartgul-

demond.nl, http://www.vlieland-

foto.nl

photog-

raphy

2 fotografie websit momentel

blijf druk welkom websit kvk

foto btw com volg breng houd

wer snel

43.20%

90 http://www.seeyou.nl,

http://www.identipack.com,

http://www.pasopswingtuit.nl,

http://www.see-listen.nl,

http://www.goatmilkpowder.nl,

http://www.mobielewasstraat.nl,

http://www.whistler-it.nl,

http://www.promovlag.nl,

http://www.biozuiger.nl,

http://www.raphaelconsult.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 www http http www com info

websit onz domeinnam infor-

matie tel this nieuw https link

wij .

23.05%

92 http://www.berkersadvies.nl,

http://www.wildeharen.nl,

http://www.ebicus.com,

http://www.friendly-fire.nl,

http://www.magnetronmusic.com,

http://www.onna.nl,

http://www.persens.nl,

http://www.krijnsenteksten.nl,

http://www.e-flexpersoneel.nl,

http://www.vbku.nl

No clear

similari-

ties be-

tween

docu-

ments

-2 twitter geled maand for via

com web ongever consult

coach inzicht beweg geeft gra-

tis out .

21.57%

60

95 http://www.vanders.nl,

http://www.kikkerkinderyoga.nl

Two

URLs =

Same

Destina-

tion

0 train bereikt verstand rustig

beweg licham gevoel zorgt lo-

catie vast manier les mann

jong oud bedoeld

34.78%

97 http://www.de-glashut.nl,

http://www.a1glas.nl,

http://www.glasinloodstudio.nl,

http://www.pfann.nl,

http://www.glasinlood-amers-

foort.nl, http://www.vanderwal-

glasinlood.nl,

http://www.loligo.nl,

http://www.glasinloodateliercooij

mans.nl, http://www.glas-linq.nl,

http://www.gbb.nl

Glas 2 glas gevoel passend techniek

ontwerp behalv onz showrom

bron boodschap karakter

daarin vlak kijkt beperk onz

33.38%

*Clusters with one or less documents are omitted from this table for readability.

When comparing the clustering to the SBI classification the following becomes clear: (1) Some clusters

that seemed cohesive based on comparing URLs and checking websites manually where not as cohesive

as suspected E.g. cluster 3 based on the available SBI data only consisted for 18% out of Lawyers and

curators (69101) (5th level) and 35% of the documents including the “lawyers and curators” documents

fall under the general level 2 category legal services, accounting. tax consultancy, administration (69).

When looking at the total distribution of this cluster 200 documents or 6.5% in this cluster appear in the

SBI as “Other paramedical practitioners (no physiotherapy and psychology) and alternative healers”

(86919), and yet another 181 documents or 5.9% of the documents found in cluster 3 fall under “Prac-

tices of psychotherapists and psychologists” (86913), which both fall under the general SBI level 2

category Human health activities (86) and SBI level 3 category Paramedical practitioners and other

human health activities without accommodation (869). Even more interesting in this regard is the fact

that neighbouring cluster (4) does have a substantial number of documents that fall in this (869) level 3

category, with a majority of the documents fall into “Other paramedical practitioners (no physiotherapy

and psychology) and alternative healers” (86919) (See Appendix VIII).

Cluster Most dominant sub-

category

SBI 2008 Percentage docu-

ments with SBI-code

represented by most

dominant category

Number of docu-

ments with SBI-code

61

3 (level 2) Legal services, ac-

counting, tax consul-

tancy, administration

69 35% 3058 document(s)

3 (level 5) Lawyers and curators 69101 18% 3058 document(s)

4 (level 2)

Human health activi-

ties

86

80.80%

5279 document(s)

4 (level 3) Paramedical practition-

ers and other human

health activities with-

out accommodation

869 51.70% 5279 document(s)

4 (level 5) Other paramedical

practitioners (no

physiotherapy and

psychology) and al-

ternative healers

86919 22% 5286 document(s)

Some clusters that did not have any similarity between the documents, when compared to the SBI. E.g

cluster 40 did not held an apparent common theme amongst the documents compared to the SBI 20%

of the documents belong the fifth level SBI category “Organisational planning”. When looking at clus-

ter 83 the greatest portion (44%) of documents fall into the level 2 category supports activities in the

field of information technology (62). When looking deeper (level 4) a total of almost 20% and thus a

little less than halve of these falls into the “Writing, producing, and publishing of software” (6201)

category. 16.68% of all documents in the cluster fall into the “computer consultancy activities” (6202),

5.6% fall in to “Other information technology and computer service activities” (6209), another 5,6%

however fall into “financial holdings” (6420). 4.8% fall into “Wholesale of computers, peripheral

equipment and software” (4651), another 4.8% fall into “Engineers and other technical design and con-

sultancy” (7112).11

11 Results are stored in a table which is too large to include here or in the appendix but can be given at request.

62

Appendix V: Percentage of Smaller Clusters on different parameters and sizes for k

Thresholds Percentage clusters < = 1 Docu-

ments.

Percentage < = 10

Documents.

Percentage < = 100

Documents.

min_df: 0.1 1% (1 cluster) 4% (4 clusters) 4% (4 clusters)

min_df: 0.01

k = 50

k = 100

34% (17 clusters)

34% (34 clusters)

48% (24 clusters)

43% (43 clusters)

58% (29 clusters)

43% (43 clusters)

min_df: 0.01 –

max_df 0.7

k = 50

k=100

k= 300

34% (17 clusters)

28% (28 clusters

20% (62 clusters)

44% (22 clusters)

40% (40 clusters)

24% (72 clusters)

50% (25 clusters)

50% (50 clusters)

29.3% (88 clusters)

min_df 0.008

max_df: 0.7

k = 50

k = 100

48% (24 clusters)

32% (32 clusters)

58% (29 clusters)

44% (44 clusters)

64% (32 clusters)

51% (51 clusters)

min_df: 0.007

max_df: 07

k = 50

k = 100

40% (20 clusters)

32% (32 clusters)

58% (29 clusters)

44% (44 clusters)

66% (33 clusters)

50% (50 clusters)

Min_df 0.005,

max_df 0.7

K=100

29% (29 clusters)

44% (clusters)

52% (52 clusters)

Min_df: 0.003

max_df : 07

k = 100

39% (39 clusters)

45% (45 clusters)

53% (53 clusters)

63

Appendix VI: Relative Overrepresentation

K=100

K=500

Cluster

Level 4/5 most

dominant cate-

gories

SBI

Percentage

represented by

most dominant

category

Innovative

documents

Total

docu-

ments

Relative

overrepresen-

tation

191 Financial hold-

ings

6420 6.00% 16

(7.58%)

9754

(2.19%)

5.39%

113 Writing, pro-

ducing and

publishing of

software

6201 11.10% 12

(5.69%)

2741

0.62%)

5.07%

279 Engineers and

other technical

design and

consultancy

7112 14.00% 12

(5.69%)

2877

(0.65%)

5.04%

53 Organisational

planning

70221 5.00% 22

(10.43%)

32079

(7.22%)

3.21%

42 Organisational

planning

70221 7.60% 6

(2.84%)

3385

(0.76%)

2.08%

Cluster

Level 4/5 most

dominant catego-

ries

SBI

Percentage

represented by

most dominant

category

Innovative

documents

Total

docu-

ments

Relative

overrepresen-

tation

83

Writing, producing

and publishing of

software

6201 20% 36

(15.79%)

12654

(2.85%) 12.94%

49 Organisational

planning 70221 7% 10 (4.39%)

6031

(1.36%) 3.03%

12 Other interest or-

ganizations n.e.c.* 94997 4%

60

(26.32%)

104918

(23.61%) 2.71%

10 Other interest or-

ganizations n.e.c.* 94997 10% 5 (2.19%)

3394

(0.76%) 1.43%

64

K=1500

Cluster

Level 4/5

most domi-

nant cate-

gories

SBI

Percentage

represented

by most

dominant

category

Innovative

Documents

Total

Docu-

ments

Relative

overrepresen-

tation

103 Financial

holdings 6420 6.30%

38

(15.45%)

44594

(10.3%) 5.42%

309

Other inter-

est organi-

zations

n.e.c.*

94997 5.60% 49

(19.92%)

77618

(17.46%)

2.46%

65

Appendix VII: Dominant level 4 / 5 SBI Category within Clusters for k = 100

Cluster Level 4/5 most domi-

nant categories

SBI Percentage represented by


Number of

documents

89 Photography 74201 83% 12 document(s)




sulting

96022 62% 2940 document(s)

54 Advertising agencies 7311 58% 12 document(s)



vehicles (no import of

new cars)

45112 57% 3786 document(s)

24 Other service activities

n.e.c.*

9609 57% 931 document(s)

17 Restaurants 56101 56% 2357 document(s)

85 Landscape service activi-

ties

8130 54% 1709 document(s)

66 Other paramedical practi-

tioners (no physiotherapy

and psychology) and al-

ternative healers

86919 51% 3411 document(s)




ternative healers

86919 47% 975 document(s)

20 Organisational planning 70221 41% 17 document(s)

47 Shops selling bicycles

and mopeds

47641 41% 822 document(s)

97 Shaping and processing

of flat glass

2312 38% 24 document(s)

76 Hotels with restaurants 55101 37% 846 document(s)

72 Wholesale of beverages

(no diary products)

4634 32% 403 document(s)

36 Freight transport by road

(no removal services)

4941 31% 1901 document(s)

71 Writing and other artistic

creation

9003 28% 2715 document(s)

51 Floor and wall covering 4333 27% 928 document(s)

79 Business education and

training

85592 25% 3492 document(s)

52 Driving schools 8553 23% 1999 document(s)




ternative healers

86919 22% 5286 document(s)

31 Day nurseries for pupils 88911 22% 4898 document(s)

66

80 Writing, producing and

publishing of software

6201 22% 8867 document(s)

1 Construction of residen-

tial and non-residential

buildings

4120 21% 8197 document(s)

62 Retail sale via internet of

clothes and clothing ac-

cessories

47914 21% 187 document(s)


tions n.e.c.*

94997 20% 3848 document(s)

40 Organisational planning 70221 20% 28396 docu-

ment(s)



6201 20% 7878 document(s)

5 Fast-food restaurants, caf-

eterias, ice cream par-

lours, take-out eating

places etc.

56102 19% 53 document(s)

3 Lawyers and curators 69101 18% 3058 document(s)

45 Business education and

training

85592 16% 1674 document(s)




cessories

47914 13% 148 document(s)




ternative healers

86919 12% 13940 docu-

ment(s)



6201 12% 170 document(s)


tions n.e.c.*

94997 12% 1867 document(s)


tions n.e.c.*

94997 10% 2039 document(s)



69 Machining 2562 8% 1139 document(s)





cessories

47914 6% 576 document(s)


tions n.e.c.*

94997 6% 34651 docu-

ment(s)

29 Fast-food restaurants, caf-

eterias, ice cream par-

lours, take-out eating

places etc.

56102 6% 7412 document(s)

67


tions n.e.c.*

94997 6% 7955 document(s)


48 Dispensing chemists 4773 5% 2464 document(s)


87 Financial holdings 6420 5% 3395 document(s)

9 Financial holdings 6420 4% 2140 document(s)


tions n.e.c.*

94997 4% 56982 docu-

ment(s)


tions n.e.c.*

94997 4% 4854 document(s)



68

Appendix VIII: Most Dominant level 2 SBI Category in Clusters for k = 100

Cluster level 2 most domi-

nant categories

SBI Percentage represented by


number of documents

89 Industrial design,

photography, transla-

tion and other con-

sultancy

74 92% 12 document(s)

4 Human health activi-

ties

86 81% 5286 document(s)


ties

86 80% 3411 document(s)

17 Food and beverage

service activities

56 75% 2357 document(s)

0 Sale and repair of

motor vehicles, mo-

torcycles and trailers

45 73% 3786 document(s)

62 Retail trade (not in

motor vehicles)


88 Wellness and other

services; funeral ac-

tivities

96 72% 2940 document(s)

52 Education 85 65% 1999 document(s)

54 Advertising and mar-

ket research



motor vehicles)


24 Wellness and other

services; funeral ac-

tivities


85 Facility management 81 54% 1709 document(s)

20 Holding companies

(not financial)



motor vehicles)



ties


76 Accommodation 55 50% 846 document(s)

97 Manufacture of other

non-metallic mineral

products



motor vehicles)



motor vehicles)



motor vehicles)

47 45% 2464 document(s)

69

83 Support activities in

the field of infor-

mation technology

62 44% 7878 document(s)



41 Specialised construc-

tion activities


5 Food and beverage

service activities


72 Wholesale trade (no

motor vehicles and

motorcycles)


3 Legal services, ac-

counting, tax consul-

tancy, administration

69 35% 3058 document(s)

36 Land transport 49 32% 1901 document(s)

51 Specialised construc-

tion activities


71 Arts 90 32% 2715 document(s)

31 Social work activities

without accommoda-

tion

88 31% 4898 document(s)



the field of infor-

mation technology

62 28% 8867 document(s)


motor vehicles)

47 25% 7412 document(s)


(not financial)

70 25% 28396 document(s)

1 Construction of

buildings and devel-

opment of building

projects

41 24% 8197 document(s)

43 World view and po-

litical organizations,

interest and ideologi-

cal organizations,

hobby clubs

94 24% 1867 document(s)

69 Manufacture of fabri-

cated metal products,

except machinery

and equipment

25 24% 1139 document(s)


ties

86 22% 13940 document(s)

70




cal organizations,

hobby clubs

94 22% 3848 document(s)


motor vehicles)

47 19% 4854 document(s)


the field of infor-

mation technology

62 15% 1385 document(s)


(not financial)





cal organizations,

hobby clubs

94 13% 2039 document(s)


(not financial)

70 13% 3057 document(s)


motor vehicles)

47 12% 56982 document(s)

21 Wholesale trade (no

motor vehicles and

motorcycles)

46 12% 34651 document(s)

49 Arts 90 12% 3374 document(s)


motor vehicles)

47 11% 3395 document(s)


motor vehicles)

47 10% 2064 document(s)


(not financial)



motor vehicles)





cal organizations,

hobby clubs



ties



ties



motor vehicles)


71

TOP 10 Most Dominant SBI Level 2 Categories for k=500 and k=1500

Cluster

K=500

Dominant SBI level 2 Category SBI Percentage of docu-

ments in cluster in

dominant category

total number of

documents in cluster

39 Human health activities 86 100.00% 68 document(s)


274 Travel agencies, tour operators, tourist infor-

mation and reservation services

79 93.30% 45 document(s)

245 Legal services, accounting, tax consultancy,

administration

69 93.20% 44 document(s)

317 Legal services, accounting, tax consultancy,

administration

69 92.60% 27 document(s)

345 Sale and repair of motor vehicles, motorcycles

and trailers

45 90.40% 73 document(s)

190 Retail trade (not in motor vehicles) 47 89.70% 29 document(s)


99 Sale and repair of motor vehicles, motorcycles

and trailers

45 87.50% 104 document(s)


Cluster

K=1500

Dominant SBI level 2 Category

SBI

Code

Percentage repre-

sented by most

dominant category

Total Number of

Documents in Cluster




1328 Food and beverage service activities 56 100.00% 19 document(s)


1217

Wellness and other services; funeral activ-

ities 96 98.50% 68 document(s)


447 Education 85 97.10% 206 document(s)

858

World view and political organizations,

interest and ideological organizations,

hobby clubs 94 96.90% 5713 document(s)


72

73

Appendix IX: Percentage Innovative URLs per SBI level 4/5 Category

Count Level 4/5 Category Name SBI Percentage

0 Research and development on technology 72192 2.31%

1 Treatment and coating of metals 2561 0.46%

2 Machining 2562 0.46%

3

Agents involved in the sale of agricultural raw materials,

live animals, textile raw materials 4611 0.46%

4

Agents involved in the sale of fuels, ores, metals and chem-

icals 4612 0.46%

5

Agents involved in the sale of machinery, industrial equip-

ment, ships and aircraft 4614 0.46%

6

Manufacture of non-domestic cooling and ventilation equip-

ment 2825 0.46%

7 Specialised hospitals (not for mental health) 86103 0.46%

8 Manufacture of man-made fibres 2060 0.46%

9 Manufacture of agricultural and forestry machinery 2830 0.46%

10 Development of building projects 4110 0.46%

11 Social clubs 94991 0.93%

12 Manufacturing of bodies (coachwork) for motor vehicles 29201 0.46%

13 Manufacturing of and trailers and semi-trailers 29202 0.46%

14 Financial holdings 6420 9.26%

74

15 Other interest organizations n.e.c.* 94997 0.46%

16 Growing of arboricultural crops in open fields 1305 1.39%

17 Non-specialised wholesale of food 4639 0.46%

18 Renting of trucks, busses and motor homes 7712 0.93%

19 Roofing 4391 0.46%

20

Production of electricity by solar cells, heat pumps and hy-

dropower 35113 3.70%

21 Wholesale of computers, peripheral equipment and software 4651 0.46%

22

Wholesale of electronic and communication equipment and

related parts 4652 1.39%

23 Investment funds in financial assets 64301 0.46%

24 Investment funds in real estate 64302 0.46%

25

Wholesale of agricultural machinery, equipment and trac-

tors 4661 0.46%

26

Sale and repair of passenger cars and light motor vehicles

(no import of new cars) 45112 0.46%

27 Writing, producing and publishing of software 6201 5.09%

28 Computer consultancy activities 6202 1.39%

29 Manufacture of basic pharmaceutical products 2110 0.46%

30 Weighing and measuring 52292 0.46%

31 Manufacture of communication equipment 2630 0.93%

75

32 Private security 8010 0.46%

33 Wholesale of video and music recordings 46435 0.46%

34 Organisational planning 70221 1.85%

35 Removal services 4942 0.46%

36 Wholesale of internal transport equipment 46691 0.93%

37 Security systems service activities 8020 0.46%

38 Wholesale of flowers and plants 4622 0.93%

39

Renting and leasing of other machinery and equipment and

of other goods (no vending and slot 77399 0.46%

40 Business education and training 85592 0.93%

41

Manufacture of instruments for measuring, testing, naviga-

tion and controlling 2651 0.46%

42

Repair and maintenance of machinery for general use and

machine parts (no tools) 33121 0.93%

43 Repair and maintenance of machinery for specific industries 33123 0.46%

44 Wholesale of heating and cooling equipment 46692 0.93%

45 Wholesale of combustion engines, pumps and compressors 46693 0.46%

46 Wholesale of fittings and equipment for industrial use 46694 0.46%

47 Wholesale of measuring and control equipment 46695 0.46%

48 Wholesale of packaging 46696 0.46%

76

49 Wholesale of detergents and cleaners 46442 0.46%

50

Wholesale of other machines, equipment and supplies for

manufacturing and trade n.e.c.* 46699 0.93%

51 Renting of non-residential real estate 68204 0.46%

52 Educational support activities 8560 0.46%

53 Retail sale via internet of food and medical goods 47911 0.46%

54 Other information service activities n.e.c.* 6399 0.46%

55 Product design 74102 0.46%

56 Interior and spatial design 74103 0.46%

57

Wholesale of medical and dental instruments, nursing and

orthopaedic articles and laboratory 46462 1.39%

58 Other retail sale 47999 0.93%

59 Practices of physiotherapists 86912 0.46%

60

Wholesale of ferrous metals and ferrous semi-finished prod-

ucts 46722 0.46%

61 Wholesale of articles for lighting 46473 0.46%

62 Support activities for the own enterprise group 70101 0.46%

63 Trust offices 66191 0.93%

64

Umbrella organisations in the field of health care and other

support activities for health care 86929 0.46%

65 Specialised wholesale of other construction materials 46738 1.85%

77

66 Non-specialised wholesale of construction materials 46739 0.46%

67 Wholesale of hardware 46741 0.46%

68 Mixed farming 150 0.46%

69 Wholesale of sports goods (not for water sports) 46496 0.46%

70 Inland passenger water transport and ferry-services 5030 0.46%

71

Community centres, other consultancy and cooperative bod-

ies in the field of welfare 88999 0.93%

72 Web portals 6312 2.31%

73 Manufacture of plastic packing goods 2222 0.46%

74 Manufacture of builders’ ware of plastic 2223 0.46%

75 Management of real estate 6832 0.46%

76 Manufacture of electric lighting equipment 2740 0.46%

77 Manufacture of other plastic products 2229 0.46%

78 Trade of electricity and gas through pipes 3514 2.31%

79 Burial and cremation services 96031 0.46%

80 Other cleaning 8129 0.93%

81

Forging, pressing, stamping and roll-forming of metal; pow-

der metallurgy 2550 0.46%

82 Architects (no interior architects) 71111 0.46%

83 Engineers and other technical design and consultancy 7112 10.19%

78

84

Specialist medical practices and outpatients' clinics (no den-

tistry or psychiatry) 86221 0.46%

85 Freight transport by road (no removal services) 4941 0.93%

86 Stockbrokers, investment consultants etc. 6612 0.46%

87

Management en business consultancy (no public relations

and organisational planning) 70222 0.46%

88 Holding companies (not financial) 70102 1.85%

89 Earth moving 4312 0.46%

90 Manufacture of metal structures and parts of structures 2511 1.39%

91 Manufacture of metal tanks and reservoirs 2529 0.93%

92 Collection of non-hazardous waste 3811 0.46%

93 Impregnation of wood 16102 0.46%

94 Other manufacturing n.e.c.* 32999 1.39%

95 Retail sale via internet of articles for house and garden 47915 0.46%

96 Data processing, hosting and related activities 6311 0.46%

97 Treatment of non-hazardous waste 3821 0.46%

98

Manufacture of paints, varnishes and similar coatings, print-

ing ink and mastics 2030 0.46%

99

Manufacture of other special-purpose machinery and equip-

ment n.e.c.* 2899 0.46%

100

Actuarial and pension consultancy; management of pension

funds 66292 0.46%

79

101 Processing of meat (no prepared dishes) 1013 0.46%

102

Manufacture of medical instruments and supplies (no dental

laboratories) 32502 1.39%

103 Recovery of sorted materials 3832 0.93%

104 Installation of electronic and optical equipment 3323 0.46%

105 Processing of fish 1020 0.93%

106

Wholesale and commission trade of motor vehicle parts and

accessories (no tyres) 45311 0.46%

80

Appendix X: Stability of Clusters

K=

100

Per

centa

ge

of

Clu

ster

s F

alli

ng W

ithin

Sim

ilar

ity P

erce

nta

ge

Percentage of Similarity Within Cluster

Sets 100%>

90%

90%>

80%

80%>

70%

70%>

60%

60%>

50%

50%>

40%

40% >

30%

30%>

20%

20%>

10%

10%>

0%

Average

similarity

within all

clusters*

1,2 10% 3% 0% 6% 2% 3% 4% 2% 4% 66% 27.06%

1,3 15% 4% 5% 2% 2% 0% 4% 3% 3% 62% 21.36%

1,4 16% 3% 3% 2% 2% 0% 2% 3% 4% 65% 24.81%

1,5 14% 3% 1% 4% 2% 5% 4% 1% 1% 65% 25.06%

2,3 15% 3% 4% 5% 1% 1% 2% 0% 2% 67% 25.72%

2,4 14% 3% 9% 2% 2% 1% 1% 0% 4% 65% 25.97%

2,5 16% 6% 5% 4% 1% 1% 2% 1% 0% 64% 29.20%

3,4 15% 3% 5% 2% 3% 2% 0% 3% 0% 67% 25.78%

3,5 18% 2% 2% 3% 1% 2% 1% 2% 3% 66% 25.41%

4,5 14% 5% 4% 2% 3% 1% 4% 2% 2% 63% 26.70%

Mean 15% 4% 4% 3% 2% 2% 2% 2% 2% 65% 25.71%

St.dev. 0.02 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.02

81

K=

500

Per

centa

ge

of

Clu

ster

s F

alli

ng W

ithin

S

imil

arit

y P

erce

nta

ge


Sets 100%>

90%

90%>

80%

80%>

70%

70%>

60%

60%>

50%

50%>

40%

40% >

30%

30%>

20%

20%>

10%

10%>

0%

Average

similarity

within all

clusters*

1,2 17% 6% 5% 3% 2% 2% 2% 2% 4% 57% 31.91%

1,3 15% 6% 5% 3% 2% 1% 2% 2% 3% 60% 30.29%

1,4 16% 8% 4% 1% 2% 1% 2% 1% 5% 59% 30.67%

1,5 17% 7% 4% 3% 2% 1% 1% 2% 4% 59% 30.89%

2,3 18% 7% 4% 4% 3% 2% 2% 2% 5% 53% 33.99%

2,4 19% 6% 5% 4% 3% 2% 1% 2% 4% 54% 34.58%

2,5 18% 8% 4% 3% 3% 2% 0% 2% 4% 56% 33.67%

3,4 18% 6% 4% 3% 1% 3% 2% 2% 3% 58% 31.60%

3,5 19% 5% 4% 3% 2% 1% 2% 2% 3% 58% 31.58%

4,5 18% 7% 4% 2% 3% 2% 1% 3% 4% 58% 31.54%

Mean 18% 7% 4% 3% 2% 2% 2% 2% 4% 57% 32.07%

St.dev. 0.01 0.01 0.00 0.01 0.01 0.01 0.01 0.00 0.01 0.02 0.01

82

K=

1500 P

erce

nta

ge

of

Clu

ster

s F

alli

ng W

ithin

S

imil

arit

y P

erce

nta

ge


Sets 100%>

90%

90%>

80%

80%>

70%

70%>

60%

60%>

50%

50%>

40%

40%

>

30%

30%>

20%

20%>

10%

10%>

0%

Average

similarity

within all

clusters*

1,2 11% 6% 4% 2% 1% 1% 1% 2% 3% 69% 23.09%

1,3 10% 6% 3% 2% 2% 1% 1% 2% 3% 69% 21.84%

1,4 11% 6% 3% 3% 2% 1% 1% 2% 3% 67% 23.77%

1,5 11% 5% 3% 2% 2% 1% 1% 2% 4% 67% 23.00%

2,3 10% 5% 3% 2% 2% 2% 1% 1% 2% 70% 22.15%

2,4 12% 6% 3% 3% 2% 1% 1% 2% 3% 68% 23.41%

2,5 11% 6% 3% 3% 2% 1% 1% 1% 3% 69% 23.00%

3,4 11% 6% 4% 2% 1% 1% 1% 1% 3% 70% 22.00%

3,5 10% 5% 3% 2% 2% 1% 1% 1% 3% 71% 21.00%

4,5 12% 6% 3% 2% 2% 2% 1% 1% 4% 67% 24.00%

Mean 11% 6% 3% 2% 2% 1% 1% 2% 3% 69% 22.73%

St.dev. 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01

83

Appendix XI: Code Excerpts

Jupiter Notebook – sampling and checking Common Crawl index (Section 3.2.1)

Jupiter Notebook Web Scraping Code Excerpts (Section 3.2.2)

Gets 100 html code into _pickle files

84

Gets texts from pickled file

85

Language Detection (Section 3.3.1)

Finding URLs classified as e.g. Afrikaans

Stemming and Removing Stop-words (Section 3.3.2)

86

TF-IDF (Section 3.5)

k-means mini-batch (Section 4)

87

Stability of Surfacing Clusters (Section 3.6)

Creating Clusters / SBI comparison table including the getting SBI category name for each code

(Section 4).

88

89

Appendix XII: Software, Libraries and Hardware

In this research, the Python computer programming language is used to perform both extraction of the

texts, clustering documents and analysis clustering. For most of the research Jupyter notebook is used.

For some processes however, especially when running k-means minibatch, Spider is used for it proved

itself more sufficient in terms of system memory. Both in a Windows 10 and Linux Ubuntu environ-

ments will be used in the research.

Beside the standard Python libraries cdx-index-client is used to retrieve URL’s from the Common Ar-

chive, in order to address the second research question.12 Grequest (a combination of Request and

Gevent) is used to retrieve HTMLs from the web.13 Beside this the NLTK packages will be chosen for

pre-processing.

For the great majority two machines have been used for been used for the creation of models, perform-

ing experiments and documenting results. These can be found in the following table.

Located Portable Statistics Netherlands

(The Hague)

System specifications

HP Elitebook 2560p-02

Intel® Core™ i7-2620M

CPU 2.70 GHz (4 cores)

6.00 GB RAM

Windows 10 Pro

Big Data PC

Intel ® Xeon ® CPU E5-2670 0

@2.60 GHZ (16 cores)

Gallium 0.4 NVD 9 64- bit

Memory 62.8 GB 64GB

Operation System Windows 10 Professional Linux Ubuntu 16

12cdx-index-client is a tool designed especially for Common Crawl, which can be used to retrieve URL’s

in bulk from the Common Crawl Database, with use of the Common Crawl index API. It does so by

using parallel programming.

13 Grequest is a combination of the Request Module and Gevent module and it is used to make asyn-

chronous HTTP Requests, this means this is a more efficient method to request HTTP’s since it enables

users to make multiple requests simultaneously.

90

Appendix XIII: Explanation SBI Levels

The SBI Excerpt below has different levels of categories. Throughout this thesis different level are

used.

1. First level is e.g., C Manufacturing, in this thesis this is called a level 1 category. (one letter).

2. Second level is e.g., 10 Manufacture food products, in this thesis this is called a level 2 category.

(two digits).

3. Third level is e.g., 105 Manufacturing of dairy products, in this thesis this is called a level 3 cate-

gory (three digits).

In the level 3 category 869 Paramedical practitioners and other human health activities without

accommodation, there are both categories with 4 digits and categories with 5 digits. Categories with

5 digits are categories unique to the Netherlands. Both 4 and 5 digit categories exist on the same level,

namely one level beneath a level 3 category.

This this level is therefore called a level 4/5 category (Four or Five Digits).