View
0
Download
0
Category
Preview:
Citation preview
Clustering for Innovation:
Clustering Economic Activities Based on Textual Webpage Content
Master’s Thesis
Communication and Information Sciences
School of Humanities
Specialization Data Science: Business & Governance
Tilburg University
Author: Sjaak van der Zwan
ANR: 915601
SNR: 2002477
June 8th 2018
Supervisor / First reader: dr. E.E. van der Vaart
Second reader prof. E. Postma
ii
Preface
This thesis is the result of my research performed at the Center for Big Data Statistics at Statistics Netherlands.
Most of research of this thesis has been performed between April 2017 and October 2017.
During my Public Administration bachelor at Leiden University I became interested in quantitative research
methods taught during the program and started to look around for more a quantitative master program to com-
plement my bachelor’s degree. Tilburg University offered a Data Science specialization within their Commu-
nication and Information Sciences program. The Data Science program proved to be challenging yet rewarding,
adding much to my understanding of the possible use and application of data science in both public and private
sectors. The extra-curricular text mining course offered within the program have finally led me to the subject of
this thesis.
I like to thank everybody who has everyone who has supported me in completing this research. In the first place
Piet Daas and Ali Hurriyetoglu for helping me design my research questions, pointing me to relevant sources
and offering possible alternative solutions when running aground. I also like to thank dr. van der Vaart for her
constructive feedback in organizing and clarify this thesis. Allot of thanks to Marga for proofreading, checking
and correcting grammar and spelling. Thanks to my father for financially supporting my decision to continue
my education and most importantly a heartfelt thank you to Laura, Sophia-Amy and Amélie for love and inspi-
ration.
Sjaak van der Zwan
Leiden, June 2018
iii
Abstract
For economic and development purposes municipalities and other local governmental organisations in the Neth-
erlands have shown an interest in knowing which innovative companies can be found in the area that they
govern. Statistics Netherlands would like to provide this kind of information but is unable to do so using the
current Business Categorization System (SBI), since the SBI does not have separate categories for innovative
companies, nor is it able to separate innovative businesses from non-innovative business. Another problem with
SBI is that new innovative companies seem to end up in a remainder category. Researched is whether one or
more clusters containing innovative companies would form if an unsupervised clustering algorithm is applied
using the texts found on the main webpages of a selected number of companies. For this purpose, Statistics
Netherlands has provided a list containing 956,540 Web URLs, which have been identified as URLs belonging
to Dutch companies. Since EU Regulation 1893/2006 forces Statistics Netherlands to follow the structure of
NACE, Statistics Netherlands would not be able to formally change their standard industrial classification in-
dependently from other EU Member states. An alternative categorization system might exist however next to
the current SBI in order to meet the demands of municipalities and other local governmental organisations. For
the collection of Common Crawl web archive is explored but has been found to have insufficient coverage (less
than 30%) of the provided list of URLs. Alternatively, web scraping was used to extract main texts from the
webpages. K-means mini batch is chosen as clustering algorithm, because of its low run time complexity and
compatibility with large data sets. The elbow method is used to find the theoretical most optimum number of
clusters (k), beside k=100 found, with the elbow method, k =1500 was chosen because it equals the total number
of subcategories in the SBI and k= 500 as an arbitrarily chosen number between k=100 and k=1500. After
applying the k-means mini batch algorithm no clear innovative clusters have surfaced. The research also showed
that most clusters that did form are not sufficiently stable to be used for official statistical purposes as discussed
by (Daas, Puts, Buelens, & van den Hurk, 2015). This because clustering is very much affected by initial cen-
troid placement, as explained by MacKay (2003: 288). The elbow method did not lead to the most stable, nor
to the most cohesive clusters and thus not seem to be an effective method for larger data sets. With regard to
the texts belonging to innovative companies (innovative documents), innovative companies tend to have a rel-
atively stronger tendency to become part of larger cluster. From the innovative documents that are used for this
research it cannot be said that the majority fall in a remainder category. While k=500 clustering proved to be
most stable and most similar to the SBI, the k=1500 clustering proved to be most cohesive. The most cohesive
clusters for k=500 and k=1500 fall within the level 2 SBI category (86) Human health activities. For k=100 this
is (74) Industrial design, photography, translation and other consultancy, followed by (86) Human health ac-
tivities. Recommendations to Statistics Netherlands are given to further research the possibilities of supervised
machine learning in order to categorize innovative companies.
Key words: Clustering, Websites, Webpages, Common Crawl, k-means mini batch, SBI, Statistics Netherlands.
iv
Table of Contents
Section 1: Introduction ........................................................................................................................................ 1
1.1. Background and Relevance ...................................................................................................................... 1
1.2. Research Question and Sub-Questions ..................................................................................................... 2
1.3. Thesis Outline ........................................................................................................................................... 3
Section 2: Theoretical Framework ...................................................................................................................... 3
2.1. Statistics Netherlands and Big Data ......................................................................................................... 3
2.2. Standard Industrial Classification (SBI) ................................................................................................... 4
2.3. Innovation ................................................................................................................................................. 5
2.4. Text Mining, Text Clustering ................................................................................................................... 7
Section 3: Experimental Setup .......................................................................................................................... 12
3.1. Data Description ..................................................................................................................................... 12
3.2. Data Collection ....................................................................................................................................... 14
3.2.1. Indirect Collection [Common Crawl] .............................................................................................. 15
3.2.2. Direct Collection [Web scraping] .................................................................................................... 15
3.3. Pre-processing ........................................................................................................................................ 16
3.3.1. Language Detection ......................................................................................................................... 16
3.3.2. Tokenization, Stop Word Removal and Stemming ......................................................................... 17
3.3.3. Document Selection ......................................................................................................................... 18
3.3.4. Feature Selection ............................................................................................................................. 18
3.4. The Algorithm ........................................................................................................................................ 18
3.5. Vector Space Model Setup ..................................................................................................................... 18
3.6. Elbow Method ........................................................................................................................................ 19
3.7. Analysing Clusters.................................................................................................................................. 19
3.7.1. Innovative Clusters .......................................................................................................................... 19
3.7.2. Similarity to SBI and Cohesiveness ................................................................................................ 20
3.7.3. Analysing Stability of Surfacing Clusters ....................................................................................... 21
3.8. Practical Outline ..................................................................................................................................... 22
Section 4: Results .............................................................................................................................................. 23
4.1. Finding Optimum k ................................................................................................................................ 23
4.2. Cluster Analysis...................................................................................................................................... 24
4.2.1. Innovative Cluster Analysis............................................................................................................. 24
4.2.2 Analysing Cohesiveness and Similarity to SBI ................................................................................ 24
4.5. Cluster Stability ...................................................................................................................................... 28
v
Section 5: Discussion and Conclusions ............................................................................................................. 30
5.1. Discussing Results .................................................................................................................................. 30
5.2. Answering Research Questions .............................................................................................................. 30
5.3. Discussing Shortcomings ....................................................................................................................... 31
5.4. Recommendations to Statistics Netherlands ........................................................................................... 32
5.5. Directions for further research ................................................................................................................ 33
Cited Works ....................................................................................................................................................... 34
Appendix I: Common Crawl Results ................................................................................................................ 38
Appendix II: Scraping without Grequest – Language Detection ....................................................................... 39
Appendix III: Innovative Documents per Cluster k = 100 ................................................................................ 40
Appendix IV: Sample Websites Cluster Quality based un URL names and website spot-checks .................... 41
Appendix V: Percentage of Smaller Clusters on different parameters and sizes for k ...................................... 62
Appendix VI: Relative Overrepresentation ....................................................................................................... 63
Appendix VII: Dominant level 4 / 5 SBI Category within Clusters for k = 100 ............................................... 65
Appendix VIII: Most Dominant level 2 SBI Category in Clusters for k = 100 ................................................. 68
Appendix IX: Percentage Innovative URLs per SBI level 4/5 Category .......................................................... 73
Appendix X: Stability of Clusters ..................................................................................................................... 80
Appendix XI: Code Excerpts ............................................................................................................................. 83
Appendix XII: Software, Libraries and Hardware ............................................................................................ 89
Appendix XIII: Explanation SBI Levels ........................................................................................................... 90
1
Section 1: Introduction
This section will give the background and relevance of this research, the research questions and sub-
questions and gives an outline of this thesis.
1.1. Background and Relevance
Innovation plays a vital role in western society and its economic progress (Schumpeter, 1975: 82-85).
Because of this governments should play an active role in stimulating innovation in order for society to
reap the desired welfare effects (Bruce R. Scott 2011: 62). According to a study of the OECD (2016)
one major way in which states promote innovation is by granting preferential tax treatments to support
research and development investments. In 2016 29 of 35 OECD countries and 22 of 28 of non-OECD
countries promoted innovation in such manner. This being the case it makes sense that governments,
that have better knowledge of the innovative companies in their country, are better equipped to support
them in order to strengthen their economy. Categorizing innovative companies, locating them and pro-
cure relevant data and up to date information is thus vital for any well-functioning 21st century society.
It is therefore of public interest that innovative business activities are categorized in such a way that
governments can recognize innovative companies and give them the appropriate support.
The Dutch government and the Dutch municipalities usually rely on the statistics provided by the “Cen-
traal Bureau voor de Statistiek” (hereafter mentioned as Statistics Netherlands) for purposes such as
adjusting current policies or designing new policy. Netherland Statistics is known for its high-quality
statistics. To provide the necessary information concerning the Dutch economy and its companies. Sta-
tistics Netherlands uses the Dutch Standard Industrial Classification (SBI). In doing so Statistics Neth-
erlands hierarchically categorize companies and economic activities into 22 categories and 99 subcate-
gories (Centraal Bureau voor de Statistiek, 2017a). Dutch municipalities have shown an increased in-
terest to be informed about the number and variation of innovative companies in their region in order
to be able to support their development. The current SBI however, does not provide the possibilities to
effectively provide this information. This because innovative companies are not bound to specific cat-
egories or subcategories.
Mr Piet Daas, senior methodologist and data scientist at Statistics Netherlands, heads a major research
project to improve business related statistics. Within this project two problems have been identified
with regard to identifying innovative companies. The first is mentioned above in the previous paragraph
and the second is that the innovative businesses from new business branches tend to end up in the
remainder category, since they cannot be categorized with conventional companies or have a category
of their own in the SBI. Since the current SBI does not provide a way in which Netherland Statistics
could present the requested information concerning innovative businesses, it would add value for Sta-
2
tistics Netherlands to discover what kind of categorization might surface when an unsupervised (clus-
tering) algorithm is used on the textual content of company websites.1 Using a clustering algorithm on
the main text of company websites would ideally result in an identifiable cluster for innovative compa-
nies, alongside conventional categories. Which in turn could be used to identify innovative companies
in e.g. the different Dutch municipalities. The following lists are used in the research of this thesis:
A list of 956.540 web URLs belonging to Dutch companies.
8 lists containing 100 URLs containing the top 100 innovative companies for the years 2009
up until 2016 as compiled by the Dutch Chamber of Commerce (KvK).
1.2. Research Question and Sub-Questions
Based on this suggestion the following research question (RQ) is the central focus of this thesis:
RQ: “In what measure will clusters of innovative companies’ surface when unsupervised machine
learning is used on the textual content of the webpages of Dutch companies?”
Beside the afore mentioned societal relevance for this research, there is an academic relevance which
lies in the question whether abstract characteristics as ‘being innovative’ are reflected in the text on
main page of businesses in such a way that it will lead to separate clusters consisting merely of docu-
ments of companies with such characteristics. This research furthermore adds to the body of knowledge
concerning clustering webpages already available. Beside this it does add to scientific knowledge about
clustering a large body of documents, and thus adding to the scientific knowledge of processing Big
Data.
Texts of webpages can be extracted through a method called web scraping or could be requested from
so-called web archives. The Web Archive Common Crawl is chosen by Statistic Netherlands to be
further explored to whether it could be used to answer the research question. Knowing whether or not
Common Crawl could be used to answer the main research question is important since, extracting data
from a web archive is less time consuming and thus more efficient if compared to web scraping. Ex-
ploring Common Crawl might also help to determine whether Common Crawl could be seen as a valu-
able source for future research regarding Dutch websites. Essential of course is to establish if the Com-
mon Crawl archive sufficiently covers URL list used in this research, because only then it would provide
enough data for this research. Therefore, the first sub-question (SQ1) reads:
1 The statements of Daas concerning categorizing innovative companies within the SBI and the interest in clus-
tering company websites have been derived from a meeting prior to this research.
3
SQ1: “Does the Common Crawl Archive have a sufficient coverage of the Dutch company’s websites
circumscribed by Statistics Netherlands for this research?”
Finally, a second sub-question (SQ2) that is relevant to this research with regard to the SBI used by
Statistics Netherlands is:
SQ2: “Do the clusters that surface show similarities to the SBI and in what measure are these clusters
cohesive and stable?”
This second sub-question (SQ2) is important since the answer will show whether under the chosen ap-
proach, clusters surface that are similar to the SBI and whether the clustering could be easily imple-
mented within the structure already used by Statistics Netherlands.
1.3. Thesis Outline
This thesis follows the following outline. Section 2 describes the relevant literature concerning the re-
search, the k-means algorithm and vector space model. Section 3 outlines the experimental setup with
a description of the data, the methods for data collection, and the methods concerning pre-processing
and processing the data. Section 4 will show the results of the clustering and cluster stability. In section
5 the results will be discussed, research questions will be answered, recommendations to Statistics
Netherlands will be given, the weakness of this research will be discussed and a direction for further
research will be given.
Section 2: Theoretical Framework
This section describes how Statistics Netherlands uses Big Data for official statistics, the pit falls of
using Big Data for official statistics, it gives a definition of innovation as used by Statistics Netherlands
and the Dutch Government, the Standard Industrial Classification (SBI) and it gives an overview of the
relevant literature concerning text mining and clustering.
2.1. Statistics Netherlands and Big Data
With the launch of the Center for Big Data Statistics (CBDS) on September 27th 2016, Statistics Neth-
erlands has created a platform where national and international governments, business, academia and
educational organisations can collaborate in the field of Big Data technology and methods for the cre-
ation of official statistics (Centraal Bureau voor de Statistiek, 2016). On June 23rd 2017 the Statistics
Netherlands presented Urban Data centre/ The Hague, which is a partnership between the municipality
of The Hague and Statistics Netherlands that aims to deepen, broaden and improve their knowledge on
The Hague’s local data in order to be able to improve policy making and decision making (Centraal
4
Bureau voor de Statistiek, 2017a). According to Struijs, Braaksma, & Daas (2014) the increasing vol-
ume, velocity and variety of Data (Big Data) will present both opportunities and challenges for National
Statistical Institutes (NSI) such as Statistics Netherlands. While a major opportunity lies in the increase
in sources from which data can be collected e.g. smart phones, twitter posts, other social media and
click trace, the major challenge is guarantying the quality of statistics derived from Big Data. These
new data sources are not purposely designed for data analysis so they often lack a well-defined target
population and they lack the structure and quality that is found in traditional datasets used for official
statistics. With data so widely available other, commercial parties have entered the information market,
that challenge the traditional role the National Statistics Institutes (NSI’s) and force these Institutes to
prove their thus far unique capabilities and added value in providing high quality statistics (Struijs,
Braaksma, & Daas, 2014). According to Daas, Puts, Buelens, & van den Hurk (2015) it is no easy task
to extract relevant and reliable data from Big Data sources in order to produce official high-quality
statistics. The three main issues when working with Big Data are missing data, volatility and selectivity
of data. Daas, Puts, Buelens, & van den Hurk (2015) are positive about the usefullnes of Big Data for
official statistics in the future, but stress the importance of the knowledge required from the fields of
data mining, high-performance computing and skills from the new emerging discipline of data science.
2.2. Standard Industrial Classification (SBI)
Statistics Netherlands uses the Dutch Standard Industrial Classification (SBI 2008), a hierarchical clas-
sification of economic activities, to classify industrial units according to their core activities (Centraal
Bureau voor de Statistiek, 2017b). The SBI hierarchy knows 5 levels and is based on the NACE (sta-
tistical classification of Economic Activities in the European Community) and on the United Nations
ISIC (International Standard Industrial Classification of All Economic Activities). In the SBI the four
first digits are similar to the first four digits of the NACE and the first two digits are similar to ISIC.
Categories with fifth digits are special Dutch categories (Centraal Bureau voor de Statistiek, 2017c).2
Statistics Netherlands is under European Union (EU) law based on EU Regulation 1893/2006 obliged
to follow the structure of NACE. A ‘regulation’ of the EU is a binding legislative act and it has direct
effect in all EU Member States. This as opposed to the EU ‘directives’ that need to be translated and
implemented in national legislation. It is therefore impossible for Statistics Netherlands to alter the setup
of the SBI. The regulation concerning the establishment of the statistical classification of economic
activities (NACE) gives in the preamble the following motivation for establishing a standard for statis-
tical classification in Member States:
2 See appendix XIII for an explanation of different SBI levels.
5
“(4) In order to function, the internal market requires statistical standards applicable to the collection,
transmission and publication of national and Community statistics so that businesses, financial institu-
tions, governments and all other operators in the internal market can have access to reliable and com-
parable statistical data. To this end, it is vital that the various categories for classifying activities in the
Community be interpreted uniformly in all the Member States. (5) Reliable and comparable statistics
are necessary to enable businesses to assess their competitiveness and are useful to the Community
institutions in preventing distortions of competition.”
While the above-mentioned regulation will prevent Statistics Netherlands to formally change their
standard industrial classification independently from the other member states that are bound by EU
regulation, the research in this thesis remains useful for Statistics Netherlands, to increase their under-
standing of natural clusters that do surface when applying cluster algorithms to the main page text of
websites of innovative companies or to create an alternative categorization which Statistics Netherlands
can use next to the SBI and which can be used for specific requests from municipalities and other
organisations. This knowledge can also be used to improve the NACE in the future.
2.3. Innovation
The Cambridge dictionary describes ‘to innovate’ as “to introduce changes and new ideas”. Similar,
but more specific the Oxford dictionary describes ‘to innovate’ as to: “Make changes in something
established, especially by introducing new methods, ideas, or products.”
Statistics Netherlands differentiate between two kinds of innovation: technical innovation and non-
technical innovation. Technical innovation is when companies renew and improve their products and
processes, while non-technical innovation is when improvements are made within an organization or in
the manner in which a product or service is marketed (Statistics Netherlands 2016: 174-176). According
to Statistics Netherlands (2016: 177) technical innovation can be seen as the classical definition of in-
novation while adding non-technical innovation will lead to a broader definition. Statistics Netherlands
measures innovativeness of companies according to rules of the European Community Innovation Sur-
vey (CIS), which are part of the EU technology and statistics.
In its official publications the Dutch government follows the definition of the Oslo Manual, which de-
fines innovation as:
“[...] the implementation/ commercialization of a product with improved performance characteristics
such as to deliver objectively new or improved services to the consumer. A technological process inno-
vation is the implementation/adoption of new or significantly improved production or delivery methods.
6
It may involve changes in equipment, human resources, working methods or a combination of these.”
(Centraal Planbureau, 2016)
Close related to and often used in combination with innovation is the phrase ‘research and develop-
ment’. The following definition from the Frascati Manual is used to describe research and development
(R&D) by the Dutch Government and Statistics Netherlands:
“Research and experimental development (R&D) comprise creative and systematic work undertaken
in order to increase the stock of knowledge – including knowledge of human kind, culture and society
– and to devise new applications of available knowledge.”
Bessant and Tidd (20011:6) write that most economists agree on the fact that innovation is one of the
main drivers for economic growth and continue by quoting William Baumol in saying that ‘virtually all
of the economic growth that has occurred since the eighteenth century is ultimately attributed to inno-
vation”.
In various reports, studies and indexes the Netherlands is amongst the leading countries as it comes
innovation. The World Economic Forum, in its Global competitiveness report of 2016-2017 ranked The
Netherlands 4th on the Global competitiveness index (World Economic Forum, 2017). In order to stay
competitive, strengthen the labour market, and solve other problems in society the Dutch government
seeks to stimulate various forms of innovation (OECD, 2014). The European Commission in 2016
marked the Netherlands, along with Denmark, Finland, Germany, Sweden and The United Kingdom,
as ‘Innovation Leader’. The group ‘Innovation leader’ is the top group of four performances groups
(Modest Innovators, Moderate innovators, Strong Innovators, Innovation Leaders) that classify Member
states of the EU. While Switzerland does outperform all the European countries with regard to Innova-
tion it is not included as Innovative leader since it is not a member state of the EU. Innovative leaders
are all member states with a relative performance with respect to innovation of more than 20% above
the average of the average performance of EU member states (Hollanders & Es-Sadki, 2017: 79). Be-
tween 2010 and 2016 The Netherlands was also one was the countries with the highest increase in
innovative performance, with an increase of almost 10%, this while Germany, Denmark and Finland
decreased in innovative performance with almost 5% between 2010 and 2016 (Hollanders & Es-Sadki,
2017: 16).
As mentioned in section 1.1 8 lists are used containing the top 100 of most innovative companies in the
Netherlands, compiled by the Dutch Chamber of Commerce (KvK) for the years 2009 up until 2016.
These lists have been created by innovation experts who judged each nominated and ranked companies
in their respective branches, society as a whole, originality, realized potential for growth. These lists
7
are used in this research and form the baseline for innovation in this research. While other companies
could be identified as innovative for consistency the innovative companies are limited to the companies
in these lists.
2.4. Text Mining, Text Clustering
Text mining is part of the field of data mining in which data scientists and other users try to discover
interesting and useful patterns from big quantities of text documents. Text mining is also known as
Intelligent Text Analysis, Knowledge Discovery in Texts (KDT) and Text Data Mining (Sheshasayee
& Thailambal, 2016). With the rise of the internet, including the explosion of social media applications,
text mining has become an increasingly interesting subject for research for both businesses and aca-
demia (Miner, Delen, Elder, Hill, & Nisbet, 2012). Witten, Frank, & Hall (2011: xxi) define data mining
as: “(…) the extraction of implicit, previously unknown, and potentially useful information from data.”
They proceed in saying that different from numeric data, textual data has no hidden implicit information,
since authors explicitly state the information in the text they want to bring across. Textual data in data
science is known as unstructured data and cannot be as easily consumed and used by computers as is
numeric data. Therefore, it has to be transformed or translated into a form which computers can work
with. According to Witten, Frank, & Hall (2011: xxi) Machine Learning provides a technical basis to
extract information from (raw) data, which information in turn can be used for other purposes. Two
main types of Machine Learning can be distinguished: supervised and unsupervised learning. Super-
vised learning schemes use labelled datasets to train models, which models in turn are used to predict
how unseen instance should be labelled (Raschka, 2015: 2- 8). In unsupervised learning problems in-
stances are unlabelled. The most common form of unsupervised learning is clustering or cluster analysis
(Raschka 2015: 312) (Manning, Raghaven, & Schütze, 2009: 348-350).
Text clustering, according Aboulia, Khader, Al-Betar, & Alomari (2017: 24), is one the most efficient
techniques used in the text mining field and its goal is to divide instances into natural groups (clusters)
in such a way that the instances most similar to each other end up in the same cluster while instances
with less similarities end up in different clusters (Yi, Zhang, Zhao, & Wan, 2017: 1). Within the field
of text mining or information retrieval, when referring to the separate designated texts which are pro-
cessed, these texts are referred to as documents. Depending on the project a document can be, a chapter
from a book, a poem, the lyrics of a song, a tweet, a new article or as in the case of this research the
textual content extracted from the main webpage of a company website. At the core of techniques used
for effective document selection and retrieval lies the so-called Cluster Hypothesis formulated by van
Rijsbergen (1989): “Closely associated documents tend to be relevant to the same requests.” or as
Manning, Raghaven, & Schütze (2009) define this hypothesis: “Documents in the same cluster behave
similarly with respect to relevance to information needs”. Manning, Raghaven, & Schütze (2009: 14)
also notice that in its essence this cluster hypothesis is identical to the contiguity hypothesis, which is
8
the basic hypothesis used in vector space models and is defined as: “Documents in the same class form
a contiguous region and regions of different classes do not overlap”. The cluster hypothesis has proven
to be effective and successful in result search clustering (Zeng, He, Chen, Ma, & Ma, 2004), creating
alternative user interfaces and improved information presentation for web-browsing (Cutting, Karger,
Pedersen, & Yukey, 1992), (McKeown, Barzilay, & Evans, 2002). Other examples of document clus-
tering are the work of (Kohonen, et al., 2000) who clustered 6.8 million patents based on similarities in
texts of its abstracts, using a statistical representation of their vocabularies as feature vector. Clustering
algorithms are also used to improve web searches and to find similar search results (Zang, Pang, Xie,
& Wu, 2006). Wang & Koopman (2017) studied the clustering of scientific articles based on semantic
similairity with the use of the algorithms k-means, k-means mini batch and Louvian communion
detection algorithm. Wang & Koopman (2017) found that k-means was more widely applicable e.g.
when citation data was missing, furthermore k-means proved to be higly scalable and produced results
wich are in high agreement with other solutions. No specific studies where found in which clustering
algorithms where applied to webpages in order to find innovative companies. Studies regarding finding
a specific theme or topic do have been conducted. In several of these studies k-means is also proposed
as a more efficienced algorithmn (Inderjit & Dharmendra, 2000), (Steinbach, Karypis, & Kumar, 2000).
Ramage, Heymann, Manning and Garcia-Molina (2008). While many other clustering algorithms exist
and being proposed not all of these algorithms are fit for large scale text clustering Kulis & Jordan
(2012) in their study on new Bayesian algorithms state that:
“(…) despite the success and flexibility of the Bayesian framework, simpler methods such as 𝑘-means
remain the preferred choice of in many large-scale applications. (…) whereas Bayesian models require
sampling algorithms or variation inference techniques which can be difficult to implement and are often
not scalable, k-means is straightforward to implement and works well for a variety of applications.”
K-means uses a centroid-based algorithm, meaning that for each cluster a centre (the mean) is chosen
see figure 1. K-means is a partition algorithm or a flat clustering algorithm, meaning that clusters are
formed independently of each other. The first step in running k-means is choosing a value for k. The
value of k equals the number of centroids and thus the number of clusters the output will have. The fact
that k needs to be determined by the user is viewed as one of the big downsides of the k-mean algo-
rithms. Several techniques e.g. the elbow method and silhouette method have been proposed to deter-
mine the optimal k for the data (Gupta & Srivastava, 2014: 7). After determining k, k-means moves
between two steps while iterating over all data points or instances. The first step, the assignment step,
assigns the instance to its nearest centroid, when the instance is assigned the second step, the update
step, is to recalculate the mean of the cluster. E.g. when k = 4 and N (number of instances) = 11.
9
Figure 1: k-means explained
Another downside to the k-means algorithm mentioned by MacKay (2003: 288) is that a change in the
placement of the initial centroids might lead to different clusters. According to (Bradley, Bennet, &
Demiriz, 2000) k-means setups with n ≥ 10 dimensions and k ≥ 20 clusters will result in a percentage
of clusters which are empty or have very few or only one data point. In order to comply with the greater
requirements of web-based applications Google’s D. Sculley proposed k-means Mini-batch. Sculley
explains that the classic k-means algorithm requires O(kns) computations in which n is the number of
examples and s the max number of non-zero elements in each feature vector, and thus the computations
increase linearly with the increase of documents. The k-means algorithm explained above is tradition-
ally processed and analysed in an ‘offline’ fashion. Meaning that the complete dataset is available and
processed is a whole, especially with big data sets this becomes computational and memory expensive
and even impossible. In order to solve this problem online machine learning is developed. In online
machine learning the data is being streamed and it is not required to load the full data set into the
memory ( (Cho & An, 2014)& An, 2014: 362). K-means mini batch takes random samples of a prede-
termined size (a mini batch) like the traditional k-means algorithm assignments and update steps. The
use of k-means mini-batch thus enables its users to run k-means on the bigger datasets without running
into memory errors. (Yadav & Baria, 2014) have shown that when applying mini batch k-means on the
reuters21578 data set (a text data test set most widely used for text categorization research) is less time
consuming and is over 10% more accurate than the classic k-means algorithm.
Beside k-means and k-means mini-batch other algorithms are used for clustering. Examples are affinity-
based clustering (Frey & Dueck, 2007) and DBSCAN clustering (Ester, Kriegel, Sander, & Xu, 1996).
For both algorithms the calculation and memory costs are relatively high compared to the k-means
10
variants, and thus less suitable for large datasets. Another form of clustering, next to these flat clustering
algorithms, is so called hierarchical clustering. Hierarchical clustering does not need a priori knowledge
or information about the number of clusters and is known in many cases to lead to superior results
compared to flat clustering (Steinbach, Karypis, & Kumar, 2000: 1). The downside to this is that the
time complexity of the clustering algorithm increases with the number of documents and compared to
standard k-means algorithms which have a time complexity which is linear (kns), most hierarchical
clustering algorithms have a time complexity which is at least quadratic to the number of documents
O(kn2s).3 This also makes to hierarchical clustering unsuitable for very large data sets.
Clustering text documents holds one major problem. Namely the fact that compared to other forms of
data, texts have relative high numbers of features, which can be either informative or informative, be-
cause text data is both unstructured and noisy. (Aboualigah, Khader, Al-Betar, & Alomari (2017: 24)
(Sarkar, 2016: 265). The performance of (cluster) algorithms tend to decline with the increase of di-
mensions. In literature, the problems which occur in relation to dimensions is referred to as ‘The Curse
of Dimensionality’, which means that with the increase of each feature the dimensions of the virtual
space in which the instances (documents) are placed increases and thus these instances become sparser
or scattered with respect to each other (Bellman, 1957). This dramatically increases memory usage and
computation time. The vast amounts of (uninformative) features decreases the accuracy of the clustered
text documents. According to Bharti & Singh (2013) effective dimension reduction methods meet five
conditions.
1. The method should identify the relevant features and remove the irrelevant features;
2. The method should remove the redundant features;
3. it should remove features that contain no information (noisy features);
4. The method should preserve useful information in the original feature space;
5. The dimension reduction should not compromise the performance of the algorithm.
The simplest feature selection method is document Frequency-based Selection. Document Frequency-
based Selection removes terms which are appearing with a higher frequency in the corpus then other
terms. As such they hold less meaning to the text. These terms are often referred to as stop words. These
high frequency terms can be removed using a stop-word list, e.g. by removing the 10 percent top fre-
quency terms or removing the top 100 most frequent terms or a combination of these. The lowest fre-
quency terms can be removed as well, since the lowest frequent terms do not add to the similarity or
3 The definition of time complexity given by (Sridhar, 2014: 19) is the following: “Time complexity refers to the measure-
ment of run time of an algorithm in terms of its input size (…)”.
11
distance computations used by clustering algorithms. Sometimes these lower frequent terms are a result
of misspellings or typographical errors. An alternative strategy proposed by Wilbur & Sirotkin (1992)
is to select features by using Term Strength in this method the term strength is compared to the expected
strength of a random term in the corpus with the same frequency. In order to cluster a collection of
unstructured text documents the documents are transformed into vectors in a feature space model, often
referred to as a Bag of Words model (Almeida, Vasconcelos, & Maia, 2009: 47), (Feldman & Sanger,
2007:102). In the Bag of Word approach each text document is first divided into separate words or
tokens (tokenization). Further pre-processing might consist of case folding (making all letters uppercase
or lowercase) and of removing various forms of punctuation (Manning, Raghaven, & Schütze, 2009:
30). All terms (unique tokens) found in the several different documents, together form the corpus of the
task. The number of tokens in the corpus determine the number of features for each document. This can
be visualized in a term frequency matrix (See figure 2). Each document becomes a vector in the vector
space model and each term becomes one dimension of the vector. In order to determine similarity and
difference between documents it is important to apply some sort of weighting scheme to the features of
each document. The most widely used weighting scheme is called TF-IDF, which stands for term fre-
quency – inverse document frequency and is determined with the following formula:
𝑤𝑡, 𝑑 = ൫1 + log 𝑡 𝑓𝑡, 𝑑൯ ∙ 𝑙𝑜𝑔10 𝑁
𝑑 𝑓𝑡
In the formula above w = weight, t = term, d = document and f = frequency. Meaning that for each
appearing term in a document the value or weight of the term increases, however it decreases for each
appearance in the total corpus. The underlying reasoning is that terms which appear in the majority of
the document would unlikely be an effective determinant for that specific document, when a term is
very rare in the whole corpus but thus appear in the specific document it must be a strong determinant
for that document.
Source: modelled after http://brandonrose.org/clustering
Figure 2: Term Frequency Matrix
12
Section 3: Experimental Setup
In this section the setup and steps are described that are used in the experiment and that have led to the
clustering results. This setup also mentions the results of sub steps which were needed in order to pro-
ceed or which form the base for elements of the experimental setup such as direct and indirect data
collection and language detection. For a total overview of the experimental setup see the practical out-
line in Section 3.8.
3.1. Data Description
The main type of data needed for this research is the text found on the main pages of websites of Dutch
companies, which are found on the list of 956.540 URLs. Inspection of the URLs reveals that: 80.11%
of the URLs are found in the .nl domain, 14.23% in the .com domain, 2.7% in the .eu domain, 0.92%
in the .net domain and a total of 2.05% in various other domains such as .org, .biz, .nu, .li and .be (see
chart 1).
Chart 1: Domain Distribution of URLs
In order to determine whether a website can be classified as being innovative, eight top 100 lists for
most innovative companies in the Netherlands are used, which are published by the Dutch Chamber of
Commerce (KvK) for the years 2009 until 2016. The selection of companies and their ranking in the
top 100 have been done by innovation experts who have judged each nominated company according to
the impact they have had on their respective branch, originality, realized potential for growth or society
as a whole. Comparing the innovative URLs with the 956.540 URLs provided by Statistics Netherlands
the following becomes clear (see table 1). From the 703 unique URLs found in top the 100 lists of 2009
till 2016, 460 URLs (65.43%) also appear in the main URL list. This shows that the main list does not
cover all innovative companies which are acknowledged by the Dutch Chamber of Commerce in their
published top 100. Accordingly, 0.05% of the all the companies in the Main URL list have been in the
Chamber of Commerce top 100 of innovative companies between 2009 and 2016. Logically not all
13
innovative companies have made it into the top 100 lists, and thus the main URL list might contain
more innovative companies than the 460 URLs in this paragraph.
Table 1: KvK Innovative URLs found in main list
Next to the above-mentioned data a spreadsheet containing companies URLs and corresponding SBI
numbers as classified by Statistics Netherlands, this makes it possible to link an URL to its specific SBI
category. Next to this a spreadsheet is used that which contains the SBI names of each category and sub
category of the SBI and its corresponding SBI numbers. This is used throughout this research to visu-
alize results and generate tables and charts (see e.g. chart 2).
Thus, there are five sets of data.
1. List of 956.540 URLs of Dutch companies
2. Texts collected from or belonging to the company main webpage.
3. Lists of the KvK top 100 innovative companies
4. A spreadsheet containing the URLs linked to the SBI numbers
5. A spreadsheet containing the SBI numbers linked to the names of categories and sub categories.
In order to evaluate Innovation and the surfaced clusters different elements will be evaluated. With the
use of a dataset containing both URLs and SBI-codes the URLs are located in the SBI. This is before-
hand done for the Innovative URLs to see their original distribution and afterwards for the surfaced
clusters to evaluate to analyse whether there is resemblance between the surfaced clusters and the SBI.
When studying the innovative URLs as used for this research it becomes obvious that the innovative
companies are found in a wide range of categories. In total 40 level 2 categories and 106 level 4/5
14
categories (see chart 2 and appendix IX).4 The chart (2) below and the list in the appendix (IX) make
clear that innovative companies from the top 100 lists are found in multiple categories and that most off
these do not appear in a remainder category.5 While most of the KvK Innovative companies fall into
the categories wholesale trade, architects, engineers, technical design and consultancy, testing and anal-
ysis and financial institutions, Innovative companies are found in many other kinds of categories as
well. This underlines the fact that the SBI is rather ineffective in providing information about innovative
companies.
Chart 2: Percentage Innovative URLs per SBI category
3.2. Data Collection
The lists of 956.540 URLs, the URLs of the top 100 lists and the spreadsheets with URLs linked to the
SBI codes are provided by Statistics Netherlands. The spreadsheet with the SBI codes linked to the
4 See appendix XIII for an explanation of different SBI levels.
5 Remainder categories have n.e.c.* added, meaning “not elsewhere classified”
15
category names is easily found on the Statistics Netherlands website. The texts of the list need to be
collected or extracted. For this there are two distinct methods, indirect and direct collection. Both meth-
ods will be described.
3.2.1. Indirect Collection [Common Crawl]
The first method which is described hereafter is the indirect method which means collecting the data
from a web archive, in this case the Common Crawl September 2016 archive. Collecting data from a
web archive can be considered as an indirect form of data collection since the data is pre-collected and
stored by the web archive. The raw HTML data and plaintext extracts are also available through Com-
moncrawl.org. Common Crawl provides a dataset on a monthly basis which consist of billions of
webpages stored in so called WARC files. WARC stands for Web Archive and stores raw crawl data
and meta data. Beside WARC files Commoncrawl.org provides WAT files and WET files, which con-
tain specific data from the WARC files. The WAT file only contains the computed metadata and the
WET file contains the flat or plaintext, and thus the actual textual content of the website. Since it is
central to this thesis to apply unsupervised learning on the actual webpage content the plain text ex-
tracted from the WET files will be used in the model (Commoncrawl.org, 2017).
In order to check whether the Common Crawl archive will sufficiently cover the URLs used in this
research 10 random samples were taken of 1000 URLs from the main list of 956.540 URLs. The result
is that on average only 27,98% is covered by the Common Crawl archive of September 2016 and
29.35% in the archive of March 2017. (see appendix I for results per sample and appendix XI for code
excerpts).
While the coverage of Common Crawl seems to be slightly increasing with regard to the URLs set apart
for this research which mainly lie in the .nl domain, it has become clear that indirect collection of
webpages through Common Crawl would not be sufficient for this research. The next sub-section will
therefore elaborate on the direct method of extracting webpages, namely through web scraping.
3.2.2. Direct Collection [Web scraping]
According to (Mitchell, 2015: 5) web scraping is the automatic gathering of information through any
other means than program interaction with an API. Web scraping, which is also known as screen scrap-
ing and web harvesting. This is most commonly done by writing a program which automatically queries
web servers, request data in the form of the HTML code and parses this data in order to extract infor-
mation (Mitchell, 2015: 5).
16
In order to extract texts of the main pages belonging to the URLs, the Python urllib3 library was used
in combination with grequest and the Beautiful Soup library6 (see appendix XI for code excerpts).7
With this program the HTML codes are extracted from the main webpage and the text (information)
extracted from this HTML code. In addition to this the documents are collected in such a way that they
can be used to produce clusters when a cluster algorithm is applied to the documents.
From the 956,540 URLs, 299,759 URLs could not be extracted or no text was found after extraction,
which is a reduction of 31.34%. From 460 Innovative URLs, the text of 70 innovative URLs could not
be extracted and thus resulted in a reduction of 17.95% (see appendix XI for code excerpts). Although
nearly one third of webpages could not be extracted, the direct method is still substantially more suc-
cessful than the indirect method. Thus, the direct method is chosen as the method for generating docu-
ments.8
3.3. Pre-processing
3.3.1. Language Detection
PyPI’s language library, which is a part of Google’s language detection, is used to determine the lan-
guage of the text. This is necessary for further (pre) processing e.g. stop-word removal, tokenisation
and stemming, since all these operations are depending on the language of the text.
The process of language detection is started with a sample of 1,000 URLs, which were manually
checked. From this sample it became evident that most (+80%) of the texts are in Dutch (nl) as expected,
and about 15% are in English (en). A manual check of a sample of the as Afrikaans (af) classified texts
showed that the Afrikaans classified texts in reality were Dutch texts. Apparently, they were misclassi-
fied as Afrikaans. Since Afrikaans and Dutch are alike this misclassification is logical. The texts that
were classified as French or Japanese were correctly classified. From the texts which were classified as
German (de) only 25% was correctly classified, while the remaining 75% were (misclassified) Dutch
texts. The text classified as Catalan were either English (50%) or Dutch (50%). The text which was
classified as being Polish (pl) in reality was a Dutch website which also had some Polish hyperlinks to
Polish Facebook pages. The texts which were classified as Tagalog (tl) (an Austronesian language) in
6 Urllib3 is a HTTP client for python, also see https://pypi.python.org/pypi/urllib3.
7 Total running time was less than 48 hours when running simultaneously on 12 separate Jupyter Notebooks on
an Intel ® Xeon ® CPU E5-2670 0 @2.60 GHZ.
8 The direct collection could be improved when more time and effort is invested in solving the errors which did
occur during the process. It has also been considered and proposed to combine both direct and indirect methods
in order to create larger corpus, this however was rejected to keep the collected data consistent.
17
reality where English texts about yoga and thus contained some Asian words which lead to the misclas-
sification. The text classified as Chinese (zh-cn) was indeed a Chinese text. While many of the texts
classified as being English where correctly classified as English, there were also texts that were in
reality Dutch texts with a lot of English Terms (see chart 3).
Chart 3: Distribution of Languages
The decision was made to use only texts that were correctly classified as Dutch and texts which were
classified as Afrikaans. This will result in a completely Dutch corpus. The reason for this decision is
that it considerably diminishes the number of dimensions and it also greatly simplifies pre-processing
and clustering processes. The reduction does not have any impact on the validity of the results that are
needed to answer the research question.
3.3.2. Tokenization, Stop Word Removal and Stemming
Before the clustering algorithm is applied to the collected and selected documents it is important to
structure the data in such a way it can be processed by the algorithm. The first step in creating structure
is by cutting the strings in smaller pieces (Tokens or terms). This process is called tokenization and, in
this research, is done with the NLTK word-tokenizer. After Tokenization additional “noise” is removed
by removing stop words. Stop words are words in a language are used often but do not add any meaning
or value from a data science perspective (see appendix XI for code excerpts). Examples of Dutch stop
words are: “aan”, “af”,”al”, “andere”, ”dan”, ”die”, ”dit”, ”doen”, ”een”, ”er”, ”heb”, ”hem”, ”het”,
”met”, ”zei”, ”zo”, ”zou”, and so on. Each language has its unique stop words. For this research the
Dutch stop word library found in PyPi is used in this research.
18
3.3.3. Document Selection
Working with several languages would ultimately lead to several different clustering’s, since e.g. Eng-
lish and Dutch texts would share very few semantic similarities while covering the same topic. Working
with two or more languages with little word similarity would increase the number of features. When
applying Lang-Detect on all documents an even larger list of languages appears, most of them repre-
senting only a small percentage of the documents (see appendix II, table ii) When looking at the lan-
guages with more than 1% of the documents. Dutch, English and Afrikaans are clearly the highest
scoring. From the sample it followed that both documents classified as Dutch and those classified as
Afrikaans where 100% Dutch. Since using more than one language would greatly increase the number
of features and thus dimensions and memory usage, all documents which are not classified as either
Dutch or Afrikaans are omitted. Which left us with 510,755 Dutch documents for further processing.
From these 510,755 documents an additional 66,283 documents were removed because they have less
than 20 terms. Documents with very few terms are unlikely to contain much information about the
company. They only create additional noise to the experiments and do not form a good base for clus-
tering. A sample of the documents with less than 20 terms was reviewed and the majority of these
documents turned out to be error messages or ´page under construction´ messages, which supported the
decision to remove them. The number of 20 terms is in itself arbitrarily chosen. Thus 444,472 docu-
ments remained for further experiments (see appendix XI for code excerpts).
3.3.4. Feature Selection
After scraping the text from the URL, all text is made lower case and punctuation is removed. In this
way no distinction is made between identical words that appear in the beginning of a sentence or in the
middle of the sentence. After tokenization stemming is applied with the use of NLTK SnowballStem-
mer. Stemming reduces dimension by bringing each word back to its stem. E.g. Works, Working,
Worked, Worker, Workers, will all be brought back to the word Work.
3.4. The Algorithm
While many algorithms may exist, which may be superior to k-means on small datasets, when working
with large datasets and limited resources k-means - mini batch seems to be the most logical starting
point, given the time frame of this project and the additional process of extracting texts from the
webpages. The k-means – mini batch algorithm is more extensively explained in sub-section 2.4.
3.5. Vector Space Model Setup
After the previous mentioned pre-processes, the vector space model was constructed with a TF-IDF
Vectorizer, TF-IDF is most commonly used and therefore chosen in this setup. The sklearn TfidfVec-
torizer is used to build the vector space model (see appendix XI for code excerpts). For feature selection
both Uni-grams (one term) and Bi-grams (two-term combinations) are used. Bi-grams may increase
19
effectiveness of the clustering algorithm since documents in which the same word combinations are
found are logically more similar to each other (Furnkranz, 1998). Features are selected by removing the
most infrequent terms (min_df) and most frequent terms (max_df). Several setups where tried (e.g.
min_df = 0.07 en min _df = 0.05, min_df = 0.01 and min_df = 0.007), based on manual exploration
min_df= 0.01 that appeared the have both a low number of very small clusters and clear cohesive clusers
(See appendix V). After min_df = 0.01 (meaning removing terms which appears in less than 1% of
documents) was selected several max_df where tried. Everything below max_df 0.6 (meaning removing
terms which appeared in more than 60% of the documents) gave less obvious clusters and everything
from max_df = 0.7 and higher gave similar clustering which appeared to be more cohesive. Therefore
min_df was set to 0.01 and max_df set to 0.7 was chosen for both having a low number of very small
clusters and a fair amount of clear cohesive clusters. With regard to the uni-grams and bi-grams it must
be stated that a combination of both where used for building the Vector Space Model. The frequency
of reappearing uni-gram features far exceeded those of the bi-grams. The result of cutting of the 1%
lowest frequency features resulted in losing almost all bi-grams with the exception of two.
3.6. Elbow Method
As Bholowalia & Kumar (2014) in their article explain, the Elbow method tries to find the ideal number
of clusters for modelling data. In the elbow method the numbers of clusters are incrementally increased,
while the sum of squared errors is calculated. While the number of clusters is increased the sum of
squared errors will go down until a certain point after which the sum of squared errors continues the be
stable. This point is the so called “elbow” and theoretically this is the most ideal number of cluster,
since each added cluster does not lead to a smaller sum of squared errors.
3.7. Analysing Clusters
3.7.1. Innovative Clusters
In order to analyse to what extend innovative clusters have surfaced, the documents classified as inno-
vative are identified within the surfaced clusters. The relative overrepresentation of innovative docu-
ments9 within a cluster is calculated by subtracting the percentage of innovative documents in the cluster
from the percentage of total number of documents in that cluster.
Relative Overrepresentation = (𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑛𝑜𝑣𝑎𝑡𝑖𝑣𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑛𝑜𝑣𝑎𝑡𝑖𝑣𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠−
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 ) * 100%
9 Within this research innovate documents are documents which are classified in accordance with the list contain-
ing the KvK top 100 of most innovative companies.
20
The relative overrepresentation helps to establish whether innovative documents behave differently than
other documents. A greater (smaller) relative overrepresentation might indicate that a cluster has a
greater (smaller) pull on innovative documents.
3.7.2. Similarity to SBI and Cohesiveness
In order to analyse surfacing clusters, the surfaced clusters are compared to the SBI and innovative
documents will be relocated in the surfaced clusters. The similarity to the SBI and cohesiveness of the
clusters can be analysed by identifying the SBI codes of documents within clusters and calculating the
most dominant category. This is illustrated in figure 3. Figure 3 represents one cluster; the different
coloured circles are documents which can be traced back to different categories in the SBI. In this
research the documents are compared to level 2 and level 4/5 of the SBI.10
Figure 3: Analyzing Clusters
As shown in figure 3, different SBI categories can be present within one cluster. Calculating the most
dominant category will give an insight in the document distribution within a cluster. Comparing the
documents in the cluster of figure 3 to the level 4/5 categories of the SBI will result in a most dominant
level 4/5 category - (Processing of vegetable or fruit (no juice) - which represents 40% of the total
documents in that cluster. Since, level 4/5 SBI categories are the most specific categories in the SBI,
this percentage will indicate how similar the unsupervised clustering is to the designed SBI categorisa-
tion. Logically, clusters with relatively larger dominant categories are more cohesive than clusters with
a relative smaller dominant category. Comparing the documents in clusters to the level 2 SBI categories
will indicate whether documents that are clustered together fall into the same business sector. In figure
10 See appendix XIII for an explanation of different SBI levels.
21
3 comparing the documents to the level 2 SBI categories will lead to a most dominant level 2 category
– (Manufacture of food) – which represents 60% of the total documents in that cluster. Comparing
clustering to level 2 SBI categories will therefore give a more general indication of the cohesion within
clusters.
3.7.3. Analysing Stability of Surfacing Clusters
As mentioned in the theoretical framework the changing placement of the initial centroids might lead
to different clusters. In order to estimate the stability of surfacing clusters the model will be run five
times for each number of k. This will lead to three collections (k=100, k=500 and k=1500) containing
five sets of clusters. Within each collection, clusters within each set will be compared to the cluster
most similar (containing the highest percentage of identical documents) in the other sets, within that
collection. The clusters in set 1 are compared to sets 2,3,4,5; set 2 will be compared to sets 3,4,5; set 3
is compared to sets 4 and 5; finally set 4 is compared to set 5. For each collection the average of the
similarities is taken. Figure 4 gives an example of one cluster and in what measure it resurfaces in each
set. The example gives an average of 71% and thus indicates on average 71% of the documents in the
cluster resurfaces together in one cluster. The same process is repeated for all clusters and is done for
all three collections. This will result in a collection of stability measures which can be categorized and
reported (see chart 4). The colours and numbers in figure 4 do not represent a value or feature but show
how a cluster might resurface containing different combinations of documents.
Figure 4: Consistency of Surfacing Clusters
22
3.8. Practical Outline
This following figure (5) presents the practical outline and gives an overview of the (pre)-process
leading to the end results and the number of documents in each phase of the research process.
Figure 5: Practical Outline
23
Section 4: Results
This section describes the results of applying k-means algorithm to the pre-processed data. For k=100
the URLs within each cluster are compared to each other in order to check for cluster cohesiveness.
Samples are checked manually for each cluster. This however is not done for k =500 and k= 1500 since
it would be too time consuming for larger number of k. Therefore, this method is not further reported
in this section. (The results and description of this method are found in appendix IV).
In order to analyse the measure in which innovative clusters resurfaces, the relative overrepresentation
of innovative documents within clusters is calculated as is explained in sub-section 3.7.1. The cohe-
siveness and the similarity to the SBI of the clustering is analysed as explained in sub-section 3.7.2.
Appendices VII and VIII show complete tables of the results for level 4/5 and level 2 SBI category
comparison.
4.1. Finding Optimum k
In order to determine the optimum number of clusters used by the k-means mini batch algorithm the
elbow method is applied on the data, this led to graphs 1 and 2. Graph 1 shows that the SSE has a
downward trend when the number of clusters is increased. However, the trend line does not go down
steadily but acts very volatile. This is in accordance with the article of Daas, Puts, Buelens, & van den
Hurk (2015) is one of the three charactics of Big Data. The so-called ‘elbow’ appears around k=100.
This becomes even clearer when the curve is smoothed (when fewer points are shown) (see graph 2).
Graph 1: Results Elbow Method Graph 2: Results Elbow Method "Smoothed"
While k=100 will be the starting position since it theoretically must lead to clustering with the highest
quality, this research also considers two other k’s. k=1500, since 1500 is the total number of subsection
which can be found in SBI and k=500 as an arbitrarily chosen value between the value found by the
elbow method and the number of k corresponding to the total number of categories in the SBI. k=500
is chosen to better comprehend how the surfacing clusters change when scaling up from the number of
24
clusters which should be an optimum according to the elbow method (k= 100) to the number of clusters
comparable to the SBI categories.
4.2. Cluster Analysis
4.2.1. Innovative Cluster Analysis
While in some clusters there has been found a relative overrepresentation of innovative documents, no
clear innovative clusters have surfaced in any of the clustering’s. The relative overrepresentation is
calculated as explained in sub-section 3.7.1. When the relative overrepresentation in clusters is analysed
the following becomes clear:
For k=100 the cluster with the greatest overrepresentation is cluster 83, which has an overrepresentation
of 12.94%. Cluster 83 contains 2.85% of total documents and 15.79% of all innovative documents. The
most dominant level 4/5 category in cluster 83 is 6201 – Writing, producing and publishing software,
representing 19.6% of the documents in the cluster. Another cluster that stands out, in the k=100 clus-
tering, is cluster 12 which has an overrepresentation of 2.71%. The clusters contain 23.61% of the total
documents and 26.3% of all innovative documents. This clearly is the biggest cluster in the k=100
clustering. The most dominant level 4/5 category in cluster 12 is 94997 – Other interest organizations
n.e.c., representing 4% of the documents in the cluster (see appendix VI).
When looking at the innovative documents in the k=500 clustering, innovative documents are more
spread out amongst different clusters. The highest overrepresentation is found in cluster 191, which has
an overrepresentation of 5.39%. The most dominant level 4/5 category in cluster 191 is 6420 - Financial
holding, representing 6% of the documents in the cluster (see appendix VI).
With regards to the innovative documents of k=1500 two clusters stand out: cluster 103 and 309, both
clusters have an above average number of documents especially when compared to the clusters with an
overrepresentation in k=500. Cluster 103 contains 10.3% of total documents and 15.45% of total inno-
vative documents and thus a relative overrepresentation of 5.42%. Cluster 309 contains 17.46% of total
documents and 19.92% of total innovative documents and thus a relative overrepresentation of 2.46%
(see appendix VI).
4.2.2 Analysing Cohesiveness and Similarity to SBI
The cohesiveness and the similarity to the SBI of the clustering is analysed as explained in sub-section
3.7.2. The table (2) below summarizes the similarity and cohesiveness scores (averages) for each num-
ber of k and are further explained thereafter.
25
Similarity and Cohesiveness Scores
k=100 k=500 k=1500
Level 4/5 (Similarity) 23.11% 32.9% 30.2%
Level 2 (Cohesiveness) 35.21% 38.41% 44.97%
Table 2: Similarity and Cohesiveness Scores
When comparing the surfaced clusters of k=100 to the level 4 SBI categories the following table (3)
can be constructed:
Cluster Most similar to SBI for k=100
Cluster level 4/5 most dominant
sub-category
SBI
2008
Percentage documents
with SBI-code repre-
sented by most dominant
category
number of docu-
ments with SBI-
code
89 Photography 74201 83.30% 12 document(s)
88 Beauty treatment, pedi-
cures and manicures,
make-up and image con-
sulting
96022 62.30% 2939 document(s)
54 Advertising agencies 7311 58.30% 12 document(s)
0 Sale and repair of passen-
ger cars and light motor
vehicles (no import of
new cars)
45112 56.80% 3776 document(s)
24 Other service activities
n.e.c.*
9609 56.70% 930 document(s)
17 Restaurants 56101 56.40% 2355 document(s)
85 Landscape service activi-
ties
8130 54.10% 1709 document(s)
Table 3: Most similar clusters to SBI for k=100
The average of percentages represented by the most dominant level 4/5 SBI category for k=100 is
23.11%. When considering the cohesiveness of the clustering, comparing to the level 2 SBI categories
the average of percentages represented by the most dominant category is 35.21%.
26
When scaling up to 500, more clusters appear with higher similarity with an average percentage of the
dominant category documents rising from 23.11% to 32.9% when compared to SBI level 4/5 categories.
Remarkable are the clusters in the sorted table extract below which range till almost 99% similarity (see
table 4). When considering the average level 2 categories the average becomes 38.41%.
Cluster Most similar to SBI for k=500
Cluster level 4/5 most dominant
sub-category
SBI 2008 Percentage documents with
SBI-code represented by
most dominant category
number of docu-
ments with SBI-
code
216 Hairdressing 96021 98.50% 65 document(s)
217 General dental practices 86231 93.20% 59 document(s)
45 Dispensing chemists 4773 86.60% 149 document(s)
248 Sale and repair of passen-
ger cars and light motor ve-
hicles (no import of new
cars)
45112 81.20% 16 document(s)
128 Football 93121 75.00% 60 document(s)
228 Insurance agents 6622 72.70% 165 docu-
ment(s)*
243 Insurance agents 6622 71.20% 66 document(s)*
70 Beauty treatment, pedi-
cures and manicures,
make-up and image con-
sulting
96022 70.50% 611 docu-
ment(s)*
74 Beauty treatment, pedi-
cures and manicures,
make-up and image con-
sulting
96022 70.30% 912 docu-
ment(s)*
200 Photography 74201 69.40% 631 docu-
ment(s)
Table 4: Most similar clusters to SBI for k=500
* For Some of the clusters (228, 70) which contain a high percentage of documents which fall in the
same SBI category a similar cluster (243,74) has surfaced nearby, having a near similar score, but
have a different number of documents.
27
When scaling up to 1500 clusters the average percentage of the dominant category documents becomes
30.2%, which is 2.7 percent points lower than k=500, but also creates many clusters (over 1000 or 66%)
clusters with less than 10 documents (see table 5). At level 2 SBI the average percentage of the dominant
category is 44.96%, which is surprisingly higher than the level of similarity found in k= 500.
Cluster Most similar to SBI for k=1500
Cluster level 4/5 most dominant
sub-category
SBI
2008
Percentage documents with
SBI-code represented by
most dominant category
number of docu-
ments with SBI-
code*
11 Hairdressing 96021 98.4% 64 document(s)
1364 General dental practices 86231 96.7% 60 document(s)
925 Practices of psychothera-
pists and psychologists
86913 96.6% 58 document(s)
794 Sale and repair of passen-
ger cars and light motor
vehicles (no import of new
cars)
45112 95.2% 21 document(s)
819 Other interest organiza-
tions n.e.c.*
94997 95.2% 5617 document(s)
64 Driving schools 8553 94.3% 1051 document(s)
118 Beauty treatment, pedi-
cures and manicures,
make-up and image con-
sulting
96022 93.9% 33 document(s)
9 Non-spe-
cialised
stores with
non-food
(no depart-
ment stores)
47192 92.9% 14 document(s)
510 Hairdressing 96021 92.3% 769 document(s)
Table 5: Most similar clusters to SBI for k=1500
28
The results from this section (4.2) lead to the conclusion that when applying k-means mini batch with
a higher number for k more cohesive clusters are appearing while more innovative documents fall in
bigger less cohesive clusters. K= 500 is performing best when comparing to the SBI and when looking
at the most cohesive clusters hairdressers score highest in both the k=500 and k=1500 clustering and
dental practices score a second place in both instances. When looking at more general cohesion using
the level 2 SBI categories, K=1500 is most cohesive. Remarkable is the fact that within both the k=500
and k=1500 clustering when comparing to the level 2 SBI categories, in both cases category (86) Human
health activities is the most dominant of two best performing clusters. In three out of four cases repre-
senting 100% of the documents. (see appendix XIII). When looking at the level 2 cohesiveness of k=100
the best performing cluster is cluster 89, in which the most dominant level 2 category is 74 – Industrial
design, photography, translation and other consultancy, representing 92% of the documents in that
cluster. The 2nd and third best performing clusters are respectively 4 and 66, in which the most dominant
level 2 category in both cases is 86 - Human health activities, representing respectively 81% and 80%
of total documents in those clusters (see appendix XIII).
4.5. Cluster Stability
The Chart (4) below shows the distribution of the stability of resurfacing clusters for different numbers
of k. Stability of the clusters is estimated as described in sub-section 3.7.3.
Chart 4: Stability of Resurfacing Clusters
0%
10%
20%
30%
40%
50%
60%
70%
80%
Pe
rce
nta
ge o
f R
esu
rfac
ing
Clu
ste
rs
Precentage of Resurfacing Documents in Cluster
STABILITY OF RESURFACING CLUSTERS
k = 100
k = 500
k =1500
29
For k=100 in 15% of the clusters resurfacing with 90 to 100% identical documents. Meaning that this
part of the clusters will very likely surface each time the algorithm is ran. 65% of all clusters however
show a resurfacing of less than 10% of documents in its clusters meaning that on average only 15% of
the clusters that surface when applying the algorithm, will resurface when it is ran again, with over 90%
of the documents in that cluster. K=100 has an average stability of 25.71% (see appendix X), meaning
that on average 25.71% of all documents are clustered together will be clustered again when the algo-
rithm is ran again. When applying the same test to the results of k=1500 only 11% of the clusters are
for 90 to 100% stable and the average stability is 22.73%. Meaning that the appearing clusters will be
less stable compared to k=100 (see appendix X). When finally, the same test is applied to k=500, 18%
of the clusters are for 90% to 100% stable and the average stability is 32.07% (see appendix X).
An interesting fact with regard to the most stable clusters (90% - 100%) is that companies that ended
up in these very clusters offer a very specific service. E.g. bakeries, floors, yoga, education etc. When
looking at the +90% clusters of the k=500 runs one of the clusters contain 123 different URLs all leading
to the Cool Blue store website (a web shop for electronic devices) e.g. https://www.3dprinterspecial-
ist.nl/, http://www.fonduesetstore.nl, http://www.stofzuigerstore.nl, http://www.kookboekstore.nl,
http://www.autoradiostore.nl, http://www.bestekstore.nl. Also, many hosting sites as Hosting2Go and
CCV Shop are among these +90% clusters, since all URLs which might have belonged to other com-
panies now all point to the same kind of website and thus generated similar documents and are thus
clustered together.
Also remarkable is that while k=100 would theoretically have led to the most stable clustering setup in
reality the arbitrarily chosen k=500 setup ended up to be most stable. This may be an indication that
when working with larger corpora applying the elbow method becomes insufficient (see Gupta & Sri-
vastava (2014: 7), Bholowalia & Kumar (2014) and (Daas, Puts, Buelens, & van den Hurk, 2015).
Further research is needed however in order to prove this. Alternative ways from the ones that are
currently available are required to determine the best number for k when working with larger corpora.
The conclusion we draw from this is that clustering with k-means mini batch under the given conditions
is not a stable method leading to high quality statistics in accordance with the standards of National
Statistics Institutes (NSI’s). Moreover, clustering does not seem to be stable enough nor effective
enough for creating a specific categorization of innovative companies which could be effectively used
by municipalities or other organizations.
30
Section 5: Discussion and Conclusions
5.1. Discussing Results
Big data implies working with many dimensions and large numbers of data points leads and it allows
for many options and angles to approach a problem. That means that many choices have to be made
and a lot of optional parameters have to be adjusted. Only a few of these options were used in this
research in order to answer the research question. It has been proven that with the used setup parameters
and pre-processing techniques no distinct clusters have appeared which primarily contain innovative
companies. While some clusters in the k=100, k=500 and k=1500 setups do show a relative overrepre-
sentation of innovative documents, the total number of innovative documents are still distributed over
many clusters. More robust clusters did surface for documents which contain specific descriptions of
the goods and services offered by the corresponding companies, and thus k-means mini-batch clustering
worked fairly well for clustering these particular documents. Clusters which mainly contained innova-
tive documents did not appear however with the pre-processing and algorithm used in this research.
This research also showed that innovative companies tend to end up within various SBI categories and
not only in a remainder category. Beside the fact that the performed experiments have not lead to clear
identifiable innovative clusters, only 10% of the clusters (in k=1500) appeared to be stable (90% -
100%). The initial centroid placement lead to different resurfacing of clusters for the majority of clusters
(MacKay, 2003: 288). This is not nearly sufficient for the standards as required for the official statistical
purposes of NSI’s and thus Statistics Netherlands (Struijs, Braaksma, & Daas, 2014). Unfortunately,
this means that the result of the research is that this method does not transform data in useful information
for municipalities or other public institutions.
5.2. Answering Research Questions
After collecting the results, we can answer the research questions. For the main research question:
RQ: “In what measure will clusters of innovative companies’ surface when unsupervised machine
learning is used on the textual content of the webpages of Dutch companies?”
We can definitely state that no clear cohesive clusters of innovative clusters have surfaced as a result of
using the techniques in this thesis. Although some clusters surfaced in which innovative URLs were
overrepresented, unfortunately no actual innovative clusters have been found.
With regard to the first sub-question (SQ1) question:
SQ1: “Does the Common Crawl Archive have a sufficient coverage of the Dutch company’s websites
circumscribed by Statistics Netherlands for this research?”
31
Common Crawl was found to have insufficient coverage of Dutch companies. As a consequence, for
this research we needed to collect the data with a direct method. It also meant that at the time of the
research Common Crawl did not prove to be a reliable source for statistical research concerning Dutch
websites, which connects with the concern posed by Daas, Puts, Buelens, & van den Hurk (2015), about
missing data in Big Data analytics for official statistics.
With regard to the second sub-question (SQ2):
SQ2: “Do the clusters that surfaces show similarities to the SBI, and in what measure are these clusters
cohesive and stable?”
For companies which offer services or goods which are explicitly circumscribed, e.g. hairdressers and
dentists, the surfaced clusters tend to be very similar (up to 98.5%) to the level 4/5 categories of the
SBI. For the most part the clusters where less similar to the SBI however. The similarity and cohesive-
ness are shown in table 2 in sub-section 4.2.2. While the k=500 clustering is most similar to the SBI,
the k=1500 clustering is most cohesive when comparing to level 2 SBI categories. The most cohesive
clusters for k=500 and k=1500 fall within the level 2 SBI category (86) Human health activities. For
k=100 this is (74) Industrial design, photography, translation and other consultancy, followed by (86)
Human health activities.
For k=100, only 15% of the clusters resurface with 90% to 100% of its documents. For k=500 this was
18%, and for k=1500 11%. For the majority of the clusters less than 10% of the documents resurfaced,
within a that cluster. For k=100 this 65%, for k=500 this 57% and for k=1500 this is 69% (see chart 4).
This proofs that for most documents in a cluster, and thus websites, clustering will, under the chosen
conditions, do not lead to consistent results. Therefore, it can be said the clustering is very much affected
by initial centroid placement, as explained by MacKay (2003: 288).
5.3. Discussing Shortcomings
This research has been about clusters that have naturally surfaced with the use of the k-means mini
batch algorithm. No alternative pre-processing (e.g. POS tagging) has been performed on the corpus.
Another shortcoming of this research is the fact that only a limited number of innovative companies
where available for validating proposes, while amongst the list of URLs there would have been many
more unidentified innovative documents. A more complete set of innovative documents would have led
to clearer results. Nevertheless, the data and scripts from this research could be used when a more
complete list of URLs belonging to innovative companies is made available or is constructed to further
analyse the way the innovative companies are distributed amongst the clusters. While an added list of
32
innovative URLs might lead to more insight in its distribution, it is highly unlikely it would change the
final conclusions of this research regarding the surfacing of innovative clusters.
5.4. Recommendations to Statistics Netherlands
The problem posed by Statistics Netherlands that municipalities and other local public institutions could
not identify innovative companies with the current SBI, could not be solved by applying k-means mini
batch with the chosen setup. It thus has become apparent that the main texts of innovative documents
are not adequately different to be clustered separately from the non-innovative documents. It is unlikely
that further fine-tuning would lead to better results concerning the innovative companies, nor would it
be likely that using different unsupervised algorithms would lead to a better result. While this research
was focussed on the use of unsupervised techniques to distinguish innovative documents from non-
innovative documents. For further research a supervised machine learning algorithm is recommended.
The innovative companies which are already identified could then be used as training/validating sets
and test sets in a supervised machine learning setup. Ikonomakis, Kotsiantis, & Tampakas (2005)
mention various algorithms which are used for the classification of documents. These are Naïve Bayes,
Disision Tree, Closest Neighbor, Support Vector Machines or a combination of these as an ensamble
machine learning setup. When a succesfull method for categorizing innovative companies is found
however, Statistics Netherlands should note that under current EU regulations it cannot be used to
replace the excisting SBI. This because EU regulation No 1893/2006 forces Statistics Netherlands to
follow the structure of the NACE. Statistics Netherlands would thus not be able to formally change the
SBI (which follows the NACE structure) independently from other EU Member states (Centraal Bureau
voor de Statistiek, 2017c). An alternative categorization system might exist however next to the current
SBI in order to meet the demands of municipalities and other organisations.
As shown in sub-section 3.1 most innovative URLs did not end up in a remainder category, while
according to Daas this is one of the problems Statistics Netherlands has with categorizing new
innovative companies. This suggests that the innovative companies chosen for this research do not
represent the full scope of innovative companies that Statistics Netherlands is struggling with. While in
the underlying research it was clear that innovative documents where scattered amongst many different
clusters and thus no clear innovate cluster or set of clusters was formed, it could be of added value for
further research to create a more representative data set of innovative documents, especially if a
supervised machine learning algorithm is applied. The data collected throughout this research as well
as the scripts for collecting and processing the data have been made available to Statistics Netherlands,
and can thus be used for further research. The created textual corpus collected throughout this research
can also be used for further research when applying supervised machine learning techniques.
33
It is further advised not to use Common Crawl as a data source for official statistics. As shown in this
research the coverage of Common Crawl for the list provided by Statistics Netherland has improved
between September 2016 and March 2017 with 2% to a little over 29% coverage, but that is still
insufficient. Likely the coverage will continue to improve, but there is still a long way to go before it
can be used as a valid source for similar statistic research, at least for the list URLs provided by Statistics
Netherlands. In other words Common Crawl is missing to much missing data to fully rely on for
statistical purposes (Daas, Puts, Buelens, & van den Hurk, 2015).
Since the majority of the clusters that did appear proved to be instable, clustering webpage documents
with k-means mini batch does not seem to lead to the stable results needed by Statistics Netherlands. It
is therefore advised not to use k-means mini batch for statistical puproses at least not when used in
similar projects as this research (Daas, Puts, Buelens, & van den Hurk, 2015).
5.5. Directions for further research
In this study it has become apparent that documents which were labelled as “innovative” did not natu-
rally cluster together when applying the k-means mini batch algorithm. While studies about the use of
clustering of webpages do exist, (especially for the use of creating or improving search engines Zang,
H., Pang, B., Xie, K., & Wu, H, 2006, Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., & Ma, J. ,2004), no
specific studies could be found that deal with clustering on a higher abstraction level when working
with large data sets. The proposed elbow in this research did not lead the most optimal number for k as
would follow from Gupta & Srivastava (2014: 7) and Bholowalia & Kumar (2014). The curve found in
section 4.1, shows much volatility, which is in accordance with Daas, Puts, Buelens, & van den Hurk
(2015), who mark volatility is one of the charactaristics of Big Data and thus different methods should
researched for determining the most optimal number for k when working with larger corpora.
34
Cited Works
Abualigah, L. M., Khader, A. T., Al-Betar, M. A., & Alomari, O. A. (2017). Text feature selection with
a robust weight scheme and dynamic dimension reduction to text document clustering. Expert
Systems With Apllications, 24-36.
Almeida, L. G., Vasconcelos , A. T., & Maia, M. A. (2009). A Simple and Fast Term Selection
Procedure for Text Clustering. In N. Nedjah, L. de Macedo Mourelle, J. Kacprzuk, F. M.
França, & A. F. de Souza, Intelligent Text Categorization and Clustering (pp. 47-64). Berlin:
Springer.
Bellman, R. E. (1957). Dynamic programming. Princeton : Princeton University Press.
Bessant, J., & Tidd, J. (20011). Innovation and Entrepeneurship. Chichester: John Wiley & Sons Ltd.
Bholowalia, P., & Kumar, A. (2014). EBK-Means: A Clustering Technique based on Elbow Method
and K-Means in WSN. International Journal of Computer Applications, 17-24.
Bradley, P. S., Bennet, K. P., & Demiriz, A. (2000). Constrained K-Means Clustering. Microsoft
Research Techinical Report (MSR-TR) 2000-65, 1-9.
Centraal Bureau voor de Statistiek. (2016, September 27). CBS start uniek initiatief voor big data-
onderzoek. Retrieved from CBS.nl: https://www.cbs.nl/nl-nl/nieuws/2016/39/cbs-start-uniek-
initiatief-voor-big-data-onderzoek.
Centraal Bureau voor de Statistiek. (2017a, September 8). CBS Urban Data Centre the Hague
Launched. Retrieved from CBS.nl: https://www.cbs.nl/en-gb/corporate/2017/26/cbs-urban-
data-centre-the-hague-launched
Centraal Bureau voor de Statistiek. (2017b, Augustus 20-08-2017). SBI 2008 - Standaard
bedrijfsindeling 2008. Retrieved from CBS.nl: https://www.cbs.nl/nl-nl/onze-
diensten/methoden/classificaties/activiteiten/sbi-2008-standaard-bedrijfsindeling-2008
Centraal Bureau voor de Statistiek. (2017c, Augustus 17). Standard Industrial Classifications (Dutch
SBI 2008, NACE and ISIC). Retrieved from CBS.nl: https://www.cbs.nl/en-gb/our-
services/methods/classifications/activiteiten/standard-industrial-classifications--dutch-sbi-
2008-nace-and-isic--
Centraal Planbureau. (2016). Kansrijk innovatiebeleid. Den Haag: Centraal Planbureau.
Cho, H., & An, M. K. (2014). Co-Custering Algorithm: Batch, Mini-Batch, and Online. International
Journal of Information and Electronics Engineering, 340=346.
Commoncrawl.org. (2017, April 04). CC-mrjob. Retrieved from Common Crawl:
https://github.com/commoncrawl/cc-mrjob
35
Cutting, D. R., Karger, D. R., Pedersen, J. O., & Yukey, J. W. (1992). Scatter/ Gather: a cluster-based
approach to browsing large document collection. SIGIR.
Daas, P. J., Puts, M. J., Buelens, B., & van den Hurk, P. A. (2015). Big Data as a Source for Official
Statistics. Journal of Official Statistics, 249-262.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering
Clusters. Proceedings of the Second International Conference on Knowledge Discovery and
Data Mining (KDD-96), 288-231.
Feldman, R., & Sanger, J. (2007). The text Mining Handbook: Advanced Approaches in Analyzing
Unstructured Data. Cambridge: Cambridge University Press.
Frey, B. J., & Dueck, D. (2007). Clustering by Passing Messages. Science, 972-974.
Furnkranz, J. (1998). A Study Using n-gram Features for Text Categorization. Wien: Austrian Research
Institute for Artificial Intelligence.
Gupta, H., & Srivastava, R. (2014). k-means Based Document Clustering with automatic "k" Selection
and Cluster Refinement . International Journal of Computer Science and Mobile Applications,
7-13.
Hollanders, H., & Es-Sadki, N. (2017). European Innovative Scoreboard 2017. Brussels: European
Commision.
Ikonomakis, I., Kotsiantis, S., & Tampakas, V. (2005). Text Classification Using Machine Learning
Techniques. WSEAS TRANSACTIONS on COMPUTERS, 966-974.
Inderjit, S. D., & Dharmendra, S. M. (2000). Concept Decompositions for Large Sparse Text Data using
Clustering. IBM Research Report RJ 10147.
Kohonen, S., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self
organization of a masive document collection. IEEE Transactions, 574-585.
Kulis, B., & Jordan, M. I. (2012). Revisiting k-means: New Algorithms via Bayesian. Proceedings of
the 29th International Conference on Machine Learning (ICML-12) (pp. 513-520). WUSTL
Machine Learning Group.
MacKay, D. J. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge:
Cambridge University Press.
Manning, C. D., Raghaven, P., & Schütze, H. (2009). An introduction to Information Retrieval.
Cambridge: Cambridge University Press.
McKeown, K. R., Barzilay, R., & Evans, D. (2002). Tracking and summarizing news on a daily basis
with Columbia's Newsblasster. HLT.
36
Miner, G., Delen, D., Elder, J., Hill, T., & Nisbet, R. (2012). The Seven Pratice Areas of Text Analysis.
In G. Miner, D. Delen, J. Elder, T. Hill, & R. Nisbet, Practical Text Mining and Statistical
Analysis for Non-Structured Text Data Applications (pp. 29- 41). Amsterdam: Amsterdam.
Mitchell, R. (2015). Web Scraping with Python. Sebastopol: O'Reilly books.
OECD. (2014). OECD Reviews Of Inovative Policy: Netherlands. Paris: OECD.
Raschka, S. (2015). Pyhton Machine Learning. Birmingham: Packt Publishing.
Rijsbergen, C. J. (1989). Information Retrieval. London: Buttersworth.
Sarkar, D. (2016). Text Analysis with Python. Bangalore: Apress.
Schumpeter, J. A. (1975). Capitalism, Socialism and Democracy. New York: Harper.
Scott, B. R. (2011). Capitalism: Its origins and Evolution as a system of Governance. Boston: Springer.
Sheshasayee, A., & Thailambal, G. (2016). A Study on K-means Clustering in Text Mining Using
Python. Internatinal Journal of Computer Systems, 560- 564.
Sridhar, S. (2014). Design and Analysis of Algorithmns. New Delhi: Oxford University Press.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A Comparison of Document Clustering Techniques.
Proceedings of the International KDD Workshop on Text Mining (pp. 1-20). Minneapolis:
Department of Computer Science and Egineering, University of Minnesota.
Struijs, P., Braaksma, B., & Daas, P. J. (2014). Official statistics and Big Data. Big Data & Society, 1-
6.
UN Classifications Registry. (2017, Augustus 18). Retrieved from UNDS:
https://unstats.un.org/unsd/cr/registry/regcs.asp?Cl=27&Lg=1&Co=63
Wang, S., & Koopman, R. (2017). Clustering articles based based on semantic similarity.
Scientometrics, 1017-1031.
Wilbur, J., & Sirotkin, K. (1992). The automatic identification of stopwords. Journal Information
Science, 45 -55 .
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Prectical Machine Learning Tools and
Techniques. Amsterdam: Morgan Kaufmann.
World Economic Forum. (2017). The Global Competiveness Report 2016-2017. Geneva: World
Economic Forum.
37
Yadav, K., & Baria, J. (2014). Mini-Batch K-means Clustering Using Map-Reduce in Hadoop.
International Journal of Computer Science and Information Technology Research, 336-342.
Zang, H., Pang, B., Xie, K., & Wu, H. (2006). An Efficient Algorithm for Clustering Search Engine
Result. International Conference, Computational Intelligence and Security (pp. 661-671).
Guangzhou: Springer.
Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., & Ma, J. (2004). Learning to cluster web search results.
Special Interest Group on Information Retrieval.
38
Appendix I: Common Crawl Results
Results September 2016 Results March 2017
Sample Percentage of URLs found
in Common Crawl
Sample Percentage of URLs found
in Common Crawl
1 28.0% 1 27.8%
2 28.3% 2 30.3%
3 29.7% 3 30.6%
4 28.2% 4 30.8%
5 28.8% 5 28.9%
6 25.8% 6 29.9%
7 27.1% 7 30.5%
8 28.2% 8 27.8%
9 27.5% 9 27.8%
10 28.2% 10 29.1%
Average 27.98% Average 29.35%
Running Time 3:31:44 Running Time 2:22:45
39
Appendix II: Scraping without Grequest – Language Detection
Table i: Extracting HTML’s Table ii: language de-
tection
Sample Produced
an error
Returned
no text
Successfully
extracted
Time
1 17.40% 19.30% 63.30% 0:43:29
2 18.00% 19.40% 62.60% 0:44:28
3 16.30% 18.60% 65.10% 0:43:19
4 15.60% 20.70% 63.70% 0:29:27
5 15.80% 19.90% 64.30% 0:34:05
6 16.60% 17.70% 65.70% 0:28:39
7 16.30% 19.70% 64.00% 0:40:27
8 16.50% 16.50% 67% 0:40:57
9 18.60% 16.10% 65.30% 0:31:21
10 17.10% 19.70% 63.20% 0:37:08
Total
6:13:20
Mean 16.82% 18.76% 64.42% 0:37:20
Std.
dev.
0.009542 0.015226 0.013456 0.00421
Language URLs
af 1,26%
ca 0,24%
cs 0,01%
cy 0,06%
da 0,28%
de 0,25%
en 15,32%
es 0,07%
et 0,06%
fi 0,02%
fr 0,27%
hr 0,17%
hu 0,01%
id 0,04%
it 0,13%
lt 0,02%
lv 0,01%
nl 81,20%
no 0,14%
pl 0,08%
pt 0,05%
ro 0,08%
sk 0,03%
sl 0,03%
so 0,03%
sq 0,02%
sv 0,06%
tl 0,04%
tr 0,02%
40
Appendix III: Innovative Documents per Cluster k = 100
Cluster Innovative Doc-
uments
Total Docu-
ments
Percentage
Total Docu-
ments
Percentage To-
tal Innovative
Documents
Difference
0 3 6652 1.5% 1.32% -0.18%
1 3 12836 2.89% 1.32% -1.57%
3 1 4323 0.97% 0.44% -0.53%
4 3 8356 1.88% 1.32% -0.56%
7 6 23793 5.35% 2.63% -2.72%
9 1 3551 0.8% 0.44% -0.36%
10 5 3394 0.76% 2.19% 1.43%
12 60 104918 23.61% 26.32% 2.71%
13 1 2745 0.62% 0.44% -0.18%
21 20 57699 12.98% 8.77% -4.21%
26 1 8990 2.02% 0.44% -1.58%
28 2 6496 1.46% 0.88% -0.58%
29 4 14483 3.26% 1.75% -1.51%
31 1 8734 1.97% 0.44% -1.53%
33 3 4286 0.96% 1.32% 0.36%
34 3 5133 1.15% 1.32% 0.17%
36 3 2638 0.59% 1.32% 0.73%
40 16 34871 7.85% 7.02% -0.83%
47 3 1690 0.38% 1.32% 0.94%
48 3 5892 1.33% 1.32% -0.01%
49 10 6031 1.36% 4.39% 3.03%
56 2 4442 1.0% 0.88% -0.12%
61 8 11567 2.6% 3.51% 0.91%
69 2 2029 0.46% 0.88% 0.42%
71 1 4602 1.04% 0.44% -0.6%
77 4 4843 1.09% 1.75% 0.66%
79 1 5816 1.31% 0.44% -0.87%
80 2 16240 3.65% 0.88% -2.77%
83 36 12654 2.85% 15.79% 12.94%
85 1 3342 0.75% 0.44% -0.31%
87 1 6700 1.51% 0.44% -1.07%
90 1 4125 0.93% 0.44% -0.49%
41
Appendix IV: Sample Websites Cluster Quality based un URL names and website spot-
checks
The original URL belonging to the document might reveal much about the kind of website it leads to
e.g. (1) http://www.mobilecarfix.nl, http://www.carwrapservice.nl and http://www.autobedrijfkhan.nl
(cluster 1) are all related to cars; (2) http://www.dekkerfietsen.com, http://www.tielemanfietsen.nl, and
http://www.fietsershoptslimmer.nl (cluster 47) are all about bicycle related; and (3) http://www.mas-
sage4all.nl, http://www.liesbethmassagepraktijk.nl, and http://www.mvmassage.nl cluster ( all offer
some kind massage service. When one of the three examples along with other similar URLs are found
in the sample of the cluster it is clear the cluster has a cohesive theme.
In the three examples above the URLs themselves had reoccurring words e.g. “fiets” which makes them
easy the identify when manually looking through them, another way to approach this is by looking at
the string similarity of the URLs. In order to automatize this the python module difflib.SequenceMatcher
was used in combination with itertools.Combinations on samples of 300 URL second level names, e.g.
‘dekkerfietsen’, per cluster. This results in all possible sets of two URL-names per cluster and their
similarity scores. For each cluster the average of the similarity scores is calculated in order to find how
similar the URL-names in each cluster are, and thus gives a score for each cluster.
This method however is only effective when the URL name explicitly displays the product or service
the company is offering e.g. cars (cluster 1: 27.53%), bikes (cluster 47: 28.53%), floors (cluster 51:
27.04%) or photography (cluster 89: 43.20%). Notice that most clusters that are without a clear simi-
larity, score between 22 and 24 percent. For the clusters in which the product or services are explicitly
mentioned this ranges between 26% to 45%. The instances were a very high percentage is found up to
100%, without exception constituted very small clusters containing only 2 or 3 URLs. These scores
should not be taken into account since these are usually identical documents with near identical URL
names, which should have been clustered with other documents. Other clusters would score low be-
cause the URL name itself does not reveal much about the product or service it delivers. e.g. (1)
http://www.vanleeuwen-advocaat.nl, http://www.ankerenanker.nl and http://www.barentskrans.nl ,
which are all lawyers or law firms. In these cases, some of the websites needed to be visited in order
to determine whether or not there is a measure of coherency within the cluster. Other examples of this
are (2) http://www.operagorinchem.nl, http://www.sari-djaya.nl and http://www.venezia-wijk-
bijduurstede.nl (cluster 5) which happened to be all food delivery services affiliated with Thuisbe-
zorgd.nl; and (3) http://eazyict.nl, http://www.djmissdeedy.com and http://www.hoenuverder.nl (clus-
ter 13) which where all reserved domain names by owned by a company called TransIP. A second way
used to spot cohesiveness in within clusters surfaces is by looking at the at the top terms of the cluster.
The top terms of cluster indicate cohesion within its cluster: “cli cli ebnt ebnt wij zorg onz kantor person
42
werk juridisch begeleid mens mogelijk goed behandel” Note that some of the terms above are not cor-
rectly written in Dutch, since stemming is applied before clustering. An example a non-cohesive cluster
has the following URLs: http://www.hoogtechniek.eu, http://www.kimmenkehorst.nl and
http://www.meubelherstel.nl, and the following top terms: info mail btw tel mail info kvk contact fax
onz den wij telefon all nummer com. These top terms merely contain general information and most
likely is to be found on many Dutch websites. See Table below for results.
Clusters * URLs Topic Score Top Terms percentage
similarity
in URL
0 http://www.mobilecarfix.nl,
http://www.carwrapservice.nl,
http://www.autobedrijfkhan.nl,
http://www.derooyautoschade.nl,
http://www.bpam.nl,
http://www.autoenfiscus.nl,
http://www.brouwerlpg.nl,
http://www.rhcleaningproducts.nl,
http://www.gertbrandsen.nl,
http://www.autodemon-
tagevanderven.nl
Cars and
related to
cars
2 auto wij onderhoud onz servic
merk kunt reparatie all nieuw
car verkop bent terecht goed
27.53%
1
http://www.veba-elektro.nl,
http://www.warmerdam-
lichtwerk.nl,
http://www.htogroep.nl,
http://www.dcmbeheer.nl,
http://www.profectadvies.nl,
http://www.frankenesveld.nl,
http://www.kastenwanden.nl,
http://www.planhus.nl,
http://www.aannemingsbed-
rijfkooistra.nl,
http://www.dirkjankarsten.nl
Housing
and affili-
ated
2 woning bouw verbouw wij re-
novatie nieuwbouw project
huis onderhoud onz werkzam
goed kunt badkamer nieuw
23.93%
2 http://www.vrolijk.nl,
http://www.wishpel-vijver.nl,
http://www.cipela.nl,
http://www.palletwagenshop.nl,
http://www.gevelridder.nl,
http://www.lundbypoppenhuis.nl,
http://www.estherkrop.nl,
div. web-
shops
1 normal prijs special javascript
browser btw functionaliteit
websit browser javascript
functionaliteit incl uitgescha-
keld
23.95%
43
http://www.polarprofilters.nl,
http://www.bedankjes.nl,
http://www.water-
sportartikelen.com
3 http://www.dntw.nl,
http://www.vanleeuwen-advo-
caat.nl, http://www.notarisdos-
sier.nl, http://www.alkcare.nl,
http://www.bekendeparag-
nost.com, http://www.ankerenan-
ker.nl, http://www.dub-
belgenieten.nl,
http://www.keijservandervelden.n
l, http://www.sibb.nl
lawyers
and
notery
2 cli cli ebnt ebnt wij zorg onz
kantor person werk juridisch
begeleid mens mogelijk goed
behandel
23.21%
4 http://www.fysiodenham.nl,
http://www.smelik.huisarts-
plus.nl, http://www.tandartsen-
praktijkburgum.nl, http://www.ka-
relmatla.praktijkinfo.nl,
http://www.fysta.nl,
http://www.fysiotherapiegale-
cop.nl, http://www.petriepedi-
cure.nl, http://www.life-in-en-
ergy.nl, http://www.fysiotherapie-
heuvelland.nl, http://www.geels-
psychotherapie.nl
2 praktijk behandel pati pati
ebnt kunt onz ebnt informatie
wij afsprak medisch websit
welkom zorg therapie .
25.23%
5 http://www.operagorinchem.nl,
http://www.sari-djaya.nl,
http://www.venezia-wijk-
bijduurstede.nl,
http://www.snackhouse-twins.nl,
http://www.eethuis-bon-ap-
petit.nl, http://www.seleraanda-
amstelveen.nl, http://www.pizze-
ria-popeye.nl, http://www.cleopat-
ragrillenschede.nl,
http://www.bellamilanomoor-
drecht.nl
Food de-
livery,
thuisbe-
zorgd.nl
2 beoordel med gemaakt websit
mogelijk bekijk lekker heer-
lijk eten grot keuz warm
super goed vandag
25.00%
44
7 http://www.luukimberg.nl,
http://www.phytofemme.nl,
http://www.chrispellefoto-
grafie.nl, http://www.rijschool-
hennyleenen.nl, http://www.prins-
synergy.nl,
http://www.phdevriesrijsoord.nl,
http://www.cnip.nl,
http://www.joyboelens.nl,
http://www.sionsluis.nl,
http://www.npvbommel-
erwaard.com
No clear
similari-
ties be-
tween
docu-
ments
-2 lev jouw jij jou mens werk je-
zelf goed mak war wer gan an-
der person wet .",
23.13%
8 http://www.mnprojecten.nl,
http://www.friendlydolphin.nl,
http://www.prominent-vast-
goed.nl, http://www.vankan-
dronten.nl,
http://www.kastdesign.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 gerealiseerd garantie project
eig jar jar ervar ervar zzp
goedkop gezin gezond ging
glas goed goed adres
26.69%
9 http://www.bregtjedeboer.nl,
http://misterbassman.nl,
http://www.kinderboekwin-
keldegiraf.nl, http://www.falcon-
air-online.nl, http://www.wil-
lemdegroot.nl,
http://www.anessche.nl,
http://www.margriet4kids.nl,
http://www.pannen-
koekenboerderij.com,
http://www.uitvaartvereniging-
dokkum.nl, http://www.noord-
stee.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 addy var document getele-
mentbyid prefix path getele-
mentbyid document var addy
text spambot prefix var var
path path var var prefix path
prefix
22.93%
10 http://www.oploss-
ingsgerichtdenkenenwerken.com,
http://www.kiwienzo.nl,
http://www.woonmallvil-
laarena.com, http://www.zwem-
bad-info.nl, http://www.monni-
kendam.nl, http://www.relatiebe-
middeling-info.nl, http://www.in-
ternetconnections.nl,
http://www.smale.nl,
No clear
similari-
ties be-
tween
docu-
ments
-2 cookies gebruik cookies ge-
bruik websit onz maakt ge-
bruik gebruikt instell onz web-
sit informatie maakt wij brow-
ser sit klik
22.75%
45
http://www.artinsteel.nl,
http://www.roflexinternational.nl
12 http://www.deballetboetiek.nl,
http://www.bleijmakelaardij.nl,
http://www.movietrader.nl,
http://www.gizo.nl,
http://www.bouwbedrijfmeyer.nl,
http://www.apeldoornsegolfkam-
pioenschappen.nl,
http://www.deleukstelu-
iertaarten.nl,
http://www.tomoveyourbody.nl,
http://www.frendz.nl,
http://www.bedrijvenuitvoor-
burg.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 onz kunt nieuw info wij all jar
product mak contact neder-
land informatie mail mogelijk
via .",
22.97%
13 http://eazyict.nl,
http://www.djmissdeedy.com,
http://www.hoenuverder.nl,
http://www.studyrussian.nl,
http://www.metmeerplezi-
erproductief.nl, http://baba-
nicongo.org, http://www.het-vo-
gelparadijs.nl, http://www.de-
onlinetafelwinkel.com,
http://www.honigfabriek.org,
http://www.freshfashionguy.com
Websites
reserved
by Tran-
sIp
2
onz kunt nieuw info wij all jar
product mak contact neder-
land informatie mail mogelijk
via
22.41%
15 http://www.shemalesexdates.nl,
http://www.betaaldesexdates.nl
Two
URLs =
Same
Destina-
tion
0 zin toegang led gratis krijg
will mak mann functies ac-
count automatisch gebruiker
sprek aanmeld word
77.42%
16 http://www.javatimmerwerken.nl,
http://www.ovensvoordeindus-
trie.nl, http://www.franje.com,
http://www.correctsystems.nl,
http://www.desidesign.nl,
http://www.correct-systems.nl,
http://www.toolsupport.nl,
http://www.beachline.nl,
No clear
similari-
ties be-
tween
docu-
ments
-2 afbeeld www material tech-
nisch bureau advies allen stan
klar verlop rendement glas ex-
pert hog vaandel vaandel pro-
ductie
25.01%
46
http://www.loopbaanineigen-
hand.nl, http://www.eck-
hardtbouw.nl
17 http://www.defuik.nl,
http://www.landgoeddesalen-
tein.nl, http://www.watermolen-
singraven.nl, http://www.mya-
sia.nu, http://www.restaurant-
gustavino.com, http://www.de-
nieuwenhofvoorst.nl,
http://www.bijdeluts.nl,
http://www.restaurantxiexie.nl,
http://www.napoli-maastricht.nl,
http://www.residencerhenen.nl
Restau-
rants
2 restaurant gerecht diner reser-
ver geniet heerlijk gezell wij
onz eten lunch keuk kunt caf
terras
25.73%
19 http://www.diederikstevens.com,
http://www.michielmeijers.nl,
http://www.kraaima-media.nl,
http://www.jaike.nl,
http://www.lifejoy.nl,
http://www.denb-retail.nl,
http://www.janwil-
lemvandegroep.com,
http://www.indepaskamer.nl,
http://www.lisastolk.nl,
http://www.margotcpol.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 mor spannend social media
druk social media tegelijker-
tijd sted event welk onlin
kwam ten zakelijk bezig
24.73%
20 http://www.sprenkeler-pr.nl,
http://www.nldcommunicatie.nl,
http://www.birdycommuni-
catie.nl, http://www.fhcommuni-
catie-advies.nl, http://www.green-
communicatie.nl, http://www.ku-
buscommunicatie.nl,
http://www.proracom.nl,
http://www.herdercommuni-
catie.nl, http://www.breed-
beeld.nl, http://www.com-
passcommunicatie.nl
Commu-
nication
2 communicatie intern adviseur
advies rad ontwikkel onder-
wijs nieuw huisstijl hoger
partner strategisch activiteit
strategie diver
43.41%
21 http://www.ikhebeenverstop-
ping.nl, http://www.dakpannen-
verkoop.nl,
http://www.thephoneshopper.nl,
No clear
similari-
ties be-
tween
-2 wij onz kunt product lever
klant grag contact kwaliteit
23.21%
47
http://www.carreaux.nl,
http://www.reproxchange.nl,
http://www.kaasadministraties.nl,
http://www.mgbrandhout.nl,
http://www.cultwheels.nl,
http://www.hetsterrenhuis.nl,
http://www.w-sec.nl
docu-
ments
servic mogelijk all bedrijf
goed mak
22 http://www.bruidsbeurslim-
burg.nl, http://www.limburgreno-
veert.nl, http://www.velde-
kekids.nl, http://www.wat-
tedoeninlimburg.nl,
http://www.meteolimburg.nl,
http://www.veldekeremunj.nl,
http://www.wpm.nl,
http://www.lim-
burgsvakantiehuis.nl,
http://www.hsdgroep.nl,
http://www.advlimburg.nl
Limburg 2 limburg rendement reinig
energie rest kop hog hoeft wij
onderhoud kwalitatief jaarlijk
kijk mei hoogwaard
25.70%
24 http://www.kwispelstaartje.nl,
http://www.doggyfun.nl,
http://www.canilos.org,
http://www.trimsalondiane.nl,
http://www.hus-walkabout.nl,
http://www.hondenschoon.nl,
http://www.kwispel-tijd.nl,
http://www.heppie-hond.nl,
http://www.taketheleash.nl,
http://www.uniquedog.nl
Dogs 2 hond dier wij onz kunt goed
welkom gedrag all wandel be-
handel natur vind informatie
les
25.38%
26 http://www.mzcwaalwijk.nl,
http://www.beppebaukje.nl,
http://www.schietbaan.com,
http://www.tcroomburg.nl,
http://www.waterland.nl,
http://www.dierenkliniek-
dewaard.nl, http://www.vishen-
kkok.nl, http://www.recreama.nl,
http://www.bijoumoderne.nl,
http://www.spesautobanden.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 uur uur uur vrijdag zaterdag
maandag vrijdag uur dinsdag
wij donderdag openingstijd
zondag onz woensdag geslot
info
24.03%
27 http://www.livemediums.nl,
http://www.datcoaching.nl,
Coaching 2 coach hartelijk beroep eig suc-
cesvoll talent onderwijs
40.51%
48
http://www.edu1coach.nl,
http://www.smart-coach.nl,
http://www.p-coach.nl,
http://www.resultaatgericht-
coachen.nl
voorop var doe maatwerk ver-
ander inzicht waarin duurzam
28 http://www.hoogtechniek.eu,
http://www.kimmenkehorst.nl,
http://www.meubelherstel.nl,
http://www.duvah.nl,
http://www.kneib.com,
http://www.powerkilo.nl,
http://www.sketchuppro.eu,
http://www.eurobalans.nl,
http://www.wijkbouw.nl,
http://www.duinker.eu
No clear
similari-
ties be-
tween
docu-
ments
-2 right all right right reserved re-
served all copyright wij onz
websit kunt designed info con-
tact nieuw welkom
23.26%
29 http://www.twentyfour-shops.nl,
http://www.quicklunchshop.nl,
http://www.dekleineveer-
sepoort.nl,
http://www.wijnboerderijvlaar-
dingen.nl,
http://www.pouww.keurslager.nl,
http://www.grandcafededijk.nl,
http://www.ravanello.nl,
http://www.breshulpmiddelen.nl,
http://www.onsbakhuis.nl,
http://www.lekkerkoken.nu
food 1 onz heerlijk lekker ver wij
product geniet smak kunt wijn
winkel natur eten koffie gezell
24.41%
31 http://www.schaapjeblij.nl,
http://www.gastoudercindy.org,
http://www.bijmijles.nl,
http://www.isg-arcus.nl,
http://www.leskracht.nl,
http://www.sbodedijk.nl,
http://www.leukerik.nl,
http://www.kleurrijkvilt.nl,
http://www.hetzonnetje.nl,
http://www.hlinssen.nl
primary
and day-
care
2 kinder kind schol ouder onder-
wijs leerling ler wij groep ont-
wikkel onz jar spel begeleid
goed
23.87%
33 http://www.evertz.nl,
http://www.vakgaragewolters.nl,
http://www.firimass.nl,
http://www.thetroupe.nl,
No clear
similari-
ties be-
tween
-2 les verder verder les wij nieuw
onz jar goed all project neder-
land juni wer mak werk
22.56%
49
http://www.urotex.nl,
http://www.bedrijfsinterview.nl,
http://www.restauratiecentrum.nu,
http://www.merkwaaardig.nl,
http://www.vrijwilligerssteun-
puntommen.nl, http://www.hzzon-
wering.nl
docu-
ments
34 http://www.swiebertje.org,
http://www.filmuwbedrijf.nl,
http://www.omroepwest.nl,
http://www.zeelandvakantie-
woningen.eu, http://www.schoen-
tauf.de, http://www.haagrecht.nl,
http://www.printvandemaand.nl,
http://www.sarimanis.nl,
http://www.vdakker.nl,
http://www.elsen-uden.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 den hag den hag wij onz rot-
terdam info jar amsterdam
nieuw kunt all nederland werk
utrecht
23.44%
35 http://www.tonertradinggroup.nl,
http://www.theemaatje.nl,
http://www.spijkenissekoi.nl,
http://www.mijngeschenk.lu-
ondo.nl, http://juniorvintag-
eanddesign.eu, http://www.ad-
mir.nl, http://www.mariekelode-
wijk.nl, http://www.shop.tro-
pacafe.nl, http://www.spray-
tancity.nl, http://www.seasons.nu
No clear
similari-
ties be-
tween
docu-
ments
-2 webwinkel adres www shop
support controler geschrev
wellicht beginn beschik ver-
wacht indien jouw product
nem contact
23.13%
36 http://www.amotex.nl,
http://www.vanhooft-transport.nl,
http://www.pkwaterbouw.nl,
http://www.slaats-dierenvoed-
ers.nl, http://www.bonotrans.nl,
http://www.embassyfreight.nl,
http://www.tiniemander-
stransport.nl, http://www.piano-
verhuuramsterdam.nl,
http://www.vanwaveren-
transport.nl, http://www.correu-
ten.nl
mainly
transpor-
tation
2 transport logistiek wij onz ver-
voer international klant eu-
ropa bedrijf nederland servic
jar gespecialiseerd all dienst
25.76%
50
37 http://www.gifts4thegreen.nl,
http://www.surpreza.com,
http://www.sell-to-you.nl,
http://www.huidverzorging-ko-
pen.nl, http://www.tunertape.com,
http://www.realperro.nl,
http://www.goedkoop-eroken.nl,
http://www.houtvanheidi.nl,
http://www.publicsolutions.bied-
meer.nl, http://www.shopat-
work.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 webwinkel www adres con-
troler geschrev wellicht shop
support beginn verwacht in-
dien nem contact beschik nam
wel
23.04%
39 http://www.hetkompas-opende.nl,
http://www.derietvink-breda.nl,
http://www.delinderte.nl,
http://www.cbssamenopweg.nl,
http://www.cbsdewel.nl,
http://www.movendi.nl,
http://www.obsdebolder.nl,
http://www.demorgenster-
kampen.nl, http://www.obshar-
rybannink.nl, http://www.gerar-
duswinkel.nl
Elemen-
tary
schools
2 groep juli oktober september
addy kinder schol leerling on-
derwijs lunch vrij ging woens-
dag eig var
25.62%
40 http://www.jmadvies.nl,
http://www.boxbudgetbeheer.nl,
http://www.brehoff.nl,
http://www.chzorg.nl,
http://www.p-oatwork.nl,
http://www.muus.nl,
http://www.werknemerstevreden-
heid.eu, http://www.tele-
comhuys.nl, http://www.fgbfacili-
tygroup.nl, http://www.sobm.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 organisatie wij project werk
ervar onz ebl management fi-
nanci organisaties advies fi-
nanci ebl ondernem kennis on-
dernemer
22.24%
41 http://www.hofstrasteel.nl,
http://www.gruythuysen.nl,
http://www.dijkinkelektra.nl,
http://www.dijkinkbeveiliging.nl,
http://www.thermos-inst.nl,
http://www.europlant.nl,
http://www.pauluskerkvluchtel-
ingenwerk.nl,
http://www.daldrup.eu,
No clear
similari-
ties be-
tween
docu-
ments
-2 tel fax sneller hoger fax lag
maatwerk tel lever informatie
info goed ging glas goed mo-
gelijk gezond
22.70%
51
http://www.sanitaskliniek.nl,
http://www.hummelelektra.nl
43 http://www.progay.nl,
http://www.walkinngoes.nl,
http://www.telecom-erfgoed.nl,
http://www.gezondheidscentru-
melst.nl, http://www.stichting-
srz.nl, http://www.triviumdiag-
nostiek.nl, http://www.wal-
fridus.nl, http://www.roefelen.nl,
http://www.tdw-advies.nl,
http://www.opleiding-particuli-
eronderzoeker.nl
Mostly
founda-
tions
1 stichting nederland onz activi-
teit jar wij gemeent doel web-
sit project mens nieuw veren
kinder informatie
22.74%
45 http://www.nbz.nl, http://www.e-
beat.biz, http://www.wen-
sinkdancemasters.nl,
http://www.bootbouwschool.nl,
http://www.persoonsbeveiliger.nl,
http://www.rudolfholleman.com,
http://www.spaansetaal.org,
http://www.vu-dekempen.nl,
http://www.akkonderwijs.nl,
http://www.stimulans-fysiothera-
pie.nl
Educa-
tion
2 cursus cursuss opleid work-
shop ler volg kunt wij onz in-
formatie les docent mak trai-
ning werk
23.34%
47 http://www.dekkerfietsen.com,
http://www.tielemanfietsen.nl,
http://www.louwerenburg.nl,
http://hotelbrinkzicht.com,
http://www.fietsershoptslim-
mer.nl, http://www.lem-
mentweewielers.nl,
http://www.telutci.com,
http://www.bito.nl,
http://www.defruitgaard.nl,
http://www.defietsenwinkel.nl
Bikes 2 fiet fiets elektrisch wij onz
kunt nieuw merk winkel ac-
cessoires servic reparatie as-
sortiment all onderdel
28.53%
48 http://www.ikbenarie.nl,
http://www.amigo.nl,
http://www.verstraatengroep.nl,
http://www.elkedagietsleuks.nl,
http://www.pointofsales.nl,
http://www.justflow.nl,
Web-
shops
1 javascript browser functiona-
liteit websit browser javas-
cript functionaliteit uitgescha-
keld javascript lijkt lijkt uitge-
22.76%
52
http://www.shop4networks.nl,
http://www.cadjobs.nl,
http://www.fpcollection.mobi.nl,
http://www.steenstripwinkel.nl
schakeld uitgeschakeld brow-
ser your geactiveerd lijkt web-
sit benut winkelwag
49 http://www.flowfoundation.nl,
http://www.newcase-audiovisu-
als.com, http://www.fbeyeproduc-
tions.nl, http://www.citystarsou-
venirs.com, http://www.tokata.nl,
http://www.winenetwork.nl,
http://www.zichtopjezelf.com,
http://www.buywine.nl,
http://www.davenschot.nl,
http://www.janheinarens.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 the and for you this with are
your not that from can websit
wij hav
22.67%
50 http://www.vanderreest.nl,
http://www.reestmachines.nl
Two
URLs =
Same
Destina-
tion
0 verwijz reparer vernieuwd
machines bent exclusief wij
verhur bent zoek ruim jar ja-
nuari wij onz total jar ervar
juist adres
41.67%
51 http://www.beterisoleren.nu,
http://www.therdex.com,
http://www.houseofpandomo.nl,
http://www.deweerd-
wonenenslapen.nl,
http://www.kjfloorsolutions.com,
http://www.lambooparket.nl,
http://www.domcity.nl,
http://www.natuursteen-tegel-
werken.nl, http://www.brent-
jens.nl, http://www.par-
ketlijm.com
Floors
and
Flooring
2 vloer hout wij onz showrom
legg onderhoud kunt mogelijk
nieuw lever all jar kleur kwa-
liteit
27.04%
52 http://www.silverdaletraining.nl,
http://www.rijschoolgeduld.nl,
http://www.zangstudiodelft.nl,
http://www.simondewit.nl,
http://www.lgmusicschool.nl,
http://www.yukta.nl,
http://www.marinusterpstra.nl,
http://www.sterkeschool.nl,
Schools
and train-
ings
2 less leerling les jar ler docent
gegev wij onz workshop mu-
ziek volg goed mogelijk kun
24.68%
53
http://www.rijschooltalander.nl,
http://www.de-bolster.nl
54 http://www.vormgevenenzo.nl,
http://www.maren74.nl,
http://www.dehaancreative.nl,
http://www.agasi.nl,
http://www.eijsbroek.com,
http://www.schootvormgeving.nl,
http://www.drd-support.nl,
http://www.charlotluiting.nl,
http://www.miesign.nl,
http://www.thesculpfactory.com
Design 2 vormgev grafisch ontwerp
huisstijl rod websit stap be-
drijf complet professionel
snelheid maakt gebruik ge-
zicht rest kenmerk
22.44%
55 http://www.paardenmiddel.nl,
http://www.paardenmiddel.com
Two
URLs =
Same
Destina-
tion
0 sit bedoeld gezond dagelijk in-
formatie leuk krijg consult ge-
bruik stukj wet welkom sit
goed mogelijk dier beant-
woord
100.00%
56 http://www.vandaan-media.com,
http://www.possibilit.webs.com,
http://www.rodanthedecoratie.nl,
http://www.rekenenoprekenen.nl,
http://www.strandhuisje.com,
http://www.horecaoutlet.nl,
http://www.haarstudioalina.nl,
http://www.ctbaa.nl,
http://www.camerafilterstore.nl,
http://www.ggspecialsizes.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 var document function script
http https com getelementbyid
for data twitter new window
wij www
23.04%
58 http://www.in-visible.nl,
http://www.groenveldboeken.nl,
http://www.grimbergenboeken.nl,
http://www.iwema.nl,
http://www.voordeelboeke-
nonline.nl, http://www.agentsaft-
erall.nl, http://www.kolstein.nl,
http://www.agentsafterall.com,
http://www.boekhandel-
vandervelde.nl,
http://www.venstra.nl
books 2 boek activiteit onz onz winkel
winkel aanmeld regelmat foto
hoogt blijv wilt presentatie
onz nieuwsbrief wij grag lang
25.62%
54
61 http://www.cellebroederskapel.nl,
http://www.wingbergermolen.nl,
http://www.bikeparkspaarn-
woude.com, http://www.der-
ooijmakelaars.nl, http://www.be-
zoekerscentrumleudal.nl,
http://www.miataonderdelen.nl,
http://www.bureaucicero.nl,
http://www.nootdorp-
slotenmaker-specialist.nl,
http://www.hetkleineparadijs.nl,
http://www.indonesiatravel.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 les les verder verder wij onz
les mer mer nieuw jar all goed
kunt mak nederland werk
22.92%
62 http://www.littlebuds.nl,
http://www.jmgamesonline.nl,
http://www.herbanaatje.nl,
http://www.dewoonwinkel.eu,
http://www.mijnwebwin-
kel.nl/winkel/debontebazaar.nl,
http://www.rhmpstore.nl,
http://www.vlindertjevrolijk.nl,
http://www.no1els.nl,
http://www.meneerdebock.nl,
http://www.decreatievevlinder.nl
All
closed
webshops
of
mijnweb-
winkel.nl
2 helas webwinkel later mail-
adres prober zeker pro per
maand product domeinnam
functies betaalt inschrijv blog
inspiratie
24.69%
63 http://www.customicebp.com,
http://www.netster.nl
Two
URLs =
Same
Destina-
tion
0 tekst realiser ruim domeinnam
optimaliser afbeeld hosting
controler lat voldoet ruim er-
var kleding vertal email info
aanpass .
33.33%
64 http://www.blasteq.nl,
http://www.blasteq.com,
http://www.blasteq.eu
Three
URLs =
same des-
tination
0 direct kunt biedt les reinig ver-
led all soort schon hiervan be-
hor sector belangrijkst voor-
beeld gebruik mak gecertifi-
ceerd
100.00%
66 http://www.zijnenschijn.nl,
http://www.praktijkasem.nl,
http://www.acupunctuur-scha-
gen.nl, http://www.praktijko-
penblik.nl,
http://www.maulany.nl,
http://www.podothera-
Yoga and
alterna-
tive heal-
ing
2 klacht licham behandel thera-
pie praktijk gezond beweg ba-
lan geest oorzak lev stres kunt
ontspann wer
24.05%
55
pievenray.nl, http://www.heel-
bewust.nl, http://www.hildatop-
per.nl, http://www.sensbewee-
gtje.nl, http://www.prak-
tijkvanrumpt.nl
67 http://www.massage4all.nl,
http://www.ariana-lamberts.nl,
http://www.sedoc.nl,
http://www.spier.nu,
http://www.carlawinkelman.nl,
http://liesbethmassagepraktijk.nl,
http://www.westriknatuurgenees-
wijzen.nl,
http://www.timiselamassage.nl,
http://www.bobodywork.nl,
http://www.mvmassage.nl
Massage 2 massag ontspann licham be-
handel klacht praktijk geest
rust stres jezelf kunt heerlijk
balan aandacht goed
27.50%
68 http://www.bodewes.nu,
http://www.jackbodewes.nl
Two
URLs =
Same
Destina-
tion
0 sport geopend voll augustus
geslot training wer tijd gang
hanter draait maandag vrijdag
gezet hom verbouw
77.78%
69 http://www.sierhekwerkdejong.nl,
http://www.tinnemans-
scheepswerf.nl, http://www.rhino-
bv.nl, http://www.stal-de-elzen.nl,
http://www.vdkengineering.com,
http://www.meijerbv.nl,
http://www.stal-skulenboarch.nl,
http://www.rollen.nl,
http://www.lava3.nl,
http://www.vanhartskampmetaal-
werken.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 stal wij onz product metal le-
ver bedrijf all jar kunt info
mogelijk hout mat project
23.79%
71 http://www.timkorenhofftekst.nl,
http://www.rotorcommunicatie.nl,
http://www.willemnijeboer.com,
http://www.matthijsmeulblok.nl,
http://www.marionvanes.com,
http://www.barbarabreedijk.com,
http://www.lotjevanlieshout.com,
http://www.matchwinner-
Text edi-
tors, writ-
ters,
2 tekst verhal schrijv vertal
communicatie boek goed tal
boodschap mak jouw beeld
woord werk nederland
24.13%
56
shop.com, http://www.mar-
tijngort.nl, http://www.evi-
dentpr.nl
72 http://www.dewijnprins.nl,
http://www.wijnvoordeel.nl,
http://www.westerveldwijnen.nl,
http://www.wijnenmeat.nl,
http://www.taste-trade.eu,
http://www.wijnenvanegbert.nl,
http://www.wijngaard-zon-
nestraal.nl, http://www.vi-
novelzky.nl,
http://www.griekswijnhuis.nl,
http://www.ctwfinewines.com
wines 2 wijn amsterdam drink smak
rod onz del wit juist bestell
mooi kwaliteit zorg biologisch
wij
29.37%
76 http://www.heuvellandhotels.nl,
http://www.hotelvelsen.nl,
http://www.hotel-rido.nl,
http://www.renl.nl,
http://www.zaaninnhotel.nl,
http://www.hotelvanoranje.nl,
http://www.hulsmanvenray.nl,
http://www.invast.nl,
http://www.turkije.nl,
http://www.bellevuegroothoofd.nl
Hotels
and vaca-
tions
2 hotel kamer restaurant geniet
onz heerlijk wij prachtig kunt
geleg vakantie lux centrum
ligt all
29.66%
77 http://www.projectinrichting-
lavoir.nl, http://www.difofoto-
grafie.nl, http://www.den-
kontwerp.nl, http://www.foot-
printchallenge.nl,
http://www.crea-art.nl,
http://www.frozenwebshop.nl,
http://www.gymsportcongres.nl,
http://www.broekenbuuren.nl,
http://www.sfeerenmeer-
events.nl, http://www.meur-
swerkt.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 cre cre ebr ebr wij onz mak
werk klant ontwerp nieuw sam
product war goed ontwikkel
22.75%
78 http://www.bobbledesign.nl,
http://www.bobinterieurbouw.nl
Two
URLs =
Same
0 project diver opgedan meubel
besteld gedur klassiek acces-
soires webwinkel binnenkort
28.57%
57
Destina-
tion
bureau kijkj voorbeeld eis
deur
79 http://www.officecontent.nl,
http://www.fysiotherapie-deriet-
landen.nl, http://www.mh-loop-
baanadvies.nl,
http://www.presentatieo-
pleiding.nl, http://www.knubb.nl,
http://www.kompastraining.nl,
http://www.bontalen.nl,
http://www.marcelfuchs.nl,
http://www.inneraction.nl,
http://www.profact.org
Caoching
and train-
ing
2 training coaching trainer werk
person opleid begeleid train
onz wij mens ontwikkel sport
ler ervar
23.31%
80 http://www.adviseyou.nl,
http://www.newharttings.com,
http://www.dwain.nl,
http://www.koemeester.nl,
http://www.dutchcre8.com,
http://www.eco-communi-
catie.com, http://www.me-
diaflame.nl, http://www.vrijzin-
nigevangelisch.nl,
http://www.semweb.nl,
http://www.ictlimburg.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 websit websites wij onlin do-
meinnam onz ontwerp hosting
klant contact kunt huisstijl
mak informatie nieuw .
22.59%
81 http://www.ag-transporten.com,
http://www.ag-transporten.nl
Two
URLs =
Same
Destina-
tion
0 les verder transport fax verder
les tel fax vervoer tel del sales
specialiteit allround duitsland
vestig uitgevoerd
100.00%
82 http://www.ijmonduitvaart.nl,
http://www.beeldendetherapie-
mcdejager.nl, http://www.pa-visu-
als.nl, http://www.studioacan-
thus.nl, http://www.grapheus.nl,
http://www.reedemannaerts.nl,
http://www.romi-kin-
deropvang.com,
http://www.paulsellers.nl,
http://www.sandalfon.eu,
http://www.liedschrijvers.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 document writ writ document
addy var span prefix path var
addy styl prefix addy var pre-
fix path var var path path pre-
fix .
23.03%
58
83** http://www.crmconnectors.com,
http://www.nl.cremer.com,
http://www.limbra-ict.nl,
http://www.rely.nl,
http://www.leasebits.eu,
http://www.vhi.nl, http://www.ba-
nanajama.net, http://www.peri-
cia.nl, http://www.hipposoft-
ware.com, http://www.grout-
mij.com
Data,
software
and ICT
2 softwar system ict oploss wij
onz klant ontwikkel product
computer dienst beher bedrijf
all mogelijk
22.30%
84 http://www.interiorinput.nl,
http://www.interiorinput.com
Two
URLs =
Same
Destina-
tion
0 gebouw effici ebnt initiatief
effici onz opdrachtgever ebnt
ontwerp opdrachtgever interi-
eur concept onz realiser wijz
ieder ruimt
100.00%
85 http://www.detuinnatuurlijk.nl,
http://www.terstegentuinen.nl,
http://www.groengennep.nl,
http://www.westbeplanting.nl,
http://www.dethuismeester.nl,
http://www.gemmavermeulen.nl,
http://www.belshoftuin.nl,
http://www.fravin-sierbestrat-
ing.nl, http://www.moree-
groen.nl, http://www.raatjestui-
nontwerp.nl
Garden-
ing
2 tuin wij onderhoud ontwerp
groen onz kunt plant wens
mak geniet goed grag particu-
lier wilt .
28.99%
87 http://www.lpg.nl,
http://www.tegelhandelallertz.nl,
http://www.biggelaar.eu,
http://www.weltevredegroep.nl,
http://www.janvanzanten.nl,
http://www.werkvanuithuis.nl,
http://www.berenkind.nl,
http://www.pizzaovenfeestje.nl,
http://www.drukwerkvergelijker.n
et, http://www.tools-and-more.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 per dag stuk per jar btw incl
per dag dag per per wek wij
uur onz uur per jar wek .
23.06%
88 http://www.100procentvoet.nl,
http://www.lisabella.eu,
http://www.beaulissa.nl,
http://www.wellshaped.nl,
http://zhi-shiatsu.nl,
Beauty
and cos-
metics
2 behandel huid salon voet af-
sprak product ontspann kunt
mak wij verzorg natur terecht
onz goed .
25.13%
59
http://www.salonlapromesse.nl,
http://www.abeauty.nl,
http://www.cestcabeauty.nl,
http://www.beautysalonantoi-
nette.nl, http://www.huidcen-
trumlimburg.nl
89 http://www.schnek.nl,
http://www.esthergoldstein.nl,
http://www.leanderfoto-
grafie.com, http://www.cre-
anita.nl, http://www.ellefoto-
grafie.nl, http://www.frbfoto-
grafie.nl, http://www.nl-foto-
grafie.nl, http://www.stonewood-
fotografie.nl, http://www.bartgul-
demond.nl, http://www.vlieland-
foto.nl
photog-
raphy
2 fotografie websit momentel
blijf druk welkom websit kvk
foto btw com volg breng houd
wer snel
43.20%
90 http://www.seeyou.nl,
http://www.identipack.com,
http://www.pasopswingtuit.nl,
http://www.see-listen.nl,
http://www.goatmilkpowder.nl,
http://www.mobielewasstraat.nl,
http://www.whistler-it.nl,
http://www.promovlag.nl,
http://www.biozuiger.nl,
http://www.raphaelconsult.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 www http http www com info
websit onz domeinnam infor-
matie tel this nieuw https link
wij .
23.05%
92 http://www.berkersadvies.nl,
http://www.wildeharen.nl,
http://www.ebicus.com,
http://www.friendly-fire.nl,
http://www.magnetronmusic.com,
http://www.onna.nl,
http://www.persens.nl,
http://www.krijnsenteksten.nl,
http://www.e-flexpersoneel.nl,
http://www.vbku.nl
No clear
similari-
ties be-
tween
docu-
ments
-2 twitter geled maand for via
com web ongever consult
coach inzicht beweg geeft gra-
tis out .
21.57%
60
95 http://www.vanders.nl,
http://www.kikkerkinderyoga.nl
Two
URLs =
Same
Destina-
tion
0 train bereikt verstand rustig
beweg licham gevoel zorgt lo-
catie vast manier les mann
jong oud bedoeld
34.78%
97 http://www.de-glashut.nl,
http://www.a1glas.nl,
http://www.glasinloodstudio.nl,
http://www.pfann.nl,
http://www.glasinlood-amers-
foort.nl, http://www.vanderwal-
glasinlood.nl,
http://www.loligo.nl,
http://www.glasinloodateliercooij
mans.nl, http://www.glas-linq.nl,
http://www.gbb.nl
Glas 2 glas gevoel passend techniek
ontwerp behalv onz showrom
bron boodschap karakter
daarin vlak kijkt beperk onz
33.38%
*Clusters with one or less documents are omitted from this table for readability.
When comparing the clustering to the SBI classification the following becomes clear: (1) Some clusters
that seemed cohesive based on comparing URLs and checking websites manually where not as cohesive
as suspected E.g. cluster 3 based on the available SBI data only consisted for 18% out of Lawyers and
curators (69101) (5th level) and 35% of the documents including the “lawyers and curators” documents
fall under the general level 2 category legal services, accounting. tax consultancy, administration (69).
When looking at the total distribution of this cluster 200 documents or 6.5% in this cluster appear in the
SBI as “Other paramedical practitioners (no physiotherapy and psychology) and alternative healers”
(86919), and yet another 181 documents or 5.9% of the documents found in cluster 3 fall under “Prac-
tices of psychotherapists and psychologists” (86913), which both fall under the general SBI level 2
category Human health activities (86) and SBI level 3 category Paramedical practitioners and other
human health activities without accommodation (869). Even more interesting in this regard is the fact
that neighbouring cluster (4) does have a substantial number of documents that fall in this (869) level 3
category, with a majority of the documents fall into “Other paramedical practitioners (no physiotherapy
and psychology) and alternative healers” (86919) (See Appendix VIII).
Cluster Most dominant sub-
category
SBI 2008 Percentage docu-
ments with SBI-code
represented by most
dominant category
Number of docu-
ments with SBI-code
61
3 (level 2) Legal services, ac-
counting, tax consul-
tancy, administration
69 35% 3058 document(s)
3 (level 5) Lawyers and curators 69101 18% 3058 document(s)
4 (level 2)
Human health activi-
ties
86
80.80%
5279 document(s)
4 (level 3) Paramedical practition-
ers and other human
health activities with-
out accommodation
869 51.70% 5279 document(s)
4 (level 5) Other paramedical
practitioners (no
physiotherapy and
psychology) and al-
ternative healers
86919 22% 5286 document(s)
Some clusters that did not have any similarity between the documents, when compared to the SBI. E.g
cluster 40 did not held an apparent common theme amongst the documents compared to the SBI 20%
of the documents belong the fifth level SBI category “Organisational planning”. When looking at clus-
ter 83 the greatest portion (44%) of documents fall into the level 2 category supports activities in the
field of information technology (62). When looking deeper (level 4) a total of almost 20% and thus a
little less than halve of these falls into the “Writing, producing, and publishing of software” (6201)
category. 16.68% of all documents in the cluster fall into the “computer consultancy activities” (6202),
5.6% fall in to “Other information technology and computer service activities” (6209), another 5,6%
however fall into “financial holdings” (6420). 4.8% fall into “Wholesale of computers, peripheral
equipment and software” (4651), another 4.8% fall into “Engineers and other technical design and con-
sultancy” (7112).11
11 Results are stored in a table which is too large to include here or in the appendix but can be given at request.
62
Appendix V: Percentage of Smaller Clusters on different parameters and sizes for k
Thresholds Percentage clusters < = 1 Docu-
ments.
Percentage < = 10
Documents.
Percentage < = 100
Documents.
min_df: 0.1 1% (1 cluster) 4% (4 clusters) 4% (4 clusters)
min_df: 0.01
k = 50
k = 100
34% (17 clusters)
34% (34 clusters)
48% (24 clusters)
43% (43 clusters)
58% (29 clusters)
43% (43 clusters)
min_df: 0.01 –
max_df 0.7
k = 50
k=100
k= 300
34% (17 clusters)
28% (28 clusters
20% (62 clusters)
44% (22 clusters)
40% (40 clusters)
24% (72 clusters)
50% (25 clusters)
50% (50 clusters)
29.3% (88 clusters)
min_df 0.008
max_df: 0.7
k = 50
k = 100
48% (24 clusters)
32% (32 clusters)
58% (29 clusters)
44% (44 clusters)
64% (32 clusters)
51% (51 clusters)
min_df: 0.007
max_df: 07
k = 50
k = 100
40% (20 clusters)
32% (32 clusters)
58% (29 clusters)
44% (44 clusters)
66% (33 clusters)
50% (50 clusters)
Min_df 0.005,
max_df 0.7
K=100
29% (29 clusters)
44% (clusters)
52% (52 clusters)
Min_df: 0.003
max_df : 07
k = 100
39% (39 clusters)
45% (45 clusters)
53% (53 clusters)
63
Appendix VI: Relative Overrepresentation
K=100
K=500
Cluster
Level 4/5 most
dominant cate-
gories
SBI
Percentage
represented by
most dominant
category
Innovative
documents
Total
docu-
ments
Relative
overrepresen-
tation
191 Financial hold-
ings
6420 6.00% 16
(7.58%)
9754
(2.19%)
5.39%
113 Writing, pro-
ducing and
publishing of
software
6201 11.10% 12
(5.69%)
2741
0.62%)
5.07%
279 Engineers and
other technical
design and
consultancy
7112 14.00% 12
(5.69%)
2877
(0.65%)
5.04%
53 Organisational
planning
70221 5.00% 22
(10.43%)
32079
(7.22%)
3.21%
42 Organisational
planning
70221 7.60% 6
(2.84%)
3385
(0.76%)
2.08%
Cluster
Level 4/5 most
dominant catego-
ries
SBI
Percentage
represented by
most dominant
category
Innovative
documents
Total
docu-
ments
Relative
overrepresen-
tation
83
Writing, producing
and publishing of
software
6201 20% 36
(15.79%)
12654
(2.85%) 12.94%
49 Organisational
planning 70221 7% 10 (4.39%)
6031
(1.36%) 3.03%
12 Other interest or-
ganizations n.e.c.* 94997 4%
60
(26.32%)
104918
(23.61%) 2.71%
10 Other interest or-
ganizations n.e.c.* 94997 10% 5 (2.19%)
3394
(0.76%) 1.43%
64
K=1500
Cluster
Level 4/5
most domi-
nant cate-
gories
SBI
Percentage
represented
by most
dominant
category
Innovative
Documents
Total
Docu-
ments
Relative
overrepresen-
tation
103 Financial
holdings 6420 6.30%
38
(15.45%)
44594
(10.3%) 5.42%
309
Other inter-
est organi-
zations
n.e.c.*
94997 5.60% 49
(19.92%)
77618
(17.46%)
2.46%
65
Appendix VII: Dominant level 4 / 5 SBI Category within Clusters for k = 100
Cluster Level 4/5 most domi-
nant categories
SBI Percentage represented by
most dominant category
Number of
documents
89 Photography 74201 83% 12 document(s)
88 Beauty treatment, pedi-
cures and manicures,
make-up and image con-
sulting
96022 62% 2940 document(s)
54 Advertising agencies 7311 58% 12 document(s)
0 Sale and repair of passen-
ger cars and light motor
vehicles (no import of
new cars)
45112 57% 3786 document(s)
24 Other service activities
n.e.c.*
9609 57% 931 document(s)
17 Restaurants 56101 56% 2357 document(s)
85 Landscape service activi-
ties
8130 54% 1709 document(s)
66 Other paramedical practi-
tioners (no physiotherapy
and psychology) and al-
ternative healers
86919 51% 3411 document(s)
67 Other paramedical practi-
tioners (no physiotherapy
and psychology) and al-
ternative healers
86919 47% 975 document(s)
20 Organisational planning 70221 41% 17 document(s)
47 Shops selling bicycles
and mopeds
47641 41% 822 document(s)
97 Shaping and processing
of flat glass
2312 38% 24 document(s)
76 Hotels with restaurants 55101 37% 846 document(s)
72 Wholesale of beverages
(no diary products)
4634 32% 403 document(s)
36 Freight transport by road
(no removal services)
4941 31% 1901 document(s)
71 Writing and other artistic
creation
9003 28% 2715 document(s)
51 Floor and wall covering 4333 27% 928 document(s)
79 Business education and
training
85592 25% 3492 document(s)
52 Driving schools 8553 23% 1999 document(s)
4 Other paramedical practi-
tioners (no physiotherapy
and psychology) and al-
ternative healers
86919 22% 5286 document(s)
31 Day nurseries for pupils 88911 22% 4898 document(s)
66
80 Writing, producing and
publishing of software
6201 22% 8867 document(s)
1 Construction of residen-
tial and non-residential
buildings
4120 21% 8197 document(s)
62 Retail sale via internet of
clothes and clothing ac-
cessories
47914 21% 187 document(s)
34 Other interest organiza-
tions n.e.c.*
94997 20% 3848 document(s)
40 Organisational planning 70221 20% 28396 docu-
ment(s)
83 Writing, producing and
publishing of software
6201 20% 7878 document(s)
5 Fast-food restaurants, caf-
eterias, ice cream par-
lours, take-out eating
places etc.
56102 19% 53 document(s)
3 Lawyers and curators 69101 18% 3058 document(s)
45 Business education and
training
85592 16% 1674 document(s)
92 Organisational planning 70221 14% 100 document(s)
37 Retail sale via internet of
clothes and clothing ac-
cessories
47914 13% 148 document(s)
7 Other paramedical practi-
tioners (no physiotherapy
and psychology) and al-
ternative healers
86919 12% 13940 docu-
ment(s)
35 Writing, producing and
publishing of software
6201 12% 170 document(s)
43 Other interest organiza-
tions n.e.c.*
94997 12% 1867 document(s)
10 Other interest organiza-
tions n.e.c.*
94997 10% 2039 document(s)
13 Organisational planning 70221 10% 1385 document(s)
77 Organisational planning 70221 10% 3057 document(s)
69 Machining 2562 8% 1139 document(s)
33 Organisational planning 70221 7% 2517 document(s)
49 Organisational planning 70221 7% 3374 document(s)
2 Retail sale via internet of
clothes and clothing ac-
cessories
47914 6% 576 document(s)
21 Other interest organiza-
tions n.e.c.*
94997 6% 34651 docu-
ment(s)
29 Fast-food restaurants, caf-
eterias, ice cream par-
lours, take-out eating
places etc.
56102 6% 7412 document(s)
67
61 Other interest organiza-
tions n.e.c.*
94997 6% 7955 document(s)
28 Organisational planning 70221 5% 3481 document(s)
48 Dispensing chemists 4773 5% 2464 document(s)
82 Organisational planning 70221 5% 1259 document(s)
87 Financial holdings 6420 5% 3395 document(s)
9 Financial holdings 6420 4% 2140 document(s)
12 Other interest organiza-
tions n.e.c.*
94997 4% 56982 docu-
ment(s)
26 Other interest organiza-
tions n.e.c.*
94997 4% 4854 document(s)
56 Organisational planning 70221 4% 2487 document(s)
90 Organisational planning 70221 4% 2064 document(s)
68
Appendix VIII: Most Dominant level 2 SBI Category in Clusters for k = 100
Cluster level 2 most domi-
nant categories
SBI Percentage represented by
most dominant category
number of documents
89 Industrial design,
photography, transla-
tion and other con-
sultancy
74 92% 12 document(s)
4 Human health activi-
ties
86 81% 5286 document(s)
66 Human health activi-
ties
86 80% 3411 document(s)
17 Food and beverage
service activities
56 75% 2357 document(s)
0 Sale and repair of
motor vehicles, mo-
torcycles and trailers
45 73% 3786 document(s)
62 Retail trade (not in
motor vehicles)
47 73% 187 document(s)
88 Wellness and other
services; funeral ac-
tivities
96 72% 2940 document(s)
52 Education 85 65% 1999 document(s)
54 Advertising and mar-
ket research
73 58% 12 document(s)
2 Retail trade (not in
motor vehicles)
47 57% 576 document(s)
24 Wellness and other
services; funeral ac-
tivities
96 57% 931 document(s)
85 Facility management 81 54% 1709 document(s)
20 Holding companies
(not financial)
70 53% 17 document(s)
37 Retail trade (not in
motor vehicles)
47 53% 148 document(s)
67 Human health activi-
ties
86 53% 975 document(s)
76 Accommodation 55 50% 846 document(s)
97 Manufacture of other
non-metallic mineral
products
23 50% 24 document(s)
47 Retail trade (not in
motor vehicles)
47 48% 822 document(s)
35 Retail trade (not in
motor vehicles)
47 45% 170 document(s)
48 Retail trade (not in
motor vehicles)
47 45% 2464 document(s)
69
83 Support activities in
the field of infor-
mation technology
62 44% 7878 document(s)
79 Education 85 43% 3492 document(s)
45 Education 85 41% 1674 document(s)
41 Specialised construc-
tion activities
43 39% 28 document(s)
5 Food and beverage
service activities
56 38% 53 document(s)
72 Wholesale trade (no
motor vehicles and
motorcycles)
46 37% 403 document(s)
3 Legal services, ac-
counting, tax consul-
tancy, administration
69 35% 3058 document(s)
36 Land transport 49 32% 1901 document(s)
51 Specialised construc-
tion activities
43 32% 928 document(s)
71 Arts 90 32% 2715 document(s)
31 Social work activities
without accommoda-
tion
88 31% 4898 document(s)
39 Education 85 29% 14 document(s)
80 Support activities in
the field of infor-
mation technology
62 28% 8867 document(s)
29 Retail trade (not in
motor vehicles)
47 25% 7412 document(s)
40 Holding companies
(not financial)
70 25% 28396 document(s)
1 Construction of
buildings and devel-
opment of building
projects
41 24% 8197 document(s)
43 World view and po-
litical organizations,
interest and ideologi-
cal organizations,
hobby clubs
94 24% 1867 document(s)
69 Manufacture of fabri-
cated metal products,
except machinery
and equipment
25 24% 1139 document(s)
7 Human health activi-
ties
86 22% 13940 document(s)
70
34 World view and po-
litical organizations,
interest and ideologi-
cal organizations,
hobby clubs
94 22% 3848 document(s)
26 Retail trade (not in
motor vehicles)
47 19% 4854 document(s)
13 Support activities in
the field of infor-
mation technology
62 15% 1385 document(s)
92 Holding companies
(not financial)
70 15% 100 document(s)
10 World view and po-
litical organizations,
interest and ideologi-
cal organizations,
hobby clubs
94 13% 2039 document(s)
77 Holding companies
(not financial)
70 13% 3057 document(s)
12 Retail trade (not in
motor vehicles)
47 12% 56982 document(s)
21 Wholesale trade (no
motor vehicles and
motorcycles)
46 12% 34651 document(s)
49 Arts 90 12% 3374 document(s)
87 Retail trade (not in
motor vehicles)
47 11% 3395 document(s)
90 Retail trade (not in
motor vehicles)
47 10% 2064 document(s)
33 Holding companies
(not financial)
70 9% 2517 document(s)
56 Retail trade (not in
motor vehicles)
47 9% 2487 document(s)
61 World view and po-
litical organizations,
interest and ideologi-
cal organizations,
hobby clubs
94 9% 7955 document(s)
9 Human health activi-
ties
86 8% 2140 document(s)
82 Human health activi-
ties
86 8% 1259 document(s)
28 Retail trade (not in
motor vehicles)
47 7% 3481 document(s)
71
TOP 10 Most Dominant SBI Level 2 Categories for k=500 and k=1500
Cluster
K=500
Dominant SBI level 2 Category SBI Percentage of docu-
ments in cluster in
dominant category
total number of
documents in cluster
39 Human health activities 86 100.00% 68 document(s)
474 Human health activities 86 94.80% 58 document(s)
274 Travel agencies, tour operators, tourist infor-
mation and reservation services
79 93.30% 45 document(s)
245 Legal services, accounting, tax consultancy,
administration
69 93.20% 44 document(s)
317 Legal services, accounting, tax consultancy,
administration
69 92.60% 27 document(s)
345 Sale and repair of motor vehicles, motorcycles
and trailers
45 90.40% 73 document(s)
190 Retail trade (not in motor vehicles) 47 89.70% 29 document(s)
109 Human health activities 86 87.60% 792 document(s)
99 Sale and repair of motor vehicles, motorcycles
and trailers
45 87.50% 104 document(s)
249 Retail trade (not in motor vehicles) 47 86.60% 149 document(s)
Cluster
K=1500
Dominant SBI level 2 Category
SBI
Code
Percentage repre-
sented by most
dominant category
Total Number of
Documents in Cluster
375 Human health activities 86 100.00% 68 document(s)
853 Human health activities 86 100.00% 19 document(s)
950 Retail trade (not in motor vehicles) 47 100.00% 16 document(s)
1328 Food and beverage service activities 56 100.00% 19 document(s)
769 Human health activities 86 98.70% 299 document(s)
1217
Wellness and other services; funeral activ-
ities 96 98.50% 68 document(s)
921 Human health activities 86 97.40% 76 document(s)
447 Education 85 97.10% 206 document(s)
858
World view and political organizations,
interest and ideological organizations,
hobby clubs 94 96.90% 5713 document(s)
354 Human health activities 86 96.70% 61 document(s)
72
73
Appendix IX: Percentage Innovative URLs per SBI level 4/5 Category
Count Level 4/5 Category Name SBI Percentage
0 Research and development on technology 72192 2.31%
1 Treatment and coating of metals 2561 0.46%
2 Machining 2562 0.46%
3
Agents involved in the sale of agricultural raw materials,
live animals, textile raw materials 4611 0.46%
4
Agents involved in the sale of fuels, ores, metals and chem-
icals 4612 0.46%
5
Agents involved in the sale of machinery, industrial equip-
ment, ships and aircraft 4614 0.46%
6
Manufacture of non-domestic cooling and ventilation equip-
ment 2825 0.46%
7 Specialised hospitals (not for mental health) 86103 0.46%
8 Manufacture of man-made fibres 2060 0.46%
9 Manufacture of agricultural and forestry machinery 2830 0.46%
10 Development of building projects 4110 0.46%
11 Social clubs 94991 0.93%
12 Manufacturing of bodies (coachwork) for motor vehicles 29201 0.46%
13 Manufacturing of and trailers and semi-trailers 29202 0.46%
14 Financial holdings 6420 9.26%
74
15 Other interest organizations n.e.c.* 94997 0.46%
16 Growing of arboricultural crops in open fields 1305 1.39%
17 Non-specialised wholesale of food 4639 0.46%
18 Renting of trucks, busses and motor homes 7712 0.93%
19 Roofing 4391 0.46%
20
Production of electricity by solar cells, heat pumps and hy-
dropower 35113 3.70%
21 Wholesale of computers, peripheral equipment and software 4651 0.46%
22
Wholesale of electronic and communication equipment and
related parts 4652 1.39%
23 Investment funds in financial assets 64301 0.46%
24 Investment funds in real estate 64302 0.46%
25
Wholesale of agricultural machinery, equipment and trac-
tors 4661 0.46%
26
Sale and repair of passenger cars and light motor vehicles
(no import of new cars) 45112 0.46%
27 Writing, producing and publishing of software 6201 5.09%
28 Computer consultancy activities 6202 1.39%
29 Manufacture of basic pharmaceutical products 2110 0.46%
30 Weighing and measuring 52292 0.46%
31 Manufacture of communication equipment 2630 0.93%
75
32 Private security 8010 0.46%
33 Wholesale of video and music recordings 46435 0.46%
34 Organisational planning 70221 1.85%
35 Removal services 4942 0.46%
36 Wholesale of internal transport equipment 46691 0.93%
37 Security systems service activities 8020 0.46%
38 Wholesale of flowers and plants 4622 0.93%
39
Renting and leasing of other machinery and equipment and
of other goods (no vending and slot 77399 0.46%
40 Business education and training 85592 0.93%
41
Manufacture of instruments for measuring, testing, naviga-
tion and controlling 2651 0.46%
42
Repair and maintenance of machinery for general use and
machine parts (no tools) 33121 0.93%
43 Repair and maintenance of machinery for specific industries 33123 0.46%
44 Wholesale of heating and cooling equipment 46692 0.93%
45 Wholesale of combustion engines, pumps and compressors 46693 0.46%
46 Wholesale of fittings and equipment for industrial use 46694 0.46%
47 Wholesale of measuring and control equipment 46695 0.46%
48 Wholesale of packaging 46696 0.46%
76
49 Wholesale of detergents and cleaners 46442 0.46%
50
Wholesale of other machines, equipment and supplies for
manufacturing and trade n.e.c.* 46699 0.93%
51 Renting of non-residential real estate 68204 0.46%
52 Educational support activities 8560 0.46%
53 Retail sale via internet of food and medical goods 47911 0.46%
54 Other information service activities n.e.c.* 6399 0.46%
55 Product design 74102 0.46%
56 Interior and spatial design 74103 0.46%
57
Wholesale of medical and dental instruments, nursing and
orthopaedic articles and laboratory 46462 1.39%
58 Other retail sale 47999 0.93%
59 Practices of physiotherapists 86912 0.46%
60
Wholesale of ferrous metals and ferrous semi-finished prod-
ucts 46722 0.46%
61 Wholesale of articles for lighting 46473 0.46%
62 Support activities for the own enterprise group 70101 0.46%
63 Trust offices 66191 0.93%
64
Umbrella organisations in the field of health care and other
support activities for health care 86929 0.46%
65 Specialised wholesale of other construction materials 46738 1.85%
77
66 Non-specialised wholesale of construction materials 46739 0.46%
67 Wholesale of hardware 46741 0.46%
68 Mixed farming 150 0.46%
69 Wholesale of sports goods (not for water sports) 46496 0.46%
70 Inland passenger water transport and ferry-services 5030 0.46%
71
Community centres, other consultancy and cooperative bod-
ies in the field of welfare 88999 0.93%
72 Web portals 6312 2.31%
73 Manufacture of plastic packing goods 2222 0.46%
74 Manufacture of builders’ ware of plastic 2223 0.46%
75 Management of real estate 6832 0.46%
76 Manufacture of electric lighting equipment 2740 0.46%
77 Manufacture of other plastic products 2229 0.46%
78 Trade of electricity and gas through pipes 3514 2.31%
79 Burial and cremation services 96031 0.46%
80 Other cleaning 8129 0.93%
81
Forging, pressing, stamping and roll-forming of metal; pow-
der metallurgy 2550 0.46%
82 Architects (no interior architects) 71111 0.46%
83 Engineers and other technical design and consultancy 7112 10.19%
78
84
Specialist medical practices and outpatients' clinics (no den-
tistry or psychiatry) 86221 0.46%
85 Freight transport by road (no removal services) 4941 0.93%
86 Stockbrokers, investment consultants etc. 6612 0.46%
87
Management en business consultancy (no public relations
and organisational planning) 70222 0.46%
88 Holding companies (not financial) 70102 1.85%
89 Earth moving 4312 0.46%
90 Manufacture of metal structures and parts of structures 2511 1.39%
91 Manufacture of metal tanks and reservoirs 2529 0.93%
92 Collection of non-hazardous waste 3811 0.46%
93 Impregnation of wood 16102 0.46%
94 Other manufacturing n.e.c.* 32999 1.39%
95 Retail sale via internet of articles for house and garden 47915 0.46%
96 Data processing, hosting and related activities 6311 0.46%
97 Treatment of non-hazardous waste 3821 0.46%
98
Manufacture of paints, varnishes and similar coatings, print-
ing ink and mastics 2030 0.46%
99
Manufacture of other special-purpose machinery and equip-
ment n.e.c.* 2899 0.46%
100
Actuarial and pension consultancy; management of pension
funds 66292 0.46%
79
101 Processing of meat (no prepared dishes) 1013 0.46%
102
Manufacture of medical instruments and supplies (no dental
laboratories) 32502 1.39%
103 Recovery of sorted materials 3832 0.93%
104 Installation of electronic and optical equipment 3323 0.46%
105 Processing of fish 1020 0.93%
106
Wholesale and commission trade of motor vehicle parts and
accessories (no tyres) 45311 0.46%
80
Appendix X: Stability of Clusters
K=
100
Per
centa
ge
of
Clu
ster
s F
alli
ng W
ithin
Sim
ilar
ity P
erce
nta
ge
Percentage of Similarity Within Cluster
Sets 100%>
90%
90%>
80%
80%>
70%
70%>
60%
60%>
50%
50%>
40%
40% >
30%
30%>
20%
20%>
10%
10%>
0%
Average
similarity
within all
clusters*
1,2 10% 3% 0% 6% 2% 3% 4% 2% 4% 66% 27.06%
1,3 15% 4% 5% 2% 2% 0% 4% 3% 3% 62% 21.36%
1,4 16% 3% 3% 2% 2% 0% 2% 3% 4% 65% 24.81%
1,5 14% 3% 1% 4% 2% 5% 4% 1% 1% 65% 25.06%
2,3 15% 3% 4% 5% 1% 1% 2% 0% 2% 67% 25.72%
2,4 14% 3% 9% 2% 2% 1% 1% 0% 4% 65% 25.97%
2,5 16% 6% 5% 4% 1% 1% 2% 1% 0% 64% 29.20%
3,4 15% 3% 5% 2% 3% 2% 0% 3% 0% 67% 25.78%
3,5 18% 2% 2% 3% 1% 2% 1% 2% 3% 66% 25.41%
4,5 14% 5% 4% 2% 3% 1% 4% 2% 2% 63% 26.70%
Mean 15% 4% 4% 3% 2% 2% 2% 2% 2% 65% 25.71%
St.dev. 0.02 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.02
81
K=
500
Per
centa
ge
of
Clu
ster
s F
alli
ng W
ithin
S
imil
arit
y P
erce
nta
ge
Percentage of Similarity Within Cluster
Sets 100%>
90%
90%>
80%
80%>
70%
70%>
60%
60%>
50%
50%>
40%
40% >
30%
30%>
20%
20%>
10%
10%>
0%
Average
similarity
within all
clusters*
1,2 17% 6% 5% 3% 2% 2% 2% 2% 4% 57% 31.91%
1,3 15% 6% 5% 3% 2% 1% 2% 2% 3% 60% 30.29%
1,4 16% 8% 4% 1% 2% 1% 2% 1% 5% 59% 30.67%
1,5 17% 7% 4% 3% 2% 1% 1% 2% 4% 59% 30.89%
2,3 18% 7% 4% 4% 3% 2% 2% 2% 5% 53% 33.99%
2,4 19% 6% 5% 4% 3% 2% 1% 2% 4% 54% 34.58%
2,5 18% 8% 4% 3% 3% 2% 0% 2% 4% 56% 33.67%
3,4 18% 6% 4% 3% 1% 3% 2% 2% 3% 58% 31.60%
3,5 19% 5% 4% 3% 2% 1% 2% 2% 3% 58% 31.58%
4,5 18% 7% 4% 2% 3% 2% 1% 3% 4% 58% 31.54%
Mean 18% 7% 4% 3% 2% 2% 2% 2% 4% 57% 32.07%
St.dev. 0.01 0.01 0.00 0.01 0.01 0.01 0.01 0.00 0.01 0.02 0.01
82
K=
1500 P
erce
nta
ge
of
Clu
ster
s F
alli
ng W
ithin
S
imil
arit
y P
erce
nta
ge
Percentage of Similarity Within Cluster
Sets 100%>
90%
90%>
80%
80%>
70%
70%>
60%
60%>
50%
50%>
40%
40%
>
30%
30%>
20%
20%>
10%
10%>
0%
Average
similarity
within all
clusters*
1,2 11% 6% 4% 2% 1% 1% 1% 2% 3% 69% 23.09%
1,3 10% 6% 3% 2% 2% 1% 1% 2% 3% 69% 21.84%
1,4 11% 6% 3% 3% 2% 1% 1% 2% 3% 67% 23.77%
1,5 11% 5% 3% 2% 2% 1% 1% 2% 4% 67% 23.00%
2,3 10% 5% 3% 2% 2% 2% 1% 1% 2% 70% 22.15%
2,4 12% 6% 3% 3% 2% 1% 1% 2% 3% 68% 23.41%
2,5 11% 6% 3% 3% 2% 1% 1% 1% 3% 69% 23.00%
3,4 11% 6% 4% 2% 1% 1% 1% 1% 3% 70% 22.00%
3,5 10% 5% 3% 2% 2% 1% 1% 1% 3% 71% 21.00%
4,5 12% 6% 3% 2% 2% 2% 1% 1% 4% 67% 24.00%
Mean 11% 6% 3% 2% 2% 1% 1% 2% 3% 69% 22.73%
St.dev. 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01
83
Appendix XI: Code Excerpts
Jupiter Notebook – sampling and checking Common Crawl index (Section 3.2.1)
Jupiter Notebook Web Scraping Code Excerpts (Section 3.2.2)
Gets 100 html code into _pickle files
84
Gets texts from pickled file
85
Language Detection (Section 3.3.1)
Finding URLs classified as e.g. Afrikaans
Stemming and Removing Stop-words (Section 3.3.2)
86
TF-IDF (Section 3.5)
k-means mini-batch (Section 4)
87
Stability of Surfacing Clusters (Section 3.6)
Creating Clusters / SBI comparison table including the getting SBI category name for each code
(Section 4).
88
89
Appendix XII: Software, Libraries and Hardware
In this research, the Python computer programming language is used to perform both extraction of the
texts, clustering documents and analysis clustering. For most of the research Jupyter notebook is used.
For some processes however, especially when running k-means minibatch, Spider is used for it proved
itself more sufficient in terms of system memory. Both in a Windows 10 and Linux Ubuntu environ-
ments will be used in the research.
Beside the standard Python libraries cdx-index-client is used to retrieve URL’s from the Common Ar-
chive, in order to address the second research question.12 Grequest (a combination of Request and
Gevent) is used to retrieve HTMLs from the web.13 Beside this the NLTK packages will be chosen for
pre-processing.
For the great majority two machines have been used for been used for the creation of models, perform-
ing experiments and documenting results. These can be found in the following table.
Located Portable Statistics Netherlands
(The Hague)
System specifications
HP Elitebook 2560p-02
Intel® Core™ i7-2620M
CPU 2.70 GHz (4 cores)
6.00 GB RAM
Windows 10 Pro
Big Data PC
Intel ® Xeon ® CPU E5-2670 0
@2.60 GHZ (16 cores)
Gallium 0.4 NVD 9 64- bit
Memory 62.8 GB 64GB
Operation System Windows 10 Professional Linux Ubuntu 16
12cdx-index-client is a tool designed especially for Common Crawl, which can be used to retrieve URL’s
in bulk from the Common Crawl Database, with use of the Common Crawl index API. It does so by
using parallel programming.
13 Grequest is a combination of the Request Module and Gevent module and it is used to make asyn-
chronous HTTP Requests, this means this is a more efficient method to request HTTP’s since it enables
users to make multiple requests simultaneously.
90
Appendix XIII: Explanation SBI Levels
The SBI Excerpt below has different levels of categories. Throughout this thesis different level are
used.
1. First level is e.g., C Manufacturing, in this thesis this is called a level 1 category. (one letter).
2. Second level is e.g., 10 Manufacture food products, in this thesis this is called a level 2 category.
(two digits).
3. Third level is e.g., 105 Manufacturing of dairy products, in this thesis this is called a level 3 cate-
gory (three digits).
In the level 3 category 869 Paramedical practitioners and other human health activities without
accommodation, there are both categories with 4 digits and categories with 5 digits. Categories with
5 digits are categories unique to the Netherlands. Both 4 and 5 digit categories exist on the same level,
namely one level beneath a level 3 category.
This this level is therefore called a level 4/5 category (Four or Five Digits).
Recommended