Effectiveness of template detection on noise reduction and websites summarization

Information Sciences 219 (2013) 41–72

Contents lists available at SciVerse ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Effectiveness of template detection on noise reductionand websites summarization

Derar Alassi, Reda Alhajj ⇑Department of Computer Science, University of Calgary, Calgary, Alberta, Canada

a r t i c l e i n f o

Article history:Received 15 January 2011Received in revised form 20 December 2011Accepted 14 July 2012Available online 25 July 2012

Keywords:Template detectionNoise detectionWeb miningWebsite summarization

0020-0255/$ - see front matter � 2012 Elsevier Inchttp://dx.doi.org/10.1016/j.ins.2012.07.022

⇑ Corresponding author.E-mail address: [email protected] (R. Alhaj

a b s t r a c t

The World Wide Web is the most rapidly growing and accessible source of information. Itspopularity has been largely influenced by the wide availability of the Internet in almostevery modern house and even on the go after the wide-spread of the handheld devices.Yet, pages on the Web have an additional template (we call it noisy) information that doesnot add value to the actual content of the page. Even worse, it can harm the effectiveness ofWeb mining techniques; these templates could be eliminated by preprocessing. Templatesform one popular type of noise on the Internet. In this paper, we introduce Noise Detector(ND) as an effective approach for detecting and removing templates from Web pages. NDsegments Web pages into semantically coherent blocks. Then it computes content andstructure similarities between these blocks; a presentational noise measure is used as well.ND dynamically calculates a threshold for differentiating noisy blocks. Provided that theinvestigated website has a single visible template, ND can detect the template with highaccuracy using two pages only. However, ND can be expanded to detect multiple templatesper website, and the challenge will be to minimize the number of pages to be checked. Fur-ther, ND leads to website summarization. The conducted experiments show that ND out-performs existing approaches in space complexity, time complexity (see Section 4.6 formore details on ND’s processing time against other algorithms’), minimum requirementsto produce acceptable results, and results accuracy.

� 2012 Elsevier Inc. All rights reserved.

1. Introduction

With the emergence of the Internet and the massive amount of available content, mining the Web has become an inev-itable fact that researchers have focused on. Bing Liu states that the Internet is perhaps the largest source of information thatanyone can access at anytime [16]. The document described in [25] shows that around 1.5 billion people use the Internet.

People have different interests when using the Internet. Some of them might be interested in searching for Web resourcesusing the so-called search engines, such as Google, Yahoo, etc. Other users might want to get summaries of Web pages usingsummarization utilities to save their time required for reading the whole pages [9,21]. More advanced users could be inter-ested in automatically extracting information from pages for later processing. This can be achieved by the so-called informa-tion extraction systems [6,15]. These different systems are all examples of Web mining systems.

A Web mining system takes the input data through three different stages to reach its final result: namely pre-processing,data mining and post-processing [17]. Since data is the input in a data mining system, pre-processing becomes essential to

. All rights reserved.

j).

http://dx.doi.org/10.1016/j.ins.2012.07.022

mailto:[email protected]

http://dx.doi.org/10.1016/j.ins.2012.07.022

http://www.sciencedirect.com/science/journal/00200255

http://www.elsevier.com/locate/ins

Fig. 1.1. Noise Detector architecture.

42 D. Alassi, R. Alhajj / Information Sciences 219 (2013) 41–72

transform this raw data into an acceptable format. Pre-processing may include eliminating attributes that are irrelevant and/or cleaning the data from noisy information. The latter step is very important in data pre-processing since data on the Webincludes a high level of noisy information [23].

Web noise can be defined as information present in a Web page and not relevant to the main content of the page. Forexample, when there is an article with a specific title, all the copyright, banners, advertisements and other redundant seg-ments within that website are considered noise because they do not add any value to the article. If Web mining systems(search engines, summarization utilities, recommender system, information extraction systems) have noisy data as their in-put, the results will not be accurate [13].

Two types of noise that exist on the Web have been defined in the literature, namely local noise and global noise [27]. Thelatter type of noise exists with large granularity, and includes mirror sites, duplicated Web pages and near-replicas. Globalnoise not only tampers the quality of crawling and ranking by Web information retrieval systems, it also forces archivingapplications to waste large disk space for storing the duplicated pages. On the other hand, the former noise covers noisy con-tent within a Web page. Local noise is usually incoherent with the main topic of the Web page, e.g., advertisements, navi-gation panels, copyright announcements, etc. Local noise makes it very difficult for programs to automatically grasp the topiccontent of a page, i.e., local noise worsen the quality of Web applications. The first step to discover local noise is by segment-ing a page into blocks.

Definition 1.1 (Block). A block is a semantically and visually coherent area of a Web page.

Definition 1.2 (Intra-Site Noise). Intra-site noise is the set of blocks that are redundant over pages in the same Website.In fact websites do incorporate some information (like advertisements) that should be displayed with each page of the

website. Such information is essential according to website owners because it might be a major source of their profit. How-ever, it is considered to be noise by most visitors who are interested in the actual content of the website. A recent study byGibson et al. [9] shows that templates represent between 40% and 50% of the Web; they are growing every year. To cope withnoisy information, the approach proposed in this paper segments Web pages into blocks (see Definition 1.1), and then pro-cesses the blocks to find out those which make up the site template. We are mainly interested in intra-site noise (templates)as specified in Definition 1.2.

This paper proposes an integrated approach called Noise Detector which is capable of investigating and eliminating intra-site (template) noise. The extensive experiments that we have conducted show that the proposed Noise Detector achieveshigh precision and recall rates. Better results can be achieved by using different Web mining techniques.

Fig. 1.1 shows the main modules of the proposed Noise Detector. It has 5 main modules: page segmentation, DOM-treebuilder, filter, noise matrix computer, and the cut-off value calculator. The input to the Noise Detector consists of only twopages provided that the whole investigated website has a single template; as the number of templates per website increases,more effort will be required to decide on the most appropriate number of pages to consider for deriving all the templates; thelatter multi-template case has been left as future work (see Section 4.7). The small number of pages required as input givesthe proposed approach strength over the other approaches described in the literature (More details will be covered in Sec-tion 2).

The two input pages, say p1 and p2 are segmented into blocks using a utility called Vision-based Page Segmentation (VIPS)[5]. Each page is segmented at the semantic level to generate coherent areas. Then a Document Object Model tree(DOM-tree) is built for each block. The DOM-tree is a standard for accessing and manipulating HTML documents (W3C).

D. Alassi, R. Alhajj / Information Sciences 219 (2013) 41–72 43

The blocks are validated by the valid-pass filter. A valid block has size (amount of content) greater than or equal to a pre-defined threshold. This threshold is 20 words1 and was determined through experiments to trim out all unnecessary blocks.This is considered as an early block elimination step. The valid blocks are passed to the noise matrix module that compares thecontent and structure of each block in p1 against all the blocks in p2. In other words, the noise measure employed by the NoiseDetector relies on the content and the structure similarity between blocks in the input pages. Essentially, this is a templatedetection approach since we are using the similarity between blocks to determine components of the template. After we com-pute the noise matrix, we assign to each block in the two pages its corresponding noise value. If the number of blocks in p1 andp2 are the same, then the first block in p1 will be compared against the first block in p2, and the second block from p1 against thesecond block in p2, and so on. In case the number of blocks in p1 is different than the number of blocks in p2, a block matcheralgorithm is applied to find blocks from p2 that match blocks in p1. All blocks left in p1 without matching blocks from p2 will beassigned noise values using only their presentational features. Section 3.2 shows with detailed examples the different scenariosof how to compute the noise values of each block in p1 and p2. The last stage before outputting the valid blocks is to calculate thecut-off value which defines the threshold that determines the valid blocks. This value is calculated dynamically; thus it is case-dependent, i.e., a page could have different cut-off values when it is matched against different pages. More information on cal-culating the cut-off value is included in Section 3.2.6. After calculating the cut-off value, blocks with higher noise value than thecalculated threshold are marked as noisy. The final result of the system is the set of blocks that are not noisy. We combine all thelatter blocks to create the new clean HTML file.

As a byproduct, the Noise Detector is able to identify websites that follow a pre-defined template and whether the tem-plate is consistent for all pages of the analyzed website. This is achieved by comparing in different pages from the site thenumber of blocks and their respective noise values. In other words, it is possible to comment on the design of a website bydetermining whether the same template is used across the site to specify the extra information in each of its pages.

The proposed Noise Detector is a useful tool for cleaning websites before they are further processed to serve certain tar-gets like Web mining. The following points can summarize the main contributions of the work described in this paper:

� We propose an automated system called Noise Detector which is capable of detecting intra-site noise. This system out-puts clean pages that can be used as input to Web mining techniques.� Noise Detector operates with two pages as input; this is optimal as compared to the other approaches described in the

literature. Experiments show high precision and recall values of detecting noisy blocks.� We propose a complex and robust noise measure. This measure includes content, structure and presentation of a Web

page. To the best of our knowledge, this is the first measure that combines these three features to detect templates.� We have conducted a number of thorough experiments to evaluate our system and to validate the measures that we have

proposed. To avoid any bias and to show that Noise Detector has not been engineered to serve specific cases, a variety ofwebsites from different domains have been used in the testing. The reported results demonstrate the applicability andeffectiveness of the Noise Detector.

The main component of this paper is the Noise Detector as outlined above. Its different components are described, tested andvalidated in the following sections. The necessary background is also covered in order to have this document more self-con-tained. The remainder of this paper is organized as follows. Section 2 discusses some of the related work. In addition, it providesan overview of the background. In Section 3 we describe the proposed Noise Detector system. Section 4 investigates the appli-cability of Noise Detector as a noise detection system. Section 5 is summary, conclusions and future research directions.

2. Related work

Web content analysis and extraction has received considerable attention in the research community, e.g.,[1,9,13,19,23,27]. The main goal of this paper is to detect and eliminate Web noise that is part of templates. Our work asdescribed in this paper overlaps with several Web related research areas, including noise detection and elimination, tem-plate detection, informative segments and blocks detection. Some of these topics are interchangeable, e.g., noise detectionand template detection (see Definition 1.2). However, discovering informative blocks in a Web page and noise detectionon the Web are complementary techniques. Therefore, we will survey the different pieces of work that cover these ap-proaches. Most importantly, we will discuss how each technique tackles the problem and what shortcomings each approachhas. Then, we will briefly discuss how we address these problems in our proposed approach.

2.1. Block vs. whole page based detection

We will investigate the approaches that have been introduced in the literature to solve the problem of detecting tem-plates and informative blocks. We classify these approaches into three different types: presentation-based, DOM-based,and segmentation-based approaches.

1 PDOC value that determines the coherence level of VIPS should be considered as well in defining this threshold. The more PDOC is the more coherent theblocks are, hence less content in blocks in general.


Gupta et al. [10] use a naïve approach to remove advertisements and links; their approach is based on the ratio of thenumber of linked and non-linked words in a DOM-tree. The linkage ratio in a block is a general indication that the blockis likely to be part of the template, but this is not enough. For example, a block that has copyright information at the endof a page does not have high linkage ratio; however, it is part of the template. On the other hand, if a page has a lot of usefullinks, e.g., MSN,2 most of them will be removed, which is not acceptable. Yi et al. [27] introduced an interesting approach toeliminate noisy information from Web pages. They proposed the Site Style Tree (SST), which is discovered by processing aset of Web pages from a specific Web site. Initially, the SST tree is the DOM-tree of the first document. Then they build theDOM-tree of each Web page and match it with the existing trees. Thus, the SST is an accumulative tree of all the mapped trees(union). While matching a new tree, if the node in the new DOM-tree exists in the SST, then the frequency of that node is incre-mented. Otherwise, this node will be inserted into the SST tree with initial frequency of 1. To build a ‘‘site-representative’’ SST,they require 400–500 pages from that site. This is a large number; especially when compared to our approach which requirestwo pages at a time. After the SST tree is constructed, the frequency of nodes will identify nodes that are part of the template.Essentially, the higher frequency a node has, the more likely this node is part of the template. Nonetheless, they use informationentropy to calculate node importance; they will eventually identify the noisy nodes that have low importance values. Presenta-tional features are used to calculate the importance values as well.

Lo et al. [18] proposed CF-EXALG (Collaborative Finer-EXALG), which is based on the information extraction system EX-ALG (EXtraction ALGorithm). CF-EXALG parses HTML pages to extract sets of words that have similar occurrence patterns indifferent pages. As a result, equivalence classes are constructed to represent these occurrence patterns. Then any newly un-seen pages will be parsed in the same fashion to extract their tokens (frequently occurring patterns). After that, the discov-ered tokens are matched against the equivalent classes that were deduced from previous pages. The last stage is to outputthe template schema using the frequent patterns as an XML DOM tree.3 This algorithm basically uses the content to extracttokens which will be used to identify the template of the site. This means that content is the only feature used and therefore,this needs a large number of pages to produce good results. Another shortcoming of their evaluation is not conducting extensiveexperiments to show precision and recall accuracy results. The only measure they used is the ‘‘correctness’’ which is not definedclearly.

The algorithms proposed under this category either use a previously proposed segmentation algorithm or propose a newsegmentation algorithm. Lin and Ho [14] discover the informative content in news Web sites. They use the <Table> tag as a‘‘container tag’’ to extract content strings in each block. Then, the entropy of each feature in each block is calculated. Finally,the final entropy value of a block is defined as the summation of its features’ entropy values. This work is limited to newsWeb sites as appeared in their experimental section. In addition, it only uses the <Table> tag as the ‘‘container tag’’ to dis-cover the informative blocks, which is not always the case. Clearly, there are other tags that are used to layout blocks in Webpages, such as using horizontal lines (<HR>), or using the division tag (<DIV>). Yet, utilizing the visual layout of a Web page ismore effective in the segmentation process. Bar-Yossef and Rajagopalan [3] do utilize the visual and functional structure ofWeb pages in their work. They segment Web pages into pagelets which are equivalent to blocks in our work. A pagelet isessentially a DOM-tree element that has at least k hyperlinks, and none of its ancestor elements is a pagelet. Therefore, apagelet is meant to be an area with single and well-defined functionality that is not shared with another area. After pageletshave been identified, almost-similarities are computed between pagelets using a shingling4 technique.

Song et al. [21] proposed a block importance model that automatically assigns importance weights to different blocks(regions) in a Web page. VIPS [5] is used first to segment Web pages. Spatial and content features are then used to calculateblock importance. Spatial features include: x-coordinate of a block center, y-coordinate of a block center, block width, andblock height. On the other hand, content features include image number, interaction size, form number, and form size. A vec-tor is built using a set of labeled blocks that have been already extracted from different pages. This vector represents all fea-tures that belong to the analyzed block; it is used later with its corresponding importance value to find out the best function fthat minimizes the expression jf(x,y) � yj, where x is the feature vector of a block, and y is its importance. They solve thisproblem as a regression problem by considering that the input is a continuous variable. However, it is considered as a clas-sification problem when the input is discrete. This consideration requires labeling example data sets first, and then buildingthese models to be applied on unseen data sets.

Li and Ezeife [13] proposed WebPageCleaner which detects and removes noisy blocks from Web pages. VIPS is used tosegment a set of Web pages into a set of blocks, then content similarity and linkage percentage of the extracted blocksare computed. Text fingerprints5 are also used to calculate the content similarity between blocks. Another measure is calcu-lated using the position of these blocks which will be later combined into one importance measure. Blocks with the highestN importance values are then used to create the final clean file. They assume N = 3; which is not a very valid assumption. InSection 4.3.3, we show that the number of valid blocks is variable; it is website-dependent; even worse, it varies from one pageto another in the same website. We also compare our approach against the approach of Li and Ezeife [13].

2 http://www.msn.com/.3 XML DOM-tree is a standard way for accessing and manipulating XML documents. XML DOM-tree presents XML documents with elements, attributes and textas nodes same as in an HTML DOM-tree.4 A shingle is a text fingerprint that is invariant under small perturbations (Template detection via data mining and its applications, 2002).5 Let S be a string of n symbols over some alphabet

P. We say that a substring S0 occurring within S has fingerprint

P0 # P. Also, the fingerprint

P0 is calledthe alphabet of S’. (Efficient Text Fingerprinting via Parikh Mapping, 2003).

http://www.msn.com/


Debnath et al. [7] proposed an approach close to our work. Page segmentation is first done using a set of tags that aredefined as potential ‘‘container tags’’, including <TR>, <P>, <HR>, and <UL>. Consequently, this work may be considered asa more general approach than the approach of [14]; the latter approach uses only the table tag (<Table>). A block is definedas the set of HTML tags along with their corresponding content that reside between an open tag and its corresponding clos-ing tag. Then, FeatureExtractor and ContentExtractor modules identify importance values for these blocks. FeatureExtractordepends on heuristics and the existence of certain features to consider whether a block is valid; however, it is not clear whatthese features are. The ContentExtractor module marks blocks as valid when they contain pre-defined content features, e.g.,number of terms, number of images, number of java-scripts, etc. Afterwards, the Inverse Block-Document Frequency (IBDF)measure is calculated; it is inspired from the Term-Frequency-Inverse Document Frequency (TF-IDF) which is a weight mea-sure often used in information retrieval and text mining. TF-IDF is a statistical measure used to weight the importance ofa word (term) in a document. The importance of a term increases proportionally to its frequency in a document but it is off-set by the frequency of the word in the corpus. Similarly, IBDF assigns a higher value to a block that appears less frequentlythan another block which appears more frequently in a set of Web pages extracted from a website. A threshold (like 0.9)which is provided by an expert determines whether two blocks are similar or not. The experimental results show thatthe approach proposed in [18,20] outperforms the entropy-based approach proposed by Yi et al. [27]. Nevertheless, the ap-proach of Yi et al. [27] had been proved to outperform the template-based approach in [3].

Cai et al. [4] suggest that block-level Web search is more effective than page-level Web search. They proposed a combinedsegmentation approach using VIPS and fixed-length properties; this gives the best results compared to VIPS-alone approach,DOM-based page segmentation, and fixed-length page segmentation. This approach does not apply any learning process thatcan reduce the complexity of the proposed algorithm; i.e.: each page must be segmented first and then the importance val-ues are computed for each block. Fernandes et al. [8] proposed an approach specifically for information retrieval algorithms;it operates at block level. The authors claim that computing importance values of blocks based on the occurrence of terms inblocks (and not the whole page) gives much better results. Our approach (Noise Detector), however, proposes a method toidentify blocks that make up a template in a website; this can later be used to infer the DOM structure of blocks that are partof the template. The latter step eliminates the overhead of segmenting Web pages and requires only traversing the Webpage’s DOM tree. This learning step has not been implemented in Noise Detector and more details are provided in Section4.7. Another advantage of Noise Detector is using semantically coherent blocks as the processing units to discover website’stemplate. This makes Noise Detector unique in the way it detects templates where it uses blocks as the processing unit, andincorporates different aspects of Web pages such as: content, structure, and presentation.

Recently, Akbar et al. [2] proposed an algorithm to extract the main blocks in blog posts. They adapted the Content Struc-ture Tree (CST)-based approach by considering the attenuation quotient suffered by HTML block nodes. This is achieved bydetecting primary and secondary markers for page clusters and then extracting the main block using these markers. How-ever, Li et al. [12] have used VIPS to segment Web pages, which makes their work close to ours. They form a feature-vectorconsidering visual, spatial, and content aspects of the extracted blocks to cluster them later.

We cover from the literature the main approaches which detect templates and consequently eliminate them from Webpages. Intuitively, we can classify these approaches into two categories according to their basic processing unit:

1. Page-based approaches: these approaches depend on the DOM-tree of the whole page to discover patterns that recur inmultiple Web pages of the same site; [14, R56R] are examples on this category. For instance, the approach of Yi et al. [27]proved to be effective in its accuracy results, though it needs 400–500 pages from a site to detect its template. Further-more, the algorithm is characterized by high complexity because of the need to process the whole DOM-tree of Webpages.

2. Block-based approaches: these approaches consider blocks as the processing unit as opposed to a whole Web page; [7,R33R, R44R] are some examples of this category. They segment each page into a set of blocks and then process thisset using different techniques. Processing DOM-trees of smaller sized-blocks is more efficient than processing DOM-treesof complete pages. More importantly, the visual design followed by every Web page justifies the segmentation process.Nevertheless, segmentation has to make use of the visual design; and consequently operates at the semantic level of aWeb page.

DOM tree was initially introduced in the browser for presentation. Therefore, it does not provide much information aboutthe semantics of a Web page. That is, two nodes in the DOM tree might have the same parent, yet they might not be seman-tically related to each other. This is considered as a drawback of the approaches that use the whole DOM tree without con-sidering the semantics of a Web page. Another drawback of the whole page-based approaches is the flexibility in the HTMLsyntax. Even worse, Web designers do not always follow the W3C HTML specifications.

On the other hand, block-based (segment-based) approaches overcome the aforementioned drawbacks by simply utiliz-ing the semantics of the Web pages. Usually, humans do not perceive a Web page as one segment, but rather as multi-segments area [21]. In fact, a Web designer divides a Web page into semantically different areas, each of which is intendedto serve a specific function in the page. For example, a page might have copyright segment, banners segment, and links torelated pages segment, which have different purposes.

Fig. 2.1 shows two pages from the CNN website. Similar blocks in both pages are enclosed inside frames that have samecolor. This emphasizes the importance of segmenting Web pages into coherent blocks.

Fig. 2.1. Similar blocks in different pages extracted from CNN: http://www.cnn.com.


3. The proposed system – Noise Detector

The effectiveness of a system depends largely on its input, processing power and the produced outcome. Thus, a systemwith good quality input gives better results.

The shortcomings aforementioned in Section 2 have motivated our work described in this; ours is intended to be a morecomprehensive and robust approach. The proposed approach operates at the semantic level of a Web page as it segments apage into coherent blocks using VIPS. These blocks are then compared with each other using a robust measure that incor-porates content, structure, and presentation. Then a dynamically computed cut-off value determines which blocks are valid.In general only two input pages are needed for our approach to give good results for Web sites that have one template; thenumber of pages required in the process linearly increases with the number of templates characterizing the website.

This section investigates all the details of the proposed system, namely the Noise Detector. We start by presenting thegeneral architecture of the Noise Detector. Then the different modules that the system consists of will be explained. Exam-ples and case studies will go along to enable the reader understand the different practical cases that could arise.

3.1. Noise detector architecture

This paper concentrates on identifying templates as noisy parts of Web documents, and hence develops the Noise Detec-tor as an effective approach capable of serving the target. Most of the existing approaches build a model using a training setof annotated documents [18,20]. The model is built using different machine learning techniques and similarity measures. Tobe able to build such a model, a relatively large set of documents is required. For example, the work described in [27] needs400–500 pages from a site to be able to build the SST tree that will be used then to identify the parts of a template. The workof Vieira et al. [24] requires more than 100 pages to give satisfactory results in detecting templates. On the other hand, NoiseDetector as described in this paper requires only two pages to detect their common template with high confidence. Inputpages are fed into Noise Detector go through five main modules to reach the final stage of outputting clean HTML files.The architecture of Noise Detector is depicted in Fig. 1.1 (see Section 1).

http://www.cnn.com

Fig. 3.1. VIPS input/output.


3.2. Template detection process

Noise Detector uses segmentation first in its template detection process. Blocks that result from the segmentation processof a page are then compared against blocks of another page. In this comparison, we find the similarity of their structure andcontent which are latter combined into one similarity measure. These similarity measures are considered to be noise mea-sures as explained in previous sections. The latter two noise measures are then combined with the presentational noise mea-sure.6 High combined measures of specific blocks means that these blocks are part of the template. Although Noise Detectorcould be classified as one of the approaches that perform the analysis at the block, other approaches zoom out to considerthe whole page in the process. Section 3.2.1 discusses the different approaches described in the literature in terms of the unitused in the detection process, i.e., a block or a whole page. Section 3.2.2 discusses Noise Detector’s detection process in moredetail.

3.2.1. Page segmentationThe above discussion emphasizes that segmenting a Web page is a valid option. The challenge that we need to address in

our work is to have a good segmentation algorithm. More precisely, a good segmentation algorithm would segment a Webpage as much closely as a person perceives it. Actually, the basic processing unit should look as described in Definition 1.1.We will use segment and block interchangeably in the sequel because we use them to refer to semantic and coherent areas ina Web page.

Several segmentation algorithms have been proposed in the literature, including DOM-based segmentation [6], location-based segmentation [11], and VIPS [5]. The first two algorithms fail to take the visual structure of a Web page into consid-eration. Consequently, they do not perceive a Web page as it should actually be perceived. However, VIPS does utilize thevisual cues of a Web page and gives satisfactory results. Moreover, VIPS is technically easy to integrate into a system, asthe authors provide a DLL for VIPS. Last but not least, different permitted degree of coherence (PDOC) values identify howcoherent the extracted segments will be. The latter point gives VIPS flexibility and robustness, i.e., different sites have dif-ferent visual structures, and hence different PDOC values will be a necessity when segmenting them. Throughout the con-ducted experiments, reported PDOC = 6 as a good general choice for our system, still it is site-variant.

VIPS processes the HTML DOM tree of a Web page to extract its set of blocks. The following are the four main cues that arebeing used:

1. Tag cue: there are some HTML tags that visually separate regions in a Web page from each other. For instance, the<HR> tag is used to draw a horizontal line, and so VIPS uses this tag as a separator cue.

2. Color cue: VIPS divides a node in the DOM tree if its background color is different from the color of its children.3. Text cue: VIPS does not divide a node if most of its child nodes are text nodes.4. Size cue: if the standard deviation of a node is greater than a threshold, then VIPS divide that node.

These are the general visual cues that VIPS uses in its segmentation process. Accordingly, there is a set of rules that totallydepend on these cues.

Fig. 3.1 shows VIPS as a black box with an HTML page as the input and n blocks are the output. The number of outputblocks n is dependent on the coherence value PDOC. The larger the PDOC is, the larger n will be. Intuitively, this is due tothe higher coherence degree that each block must have.

6 Presentational noise measure is the percentage of visual elements in a block compared to its content.


VIPS adds meta data to each output block besides its HTML content, such as page width, page height, coordinates of thetop-left corner, etc. So, each block has an HTML content which is essentially a sub-tree in the DOM tree of the original inputdocument, as well as the meta data that VIPS adds.

3.2.2. Valid-pass filterUsually, there will be some output blocks that have small textual content. This filter will filter blocks whose textual con-

tent’s sizes are less than a predefined threshold. This is considered as an early elimination step.Some of the conducted experiments produced 25–30 blocks per page. The valid-pass filter normally passes less than half

of them (12 blocks on average). In fact, the eliminated blocks do not include valuable information and will reduce the timecomplexity of the algorithms employed in the next modules.

3.2.3. Noise matrixThe basic idea of this work is to find similar blocks in multiple pages. This will, consequently, imply that these blocks are

part of the template of the analyzed website. Intuitively, we will need to compare every block in the first page against allblocks in the second page. Therefore, when block b from the first page has a high similarity with block a in the second page,we deduce that b and a are part of the template of both pages. Thus, the similarity between blocks implies the noisiness of thetwo blocks. As mentioned in Section 2, the measures that have been used in the literature do not benefit from all the differentaspects of HTML pages, namely content, structure, and presentation. Thus, a distinguishing feature of the proposed approachis combining all the three measures to decide on the noisiness of a block.

Definition 3.1 (Similarity Matrix). Let p1 and p2 be two pages from the same Web site, i.e., both pages share the same layout.By assuming that p1 consists of m blocks a1, a2, . . ., am and p2 consists of n blocks b1, b2, . . ., bn, a similarity matrix Sm � n canbe computed, where si, j is the similarity value between the ith block in p1 and the jth block in p2.

According to Definition 3.1, the similarity between blocks is the factor that determines the noisiness of a block in the firstplace. Therefore, selecting the features that we consider in our similarity measure is crucial to get good results. Consideringthe nature of HTML justifies the use of content and structure similarities to compute the similarity matrix (noise matrix).

3.2.3.1. Content similarity. Content is an important feature that enables the discovery of templates. Bar-Yossef and Rajagopa-lan [3] use shingles, which are text fingerprints that are invariant under small perturbation. In our algorithm, the content ofeach block is extracted. Any text in between an open HTML tag and its corresponding close tag will be part of that block’scontent. The next step is to calculate the similarity between the content of every block in p1 and each other block in p2. This isthe worst case in terms of the time complexity of the algorithm. In the latter sections, we will discuss in more details how werefine this approach to reduce its complexity.

Cosine similarity is known to perform well as far as text analysis is concerned. Given 2 vectors, X and Y, the cosine sim-ilarity between them is defined as follows:

CosSimðX;YÞ ¼ X � YjXj:jYj ¼

Pni¼1xiyiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn

i¼1ðxiÞ2q

�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn

i¼1ðyiÞ2

q ð3:1Þ

where jXj, jYj represent the magnitude of X and Y, respectively.To get more accurate results, it is best to remove stop words (‘‘the’’, ‘‘a’’, ‘‘is’’, ‘‘will’’, etc.) from the input vectors before

applying the similarity measure. Also, applying a stemming algorithm gives much better similarity results. Porter stemmingis a well-known stemming algorithm [20] that has been adapted into the Noise Detector described in this paper. It simply

Fig. 3.2. Content similarity algorithm.

Fig. 3.3. Structure similarity example.


removes the commoner morphological and inflexional endings from English words. Fig. 3.2 illustrates the algorithm for find-ing the content similarity.

The value returned in step 7 of the content similarity algorithm in Fig. 3.2 always belongs to the interval [0,1]. The higherthe similarity value between two blocks is, the more likely the blocks are part of the template.

3.2.3.2. Structure similarity. As we have seen in Fig. 3.1, content similarity can be a strong indication that two blocks are partof a template. However, this is not the only measure that we can apply on HTML documents. In many cases, the content oftwo blocks may not be similar even if their general layout is very close. Fig. 3.3 shows two blocks which have different con-tent but similar layout. These blocks have been extracted from two different pages from the CNN website.

Apparently, the two blocks shown in Fig. 3.3a and b are part of the CNN template. Thus, we need another measure thatexploits the layout of these blocks rather than their content, i.e., block structure. It is normal to find similar cases in otherwebsites where blocks are the same in terms of the layout and different in their content. Therefore, we need to comparethe tree structures against each other to find the similarity between their layouts.

Consider Fig. 3.4a and b which are the HTML DOM trees for the blocks illustrated in Fig. 3.3a and b, respectively. Clearly,the two trees are identical; this is a high indication that the corresponding two blocks are part of the template. Therefore, wecan determine identical blocks regardless of their content. The next sections introduce the structure measure we use andinclude case studies which show content and structure similarity matrices.

3.2.3.2.1. Simple tree matching algorithm. The structure similarity problem is essentially a tree matching problem. This is par-ticularly true because we intend to find the maximum number of matching nodes between the two trees. The tree edit dis-tance was proposed to match two trees, A and B, which are labeled ordered rooted trees [28]. Tree edit distance is basicallythe cost associated with the minimum set of operations needed to transform A into B. The classical formulation of the edittree algorithm proposed three different mapping operations: node removal, node insertion, and node replacement. A cost isassigned to each operation, and the tree edit distance is the minimum mapping cost of the two trees. Mapping of trees can beformulated as follows [26]:

Let X be a tree, and assume X[i] is the ith node of tree X in a preorder traversal of the tree. A mapping M between a tree A ofsize n1 and a tree B of size n2 is a set of ordered pairs (i, j), one from each tree, satisfying the following conditions for all (i1, j1),(i2, j2) 2M:

(1) i1 = i2 iff j1 = j2;(2) A[i1] is on the left of A[i2] iff B[j1] is on the left B[j2];(3) A[i1] is an ancestor of A[i2] iff B[j1] is an ancestor of B[j2].

Fig. 3.4. Similarity between HTML DOM trees.

Fig. 3.5. General tree mapping example.


Intuitively, the definition requires that each node appears no more than once in a mapping and the order among siblingsand the hierarchical relation amongst nodes are preserved.

Fig. 3.5 shows a general tree mapping example in which we can cross-levels. For instance title in A can be matched to titlein B. Replacements are allowed as well, e.g., body in A and strong in B. A more restricted version is being used in our system,i.e., Simple Tree Matching (STM). The main goal of STM is to find number of pairs in a maximum matching in the two treeswithout doing any replacements or level crossing [26]. The more restricted version, namely STM, is used because its timecomplexity is less than that of Tai [22]. Yet, it has been demonstrated that STM is capable of producing satisfactory resultsin information extraction techniques [15].

Formally, STM defines matching between two trees to be a set of pairs of nodes having one node from each tree such that:

(1) Two nodes in a pair contain identical symbols(2) A node in the first tree can match at most one node in the other tree(3) The parent–child relationship as well as the order between sibling nodes should be preserved, hence, no level crossing

is allowed.

Although it is a restrictive formulation of the general matching algorithm, it was found to be quite effective in Web dataextraction [16].

Lemma 3.1. Let A and B be two trees. A mapping between the two trees A and B must comply with the above three points, and isdefined as the number of pairs of maximum matching between the two trees, according to [26].

Fig. 3.6. Simple Tree Matching algorithm (STM) [26].


Assume A = RA: hA1, . . . ,Aki and B = RB: hB1, . . . ,Bni are the two trees, where RA is the root of tree A and RB is the root of treeB, and Ai and Bi are the ith and jth first-level subtrees of A and B, respectively. Then, W[i, j] denotes the number of pairs in amaximum matching of the subtrees Ai and Bj. So, if RB and RA have identical symbols, i.e., similar nodes, then the maximummatching between A and B, i.e. W[A,B], is M(hA1, . . . ,Aki, hB1, . . . ,Bni) + 1, where M(hA1, . . . ,Aki, hB1, . . . ,Bni) is the number ofpairs in the maximum matching of hA1, . . . ,Aki and hB1, . . . ,Bni. But, if RA and RB have distinct symbols, then W[A,B] = 0. TheSTM algorithm is shown in Fig. 3.6.

As Fig. 3.6 shows, Simple Tree Matching is a top down algorithm which uses dynamic programming to produce the max-imum matching between two trees; its time complexity is O(n1n2), where n1 and n2 are the sizes of trees A and B,respectively.

Fig. 3.7 is used as an example to illustrate the STM algorithm. The root nodes of A and B, namely N1 and N15, respectively,are compared first (line 1). Since they do contain identical symbols, the STM algorithm will recursively find the number ofpairs in a maximum matching between the first-level subtrees of A and B, i.e., the W matrix. However, in case the root nodesdid not contain identical symbols, the STM algorithm will return 0 as a result; 7 (6 + 1) is the returned result for the examplein Fig. 3.7, where 6 is M[m,n].

In our approach, we need a normalized similarity value in the interval [0,1] to represent the similarity between two trees.Therefore, we adapt STM to return a normalized value, which is the average value of the number of maximum matching pairswith respect to the size of the smaller tree. So, in line 9 we return the following value instead:

Normalized Matching Score ¼ M½m;n� þ 1minðm; nÞ ; ð3:2Þ

where m and n are the numbers of nodes in the 1st level of the two trees A and B, and min(m,n) is a function that returns theminimum value of m and n. In Fig. 3.7, A and B have structure similarity (Normalized Matching Score) of 0.875 (7/8). So, we useSTM without major modifications, except for line 9 where we return a normalized value instead of the number of pairs in amaximum matching.

Fig. 3.7. STM example.


3.2.3.3. Case study. Before we move into providing the illustrative example, we want to clarify how VIPS numbers the outputblocks. Block indices are not assigned randomly to their respective blocks. VIPS segments a page into visually and semanti-cally coherent blocks [5]. Consequently, VIPS assigns index number 1 (first index) to the block closest to the top left cornerand the last index to the block closest to the bottom right corner of a page.

In this section, we will study the effectiveness of using content and structure as the noise measures. Given two pages p1and p2, each consists of 8 blocks and let Table 3.1 be the content similarity matrix Sn�m (see Definition 3.1) for the blocks. Forsimplicity, we picked an example where m = n. Later in the next sections we will show an example where m – n.

The values in the 1st row represent the similarity between the 1st block p1 (a1) and all the other blocks in p2 (b1. . .8). Ingeneral, S[i, j] is the similarity value between ai and bj. According to Definition 1.2, we need to find blocks with high similarityvalues, which will indicate that these blocks are part of the site’s template. The cells with the highest similarity for each ai areshaded. The following is the set of matching pairs:

Table 3Conten

fa1; b1g; fa2; b2g; fa3; b8g; fa4; b7g; fa5; b5g; fa6; b6g; fa7; b7g; fa8; b8g

There are two observations regarding this set:

1. The matching block of ai is bi "i 2 [1,m], i R {3,4}.2. The values of S[1,1], S[6,6], S[7,7], and S[8,8], are noticeably high. On the other hand, S[2,2], S[3,8], S[4,7], and S[5,5]

have relatively smaller values.

The first point assures that VIPS segments a Web page spatially consistently. In fact, VIPS utilizes the spatial informationof a block, and so it segments multiple pages in the same fashion in terms of the blocks’ location. Taking this fact into con-sideration, a block number, e.g., 1, 2, etc., provides not only a distinguishing number, but also spatial implications. Thus,block a1 is the closest block to the upper-left corner of p1, so is b1 with respect to p2. This causes a1 to be the most similarto b1 and a2 to b2 and so on, which will result in high values along the diagonal of the similarity matrix compared to the othervalues.

The second observation strongly indicates that the four pairs: {a1,b1}, {a6,b6}, {a7,b7}, and {a8,b8} are part of the site’s tem-plate. For the other blocks, namely 2, 3, 4, and 5, we need to further check their structure to decide whether they are part ofthe template or not. One last observation is that a4’s highest similarity value is very small compared to the others: 0.097704.

Structure similarity is another important feature that we can use with HTML documents. Table 3.2 is the structure sim-ilarity matrix between blocks of the two pages p1 and p2; it is computed using the Normalized Simple Matching Tree algo-rithm. The shaded cells in Table 3.2 show the highest similarity value for each corresponding block. Explicitly, the followingis the set of matching pairs according to Table 3.2:

fa1; b1g; fa2; b2g; fa3; b3g; fa4; b4g; fa5; b5g; fa6; b5g; fa7; b7g; fa8; b8g

Here, three observations can be made:

1. Block ai matches block bi, "i 2 [1,8], i R {6}.2. The pairs: {a1,b1}, {a2,b2}, {a3,b3}, {a7,b7}, and {a8,b8} have noticeably high similarity values.3. Structure similarity values tend to be high.

The first observation emphasizes the fact that when m = n values along the diagonal of the similarity matrix tend to be thecorrect values for each block in both p1 and p2. This is true in Table 3.2 for all blocks of p1 except a6. The second observationshows that {a1, b1, a2, b2, a3, b3, a7, b7, a8, b8} are highly likely to be part of the template because of the high consistency intheir structure within the two pages, p1 and p2. The third observation shows that the average of the structure similarity val-ues is higher than the average of the content similarity values. The average of the structure similarity values is 0.75, while itis 0.61 for content. We may conclude then that the analyzed site, CNN in this case, has a highly consistent layout throughoutits pages.

.1t similarity matrix.

Table 3.2Structure similarity matrix.


3.2.4. Block matchingThe example in Fig. 3.8 considers that the two input pages have the same number of blocks. Also, we noted that we do not

need to compare block ai with all blocks of p2. Rather, we need to compare ai only with bi, where i 2 [1,m]. Thus, the noisinessof ai is the similarity between ai and bi, which means that ai matches bi. This assumption reduces the complexity of our ap-proach since there is no need to do m2 comparisons, but rather we need to do only m comparisons.

Unfortunately, this is not always true in practice, e.g., we could have different number of blocks for pages even if theybelong to the same site. In other words, pages that belong to the same site share the general layout (template), but someminor visual perturbations may occur. For example, these visual perturbations may cause VIPS to segment a page into 10blocks, while another page from the same site may be segmented into 8 blocks. This means that the page with the biggernumber of blocks will have surplus of 2 blocks that will not have matching blocks from the other page.

Definition 3.2 (Neighboring Blocks). Let p1 and p2 be two pages from the same website, i.e. p1 consists of the blocks a1, a2, . . .,am and p2 consists of the blocks b1, b2, . . ., bn with, then the neighboring blocks of ai are bj "j 2 [x,y] where j 2 Z,x = j(i � (n �m))j and y = (i + (n �m)). If x = 0 then set x = 1.

Notice that according to Definition 3.2 j = i when m = n is satisfied. In general, one of the following two scenarios may ariseafter segmenting two pages p1 and p2:

1. m = nIn this case, each block in p1 will have a matching block from p2.As illustrated in Fig. 3.8, the relationship between the blocks is a one-to-one mapping when m = n. Also, the ith block inp1 matches the ith block in p2 since VIPS segments a page into blocks by considering their spatial information. Section 4gives more details about the number of blocks for pages of the different websites that we have tested.

2. m – nIn this case there are in the first page blocks that do not have matching blocks from the other page. Clearly, the numberof blocks that do not have matching blocks is jm � nj. Fig. 3.9 shows an example of two pages with 5 and 7 blocks,respectively.

Fig. 3.8. Block matching when m = n.

Fig. 3.9. Block matching when m – n.

Fig. 3.10. MatchBlocks algorithm.


Since one block in p1 can match at most one block in p2; there will be two blocks in p2 without matching blocks from p1. InFig. 3.9, b2 and b5 in p2 do not have matching blocks from p1. Therefore, we will need to have a measure other than contentand structure that could help in sorting out the matching. Accordingly, the third measure used by Noise Detector is presen-tation. Before we introduce the presentational noise measure, we present the MatchBlocks algorithm in Fig. 3.10.

For each block in p1, MatchBlocks looks for the most similar block from the neighboring blocks in p2. Note that the input tothe MatchBlocks algorithm is a similarity matrix S that combines content and structure similarity. The combined matrix iscomputed by considering the following:

CombSim½i; j� ¼ CW � ContSim½i; j� þ SW � StructSim½i; j�; ð3:3Þ

where CW is the content weight and SW is the structure weight. CW = 0.5, SW = 0.5 (equal weight is assigned to each of thetwo similarities; other scenarios are also possible). Later in the algorithm when deciding on the valid blocks, differentweights are dynamically calculated and used.

So, the MatchBlocks does not guarantee that the pairs of blocks match; it just pairs the most similar blocks with eachother. Section 4 shows the accuracy results of this approach.

Fig. 3.11 reflects the details of our algorithm after the two pages are segmented using VIPS. As Fig. 3.11 shows, if p1 and p2

have the same number of blocks, then content and structure similarities are computed. Most importantly, there is no need tocompare all blocks from the two pages against each other; rather we compare the ith block in p1 against the ith block in p2.The mapping depicted in Fig. 3.8 is followed in this case.

If p1 and p2 do not have the same number of blocks, say p1 has fewer blocks; we find the content and structure similaritiesbetween each block in p1 and its corresponding set of neighboring blocks in p2. Another naïve approach, yet time-consuming,is to compute the whole similarity matrix S as described in Definition 3.1. Calculating the similarity between a block and onlyits neighboring blocks saves computation time. For example, block b1 in p1 (p1(b1)) has three neighboring blocks in p2,namely b1, b2, and b3. After calculating the similarity between p1(b1) and each of p2(b1), p2(b2), and p2(b3), the MatchBlocksalgorithm is applied to find the actual matching block.

Fig. 3.11. Noise Detector overall algorithm.


It is possible to have a block equivalent to a combination of two or more blocks, i.e., in some cases two blocks in one pageare equivalent to one block in another page. For example, b5 of p1 in Fig. 3.9 may match with the combination of b6 and b7 inp2. This all depends on the similarity values of the neighboring blocks. In fact, Figs. 4.1 and 4.2 illustrate this case.

For blocks that have corresponding matching blocks, content and structure similarities are sufficient to find out if they arepart of the template or not. But, blocks from p1 that do not have matching blocks in p2 cannot use content and structure sim-

Table 3.3Content similarity matrix (ContSim).

Table 3.4Structure similarity matrix (StructSim).


ilarities because there are no blocks to compare them to; presentational noise measure is used instead. To illustrate the pro-cess further, we will discuss an example in which there are two pages p1 and p2 from the CNN website with 7 blocks for p1

and 9 blocks for p2. Tables 3.3, 3.4, and 4.5 report the similarity matrices for these two pages.The shaded cells contain the maximum similarity value in their specific rows. Using Table 3.3, the MatchBlocks algorithm

will match the following pairs:

Table 3Combin

fa1; b1g; fa2; b2g; fa3; b5g; fa4; b5g; fa5; b6g; fa6; b7g; fa7; b9g

b3, b4, and b8 do not have matching blocks, and hence presentational noise is used to judge their noisiness. According to Table3.3, a3 matches b5. However, after manually investigating a3 and b5, this is considered to be false because they are not equiv-alent. In fact, a3 should match b3.

We can also notice that the values in the two columns b3 and b4 are small compared to the other columns. This indicatesthat b3 and b4 are most likely to be informative blocks and not part of the template.

If we use structure similarity alone, the following set includes the result matching pairs: {a1,b1}, {a2,b2}, {a3,b3}, {a4,b5},{a5,b5}, {a6,b7}, {a7,b9}. Notice that b4, b6, and b8 do not match with other blocks in p1.

By looking at the two result sets, we notice that b4 and b8 do not occur in either solution. Table 3.5 shows the combinedsimilarity matrix using both content and structure. This combined similarity matrix is the one used by MatchBlocks algo-rithm. The next figures show that combining content and structure gives better results.

The values in Table 3.5 are calculated using Eq. (3.3). The following set of matching pairs can be concluded from Table 3.5:{a1,b1}, {a2,b2}, {a3,b3}, {a4,b5}, {a5,b6}, {a6,b7}, {a7,b9}. This set of pairs is different from the two previous sets generatedwhen using content or structure separately.

Fig. 3.12 shows three of the matching pairs of this example, as identified by the Noise Detector. In Fig. 3.12, a1 matches b1,a2 matches b2, and a3 matches b3. These blocks are highly similar in their structure and even in their content. Fig. 3.13 showsthe two unmatched blocks in p2. We have not depicted all blocks into Fig. 3.12; though the blocks in Fig. 3.13 definitely donot match with other blocks in p1.

3.2.4.1. Presentational noise measure. As some blocks do not have matching blocks from the other page, content and structurecannot be used to judge their noisiness value. Therefore, we need to use a different measure to identify the noisiness of ablock. For this purpose, presentational features are used with the unmatched blocks.

By analyzing the Web, we can conclude that some presentational features add noisiness to blocks. Furthermore, the morepresentational features are present in a block, the more likely the block is said to be noisy, and therefore the block is iden-tified as not informative. Fig. 3.14 supports our claim that informative blocks usually have less relative presentation thannoisy blocks.

Notice that using presentational features enables one to discover the likely noisy blocks without necessarily being part ofthe template. In fact, these blocks are not likely to be part of the template because in general they do not have matchingblocks.

.5ed similarity matrix (CombSim).

Fig. 3.12. Visual layout of the matching block pairs.


Fig. 3.13. Unmatched blocks from p2.

Fig. 3.14. Examples on presentational noisy blocks from BBC. http://www.bbc.co.uk/.

Fig. 3.15. PresentationalNoise algorithm.


http://www.bbc.co.uk/

Fig. 3.16. UpdateWeights algorithm.


The presentational features include links, forms, input tags and some other tags, e.g., <A>, <INPUT>, <FORM>, <LINK>, etc.For each feature in the feature set, we calculate its relative content compared to the whole content of the unmatched block.So, if a block contains only links, then the block is most likely to be a noisy block even if it is not part of the template. Forinstance, the two dash-lined-framed blocks in Fig. 3.14 are likely to be noisy blocks because they contain high relative pre-sentational content. On the other hand, the solid-framed block will likely be identified as informative since it does not con-tain high relative presentational content. After calculating the relative weights of all presentational features, these values aresummed to calculate the total presentational noise value. Fig. 3.15 presents the PresentationalNoise algorithm.

3.2.5. Final noise measureAs we have seen so far, the combined similarity matrix incorporates both content and structure. Also, a combined sim-

ilarity value in this matrix is calculated using Eq. (3.3) which assigns equal weight of 0.5 for the content and structure sim-ilarities. These values will be used to identify which blocks belong to the site template and which do not. In the final stage,we use different weights which are biased to the measure with high value as incorporated in the UpdateWeights algorithmshown in Fig. 3.16. The values 0.3 and 0.7 in Fig. 3.16 are chosen after conducting some initial experiments. These values gavegood results for the final noise measure. These weights shown in Fig. 3.16 were set after conducting initial experiments; theyworked well because similarity values are normally high when two blocks belong to a template. We elaborate on our futuredirections in calculating these weights more dynamically in Section 4.7.

The threshold that we used in our experiments is 0.9. Blocks that do not have matching blocks will use the presentationalnoise measure to identify their noisiness. This measure decides whether a block is informative or noisy. The values are cal-culated using the PresentationalNoise algorithm which returns values in the range [0,1].

3.2.6. Noise threshold valueAfter calculating the final noise value of a block in both pages, we need to determine which blocks are valid (informative)

and which blocks are not. A dynamically calculated threshold is used to determine the cut-off value which splits the range[0,1] into two regions: valid and invalid.

cut off ¼Vmin þ r

l ; Vmin þ rl < 1

1; Vmin þ rl > 1

( )ð3:4Þ

where Vmin, r, and l are the minimum noise value, the standard deviation, and the average of the noise values for a set ofblocks from the analyzed page. This cut-off value proved to work well in our experiments as it exploits different character-istics of the set of noise values. The precision and recall results that we got are high for most of the websites that have beenevaluated. The cut-off value is forced to 1 if Vmin þ r

l

� �is greater than 1, in which case all blocks are valid because noise val-

ues are always less than or equal to 1.

3.2.7. Further valid blocks refinementWe consider a refinement procedure in our approach through which we include blocks that have been marked as invalid

(have noise value greater than the cut-off) if they satisfy a certain criterion.The best descriptive and representative sentence for a Web page is its title. Therefore, if a block has a big chunk of this

title, then the block must have information related to the page topic. In other words, if the number of words in the title of aWeb page is greater than a certain pre-specified value, say 10 words, and there is a block that has more than 90% of the titlein its content, then the latter block in marked as valid even if its noise value is greater than the cut-off value.


For this refinement procedure to work well, it is important to choose a cut-off for the number of words that the Web pagetitle must satisfy. This is because some Web pages might not have a Web page-topic-specific title. The 95% is high, but yetgood threshold. For instance, if the title of a page is ‘‘BBC NEWS j Americas j Obama seeks to woo Canadians’’, and a block hasboth ‘‘BBC NEWS’’ and ‘‘Americas’’, then this block does not necessarily have topic-related information. On the contrary, thisblock is part of the BBC (http://www.bbc.co.uk/) template and including it as a valid informative block will be a big mistake.This is why the refinement procedure should be strict.

3.3. Closing remarks

In this section, we have introduced the different components of the Noise Detector. The presented case studies demon-strated the different novel aspects of the Noise Detector. We have illustrated via several examples how the Noise Detectorcomputes and uses the content and structure measures. Also, we have shown two cases: when the number of blocks is thesame for the two pages, and when the number of blocks is different. The presentational noise measure has been explainedalong with the different corresponding cases. The methodology introduced in this section will be evaluated in the next section.

4. Experimental analysis

In this Section, we evaluate the proposed Noise Detector against the other approaches described in the literature. The con-ducted set of experiments evaluates the effectiveness of the Noise Detector in detecting and removing noisy blocks. We haveinvestigated the validity and accuracy of the proposed approach when applying Noise Detector on different websites. Thesewebsites have been used by the different researchers to evaluate their approaches described in Section 2. Before jumping toreport and discuss the results, we will briefly introduce the data sets that have been used in the experiments and the testingenvironment.

4.1. Testing environment

The Noise Detector has been implemented using the C# programming language. A graphical user interface has been de-signed to make it more interactive and user friendly. We have used an API of VIPS to segment pages as part of the NoiseDetector’s detection process. All the experiments have been conducted on a laptop with the following specifications:X86-based PC with �1862 MHz CPU, 1 GB main memory, and running Microsoft Windows XP Professional.

4.2. Data sets

In the evaluation process, we have used different websites that were used by [20]. These are some of the existing ap-proaches that have been already discussed in Section 2. The following are the websites that have been used in the evaluationprocess:

1. BBC– http://www.bbc.co.uk/: news website2. CNN– http://www.cnn.com/: news website3. eJazzNews– http://www.ejazznews.com/: Jazz news resource4. PCMag– http://www.pcmag.com/: PC Magazine for software, computers, hardware, news, reviews, and opinions5. CNET– http://www.cnet.com/: website that provides product reviews, prices, software downloads, and technology

news6. J&R (JANDR)– http://www.jr.com/: retailer of electronics7. MemoryExpress– http://www.memoryexpress.com/: retailer of computers, laptops, monitors, and other electronics8. Wikipedia– http://www.wikipedia.org/: online free encyclopedia9. Mythica Encyclopedia– http://www.pantheon.org/: encyclopedia of mythology, folklore, and religion

4.3. Validating Noise Detector

The main challenge for the Noise Detector is the consistency results and the accuracy of the results. In other words, asdescribed in Section 3, the Noise Detector uses only two pages to detect templates. Therefore, its consistency should be firstquestioned because it does not use a large number of pages. Once its consistency is confirmed to be high, we can proceed inthe evaluation process and test its accuracy in detecting templates of the investigated websites.

The remainder of this section investigates both the consistency and accuracy of the Noise Detector. We explain the mea-sures and techniques used in the evaluation.

4.3.1. Consistency evaluationOne may ask: how consistent is Noise Detector having said that it uses only two pages to differentiate noisy from

non-noisy blocks? More clearly, by comparing two pages, say p1 and ps, the Noise Detector will produce the noise



http://www.cnn.com/

http://www.ejazznews.com/

http://www.pcmag.com/

http://www.cnet.com/

http://www.jr.com/

http://www.memoryexpress.com/

http://www.wikipedia.org/

http://www.pantheon.org/


matrix S1. Values in S1 solely depend on the structure and content of both pages: p1 and p2. As a result, a cut-off value c1 willbe computed. However, comparing p1 with another page, say p3 will result in the noise matrix S2; values in S2 depend solelyon the content and structure of only p1 and p3; its cut-off value c2 will be most likely not equal to c1 and will determine thevalid blocks of p1 as well. As a result, two possible sets of valid blocks of p1 are returned. If the valid blocks of p1 in both casesare different, the Noise Detector is said to be inconsistent, and therefore cannot be trusted. However, if the valid blocks of p1

in both cases are the same, the Noise Detector is said to be consistent, and therefore can be trusted as a noise detectionsystem.

So, our first step in the system evaluation process is to check its consistency. Detailed examples are shown in the nextsubsections for some of the websites listed in Section 4.2. These examples give the reader a complete view about the differ-ent scenarios and cases.

4.3.1.1. BBC. We downloaded 100 pages from the BBC website. These pages were extracted from different sections: business,entertainment, health, and technology. Figs. 4.1 and 4.2 show two different pages from the business and technology sectionsof the BBC website. The page layout in both pages is highly similar, and therefore VIPS should be highly consistent in thesegmentation process. Nonetheless, VIPS did not segment all pages into the same number of blocks as shown in Figs. 4.1and 4.2; these two pages were segmented into five and six blocks, respectively; b1 and b2 in Fig. 4.2 have been extractedas one block (b1) in Fig. 4.1; consequently, Fig. 4.1 has one block fewer than Fig. 4.2. We report in Table 4.1 the statisticsof segmenting 100 pages from the BBC website.

In the segmentation process, we adjusted the permitted degree of coherence (PDOC) of VIPS to 6. Most of the BBC pages aresegmented into 6 or 7 valid pages as per Table 4.1. Here, it is worth mentioning that the number of blocks reported in Table4.1 refers to the number of blocks that pass through the valid-pass filter (initial valid blocks).

Fig. 4.1. An example page from BBC business section.

Fig. 4.2. An example page from BBC technology section.

Table 4.1Number of blocks per page for the BBC website.

Number of blocks

5 6 7 8 9

Percentage 13.33 40 33.33 11.11 2.22

Table 4.2Occurrence Percentages of Valid Blocks of Different Pages from the BBC Website.

Occurrence percentage of valid blocks Average (%)

B1 B2 B3 B4 B5 B6 B7 B8

p1 98 100 74 91p2 100 100 100 100p3 100 64 82

Average 91



We randomly picked three pages to be cross-tested against a set of other pages from the BBC website. We tested each ofthe three pages against every page in the test set, and recorded its valid block numbers. Table 4.2 depicts the results of cross-checking these three pages.

The percentages in Table 4.2 indicate the frequency of each block in p1, p2, and p3 being outputted as valid block; p2 con-sists of 7 blocks, while each of p1 and p3 contains 6 blocks.

As the percentages are close to 91%, our approach is highly consistent in its output. Though, this does not mean that theoutput is necessarily correct. Section 4.3.2 will evaluate the accuracy of detecting the informative blocks.

4.3.1.2. CNN. A data set of 245 pages was downloaded from the CNN website. These pages belong to five different categories:crime, entertainment, technology, travel, and world. The CNN website is highly consistent in its template design throughoutits pages.

Again, we have conducted the same consistency experiment on pages downloaded from the CNN website. Randomly se-lected four pages were cross-tested against a test set of 50 pages. The valid output blocks were recorded in each test case, andthe frequency of each block occurring in the output valid set is illustrated in Table 4.3.

Pages p1, p2, and p3 were all segmented into 8 blocks, while p4 was segmented into 9 blocks. The high percentages re-corded in Table 4.3 indicate that the Noise Detector is consistent when operating on the CNN website. Blocks {B2, B3, B4}are consistently reported by the Noise Detector as the valid blocks in p1, p2, and p3. However, p4 does not follow the sametrend; its valid blocks are B2 and B3. B6 is marked as a valid block in p4 only 18.6%. In fact, outputting B6 as a valid block is afalse positive (i.e., it is actually not a valid block). We calculated the average using only the percentages of the frequentblocks B2 and B3. If we consider B6, the percentage is 68.3% which is considered to be low. However, B6 is a false positivewhich affects only precision and not recall. More details to follow in the next sections.

The number of blocks a page is segmented to affects (to some extent) the consistency of the Noise Detector. This was clearwith p4, since it consists of 9 blocks, while all the other pages consist of 8 blocks each. Table 4.4 shows the statistics about thenumber of blocks in the data set of the CNN pages that have been used in the experiments.

4.3.1.3. J&R. J&R is an online retailer for electronics. 155 pages were downloaded for testing. These pages were extractedfrom the different categories of the J&R website, such as: MP3 players, GPS, LCD TVs, Cameras, etc. We tested the consistencyof the Noise Detector on five different pages and the results are shown in Table 4.5.

Online retailers usually have more generic templates than the templates of news websites, such as CNN and BBC. This iswhy there are more low values in Table 4.5 than in Tables 4.1 and 4.3. The first four pages in Table 4.5 consist of 8 blockseach, while p5 consists of 7 blocks. Clearly, p5 follows a different trend than the other pages by having a 100% consistencywith only one block being outputted. Table 4.6 shows the distribution of the number of blocks in the J&R’s set of pages.Around 65% of the pages are segmented into 8 blocks.

4.3.1.4. The other websites. This section investigates the consistency of the rest of the websites without presenting much de-tail as in the previous sections. The consistency results of the other seven websites are shown in Table 4.7.

Some of the websites listed in Table 4.7 have low consistency values. This is due to the generic layout that these websitesfollow. For example, CNET pages have areas where customers can post their reviews and opinions. Consequently, the seg-mentation process produced different number of blocks per page for pages of the CNET website. This is evident in Table4.8 which shows that CNET pages were segmented by VIPS into 6 different classes. These pages were segmented into 6,7, 8, 9, 10, or 11 blocks. The PDOC value of VIPS, which affects how coherent the output blocks are, was adjusted to 6. Despitethis variety of the number of blocks in the CNET pages, consistency is still high; blocks that have been outputted as valid forp1 are B2, B4, B5, and B6. Block B2 seems to be a valid block since it always gets outputted (100%). The other blocks, namely B4

and B6, are outputted as valid blocks 43% of the time. B5 is outputted 65% of the time as a valid block. These low rates com-pared to the CNN and BBC websites are due to the more generic template that CNET has compared to CNN and BBC.

EJAZZ has notably high consistency values. In fact, one block gets always outputted (100%) as the only valid block. Thishigh consistency is due to the high stability in its page layout. Also, it is worth noting that EJAZZ is consistent in the numberof blocks its pages are segmented to (see Table 4.8).

Table 4.3Occurrence Percentages of Valid Blocks of Different Pages from the CNN Website.

Occurrence percentage of valid blocks Average

B1 B2 B3 B4 B5 B6 B7 B8 B9

p1 91.3 95.7 91.3 92.8p2 95.7 97.9 97.9 97.2p3 84.8 97.8 95.7 92.8p4 86.4 100 18.6 93.2

Average 93.96

Table 4.4Number of Blocks per Page for the CNN Website.

Number of blocks

7 8 9 10

Percentage 10.0 40.0 46.7 3.3

Table 4.5Occurrence percentages of valid blocks of different pages from the J&R website.

Occurrence percentage of valid blocks Average (%)

B1 B2 B3 B4 B5 B6 B7 B8 B9

p1 78.9 84.2 84.2 81.6p2 100 92.2 96.4p3 100 100p4 82.4 100 91.2p5 100 100

Average 93.8

Table 4.6Number of blocks per page for the CNN website.

Number of blocks

7 8 9

Percentage 21.4 64.3 14.3

Table 4.7Average consistency results.

Occurrence percentage of valid blocks

B1 B2 B3 B4 B5 B6 B7 B8

CNET p1 100 42.9 64.3 42.9EJAZZ p1 100MemoryExpress p1 76.0 100 100Mythica p1 100 100 58.3

p2 100 50.0PCMag p1 41.7 100Wikipedia p1 100

Table 4.8Number of blocks in different websites.

Number of blocks (%)

3 4 5 6 7 8 9 10 11

CNET 6.7 46.7 6.7 6.7 13.3 20.0EJAZZ 9.1 54.5 36.4MemoryExpress 100Mythica 38.5 61.5Wikipedia 100


MemoryExpress achieved high consistency values although its page layout is somewhat generic and rich. This is due tothe consistency in the HTML code that is used to design its pages. Also, it does not include a comments section that can ruinthe layout consistency. Interestingly, all of the downloaded pages from MemoryExpress have been segmented into 8 blocks.This shows how well-templated MemoryExpress is. Consequently, its consistency values are high and close to perfect. BlocksB4 and B5 have been outputted 100% of the time, and B2 has been outputted 76% of the time. This shows how consistent theNoise Detector is.

Mythica has a fairly consistent behavior. As Table 4.8 shows, its pages are segmented into 4 or 5 blocks. Having the Myth-ica pages segmented into only two different block numbers increases its consistency; p1 of Mythica had B2 and B4 outputtedas valid blocks in all the test cases, however B5 was outputted 58% of the time as a valid block; p2 has 4 blocks and so it fol-


lows a slightly different trend than p1. Again, B2 has been consistently outputted as a valid block, and B4 has been outputtedas a valid block only 50% of the time.

PC Magazine is another rich-content website. Its pages have dense information, which causes its segmentation to gener-ate variety of block numbers. PC Mag’s pages were segmented into seven different numbers of blocks, namely 9–15 blocks.Despite this fact, the Noise Detector is still able to achieve high consistency results by always outputting B4 as a valid block.But, having B2 outputted only 42% of the time as a valid block degrades the consistency results.

Wikipedia is another website that has highly consistent layout. All of its pages were segmented into 3 blocks. Because ofthe very high consistency in the segmentation process, the consistency of the results is also perfect with a result of 100%.Although some pages have huge size compared to other pages, VIPS is still able to segment them based on their visual cues,and consequently all of these pages are segmented into three blocks.

The values reported in Tables 4.7, 4.5, 4.3, and 4.1 show that the Noise Detector is consistent in its output over a set ofdifferent websites. Finally, it is worth reemphasizing that these same websites (from online retailers, magazines, news, andencyclopedia) have been used in testing the other approaches described in the literature [20].

The consistency of detecting the template of a website depends heavily on whether its layout is so generic and differsfrom one page to another. This is particularly true because VIPS will segment two pages (according to their layout) into dif-ferent number of blocks even if they belong to the same website. As a result, the different number of blocks will affect thetemplate detection negatively. This is clear as far as the PCMag website is concerned. However, Wikipedia, MemoryExpress,CNN, BBC, and EJAZZ have fairly consistent layout throughout their pages; this results in high consistency values.

In our testing, we do not consider an average to the consistency values of the different blocks in a page. The followingexample explains why it is not recommended to use the average consistency as a measure for the goodness of a website.Let us consider the following example on the EJAZZ website.

Let D = {d1,d2, . . . ,d11} be a set of 11 pages downloaded from EJAZZ website, and let d1 be checked against D � {d1}. Table4.9 reports the results of this cross-testing.

We have considered 11 pages in this example for simplicity. Clearly, B2 is a valid block since it appears in all the results.However, B1 appears just once along with B2, and this gives B1 a small probability to be a valid block. The average overallconsistency is 55%, and this indicates that the EJAZZ website is not consistent which should not be the case. If we considerthe appearance of B1 as noise and do not consider it in the overall consistency calculation, the consistency of EJAZZ will be100%. As a result, we can conclude that calculating the average consistency does not give an insight about the website.

As we mentioned previously, the PDOC value affects the number of blocks in a page, and this in turn affects the templatedetection process. Table 4.10 reports the different PDOC values for the tested websites. These values have been chosen basedon some initial test runs.

In all cases in Table 4.9, B2 is the valid block being outputted except when d1 is checked against d11 where we have B1

outputted alongside with B2. After looking closely at d11, we notice that its content is considerably different from all otherpages, which in turn causes B1 to be a false positive. When there is a big difference between the number of blocks in the twopages being processed, inconsistency is possible. Still, the robust measure that we use overcomes these issues in most casesas our results show.

4.3.2. Accuracy measuresThis section evaluates the accuracy of the Noise Detector in identifying valid blocks. Since the consistency of the Noise

Detector as reported in Section 4.3.1 is fairly high, checking its accuracy is reasonable. The statistical measures to be usedin evaluating the Noise Detector are computed by the following set of equations:

Table 4.9Output of cross-checking d1 Against D � {d1}.

Page checked against Valid output blocks of d1

d2 {B2}d3 {B2}d4 {B2}d5 {B2}d6 {B2}d7 {B2}d8 {B2}d9 {B2}d10 {B2}d11 {B1,B2}

Table 4.10PDOC values of VIPS.

Website BBC CNET CNN EJAZZ J&R MemoryExpress Mythica PCMag Wikipedia

PDOC 6 6 6 6 6 5 7 5 6


Accuracy ¼ TP þ TNTP þ FP þ FN þ TN

ð4:1Þ

Specificity ¼ TNFP þ TN

ð4:2Þ

Sensitivity ðRecallÞ ¼ TPTP þ FN

ð4:3Þ

Positive Predictive Value ðPrecisionÞ ¼ TPTP þ FP

ð4:4Þ

Negative Predictive Value ¼ TNTN þ FN

ð4:5Þ

Fb ¼ ð1þ b2Þ Precision� Recallb2Precisionþ Recall

ð4:6Þ

where TN, TP, FP and FN stand for true negatives, true positives, false positives, and false negatives, respectively. These fourmeasures could be interpreted as follows: true negatives refer to noisy blocks correctly identified as noisy; true positivesrefer to informative blocks correctly identified as informative; false positives refer to noisy blocks wrongly identified asinformative; and false negatives refer to informative blocks wrongly identified as noisy.

The Noise Detector is capable of differentiating two classes of blocks, namely noisy and informative with the latter beingthe target class. A high sensitivity score (recall of the target class) means that the informative blocks have been well recog-nized; and a high specificity score (recall of the other class) means that the noisy blocks have been recognized. Specificityevaluates the effectiveness of the system in recognizing the negative cases. Therefore, it does not tell us how well the systemrecognizes positive cases. On the other hand, positive prediction rate stands for precision of the target class and negativeprediction rate stands for precision of the other class. Precision is the most important measure since it reflects the probabilitythat a positive test is capable of handling the condition underlying the tested. Fb is a measure that combines precision andrecall; it is mostly computed with b = 1 and called F-1 score. Finally, the two measures recall and precision are redefined nextfor the target class by renaming informative blocks as valid blocks.

Definition 4.1 (Precision). Precision is the number of valid returned blocks divided by the total number of returnedblocks P ¼ returned valid blocks

all returned blocks

� �.

Definition 4.2 (Recall). Recall is the number of valid returned blocks divided by the total number of valid blocks (the num-ber of blocks that should have been returned as valid) R ¼ returned valid blocks

all valid blocks

� �.

Definitions 4.1 and 4.2 (as well as Eqs. (4.1)–(4.6)) are specific to blocks in our case, but can be generalized to fit any dis-crete output: documents in retrieval systems, classes in classification problems, etc.

4.3.3. Noise Detector accuracy evaluationWe have applied the Noise Detector on an average of 100 pages per website. After downloading a set of pages from each

website, we randomly picked a page. All pages in each set are then tested against a corresponding randomly-picked page. Asa result, the Noise Detector creates new HTML files that are made up of the valid-marked blocks, i.e., 100 new clean files willbe outputted for a set of 100 original pages. These clean HTML files are checked afterwards by an evaluator who reports thenumber of valid and invalid blocks. In this evaluation process, the evaluator needs to look at the original pages to determinewhich blocks are valid and which are invalid. Also, the evaluator reports how many blocks are actually valid but are missingin the set of output blocks.

The precision, recall, and F1-score values (for the target class) of the Noise Detector for different websites are shown inTable 4.11. These accuracy results are computed by testing (most of the time) only two pages against each other. Our ap-proach still outperforms most of the other approaches described in the literature. Recall that this is all due to two mainfactors:

Table 4.11Precision, recall and F1-score results of the Noise Detector in identifying noisy blocks.

Website Precision (%) Recall (%) F1-score (%)

BBC 100 98.9 99.4CNET 87 100 93CNN 100 100 100EJAZZ 96.2 100 98J&R 100 100 100MemoryExpress 94.8 94.8 94.8Mythica 63.9 100 78PCMag 91.3 100 95.4Wikipedia 100 100 100


1. The Noise Detector operates at the semantic level by segmenting Web pages into coherent blocks.2. The noise measure used by the Noise Detector incorporates three main features of HTML: namely content, structure,

and presentation.

BBC, CNN, EJAZZ, J&R, PCMag, and Wikipedia have relatively high precision/recall results. By looking at their page layout,we can see that they are more consistent compared to CNET and Mythica. The Mythica website has the lowest precision/re-call results. Later on in the evaluation, we prove that Mythica’s precision/recall results could be boosted by testing against anextra third page (see Section 4.3.5 for details).

Table 4.12 shows the average values of the measures computed by Eqs. (4.1)–(4.6) by considering all the websites enu-merated in Section 4.2. These high values clearly show that the Noise Detector is capable of correctly identifying both noisyand informative blocks.

The figures reported in Table 4.12 shows that ND achieves high sensitivity and this means that ND can detect the targetblocks with a very high accuracy up to 99.3% on average. Also, the negative predictive rate is close to 100%, which means thatND is almost perfect in detecting the blocks that are not in the target class. The specificity of ND is 86%; this is not as high asthe other measures. This means that ND will misclassify 14% of the informative blocks as noisy. Finally, having the positivepredictive rate as 95% indicates that ND will be accurate up to 95% in identifying the target noisy blocks. These results as awhole demonstrate the robustness of ND in the sense that it is able to successfully detect the two types of blocks with highaccuracy. Unfortunately, in this regard we are not able to comment on the other approaches described in the literature be-cause they only reported the recall and precision of the target class. Having the latter two values high does not provide muchinformation about the recall and precision of the other class.

4.3.4. Number of valid blocksWe have introduced a dynamic way to decide on the number of valid blocks. This is not applied, however, in [13], where Li

and Ezeife use three as a static number of valid blocks. The extensive experiments we have conducted show that the numberof valid blocks is variable and page-variant. Fig. 4.3 has detailed statistics about the number of valid blocks for the differentwebsites that we used in the experiments.

4.3.5. Comparison against FR approachIn Section 4.3.3, we have shown the accuracy of the Noise Detector in detecting the template of different websites. All of

our tested websites have been taken from the set tested by Vieira et al. [24]. In this section, we will compare our results withthe results that were reported by Vieira et al. In the set of websites we have used, there are 8 common websites.

The Noise Detector uses a complex noise measure which enables it to infer noisy blocks using only two pages, yet withhigh confidence. More clearly, we executed our experiments on different websites. Each website has its own set of pages.Each page in a set was tested against another page from the same set. Results are then reported with the valid output blocks,invalid output blocks, and missing valid blocks, which are used to calculate precision, recall, and F1-score. So, the numbersreported in Figs. 4.4 and 4.5 pertaining to Noise Detector, are averages of testing two pages against each other. Table 4.13,however, suggests that when Noise Detector does not perform well on a certain website, we can use a set of three pages (ormore depending on the number of templates per website) instead of using only two pages. See Section 4.3.6 for more details.

For all tested websites, our F1-score results were higher than the results of Vieira et al. [24]. However, our results forMythica website were poor using only two pages. Still, Vieira et al. [24] requires more than 25 pages on average to start giv-ing good results. We will use the abbreviation FR (Fast and Robust Method) to denote the approach of Vieira et al. [24] andND for the Noise Detector.

Fig. 4.4 illustrates the F1-scores of ND and FR using only two pages. It is clear that ND outperforms FR in all websites ex-cept Mythica. Fig. 4.5 shows F1-scores of ND using sets of two pages, and F1-scores of FR using 90 pages. Two pages in theminimum for ND to give satisfactory results, and 90 pages is the minimum for FR.

The results reported in Fig. 4.5 reveals the fact that ND does not perform much better than FR for some websites such asPCMag and CNET. The common feature amongst these websites is having user comments and reviews. This adds more to thecomplexity of these websites’ structures and makes it more difficult to detect templates with high accuracy by using onlytwo pages. But still, ND has achieved high F1-score of 93% and 95.4% for CNET and PCmag, respectively. We believe that usingmore than two pages with such websites will generally boost the F1-score. In fact, our claim is supported by the improve-ment in the F1-score of the Mythica website as reported in Table 4.13.

Table 4.12Accuracy Measures of Noise Detector.

Measure Average (%)

Accuracy 92.3Sensitivity 99.3Specificity 86.2Positive predictive value 95.2Negative predictive value 99.8

Fig. 4.3. Number of valid blocks for different websites.

Fig. 4.4. F1-score measure of ND vs. FR.

Fig. 4.5. F1-score measure of ND vs. FR.

Table 4.13Using three pages for cross-testing.

Website Precision Recall F1-score

Mythica (%) 87.5 100 93.3


4.3.6. RefinementTo achieve even higher precision and recall results in detecting noisy blocks, a page can be tested against two other pages

from the same website. The intersection of both outputs is considered as the final valid result. Mythica has the lowest pre-cision results in detecting noisy blocks. Table 4.13 shows how the accuracy of ND increases when we make this type of extratesting.

When sets of three pages are used, the F1-score reported by ND for Mythica has increased from 78% to 93.3%. This extrastep adds to the time consumed by the process, but increases the accuracy of ND. So, if the accuracy of detecting valid noisyblocks in a website is not acceptable, then this extra step will be necessary. This may be recursively applied until either weget to the point where the accuracy never improves or the time consumed increases beyond expected limit. A human experthas to determine how many pages a given page needs to be cross-tested against. Yet, as the results show, for most of thetested websites high consistency results are achieved even when using only one page without cross-testing against many

Fig. 4.6. Noise Detector’s performance against different template styles.


pages. The websites that required ND to process more than two pages could be classified as outliers as the trend in websitedevelopment is concerned. In other words, it is very common to have a template for each website and even having the sametemplate across a number of websites is not unusual.

4.3.7. Reasons for degraded results of NDND gives good results when processing well-templated websites. However, the results of ND start to degrade when pro-

cessing websites that have generic templates like Amazon7 and eBay.8 We depict the performance of ND against different web-site with different template styles in Fig. 4.6.

When ND operates on websites that are well-templated (almost fixed templates), it gives the best results. BBC, CNN, andMemoryExpress are examples from this category. However, in the graph depicted in Fig. 4.6 the results of ND start to degradewhen we go towards generic templates where pages across the website have user comments and different styles similar towhat we have in eBay and Amazon. Similarly, when there are user comments and complex layout such as tabs, frames andother Web objects, the performance of ND degrades even more; a good example for this category is CNET.9

4.3.8. Using more than two pagesAs we have mentioned in the previous section, using more than two pages can give better accuracy. We used the Mythica

website to find out the effect of using more pages. Fig. 4.7 illustrates the trend of the accuracy of ND when using more thantwo pages.

As Fig. 4.7 shows, the precision stabilizes at 91.7% since the precision of ND when using four pages is 91.7% and its pre-cision when using five pages is also 91.7%.

4.4. Granularity

The granularity that ND can achieve depends solely on VIPS. That is, the more coherent the blocks are, the higher the gran-ularity will be. Recall that the basic unit that ND operates on is a block. So, a whole block will be either outputted or removed.Therefore, if a block has small noisy parts in it, these parts will be in the output if their respective block is valid. Fig. 4.8 showsan example from the BBC website where a small part of Block3 is part of the BBC template, i.e., it appears in all the BBC pages,though it is part of an output valid block.

The granularity problem can be overcome to some extent by increasing the PDOC value of VIPS. The more the PDOC is, themore coherent the output blocks are. In VIPS, PDOC can be adjusted to a maximum of 10. But still, there will be some partswhich will not be separated in a single block. Furthermore, if ND has many tiny blocks as a result of increasing PDOC, it willnot give accurate results because of the confusing similarity measures amongst blocks in the two input pages. The PDOC val-ues that we have reported in Table 4.10 are recommended for the set of websites used in the experiments. Notice that PDOCfor most of the tested sites is either 5 or 6.

4.5. General comparison with other approaches

Table 4.14 shows a comparison between our approach and the other related approaches. In this table, we compare ourapproach with LH [14], SST [27], and Bar-Yossef [1].

The approach that Lin and Ho (LH) proposed in [14] uses the <Table> tag to partition pages. Then, keywords are used toconstruct a feature-document matrix, through which they calculate the entropy of these features, i.e., content is the mainfeature that has been used.

The SST approach which was proposed in [27] mainly uses the structure of Web pages to construct a huge tree (SST); theentropy of its internal nodes is then calculated to decide their noise values. Presentational features are used to determine the

7 http://www.amazon.com/.8 http://www.ebay.com.9 http://www.cnet.com.

http://www.amazon.com/

http://www.ebay.com

http://www.cnet.com

Fig. 4.7. Precision of Noise Detector using more than two pages.

Fig. 4.8. Small noisy parts.

Table 4.14Comparison between Different Approaches.

Approach Involvessegmentation

Incorporatesstructure

Incorporatescontent

Incorporatespresentation

Requires a largetraining set of pages

Was tested ondifferent types of sites

Was tested only onnews websites

LH Xa X X XSST X X X XBar Yousef X X X X

FR X X XND X X X X X

a The segmentation used in [11] is based only the <Table> tag. This limitation made their testing limited mostly to news websites. Besides, not all thewebsites use the <Table> tag to segment their pages.


noise values of the leaf nodes. This approach requires an average of 400–500 pages to build the SST; as a result, the noisyparts are identified.

Bar-Yossef and Rajagopalan [3] proposed a template detection approach that segments a page into a set of pageletsdepending on a linkage percentage. They use shingles to determine whether a block is part of the template or not. A shingleis essentially a content measure.

To detect the template in a page, our approach mostly requires only one other page to be tested against the given page.The strength of our measure that enables the Noise Detector to give satisfactory results is due to combining the structure,content, and presentation measures, which are all characteristics of HTML pages. Our approach showed excellent results on avariety of websites, including news websites.

To sum up, we have shown that the Noise Detector is accurate in detecting noisy blocks that constitute templates in web-sites. As one of the main contributions is reducing the analyzed set of documents to only two pages, we started the evalu-ation process by proving the consistency of the Noise Detector; then, we ran other experiments to report its accuracy. Ourdata set consists of websites that were used in the literature. These websites include news, online retailers, encyclopedia, andmagazines. The F1-score results for these websites were high and our approach outperformed the work described in [24] formost of the tested websites. However, the Mythica website results indicated a case for which the developed Noise Detectorcould not report good results by depending on only two pages in the analysis; hence, we extended the methodology intousing an extra page when necessary, especially for websites that do not have a regular template. So, we realized that theresults for Mythica have been improved by testing each page against two pages instead of one page; explicitly, this boostedthe F1-score result by 15%. Four websites have F1-score value of 100%.

4.6. Processing time

When a new algorithm is proposed, its superiority over other existing algorithms is determined by applying various cri-teria such as, but not limited to: space complexity, time complexity, minimum requirements to produce acceptable results,and accuracy. We have shown through the experiments described above that ND outperforms existing techniques in spacecomplexity, minimum requirements to produce acceptable results (i.e., requires two pages), and accuracy. For time complex-ity, however, our approach cannot be directly compared to the algorithms in Table 4.14 because they require a training set to

Fig. 4.9. Execution time of refined approach vs. naïve approach of ND.


derive a model and then use this model for new unseen pages, whereas ND needs always to process two pages at a time.Therefore, on small scale, our approach has better time performance because only two pages are involved in the process.On the other hand, ND is not guaranteed to outperform the algorithms listed in Table 4.14 on large scale since there is nomodel inferred from a training set. As we explain in Section 4.7, using our algorithm it is possible to derive a model thatcan be used in the same fashion as the algorithms listed in Table 4.14.

The execution time for ND is good and got an average of 1.57 s for comparing two pages and outputting the valid blocks.This average was calculated for different websites. We noticed that CNET, for example, takes more time on average than CNNwhich is due to the larger content in CNET pages. Fig. 4.9 shows the improvement that ND achieved when applying the con-cept of the neighboring blocks (see Definition 3.2). The processing time without this refinement is 4.38 s on average, whichreflects 65% improvement.

We notice that calculating the content, structure, and presentational noise measures for the neighboring blocks only, re-duced the execution time of ND dramatically. The optimal execution time is achieved when both pages have the same num-ber of blocks. In such case, the number of blocks that need to be processed is 2m where m is the number of blocks in a page.

The SST approach [27] requires just below 20 s for building the SST tree and another 2 s to compute composite impor-tance values. Our approach, however, requires on average less than 2 s as Fig. 4.9 suggests. This is especially true whenthe number of blocks in both pages is close.

4.7. Future work

Despite the promising results that ND shows, we have identified future directions to improve its results even more andautomate its process with less human intervention.

The current version of ND cannot automatically decide which blocks are actually valid and which ones are not, i.e. noisy.Human intervention was inevitable in our experiments to validate the results. We believe that inferring a model using thenoise values calculated using presentation, content, and structure and/or block indices through a training process can auto-mate the process with high accuracy. Considering block indices is practical to use because VIPS considers the location ofblocks when assigning their indices.

A future extension that benefits from the outcome of the Noise Detector is restructuring blocks within pages of a websiteby forcing all the extra information to follow a common template. Even in case of multiple templates per website all may bearranged to follow a common layout. Such a process would make it easier for users to surf the website and examine its usefulcontent (assuming the extra information like advertisements is not useful and considered as noise).

We also intend to apply regression methods and/or genetic algorithms to determine weights of content, structure, andpresentation noise measures in a more dynamic way.

5. Conclusions

The noise detector developed in this paper could be seen as the result of realizing the importance of noise reduction inturning Web pages into more useful source of information especially for applications dedicated to deal with the actual infor-mative content of websites. This section summarizes the basic aspects of the developed approach and highlights the mainlessons learned from this experience. The conducted experiments demonstrate that the approach described in this papercould be recognized as a major step in the right direction. However, further improvements and extensions could lead to amore comprehensive product.

The proposed Noise Detector applies robust noise measure to discover templates. The experiments that we have con-ducted show that the Noise Detector is able to detect templates with high accuracy in websites from different domainsand not only news websites. The experiments also show that removing noisy information, i.e., templates in our case, booststhe accuracy of Web mining tasks. Since the Web contains 40–50% templated websites, removing these templates becomes anecessity for better mining results. In fact, our experiments show that the required storage space shrinks by 3 times theoriginal space. Also, we have shown that Noise Detector achieves better accuracy results in detecting templates; an average


F1-score of 94% was achieved. This performance is achieved despite the fact that the Noise Detector uses fewer numbers ofinput pages.

Lin and Ho [14] had tested only news websites which are usually highly consistent in their layout, and hence it is easier todetect their templates. However, the Noise Detector has been tested on a variety of websites with diversity in the layout,including news websites, online retailers, online encyclopedias, and online magazines. These websites have been collectedfrom different systems that were proposed in the literature.

References

[1] F. Ahmadi-Abkenari, A. Selamat, An architecture for a focused trend parallel Web crawler with the application of clickstream analysis, InformationSciences 184 (1) (2012) 266–281.

[2] S. Akbar, L. Slaughter, Ø. Nytrø, Extracting main content-blocks from blog posts, in: Proceedings of the International Conference on KnowledgeDiscovery and Information Retrieval, 2010, pp. 438–443.

[3] Z. Bar-Yossef, S. Rajagopalan, Template detection via data mining and its applications, in: Proceedings of the 11th International Conference on WorldWide Web, ACM, Honolulu, Hawaii, USA, 2002, pp. 580–591.

[4] D. Cai, S. Yu, J.-R. Wen, W.-Y. Ma, Block-based Web search, in: SIGIR’04, acm, Sheffield, South Yorkshire, UK, 2004.[5] D. Cai, S. Yu, J.-R. Wen, W.-Y. Ma, VIPS: A Vision-based Page Segmentation Algorithm, Microsoft Corporation, Redmond, 2003.[6] J. Chen, B. Zhou, J. Shi, H. Zhang, Q. Fenqwu, Function-based object model towards website adaptation, in: Proceedings of the 10th International

Conference on World Wide Web, Hong Kong, 2001.[7] S. Debnath, P. Mitra, C.L. Giles, Automatic extraction of informative blocks from webpages, in: Proceedings of the 2005 ACM Symposium on Applied

Computing, ACM, Santa Fe, New Mexico, 2005, pp. 1722–1726.[8] D. Fernandes, E. Moura, B. Ribiero-Neto, A. Silva, M. Goncalves, Computing block importance for searching on Web sites, in: CIKM 2007, 2007, pp. 165–

174.[9] D. Gibson, K. Punera, A. Tomkins, The volume and evolution of Web page template, in: International World Wide Web Conference, ACM, Chiba, Japan,

2005, pp. 830–839.[10] S. Gupta, G. Kaiser, D. Neistadt, P. Grimm, DOM-based Content Extraction of HTML Documents, ACM, Budapest, Hungary, 2003.[11] M. Kovacevic, M. Diligenti, M. Gori, V. Milutinovic, Recognition of common areas in a Web page using visual information: a possible application in a

page classification, in: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02), Maebashi TERRSA/EEE Computer Society,Maebashi City, Japan/Washington, DC, USA, 2002.

[12] C. Li, J. Dong, J. Chen, Extraction of informative blocks from Web pages based on VIPS, Journal of Computational Information Systems 6 (1) (2010) 271–277.

[13] J. Li, C. Ezeife, Cleaning Web pages for effective Web content mining, in: 17th International Conference Database and Expert Systems Applications,Krakow, Poland, 2006, pp. 560–571.

[14] S.-H. Lin, J.-M. Ho, Discovering informative content blocks from Web documents, in: The 8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, ACM, New York, NY, USA, 2002, pp. 588–593.

[15] B. Liu, Web Data Mining: Exploring Hyperlinks* Contents and Usage Data, Springer, New York, 2006.[16] B. Liu, K. Chen-Chuan-Chang, Editorial: special issue on Web content mining, ACM SIGKDD Explorations Newsletter (2004) 1–4.[17] B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining, in: ACM SIGKDD, AAAI Press, 1998, pp. 80–86.[18] L. Lo, V. To-Yee Ng, P. Ng, S.C. Chan, Automatic template detection for structured Web pages, in: Proceedings of the 10th International Conference on

Computer Supported Cooperative Work in Design, Nanjing, China, 2006, pp. 1–6.[19] M. Murata, H. Toda, Y. Matsuura, R. Kataoka, Access concentration detection in click logs to improve mobile Web-IR, Information Sciences 179 (12)

(2009) 1859–1869.[20] C. van Rijsbergen, S. Robertson, M. Porter, New models in probabilistic information retrieval. British Library Research and Development Report, No.

5587, London, UK, 1980.[21] R. Song, H. Liu, J.-R. Wen, W.-Y. Ma, Learning block importance models for Web pages, in: Proceedings of the 13th International Conference on World

Wide Web, ACM, New York, NY, USA, 2004, pp. 203–211.[22] K.-C. Tai, The Tree-to-Tree Correction Problem 26 (3) (1979).[23] B. van Gils, H.A.E. Proper, P. van Bommel, Th.P. van der Weide, On the quality of resources on the Web: an information retrieval perspective,

Information Sciences 177 (21) (2007) 4566–4597.[24] K. Vieira, A.S. Silva, N. Pinto, E.S. Moura, J.M. Cavalcanti, J. Freire, A fast and robust method for Web page template detection and removal, in: The 15th

ACM International Conference on Information and Knowledge Management, ACM, Arlington, Virginia, USA, 2006, pp. 258–267.[25] World Internet Usage Statistics News and World Population Stats, from Internet Usage World Stats. <www.internetworldstats.com/stats.htm>

(retrieved 16.01.09).[26] W. Yang, Identifying Syntactic Differences Between Two Programs 21, 1991.[27] L. Yi, B. Liu, X. Li, Eliminating noisy information in Web pages for data mining, in: Proceedings of the 9th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, ACM, Washington, 2003, pp. 296–305.[28] K. Zhang, The Editing Distance Between Trees: Algorithms and Applications, PhD thesis, Courant Institute, Departement of Computer Science, 1989.