Noise Reduction in Web Pages Using Featured DOM Tree

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 3, Issue 1, January - 2016. ISSN 2348 – 4853, Impact Factor – 1.317

1 | © 2016, IJAFRC All Rights Reserved www.ijfarc.org

Noise Reduction in Web Pages Using Featured DOM Tree.

Athira Paniker1, Surabhi Panicker2,Preetika Ravkhande3,Nazneen Tamboli4, Prof. Renuka Puntambekar

Dept. of Computer Engineering, MIT Academy Of Engineering, Savitribai Phule Pune University, Pune, India [email protected], [email protected], [email protected], [email protected]

A B S T R A C T

The Web contains infinite data. Web Pages have large number of information blocks. Apart from

the content blocks, it usually has blocks such as copyright, navigation panels, privacy notices, as

well as advertisements. Such blocks, that are not the main content blocks of any web page can be

called as noisy blocks. This information items sometimes serve useful function for human viewers

and are important for the web site owners. However, they often hamper automated information

retrieval and web data mining. It has now become a big challenge for the researchers to

implement new techniques for web mining so as to gather information from the Web. Since web is

the repository of massive information, the size of data is increased rapidly at an exponentially

high rate which often comes along with an excess amount of noise. Such noisy terms reduce the

efficiency of feature extraction and final classification accuracy. Therefore, it becomes critical to

clean the web pages before mining in order to improve mining results. Our project focuses on

identifying and removal of local noises in websites or html pages so as to improve the overall

performance of Web mining. We propose a rather novel idea for the easy detection and removal of

local noises with a new tree structure called featured Document Object Model Tree. We use a

three stage algorithm wherein feature selection is performed at the first phase, modelling of a

web page to create a featured DOM tree is done at the second phase and thereafter noise is

marked and pruned in the third phase. Thus the output produced is a Clean Web Page.

Index Terms: Feature Extraction, Web Data Mining, Local noise, DOM Tree, Clean Web Page,

Information Retrieval.

I. INTRODUCTION

Analyzing the large set of data and extracting useful information from it is called Data Mining. It can also

be defined as mining the information from data. Enormous amount of data is available in the field of

Information technology which needs to be extraction of useful data. Various applications like market

analysis, customer retention, production control, fraud detection, science exploration etc. make use of this

informative data. Extraction of useful information from the content of web is called as Web Content

Mining. Topic discovery, extracting association patterns, clustering of web documents and classification of

Web Pages are various research issues addressed in text mining. Significant amount of work is done in

extracting useful information from the images in the field of image processing whereas not much research

is done for web content mining. The application of data mining method to detect interesting usage

patterns from Web data to understand them and better serve the requirements of Web-based applications

is called Web Usage Mining. Identifying the correct and relevant information from the web pages has

become a difficult task due to the explosive and rapid growth of data contents in the World Wide Web and

the noises within the web pages.

Noises such as advertisements, banners, privacy notices etc surround the informative web content. These

noises can be categorized as- Global Noise: Noises such as mirror sites, duplicated web pages are not

smaller than a single web page and is called as global noise. Local Noise: Noises such as navigation panel,




advertisement, link, banners is present within the web page and is called as local noise. Our project

focuses on removal of local noise.

II. LITERATURE SURVEY

Classification based Cleaning method [1] is a simple method used for Web Page Cleaning , i.e. to detect

and eliminate specific noise from a web page by certain pattern classification techniques. This method is

semiautomatic and supervised. In this technique noise is detected with the help of decision tree classifier.

Decision tree classifier being a classic machine learning technique is used in many research fields.

Limitation of this technique is that it can eliminate only certain noise items

In Lin and Ho [2], a Segmentation based Cleaning method is a supervised method. It categorizes the

contents of web page as distinguished contents and common contents. Distinguished content is the

informative content while common content is the redundant content. The drawback of this method is

that it takes into consideration only the data contents.

In Bar-Yossefs and S. Rajagopalan work [3], a Template based Cleaning method is automatic and

unsupervised method. This method considers the Template of web page as noise. A set of web pages is

called cluster. Cluster is passed as input and the templates are cleaned from the cluster. It is efficient for

clusters consisting the web pages from different sites only.

SST based Cleaning technique [4] is partially supervised cleaning technique. It is a combination of

Segmentation based Cleaning technique and Template based Cleaning technique. It does analysis of

both, the layout as well as the contents. Site Style Tree is a generalized DOM tree presentation. It can be

used to model HTML and XML web pages. This structure is useful for detecting and eliminating noise

from web pages. Content and presentation styles of some web sites having dynamic web pages are not

common. Detection of noise from these web sites using this technique is difficult. This technique is less

successful in detecting noise patterns different from expected noise patterns.

Feature Weighting Based Cleaning Method [5] is an improved version of Site Style Tree. This method

is automatic and unsupervised noise cleaning method. This method uses a tree based approach which

combines features based on HTML content, visual representation and tree structure. Every document’s

elements within the tree are combined if their child elements share identical attributes, attribute values

and tag names. Weight of the element is calculated depending on the number of different presentation

styles. The resulting element weights are used in follow-up tasks like classification. The efficiency of this

approach depends upon the availability of relatively large amount of web documents from a limited

amount of data source.

Kao et al. [6] make use of HITS algorithm. HITS algorithm stands for Hyperlink Induced Topic Search

algorithm. It evaluates the importance of the hyperlinks present in a Web page. Its drawback is that it

only rates the web page but it does not eliminate noise.

Kao, Ho. And Chen [7] InfoDiscoverer, extracts the information from a set of tabular documents with in

the web site. Its limitation is that it is applicable for only web sites with tabular documents. But it is not

applicable for web documents of web site which does not contain any tabular document.

With the help of page layout features and some heuristic rules VIPS [8] at semantic level detects and

eliminates noise from a web page. Its drawback is that it is resource intensive.




Thanda Htwe et al [9] is a mechanism to detect and eliminate redundant and irrelevant data using Case

Based Reasoning. It analyses multiple noise pattern in Web pages and also uses back propagation neural

network algorithm for matching current noise with expected noise pattern and then noise elimination

takes place from the current page.It takes into consideration only the contents of the web page and not

the layout. Therefore other noise items like images, advertisements etc. are not taken into consideration.

Guohua Hu, Qingshan Zhao [10] proposes a new tree structure called Style Tree. It captures the actual

contents and common layouts or presentation styles of pages in a Website. It then generates a style tree

from the web page provided as input and determines and marks which part of style tree is noisy using

information based marching mechanism. The part of style tree which is marked noisy is deleted and a

clean web page is provided as output.

The technique proposed in this paper overcomes the drawbacks of all the previously mentioned

techniques. It also provides additional advantages. DOM tree can be implemented in any programming

language. It is easy to modify data structure in DOM tree and it is easy to extract data from DOM tree.

The proposed three phase algorithm aims at detecting noises, irrelevant and redundant data and

extracting them from web pages. Further work on this field will help in more efficient noise removal by

directly detecting the informative contents instead of detecting and eliminating them. For better indexing

and web page ranking it can be incorporated with the search engines. By using more efficient methods

for feature selection and featured set generation accuracy can be further improved.[11]

III. PROJECT IDEA

Detection and Elimination of noise can be implemented as a pre-processing step for web content mining.

The objective of this project is to detect and eliminate noisy items from a web page so that complexity is

reduced and efficiency is increased while processing. The three phases of the algorithm that we are

implementing is featuring, modelling and pruning. This algorithm combines different weighing

approaches.

Featuring is the first phase in which various standard web page preprocessing techniques like

tokenization, removal of html tags and stop words, generation of feature sets etc. A featured set of tokens

is Using a standard weighing scheme the tokens are assigned weights. Using some basic approaches, the

assigned weights are further normalized. Features are selected such that they have their weight above

the threshold value. The threshold value is varies dynamically according to the length of the document or

maximum weight of the terms.

Modelling is the next phase in which the DOM Tree of the Web page is generated. A document is

represented as a tree in the Document Object Model. A complete web page can be easily reconstructed

using DOM trees as they are highly transformable. DOM tree is a well-defined HTML document model. A

closing bracket is not included in some HTML tags. For some tags, the closing bracket is concluded by the

following tag, for instance <L1> tag is closed by the following <\L1> tag. When we want to analyze a web

page, we first check the syntax of HTML document since most HTML Web pages are not formed well.

After this stage we put web pages into an HTML parser, this rectifies the markup and creates a Document

Object Model (DOM) tree. Now that we have created a DOM tree, the system will split it into many sub-

trees depending on threshold level. Individual Web Sites adapt varied layout and presentation styles,

thus the depth of the tree of the Web page is varied depending upon their presentation style. System

should know the maximum level of DOM tree in order to choose the good choice of threshold level.

Hence, the system traverses the entire DOM tree to get the maximum depth of DOM.




Pruning is the last phase in the algorithm. We pick the best suited threshold level up for the training data

set, by setting various threshold levels. After which, the system picks out the suitable threshold level for

test data set by utilizing these known pair of series. An estimate is derived about the nature of the

relationship between the threshold level and maximum level depending upon linear regression analysis.

A regression is a statistical analysis that helps to assess associations between any two variables and is

also used to determine which among the independent variables are related to the dependent variable,

along with exploring the forms of these relationships. When we have obtained the threshold level, the

system will identify some nodes of DOM that are less than the threshold level as noise and discard them.

The clean web page is then generated from the DOM Tree after the marking and removal of nodes having

weight less than the threshold value. [11]

The Featured DOM Tree is used for Web Data Mining because it is easy to modify data structure in

DOM tree and it is easy to extract data from DOM Tree. Therefore noise detection and elimination in

pruning phase becomes easier. Featured DOM Tree provides detailed presentation of the web page than a

normal DOM Tree. Also DOM extension is independent of the browser size and text size settings.

The output of the proposed technique is a set of clean web pages, which is independent of any noisy

content. This is obtained after bottom up traversal of Featured DOM Tree. During the bottom up traversal

of Featured DOM tree all the nodes having weight less than the threshold is marked as noise and is

eliminated. If all the child node of a parent node is marked as noise then the parent node is itself

eliminated. Hence we get a set of web pages which is free from all noisy contents like copyright,

navigation panels, privacy notices, as well as advertisements.

IV. ARCHITECTURE DIAGRAM

Figure easily explains the process involved in noise elimination and detection.

The web page is provided as input to both, the Parser and the Featured DOM Tree Generator. The Parser

generates the token with some weight assigned to it. The DOM tree generates a Featured DOM tree using

which it is easy to modify data structure. The Featured Set and the Featured DOM Tree is then passed to

Comparator as input so that on the basis of Threshold value and the Featured set comparison can be

made to relevant and related data. On the basis of previous comparisons made Noise detector and




eliminator marks and eliminate noisy and irrelevant data item thereby modifying the Featured DOM

Tree. The Web Page Generator generates the clean web page from the modified Featured DOM tree.

V. ALGORITHM

The whole process of detecting and eliminating noise is the major challenge that is faced. The solution to

this can be presented in the form of a strategy, devised to combat the problems that appear throughout

the process. [11] The steps followed are as follows:

A. Noise reduction ()

1. Start

2. Take web page as input.

3. Call feature() function for Featured Set -1 F_Set using featuring technique.

4. Call modelling() function for Featured Set-2 F_DOM using Dom tree.

5. Input F_Set and F_DOM for Pruning Stage.

6. Return

B. Featuring

1. Input: web page including noisy items

2. Apply Pre-processing() method to the web page for feature set generation.

3. Applying weight_scheme to the tokens generated through pre-processing methods to create

F_Seti.

4. Select features having score above threshold value.

5. F_Set i={F_Set1,F_Set2,..}obtained with optimal features,further used for noise detection and

similarity verification.

C. Modelling

1. HTML document /web page modeled into DOM Tree.

2. Featured DOM Tree created using optimal feature selection for individual leaf nodes of the DOM

Tree

3. As a result , featured sets is obtained F_DOMi={F_DOM1,F_DOM2,…}

D. Pruning

1. Noise detection performed on each F_DOM I based on similarity verification.

2. Mininmum Weight Overlapping (MWO) is applied for similarity verification.

Feature

set

terms

F_Set1 F_Set2 Min (W)

x1 W11 0 0

x2 W21 W22 min(W21,W22)

x3 0 W32 0

x4 W41 W42 min(W41,W42)

Total 100 100 MWO = Min (W)

3. F_DOMi is compared with F_Seti in MWO such that certain features overcome the predefined

threshold value t and are marked as noisy node in the tree.




4. Removal of noisy blocks from DOM Tree is performed by bottom up traversal in such a manner

that a parent node is marked noisy if all its child nodes are also marked noisy and hence it is

removed while bottom up traversal.

5. Finally, a cleaned web page is returned.

6. End.

VI. CONCLUSION

Our project aims at creating an application framework for information retrieval for web data mining and

presenting a clean web page free of noises. Thus, the user would be presented with a clean web page

without advertisements, images etc.

VII. FUTURE WORK

The novel task of detecting and eliminating local noise from a web page is proposed in this paper. The

proposed three phase algorithm aims at detecting noises, irrelevant and redundant data and extracting

them from web pages. Further work on this field will help in more efficient noise removal by directly

detecting the informative contents instead of detecting and eliminating them. For better indexing and

web page ranking it can be incorporated with the search engines. By using more efficient methods for

feature selection and featured set generation accuracy can be further improved.

VIII. ACKNOWLEDGEMENT

We would like to express our sincere gratitude for the assistance and support of a number of people who

helped us. We are thankful to Prof. Renuka Puntambekar, Department of Computer Engineering, MIT

Academy of Engineering, our internal guide for her valuable guidance that she have provided us at

various stages throughout the project work. She has been a source of motivation enabling us to give our

best efforts in this project. We are also grateful to Prof. Uma Nagaraj, Head of Computer Department, MIT

Academy of Engineering.

IX. REFERENCES

[1] Detecting Image Purpose in World-Wide Web Documents. S. Paek andJ. R. Smith,January, 1998.

[2] Discovering informative content blocks from Web documents. S.H. Linand J.M. Ho. In Proceeding

of SIGKDD-2002.

[3] Template Detection via Data Mining and its Applications. Z. Bar-Yossefand S. Rajagopalan, In

Proceedings of the 11th International World-WideWeb Conference, 2002

[4] Eliminating noisy information in web pages for data mining. L. Yi, B.Liu, and X. Li. In

Proceedings of the International ACM Conference,2003.

[5] Web Page Cleaning for Web Mining through Feature Weighting.YI L.et LIU B. , In

International Joint Conference on Artificial Intelligence,2003.

[6] Entropy-Based Link Analysis for Mining Web Informative Structures.Hung-Yu Kao, Ming-

Syan Chen Shian-Hua Lin, and Jan-Ming Ho, InCIKM, 2002.




[7] Wisdom Web Intrapage Informative Structure Mining based on Docu-ment Object Model.H.

Y. Kao, J. M. Ho, and M. S. Chen, In IEEETrans KDD, 2005.

[8] VIPS: A Vision Based Page Segmentation Algorithm. Cai Deng, YuShipeng, and Wen Jirong, In

Microsoft Technical Report, 2003.

[9] Noise Removing from Web Pages Using Neural Network. Thanda Htwe,Khin Haymar Saw Hla In

ICCAE, 2010.

[10] Study to Eliminating Noisy Information in Web Pages. Guohua Hu,Qingshan Zhao.

[11] Eliminating Noisy Information in Web Pages using featured DOMtree,Shine N. Das, Pramod K.

Vijayaraghavan,Midhun Mathew