My Undergraduate Thesis (Graduation Project) Presentation

Preview:

Citation preview

EVALUATION of DOM TREE SIMILARITIES

UNDERGRADUATE THESIS , JUNE 2015-- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -TEOMAN TURAN040100014

SUPERVISOR INSTRUCTOR: ASST. PROF. DR. TOLGA OVATMAN

A Brief Introduction to the Issue

Today, the total amount of web pages all around the world is growing in a terrific pace. It is possible to encounter several sorts of web sites that serve the same purpose: bulletin boards (forums), video sharing sites, social networks, video game distrubition platforms, shopping sites, broadcasting sites, news portals etc.

From http://www.internetlivestats.com/total-number-of-websites/

A Brief Introduction to the Issue

• How to meet the design requirement of these millions of web pages?

• It is impossible to supply a design solution for each web site dissimilar to other ones.

• «Templates» can be considered as a solution for the design issue. A template, in other words, skeleton or schema, can be used for the design of thousands of websites being modified by designers.

An example template:

A Brief Introduction to the Issue

Issue: Evaluation of Web Page Similarities

• More similar web page designs, like the solutions called «templates», involve studying on a specific problem: evaluation of web page similarities.

• A similarity ratio with respect to some criteria: How much two web pages’ designs are similar to each other?

A Brief Theoretical and Basic Information: HTML

• Hyper-Text Markup Language, commonly called «HTML», is a markup language used to design web pages.

• Once a text file containing HTML codes is saved with the extension of .html/.htm, it immediately becomes the source of a new web page.

REFLECTION OUTPUT

(Interpreted by web browser)

• Three major components that form the syntax of HTML: element, attribute, text

A Brief Theoretical and Basic Information: HTML

Element

Text

Attribute

• Document Object Model, commonly called «DOM»

• An interface to access and update a component of a markup code having a syntax in a nested structure

• Provides structural representation for HTML, XML and SVG documents

• Along with special libraries like dom4j in Java, JavaScript also has a feature to benefit from HTML DOM that is examined within the context of this thesis project.

A Brief Theoretical and Basic Information: DOM

• Major components of HTML corresponds to the major DOM objects: element object, attribute objects, text objects

• The order of the nesting of components in an HTML, XML or SVG code (also DOM objects) form a tree called «DOM Tree».

• A software solution using DOM can traverse on a DOM tree extracted from an HTML, XML or SVG code.

A Brief Theoretical and Basic Information: DOM

<element1>

<element2>

<element3>

</element3>

<element4>

</element4>

</element2>

</element1>

DOM Tree Example

• Element nodes representing elements in an HTML code

• Attribute nodes representing attributes in an HTML code

• Text nodes representing texts in an HTML code

• The HTML document itself forms «document node» that can be said to be the root node in the tree.

• Comments also form «comment nodes» in the relevant DOM tree.

The Major Components of an HTML DOM Tree

The Major Components of an HTML DOM Tree

• The similarity ratio between the designs of two web pages leads the similarity ratio between their HTML files.

• The similarity ratio between these HTML files leads the similarity ratio between their DOM trees.

• Thus, evaluation of DOM tree similarities is a solution to the main problem of the web page design similarities.

The Main Problem: Similarity of DOM Trees

• To develop an algorithm that measures the similarity level between two DOM trees having been extracted from two HTML files

• Hence, the main objective: to develop a system that measures the similarity between two web pages with respect to their designs

The Objective of the Project

• 1 – Parse two HTML files having been loaded to the system.

• 2 – For each file, extract the DOM objects with their relatives, that means extract the DOM tree being formed by the code in the file.

• 3 – Develop an algorithm to compare these DOM trees, and to calculate the similarity level among them. (This is the core of the project.)

• 4 – Develop a graphical user interface (GUI) as a simple application.

Major Steps for the Development

• The project has been developed in Java programming language, using Eclipse IDE for Java EE Developers.

• The output of the project: «DOM Similarity Evaluator»

• A Java application with a simple GUI

• Can be launched directly from its JAR file

• It is open source, and cross-platform owing to being a Java application.

«DOM Similarity Evaluator» Application

• An easy-to-use Java library used to work with HTML, XML (and XPath and alike languages), SVG etc

• The library parses such a file, then extracts the DOM tree formed by the code in the file by traversing the tree.

• The built-in data types within the library corresponds to the node types in the DOM tree, like element nodes of an HTML-DOM tree extracted from a parsed HTML file.

• The methods in the library provides the way of acquiring the parent and the children of a node if exist.

dom4j Library of Java

• Forms the core of the project

• The key value processed through the algorithm: the frequencies of distinct elements

• The «frequency» of a distinct element means how many instances of the distinct elements exist in a DOM tree extracted from the relevant HTML file.

• For example, let the iteration over a DOM tree gives the following sequence of elements:

html head title body h1 p p p ul li li li p ul li li a li button button

Here, «distinct element-frequency» couples are as follows:

html-1 head-1 title-1 body-1 h1-1 p-4 ul-2 li-6 a-1 button-2

About the Similarity Evaluation Algorithm

• The algorithm compares the nodes of two DOM trees, then calculates three sort of similarity ratios with respect to element nodes, attribute nodes, text nodes respectively.

• There is also the fourth similarity ratio called «overall similarity» calculated based on the formula below. This is actually the major ratio that can be used to evaluate the design similarity between two web pages.

overall = (element * 60%) + (attribute * 30%) + (text * 10%)

Here, the percentages have been assigned considering their influence greatness, in other words, importance in the design of a web page.

Sorts of Similarity Ratios Calculated

• 1 – Extract all distinct elements with their frequencies for both DOM trees. For example, for both trees, let the following two «element-frequency» collections be obtained.

Tree 1: html-1 head-1 title-1 body-1 h1-1 p-4 ul-2 li-6 a-1 button-2

Tree 2: html-1 head-1 title-1 body-1 h2-1 p-6 ul-3 li-12 a-3 img-2 table-1 td-2 tr-2

How to Calculate the Element Node Similarity

• 2 – Find the elements that commonly exist in both trees. In the previous slide, the following elements are common: html, head, title, body, p, ul, li, a.

• 3 – For each common element from both DOM trees, take the frequency of the one with the less frequency is taken, and push it to a special frequency list. (For ones with the same frequency, directly take the value.) For the current example being studied, except for common elements with the same frequency in both trees;

p, ul, li, and h have lower frequencies in the first tree.

Special frequency list: 1, 1, 1, 1, 4, 2, 6, 1

How to Calculate the Element Node Similarity

• 4 – Sum up the frequencies in the list, then divide the total value by the number of the all element nodes in either tree containing more element nodes. Finally, multiply the result by 100 in order to obtain the percentage.

For the example being studied, the total value is 17. The second tree has more element nodes compared to that of the first one: 36.

17/36 = 0.47

0.47 * 100 = 47.0% The element node similarity between Tree 1 and Tree 2

How to Calculate the Element Node Similarity

• For the calculation of attribute node similarity, along with the attribute nodes of a DOM tree themselves, their parents, the element nodes they are connected to as children, are also considered.

• The way followed to calculate the element node similarity is followed for the attribute nodes themselves. But, the same way is also followed for these nodes’ parents, in other words, the element nodes whose children are these attribute nodes. As a result, two ratios are acquired.

• The average of these two values is the final similarity ratio with respect to the attribute nodes between two HTML files. For percentage, it is just multiplied by 100.

How to Calculate the Attribute Node Similarity

• Here, the parents of the text nodes, in other words, the element nodes owning text nodes connected as their children, play the main role.

• The way followed to calculate the element node similarity is followed here as well taking the parent element nodes of text nodes into consideration.

How to Calculate the Text Node Similarity

• The system deals with the design of two web pages.

• No reflection of comment lines in an HTML code on the output page: They are not taken into consideration.

• Since only the schema (skeleton, structure, construction) of DOM trees are considered, the values attribute and text nodes take do not play a role here.

• «Node connections» (and node existences and numbers of course) play the basic role in this system.

What About Comment Nodes, Attribute Values and Text Values?

Thank you for listening!

(Here is also a short demonstration for the application…)

Recommended