Upload
truongdat
View
229
Download
3
Embed Size (px)
Citation preview
Chapter 3
Comparative Study and Feasibility Analysis of Existing Systems and
Customization of Flash Web pages into HTML Structures
3.1 Preamble
In the previous chapter, we have discussed enhancement techniques to create
datasets and analyzed the issues encountered while dataset creation. In this
chapter, we describe about the working procedure of existing systems such as
VIPS (Vision Based Web page Segmentation), Boilerpipe, Gestalt Theory etc., on
HTML Web page structures. Further, we discuss the working procedure of
existing systems on our XML and Flash datasets and we discuss about
performance of each system. After realizing that the existing systems are not
viable for XML and Flash kind of Web structures, we describe their semantic
Some parts of the material in this chapter have appeared in the following research article (a) Krishna Murthy. A and Suresha, “Comparative Study on Browsing on SSD's”, in International Journal of Machine Intelligence (IJMI), Volume 3, pp 354-358, 2011.
(b) Krishna Murthy. A, Suresha and Anil Kumar K M. “Challenges and Issues in Adapting Web contents on Small Screen Devices”, in International Journal of Information Processing (IJIP), vol. 8 issue 1 of IJIP, Banglore-2013.
(c) Krishna Murthy. A, Suresha, Anil Kumar K M. “Analysis and Issues of Adapting Web pages on Small Screen Devices”, in International Conference on Information Processing (ICIP) page 75-85, Elsevier
publications, Bangalore - 2013.
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
60
structure analysis. In addition to this, we present a novel scheme to customize
the Flash Web pages using Tansym Optical Character Recognizer (T-OCR) tool
by applying the Internet information Service (IIS) server concepts. The results
obtained are tested on various small screen devices. Further, the accessing time
of customized Web page results are compared with accessing time of original
Web pages using time analysis factor tools.
The organization of the chapter is as follows: after giving the overview in Section
3.1, we present the comparative performance analysis of existing systems in
Section 3.2. In Section 3.3, we present the analysis of working methods of VIPS
and Boilerpipe system on our datasets. In Section 3.4, we propose a novel
algorithm to customize the Flash Web page contents and finally the important
results are discussed in Section 3.5. We present the summary of this chapter in
Section 3.6
3.2 Comparative Analysis of Existing Systems
Web contents are currently designed for the Large Screen Devices (LSD) and rich
memory resources. LSD users can use convenient input devices such as a mouse,
keyboard to retrieve any Web page from any Website. Downloading time is
rarely a problem as the Personal Computers (PCs) are usually connected to the
Internet through high capacity lines and the large screen allows much irrelevant
(noise) information such as advertisements to be placed on the screen without
distracting the user. At present, experiencing the internet on Small Screen
terminals such as Mobile, Personal Digital Assistants (PDA) etc., is becoming
very popular. The current Web pages are intended for LSDs are not suitable for
Small Screen Devices (SSDs).
The straight forward solution for browsing on SSDs is to re-design the Web
pages using specific markup languages such as WML, XHTML. Some notable
sites, which have already implemented this include Yahoo, CNN, and Google
among others. Nonetheless, the vast majority of sites on the Web have not
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
61
customized Web pages for SSDs, because it is a time consuming process and not
economical [32]. Compared to LSDs, SSDs are not ideal platforms for surfing the
Web. Because in SSDs, wireless bandwidth is quite limited, it is very expensive
and screen size varies for different devices such as Mobiles, PDA’s etc., and
devices such as Mobile phones have limited memory capabilities. Normally, the
content of a single Web page will be larger than what a Mobile phone can hold.
Therefore, methods to reconstruct LSDs optimized Web pages for Small Screen
terminals are essential.
In this section, the experimental results of Web page Segmentation, Noise
Removal and re-arranging retrieved information to fit on SSDs are discussed in
detail. Table 3.1, [33] gives the performance comparison of query expansion
using different page segmentation methods. The average retrieval precision can
be improved after partitioning pages into blocks, no matter which segmentation
algorithm is used. In the case of FULLDOC, the maximal average precision is
19.10% when the top 10 documents are used to expand the query. DOMPS
obtains 19.67% when the top 50 blocks are used, a little more than FULLDOC.
VIPS gets the best result 20.98% when the top 20 blocks are used and achieves
26.77% improvement [33]. It achieves best precision when no. of segments less
than 30. Figure 3.1 and Figure 3.2(a) show the processing time of Gestalt Theory
method [34]. About 87% of pages are processed in less than 2 seconds. The time
is almost in proportion to the size of the layout tree, which is about one third of
the size of the DOM tree. Author has used recall to evaluate the performance of
Gestalt method with VIPS [33] and PAV [18]. Here recall is the fraction of
correctly recognized blocks over the standard blocks marked manually. N is the
number of result blocks. As N grows, a page is broken into more and more
smaller blocks. It is observed that Gestalt method always outperforms than PAV.
But VIPS achieves the best results Figure 3.2(b) when N is small because
proximity plays the key role at that time [33].
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
62
Three learning methods are used [38] to learn the models such as SVM, non-
linear SVM with RBF kernel and a RBF network. The performances obtained by
these methods are reported in Table 2. SVM with RBF kernel achieved the best
performance with Micro-F1 80.2% and Micro-Acc 86.8%. The linear SVM
performed worse than both SVM with RBF kernel and RBF network.
The results indicate that a nonlinear combination of the features is better than a
linear combination. In WPC [39] experiment, Naïve Baye’s text classification is
applied on three different data sets, cleaned using three approaches of Not
Cleaned (NC), Template (TPL) [29] and Web Page Cleaner (WPC) to check the
performance. Table 3.3 shows average classification (Naïve Baye’s text
classification) accuracy and standard error on four-fold cross validation for
method NC (79.68, 4.07), for method TPL [29] (96.23, 1.18) and for method WPC
(98.12, 0.29).
Table 3.1: Performance comparison using different page segmentation methods [33]
Number of
Segments
Baseline
(%)
FULL DOC (%) DOMPS
(%)
VIPS (%)
3
16.55
17.56 (+6.10)
17.94 (+8.40)
18.01 (+8.82)
5 17.46 (+5.50)
18.15 (+9.67)
19.39 (+17.16)
10 19.10 (+15.41)
18.05 (+9.06)
19.92 (+20.36)
20 17.89 (+8.10)
19.24 (+16.25)
20.98 (+26.77)
30 17.40 (+5.14)
19.32 (+16.74)
19.68 (+18.91)
40 15.50 (6.34)
19.57 (+18.25)
17.24 (+4.17)
50 13.82 (-16.50)
19.67 (+18.85)
16.63 (+0.48)
60 14.40 (-12.99)
18.58 (+12.27)
16.37 (-1.09)
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
63
Table 3.2: Comparison of learning methods [38]
Methods Level 1 Level 2 Level 3 Micro-F1 Micro-Acc
SVM
(RBF)
0.787 (P)
0.813 (R)
0.807 (P)
0.808 R)
0.837 (P)
0.754 (R)
0.802 0.868
SVM
linear
0.691 (P)
0.740 (R)
0.745 (P)
0.737 (R)
0.823 (P)
0.675 (R)
0.731 0.821
RBF
Network
0.727 (P)
0.717(R)
0.752 (P)
0.755(R)
0.777 (P)
0.799 (R)
0.746
0.830
Table 3.3: WPC average accuracy and standard error [39]
Test Case
Train Test Methods Avg Accuracy (%)
Standard Error
1-1 25
(5 per class)
2475 NC 79.41 2013
TPL 88.63 1.41
WPC 91.10 0.69
1-2 50 (10 per class)
2450 NC 90.52 1.01
TPL 92.42 0.96
WPC 95.44 0.40
1-3 75
(15 per class)
2425 NC 95.44 0.42
TPL 94.19 0.70
WPC 97.05 0.20
1-4 100
(20 per class)
2400 NC 94.89 0.43
TPL 94.45 0.33
WPC 97.11 0.20
1-5 250
(50 per class)
2250 NC 97.40 0.37
TPL 97.33 0.21
WPC 98.64 0.12
1-6 500
(100 per class)
2000 NC 97.97 0.27
TPL 98.09 0.09
WPC 99.00 0.06
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
64
Figure 3.1: Average precision vs. Number of segments [33]
The result [42] in Figure 3.3(a) indicates that, for “Bottom” target content
elements, the system is four times more efficient than the Google Wireless Trans-
coder, and twice as efficient for “Middle” elements. These results prove that the
proposed method significantly improves the usability of Web browsing on the
mobile phone.
Segmentation results of E-Gestalt are compared [43] with VIPS [33] and Gestalt
[21]. As illustrated in Figure 3.3(b), E-Gestalt produces the most PERFECT blocks
and the least ERROR blocks.
The count of NOT-BAD blocks is much higher in other two methods, mainly
because: large blocks tend to be split into sub-blocks much smaller than screen
size by VIPS and Gestalt. This reversely demonstrates the effectiveness of the
fine-tuned dividing process for generating mobile-fitted blocks.
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
65
Figure 3.2(a): Time taken for page segmentation by GESTALT [21]
Figure 3.2(b): Recall of Gestalt, PAV and VIPS [21]
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
66
Figure 3.3(a): Usability Evaluation Results [42]
Figure 3.3(b): Block content of E-Gestalt, Gestalt and VIPS [43]
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
67
3.3 Working with Existing Systems
In this section, we describe the issues on adaptation of the existing systems on
SSDs. And in the next section, we propose a method to translate Flash Web pages
to be made browsable on SSDs after analyzing the semantic structure.
Normally, the content of a single Web page will be much larger than contents
that can be accommodated in a mobile screen. Therefore, methods to translate
LSDs optimized Web pages for Small Screen terminals are essential. Several
attempts have been made to enable Web pages to be browsable on SSDs. All the
methods were designed based on HTML tags as there is a vast availability of
HTML pages on Web [26][29].
In order to study the feasibility of the existing approaches on Web pages created
using XML and Flash, we began the process of creating data set for XML and
Flash Web page, since there are no such data sets. To check the feasibility of the
existing systems, we set up few existing approaches which predominantly
dealing with the Web page segmentation to display the Web contents on SSDs.
For this, we have considered two such popular approaches namely, VIPS [33]
and Boilerpipe [27] systems on our Datasets.
3.3.1 Testing VIPS system on our Data sets
VIPS is a Web content structure analyzer/Web page analyzer based on visual
representation algorithm. Many Web applications such as information retrieval,
information extraction and automatic page adaptation can benefit from this
structure [33]. It discusses about an automatic top-down, tag-tree independent
approach to detect Web content structure and simulates user understanding of
Web layout structure based on visual perception. After setting-up VIPS system,
we tested system feasibility on our dataset’s, but the pages does not get
segmented. Figure 3.4 depicts successful segmentation of normal HTML Web
page (ex: http://www.uni-mysore.ac.in) after applying PDoC value (Permitted
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
68
Degree of Coherence – which controls the partition granularity); here the left part
shows the segmented blocks (different visual blocks). The block marked with red
rectangle is the selected content structure VB1-1-3 (Figure 3.4).
Figure 3.4: VIPS page analyzer on HTML Web page
Figure 3.5: VIPS page analyzer on Flash Web page
The upper right part shows the PDoC value and vision based content structure
(hierarchical structure of different blocks – VIPS tree) of Web page, bottom right
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
69
part gives the features of selected content structure. In different applications, we
can control the partition granularity by setting PDoC value, whereas VIPS
system fails to segment Flash Web pages (ex: http://www.isim.ac.in/mlw) after
entering PDoC value. Figure 3.5 depicts failure of segmentation of Flash Web
pages after applying PDoC value.
It segments entire Web page as a single block, but it fails to segment as
individual blocks. The reason behind this is Flash Web pages are semantically
different from HTML pages. VIPS is developed based on pre-defined HTML
tags.
Figure 3.6: VIPS page analyzer on XML Web page
Similarly, we have tested on XML Dataset (ex:
http://shrubbery.mynetgear.net/c/disply/ W/Web.xml). VIPS system fails to
display the Web contents in display region. Figure 3.6 depicts the pictorial view
of XML Web pages on VIPS system. The reason for failure is that the semantic
structure of XML Web pages is different from HTML Web pages. Hence, we have
concluded that the existing VIPS model fails to segment the Flash Webpage.
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
70
Table 3.4: Feasibility analysis of VIPS system on our Data sets
Figure 3.7: Feasibility analysis on VIPS systems on HTML, Flash and XML Web pages
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
71
Table 3.4 and Figure 3.7 depict the feasibility analysis of VIPS systems on various
kinds of Web pages (HTML, Flash and XML respectively).
3.3.2 Testing Boilerpipe system on our Data sets
Boilerpipe provides algorithms to segment and remove the surplus “clutter”
(boilerplate, templates) around the main textual content of a Web page [27]. We
have tested the Boilerpipe model on our datasets, for all cross combinations of
extractor and output mode.
Figure 3.8: Boiler Pipe page analysis on HTML Web page
Figure 3.9: Boiler Pipe page analysis on Flash Web page
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
72
They are employed to segment and retrieve informative contents from Web
pages. The model works fine for HTML Web pages, but fails to segment and
extract informative contents from Flash and XML Web pages. The model shows
the result as blank screen or returns ‘code error’ message. Figure 3.8 depicts
successful segmentation and information extraction of HTML Web page.
Figure 3.9 depicts the failure of segmentation and information extraction of flash
Web pages. Figure 3.10 depicts the failure on XML Web pages after applying
Extractor and Output mode options of Boilerpipe model.
Figure 3.10: Boiler Pipe page analysis on XML Web page
From this study, we have realized that the Boilerpipe model also fails to segment
and extract informative contents from Flash and XML Web pages. Therefore,
from the analysis of the existing systems, we have shown that these systems i.e.,
segmentation, noise removal and mobile browsing work well on Web pages
using pre-defined HTML tags.
Because of this failure, informative contents from Flash and XML Web pages face
a few limitations in real time applications such as inability to display these
contents on SSDs, inability to share contents on social Web sites, inability to
index these Web pages by Web search engines resulting in lack of its use in
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
73
information retrieval, information extraction etc., and citation rate of these Web
pages are also reduced due to afore mentioned limitations.
In order to address these limitations, we worked on semantic structure of XML
and Flash Web pages to explore the segmentation technique. After analyzing the
structure of XML and Flash, we proposed a system to translate Flash Web pages
into HTML Structure format which is suitable and viable on SSDs. We have
analyzed the browsing performance using two metrics namely content coverage
and response time on different mobile phones.
3.4 Translation of Flash into HTML Structure
In this approach, after downloading the Flash Web pages in the format of .mht
(MHTML), each tab page is converted into images using snipping tool (this is an
inbuilt tool in windows 7 OS). The images are saved in bitmap format and text
contents of each image extracted by using the Transym Optical Character
Recognition (TOCR) tool. Then the corresponding HTML Web pages are
constructed similar to original Flash Web pages. Figure 3.11 shows the
architecture of proposed system.
Figure 3.11: Architecture of Proposed System
After the translation of HTML pages, it is hosted using reverse proxy concepts
with the help of Internet Information Service (IIS) Web server. A very common
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
74
reverse proxy scenario is to make available several internal Web applications
over the Internet. An Internet-accessible Web server is used as a reverse-proxy
server that receives Web requests and then forwards them to several intranet
applications for processing.
We set up a Reverse Proxy server using the Application Request Routing (ARR)
and the URL Rewrite Module 2.0. After setting-up the reverse proxy concepts,
translated Web domains are hosted by assigning an individual port number to
respective domain names for easier access. Testing was done on different SSDs
based on content coverage and response time.
3.5 Experimental Results
Web pages are hosted after translation into HTML format (traditional Web
pages). The results have been analyzed based on accessing the Conventional
Flash Web pages (C-Web pages) as well as Traditional Web pages (T-Web pages).
Here, system performance analyzed by comparing the time factors of
downloading time on SSDs (based on system specification) and content
coverage.
We have performed the downloading time analysis on three different type of
hand held devices such as Sony Xperia X10, Sony Ericsson WT13i and LG
Optimus Net based on Wi-Fi connection. We obtain a better performance in
accessing T-Web pages compared to its corresponding C-Web pages. Table 3.5
and Table 3.6 provide results of response time and content coverage respectively
on our approach on Flash data set.
Similar processes were carried out on LG-Optimus hand held device. Two Web
sites namely www.noleath.com and www.marvismint.com are get downloaded
completely. It was also observed that very rarely Home Page (HP) of a few sites
were getting downloaded and the rest required Flash Player (FP) to display the
contents. In our approach, T-Web pages were displayed between 2.61 to 6.21
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
75
seconds as shown in Table 3.5. Figures 3.12, 3.13 and 3.14 are depicts pictorial
view of difference between response time of C and T Web pages (Sony Xperia
X10, Sony Ericsson WT13i and LG Optimus Net respectively).
Table 3.5: Response Time analysis on various SSDs
Sl.No
Websites
Sony Xperia X10
Sony Ericsson WT13i
LG Optimus Net
T-Web pages
(Sec)
C-Web pages
(Sec)
T-Web pages
(Sec)
C-Web pages
(Sec)
T-Web pages
(Sec)
C-Web pages
(Sec)
1 www.isim.ac.in/mlw 1.06 32.47 14.59 ND 6.21 FP
2 www.comunicacion.com
1.3 30.77 59.56 HP+FP
2.95 HP+FP
3 www.urbansurvivours.org
1.22 170 7.57 ND 3.14 FP
4 www.asual.com/swfa
ddress/ samples/flash
1.28 27.87 4.77 JS+FP 3.32 HP+FP
5 www.noleath.com 2 100 5.47 ND 2.61 23.12
6 www.sensisoft.com 1.28 389.82 5.96 FP 3.24 FP
7 www.marvismint.com 1.27 107.88 5.96 68.58 2.95 4.71
Figure 3.12: Performance Analysis on SonyXPeria10
0
100
200
300
400
1 2 3 4 5 6 7
Tim
e(S
eco
nd
s)
Flash Websites
HTTPS
HTTP
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
76
Figure 3.13: Performance Analysis on Sony Ericson WT13i
Figure 3.14: Performance Analysis on LG Optimus Net
After experimental analysis, we have performed the content coverage analysis
based on human perception as against the system view. The algorithm adopts
kappa[23] statistics to quantitatively measure the degree of the agreement as to
how Web pages share the similarity terms. Kappa result ranges from 0 to 1. The
higher the value of Kappa, the stronger the similarity. Kappa more than 0.7
typically indicates that similarity of two Web pages is strong. Kappa values
greater than 0.9 are considered excellent. Boolean values are used to represent
the content coverage (0100% content loss, 0.550% content loss, 10% content
0
20
40
60
80
1 2 3 4 5 6 7
Tim
e (S
eco
nd
s)
Flash Websites
HTTP
HTTPS
0
10
20
30
1 2 3 4 5 6 7
Tim
e(S
eco
nd
s)
Flash Websites
HTTPS
HTTP
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
77
loss). Table 3.6 depicts the clear view of kappa value relation between human
perception (Human View- HV) and system view (SV).
Pr(a) Pr(e)K =
1 Pr(e)
Where Pr(a) is the relative observed agreement among raters, and Pr(e) is the
hypothetical probability of chance agreement, using the observed data to
calculate the probabilities of each observer randomly saying each category. If the
raters are in complete agreement then k = 1 (If there is no agreement among the
raters other than what would be expected by chance as defined by Pr(e)), κ = 0).
Table 3.6: Content coverage analysis on various SSDs
S.No
Websites
Sony Xperia X10
Sony Ericsson WT13i
LG Optimus Net
T-Web pages (HV)
C-Web pages (SV)
T-Web pages (HV)
C-Web pages (SV)
T-Web pages (HV)
C-Web pages (SV)
1 www.isim.ac.in/mlw 1 1 1 0 1 0.1
2 www.comunicacion.com
1 1 1 0.2 1 0.2
3 www.urbansurvivours.org
1 1 1 0 1 0.1
4 www.asual.com/swfad
dress/ samples/flash
1 1 1 0.2 1 0.2
5 www.noleath.com 1 1 1 0 1 1
6 www.sensisoft.com 1 1 1 0.2 1 0.2
7 www.marvismint.com 1 1 1 1 1 1
3.6 Summary
The results obtained by the existing systems are compared with one and other
existing systems with respect to segmentation ratio. After choosing the best
performance algorithms such as VIPS and Boilerpipe, they are used to test the
feasibility analysis on our Data sets (Flash and XML). We observed that both the
Comparative Study and Feasibility Analysis of Existing Systems and Customization of Flash Web pages into HTML Structures
78
predominantly leading algorithms fail to segment and retrieve contents from our
Data sets. Result tables clearly show that the existing systems are not viable for
XML and Flash kind of Web pages. Therefore, to propose the system to segment
and retrieve the contents, we performed the semantic structure analysis on both
the Data sets. After analyzing the semantic structure, Translation system has
been proposed to customize the Flash contents into HTML structure to make
them viable on SSDs. Here, the results obtained are tested on various hand held
devices such as Sony Xperia X10, Sony Ericsson WT13i and LG Optimus Net
based on various specification details. After achieving the better response time
for translated Web pages, obtained resultant vectors are compared with the
response time of respective conventional Web pages and Kappa statistical
analysis also has been performed to analyze the content coverage on SSDs.