Agent based Authentication for Deep Web Data Extraction · Data Record Extractor (ViDRE) and Vision-based Data Item Extractor (ViDIE). A. Visual Features Font The fonts of texts on

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Issue 2, Volume 4 (April 2015) ISSN: 2349-7009(P) www.ijiris.com

____________________________________________________________________________________________________________ © 2014-15, IJIRIS- All Rights Reserved Page -44

Agent based Authentication for Deep Web Data Extraction G.Muneeswari

Associate Professor Department of Information Technology

SSN College of Engineering

Abstract— Deep web contents are accessed by queries submitted to web databases and the returned data records are enwrapped in dynamically generated web pages (they will be called deep Web pages). Extracting structured data from deep web pages is a challenging problem due to the underlying intricate structures of such pages. As the popular two-dimensional media, the contents on web pages are always displayed regularly for users to browse. This motivates us to seek a different way for deep web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep web pages. In this paper, an agent based authentication mechanism is proposed which uses an agent program runs on the server to authenticate the user. A novel vision-based approach that is web page programming language-independent is also proposed. This approach primarily utilizes the visual features on the deep web pages to implement deep web data extraction, including data record extraction and data item extraction. Our experiments on a large set of web databases show that the proposed vision-based approach is highly effective for deep web data extraction and provides security for the registered user.

Keywords— Authentication, vision based approach, web data, agent program, web page.

I. INTRODUCTION

The World Wide Web has more and more online web databases which can be searched through their web query interfaces. The query results are enwrapped in web pages in the form of data records and are generated dynamically. These web pages are called deep web pages and are hard to index by the traditional crawler-based search engines. To ease the human users consumption of information retrieved from search engines, always the deep web pages [data records and data items] are arranged with the visual regularity to meet the reading habits of human beings. We explore the visual regularity of data records and data items on the web pages and propose a novel vision-based approach, Vision-based Data Extractor (ViDE) to extract results from deep web pages automatically. It is primarily based on visual features (font, position, content, layout and appearance features) and also utilizes some simple non visual information such as data types and frequent symbols. ViDE consists of two main components such as Vision-based Data Record Extractor (ViDRE) and Vision-based Data Item Extractor (ViDIE).

A. Visual Features Font The fonts of texts on a web page are a very useful visual information, which are determined by many attributes such as size, face, color etc. Position Features (PF’s) These features indicate the location of a data region on a deep web page. PF1: Data regions are always centered horizontally. PF2: The size of the data region is usually large relative to the area size of the whole page. Layout Features (LF’s) These features indicate how the data records in the data region are typically arranged. LF1: The data records are usually aligned left in the data region. LF2: All data records are adjoining.LF3: Adjoining data records do not overlap. Appearance Features (AF’s) These features capture the visual features within the data records. AF1: Data records are very similar in their appearances such as sizes of images they contain and fonts they use. AF2: The data items of same semantic in different data records have similar presentations. AF3: The neighboring text data items of different semantics often use distinguishable fonts. Content Features (CF’s) These features hint the regularity of contents in data records.CF1: First data item in each data record is always of a mandatory type. CF2: The presentation of data items in data records follows a fixed order.CF3: There are often some fixed static texts in data records.

The problem of web data extraction has perceived a lot of attraction because most of the solutions proposed are based on analyzing the HTML source code or the tag trees of web pages. Consider a web page which is vivid and more colorfully designed using complex Java script and CSS, it makes more difficult to infer the regularity of the structure of web pages by only analyzing the tag structures. Hence, a novel vision-based approach gains necessity. The overall objective of our project is to develop an approach that is Web-page-programming-language-independent for extracting the data from dynamically generated web pages. A number of approaches have been reported for extracting information from web pages.



Those are classified based on the degree of automation in data extraction as manual approaches, semiautomatic approaches and automatic approaches. The earliest are manual approaches in which languages were designed to assist programmer in constructing wrappers to identify and extract all the desired data items/fields. Example: Minerva, TSIMMIS and Web-OQL.Semi automatic techniques can be classified into sequence based and tree based. Example: Soft Mealy and Stalker, W4F and XWrap. These approaches require manual efforts for example, labelling some sample pages. Automatic approaches improve the efficiency and reduce the manual efforts. Example: Omini, IEPAD, Roadrunner, MDR, DEPTA etc. But these approaches do not generate wrappers (i.e.) they identify patterns and extract data from web pages directly without using the previously derived extraction rules.

B. Information Extraction and Web Intelligence

The problem of information extraction is to transform the contents of input documents into structured data. Unlike information retrieval, which concerns how to identify relevant documents from a collection, information extraction produces structured data ready for post-processing, which is crucial to many applications of text mining. IEPAD architecture shown in fig.1, attempts to eliminate the need of user-labeled training examples. The idea is based on the fact that data on a semi-structured web page is often rendered in some particular alignment and order and thus exhibit regular and contiguous pattern. By discovering these patterns in target web pages, an extractor can be generated. IEPAD has a pattern discovery algorithm that can be applied to any input webpage without training examples. This greatly reduces the cost of extractor construction. A huge number of extractors can now be generated and sophisticated web intelligence applications become practical.

Fig.1 IEPAD Architecture

C. The IEPAD Architecture

Web page information extraction has been implemented into a system called IEPAD, which includes three components: Pattern discoverer accepts an input web page and discovers potential patterns that contain the target data to be extracted. Rule generator contains a graphical user interface, called pattern viewer, which shows patterns discovered. Users can then

select the one that extracts interesting data and then the rule generator will remember the pattern and save it as an extraction rule for later applications. Users can also use pattern viewer to assign attribute names of the extracted data.

Extractor extracts desired information from similar web pages based on the designated extraction rule.

II. LITERATURE SURVEY

A. Automatic Information Extraction from Semi-Structured Web Pages By Pattern Discovery In this paper, an unsupervised approach to semi-structured information extraction is presented. The key features of this approach

are as follows. First, by applying the PAT-tree pattern discovery algorithm, the discovery of record boundaries can be completed automatically without user- labeled training examples. Second, by applying the multiple alignment algorithm, discovered patterns can be generalized over unseen pages. Third, the extraction rule is pattern-based instead of delimiter-based. Finally, by allowing alternative expressions in the extraction rules, IEPAD can handle exceptions such as missing attributes, multiple attribute values and variant attribute permutations. Comparing IEPAD to previous work, IEPAD performs equally well in terms of extraction accuracy but requires much less human intervention to produce an extractor. In terms of the tolerance of layout irregularity, the extraction rules generated by IEPAD allow alternative tokens and hence can tolerate exceptions and variance such as missing attributes in the input. Multi-level alignment can also be applied to extract finer information by extracting attributes from a discovered data record.



B. Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages

In this paper, a method to automatically generate wrappers for extracting search result records from all dynamic sections on result pages returned by search engines is presented. This method has the following novel features: (1) It aims to explicitly identify all dynamic sections, including those that are not seen on sample result pages used to generate the wrapper. (2) It addresses the issue of correctly differentiating sections and records. Experimental results indicate that this method is very promising. Automatic search result record extraction is critical for applications that need to interact with search engines such as automatic construction and maintenance of meta-search engines and deep web crawling. In this paper, an algorithm (MSE) is presented to tackle the problem of automatically extracting dynamic sections as well as search result records within these sections. The solution can potentially be very useful for all web applications that need to interact with web-based search systems, including regular search engines and e-commerce search engines. By being able to extract search result records from all dynamic sections and maintaining the section-record relationships, MSE allows an application to select the desired sections for data extraction.

C. VIPS: a Vision-based Page Segmentation Algorithm

A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are web-page-programming language dependent. In this paper a new approach for extracting web content structure based on visual representation was proposed. The produced web content structure is very helpful for applications such as web adaptation, information retrieval and information extraction. By identifying the logic relationship of web content based on visual layout information, web content structure can effectively represent the semantic structure of the web page. It simulates how a user understands the layout structure of a web page based on its visual representation. Compared with traditional DOM based segmentation method, this scheme utilizes useful visual cues to obtain a better partition of a page at the semantic level. It is also independent of physical realization and works well even when the physical structure is far different from visual presentation.

III. VISION BASED APPROACH

We propose an automatic novel vision-based approach ViDE to extract data from deep web pages. It utilizes the visual features of a web page and hence it is Web-page-programming-language-independent or HTML independent. The information on web pages consists of both texts and images and our approach provides an efficient mechanism to extract and cluster those information to ease the utility by the user. A. Vision-based Data Extractor

Our proposed system depicted in fig.2, is a vision-based data extractor tool. Given an input URL, the local web browser uses the crawler mechanism to access the global web and fetch the requested web page. The contents of the web page such as the images, tables, headers, email, phone numbers etc are then extracted using the data extractor tool. The clustering process then utilizes the visual appearance of the web page to group the semantic content together and display it in the form of HTML file.

Fig 2. Vision based System Architecture 1) Crawler Crawler is a mechanism that methodically scans or crawls through internet pages to create an index of data it is looking for. Hence given a URL as an input, our tool uses this mechanism to display the corresponding web page.

CRAWLER

DATA EXTRACTOR

HTML FILE

INPUT URL



The process is as follows. A user enters the URL of the web page he/she wants to extract in the local browser. The local browser then contacts the World Wide Web to identify and fetch the web page requested by the user. The local browser displays the fetched web page in the display panel.

2) Extractor First, fetched web page is given as an input to the extractor which uses regular expressions to retrieve the contents from the

web page and save it in the local file storage. The extraction process is as follows. Our web data extractor tool extracts different types of data from the acquired web page such as images, tables, headers, content, URL’s traversed, phone numbers, emails and fax information. Hence, when the extraction process starts, the regular expressions are constructed to extract the different types of data mentioned above. These regular expressions are compared with the fetched web page to extract the required information. Once the extraction process gets over, the extracted information is stored as individual files in the local file storage.

3) Clustering Clustering involves grouping the content extracted from the web pages in the previous step using the web data extractor tool. Clustering is performed by utilizing the visual appearance of the web page. The relevant contents are grouped together based on the visual features such as image size, image position, font size etc.The clustering process is as follows. Images are first clustered based on the size and position of the image and as a result images with identical size are retrieved and the other images are labeled as noise images. The noise images are then removed. Tables and other contents are grouped by utilizing the font features of the text displayed in the web page. Hence, at the end of clustering process the relevant content extracted from the web pages are stored in the form of a HTML file.

4) Display Display process utilizes a search engine mechanism which displays the results according to the search query submitted. The display process is as follows. The HTML file which contains the clustered content is obtained for various web sites. Hence the individual HTML files are placed in local file storage and the relevant pages are displayed when the search query is submitted.

IV. AGENT BASED AUTHENTICATION

A. Agent Server for Authentication Whenever the client wants to search for the web page, it has to first send a query to the agent server. Prior to that every client has to register with the agent server. The agent server registers all the client details in its internal database. Whenever the query has been received then the encrypted username and password will be verified accordingly. Then only the authenticated clients can access the server or send the query to the respective servers.

Fig 3. Agent Server implementation The agent sever will be acting as a proxy for ‘n’ number of clients and servers on the web. The agent server implements a linked list for storing the user name and password in encrypted form.

B. Victim/Detection Model for Authentication The victim model in our general framework consists of multiple back-end servers, which can be Web/application servers,

database servers, and distributed file systems. We do not take classic multitier Web servers as the model, since our detection scheme is deployed directly on the victim tier and identifies the attacks targeting at the same victim tier; thus, multitier attacks should be separated into several classes to utilize this detection scheme. The victim model along with front-end proxies is shown in Fig.3. We assume that all the back-end servers provide multiple types of application services to clients using HTTP/1.1 protocol on TCP connections. Each back-end server is assumed to have the same amount of resource.

Agent Server

Web Server

Client

Authentication



Moreover, the application services to clients are provided by K virtual private servers (K is an input parameter), which are embedded in the physical back-end server machine and operating in parallel. Each virtual server is assigned with equal amount of static service resources, e.g., CPU, storage, memory, and network bandwidth. The operation of any virtual server will not affect the other virtual servers in the same physical machine. The reasons for utilizing virtual servers are twofold: first, each virtual server can reboot independently ,thus is feasible for recovery from possible fatal destruction; second, the state transfer overhead for moving clients among different virtual servers is much smaller than the transfer among physical server machines. As soon as the client requests arrive at the front-end proxy, they will be distributed to multiple back-end servers for load balancing, whether session sticked or not. Notice that our detection scheme is behind this front-end tier, so the load balancing mechanism is orthogonal to our setting. On being accepted by one physical server, one request will be simply validated based on the list of all identified attacker IDs (black list). If it passes the authentication, it will be distributed to one virtual servers within this machine by means of virtual switch. This distribution depends on the testing matrix generated by the detection algorithm. By periodically monitoring the average response time to service requests and comparing it with specific thresholds fetched from a legitimate profile, each virtual server is associated with a “negative” or “positive” outcome. This victim model is deployed on the agent server for authentication purpose. Once the clients are authenticated then the web pages could be easily accessed with the proposed methodology.

Fig.3. Victim/detection model

V. EVALUATION RESULTS The Performance snapshot for all the 10 clients at a given instance is given in Table 5.3. Any misuse detected in any of the

parameters increases or decreases the performance parameters. For example in node ‘n1’ an hacker has been able to successfully authenticate himself and enter the system. The user feedback are all poor on the compromised system and hence its values is set to ‘0’. For the first run and for sample we have taken 10 clients. The implementation is done with matlab and Flame tool for agent simulation for the server. The various performance factors like confidentiality, security, privacy, non repudiation, authentication and authorization are taken into account and the following tabular column is formulated for this purpose.

Table 1. Authentication and other related Performance Factors

First Run Confidentiality Security Privacy Non repudiation Authentication Authorization no 0.625 0.625 0.5 0.375 0.75 0.375 n1 0.75 0.5 0.625 0.5 0 0.125 n2 0.625 0.875 0.5 0.375 0.5 0.625 n3 1 0.25 0.75 0.75 0.5 0.25 n4 0.125 0.25 0.5 0.625 0.875 0 n5 0.75 0.125 1 0.75 0.5 0.75 n6 0.25 1 0.125 1 0.375 0.25 n7 0.125 0.625 0 1 0.75 0.375 n8 0.625 0.5 0 0.125 0.75 0.75 n9 0.5 0.625 0 0.125 0.5 0.375



Fig.4. Performance Factors Comparison

The performance factors generated by MATlab is shown in fig.4. From this observation we could ensure that only authenticated clients can access the server and also deep web page data extraction has been done by our proposed method. A. Browser Page

The browser page is the first window in the Web Data Extraction process. The web page URL is given as input to the agent server. All the servers where the original data available are entered in the agent server. Whenever there is a request from the client for a particular website immediately the request will be forwarded to the agent server. The main responsibility of the agent server is to map the requested website with its own database. However if the matching is found then the agent server generates a valid key to the client which is the count of the number of access to the particular website. It also sends the closely related authenticated websites to the client. are browser page .The browser page uses the crawler mechanism to fetch and display the requested page.

Fig 5. Browser page Fig 6. Web page display

B. Web Data Extractor The next main window shown in fig 7 to 10 is the Web Data Extractor tool which uses regular expression to validate and extract the data from the fetched Web page.



Fig 7. Web data extractor Fig 8. Extracting images

Fig 9. Saving images Fig 10.Saved images in folder

C. Clustering The next window is the clustering window which groups the extracted content of the web pages based on the visual cues. Here, the final HTML table which contains the structured view of the web page is created and stored.

Fig 11. Build file

Fig 12. Cluster file D. Display The final window (fig 11 and fig.12) in the extraction process is a search engine which traverses the saved HTML files and fetches the content when a search query is entered.

VI. CONCLUSION AND FUTURE ENHANCEMENTS There are several approaches which had been proposed before for extracting data from the web pages. All those approaches had some drawbacks which have been overcome by using the visual-based approach. However, extracting data from the web pages completely based on visual features is beyond the scope of this project. Hence, we have proposed a system which utilizes the visual appearance to cluster the data records extracted from the semi-structured web pages. The variation which we have proposed will lead the path to extracting the web page contents purely based on visual regularity of the web pages. Our Web Extraction tool provides the scope for amateur people to view the structured content of the web pages in an efficient manner. Once the extraction and clustering process gets completed, the user can have a complete structured view of the contents of the previously semi-structured web page.



Thus our system converts the unstructured information in the web pages into a structured format which eases the user’s consumption of the web data. Our system limits the utilization of the visual features only to the clustering process. In future, the same can be expanded to extracting the contents directly from the web page. A separate search query mechanism is implemented to retrieve the collected information from the web pages. In future a search engine may be embedded in the same Web Extractor tool to improve performance. We also given an adaptive agent based authentication for security.

REFERENCES

[1] G.O. Arocena and A.O. Mendelzon, “WebOQL: Restructuring Documents, Databases, and Webs,” Proc. Int’l Conf. Data Eng.

(ICDE), pp. 24-33, 1998. [2] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. Int’l Conf.

Distributed Computing Systems (ICDCS), pp. 361-370, 2001. [3] D. Cai, X. He, J.-R. Wen, and W.-Y. Ma, “Block-Level Link Analysis,” Proc. SIGIR, pp. 440-447, 2004. [4] D. Cai, S. Yu, J. Wen, and W. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” Proc. Asia

Pacific Web Conf. (APWeb), pp. 406-417, 2003. [5] C.-H. Chang, M. Kayed, M.R. Girgis, and K.F. Shaalan, “A Survey of Web Information Extraction Systems,” IEEE Trans.

Knowledge and Data Eng., vol. 18, no. 10, pp. 1411-1428, Oct. 2006. [6] C.-H. Chang, C.-N. Hsu, and S.-C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern

Discovery,” Decision Support Systems, vol. 35, no. 1, pp. 129-147, 2003. [7] V. Crescenzi and G. Mecca, “Grammars Have Exceptions,” Information Systems, vol. 23, no. 8, pp. 539-565, 1998. [8] V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. Int’l

Conf. Very Large Data Bases (VLDB), pp. 109-118, 2001. [9] D.W. Embley, Y.S. Jiang, and Y.-K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. ACM SIGMOD, pp. 467-478,

1999. [10] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krpl, and B. Pollak, “Towards Domain Independent Information Extraction from

Web Tables,” Proc. Int’l World Wide Web Conf. (WWW), pp. 71-80, 2007. [11] J. Hammer, J. McHugh, and H. Garcia-Molina, “Semistructured Data: The TSIMMIS Experience,” Proc. East-European

Workshop Advances in Databases and Information Systems (ADBIS), pp. 1-8, 1997. [12] C.-N. Hsu and M.-T. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,”

Information Systems, vol. 23, no. 8, pp. 521-538, 1998. [13] http://daisen.cc.kyushu-u.ac.jp/TBDW/, 2009. [14] http://www.w3.org/html/wg/html5/, 2009. [15] N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness, Artificial Intelligence, vol. 118, nos. 1/2, pp. 15-68, 2000. [16] A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record,

vol. 31, no. 2, pp. 84-93, 2002. [17] B. Liu, R.L. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Int’l Conf. Knowledge Discovery and Data

Mining (KDD), pp. 601-606, 2003. [18] W. Liu, X. Meng, and W. Meng, “Vision-Based Web Data Records Extraction,” Proc. Int’l Workshop Web and Databases

(WebDB ’06), pp. 20-25, June 2006. [19] L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proc.

Int’l Conf. Data Eng. (ICDE), pp. 611-621, 2000. [20] Y. Lu, H. He, H. Zhao, W. Meng, and C.T. Yu, “Annotating Structured Data of the Deep Web,” Proc. Int’l Conf. Data Eng.

(ICDE), pp. 376-385, 2007. [21] J. Madhavan, S.R. Jeffery, S. Cohen, X.L. Dong, D. Ko, C. Yu, and A. Halevy, “Web-Scale Data Integration: You Can Only

Afford to Pay As You Go,” Proc. Conf. Innovative Data Systems Research (CIDR), pp. 342-350, 2007. [22] I. Muslea, S. Minton, and C.A. Knoblock, “Hierarchical Wrapper Induction for Semi-Structured Information Sources,”

Autonomous Agents and Multi-Agent Systems, vol. 4, nos. 1/2, pp. 93-114, 2001. [23] Z. Nie, J.-R. Wen, and W.-Y. Ma, “Object-Level Vertical Search,” Proc. Conf. Innovative Data Systems Research (CIDR), pp.

235-246, 2007. [24] A. Sahuguet and F. Azavant, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data and Knowledge Eng.,

vol. 36, no. 3, pp. 283-316, 2001. [25] K. Simon and G. Lausen, “ViPER: Augmenting Automatic Information Extraction with Visual Perceptions,” Proc. Conf.

Information and Knowledge Management (CIKM), pp. 381-388, 2005. [26] R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, “Learning Block Importance Models for Web Pages,” Proc. Int’l World Wide Web

Conf. (WWW), pp. 203-211, 2004. [27] J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” Proc. Int’l World Wide Web Conf.

(WWW), pp. 187-196, 2003.



[28] X. Xie, G. Miao, R. Song, J.-R. Wen, and W.-Y. Ma, “Efficient Browsing of Web Search Results on Mobile Devices Based on Block Importance Model,” Proc. IEEE Int’l Conf. Pervasive Computing and Comm. (PerCom), pp. 17-26, 2005.

[29] Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. Int’l World Wide Web Conf. (WWW), pp. 76-85, 2005.

[30] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C.T. Yu, “Fully Automatic Wrapper Generation for Search Engines,” Proc. Int’l World Wide Web Conf. (WWW), pp. 66-75, 2005.

[31] H. Zhao, W. Meng, and C.T. Yu, “Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 989-1000, 2006.

[32] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma, “Simultaneous Record Detection and Attribute Labeling in Web Data Extraction,” Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 494-503, 2006.