[IEEE 2010 Second International Conference on Computer and Network Technology - Bangkok, Thailand (2010.04.23-2010.04.25)] 2010 Second International Conference on Computer and Network

Recognizing and Filtering Web documents with Using Ontology

Reza Mohamadi Bahram Abadi Student Of Azad University Of Oloum-va-Tahghighat

Ahwaz, Iran [email protected]

Mohammadi Hossein Yektaie Faculty Member

Islamic Azad University Of Abadan Abadan, Iran

[email protected]

Mashallah abbasi Faculty Member

Islamic Azad University Of Oloum-va-Tahghighat Ahwaz, Iran

[email protected]

Abstract - In so far as users can have access to required data among piles of documents in World Wide Web, there is need for systematic methods of search. This paper puts to evaluation the extent of relationship between a semi structured HTML and ontology using some statistical techniques. For this purpose, having considered the existing data in ontology, what be calculated is the expected density and expected value for an ontology-related document. For giving an idea on a sample document, we need to count lexical objects viewed in the document conforming to lexical objects within ontology and upon using it, viewed density is put to calculation for the document. Then, having used lemma Pearson, there is a need to set an acceptable limitation for comparing expected density and expected value with view density and view value. If calculations for the two cases of expected value with density and view value are within the required range, then ontology would be related. According to experimental Results within a 95% reliable range, shows that the recommended method's ability to achieve value recall 100% and precision 83% is able.

Keywords: application- ontology; Web documents; Information filtering.

I. INTRODUCTION

World Wide Web consists of a large number of Web documents. Users to access the desired documents, work ahead are difficult. For that users can find information about the interest, they need targeted search methods to find valid data is felt. The main prob lem is that, most of the informat ion in web pages for humans is understandable if the machine cannot understand the meaning of them [2]. If the Web pages designed as a semantic then semantic informat ion extract ion from those pages is easy. But now all the pages of World Wide Web have been implemented, as a semantic. We must use the technology uses the web pages of contemporary meaning of the simulation. Therefore, we need intelligence program that can read Web pages and data and communication between them to form into Structured [10]. Semant ic extract ion ontology is one of these methods.

Extract informat ion based on ontology is not affiliated web structure constants but also the detected documents described the content is dependent and in a specific field of knowledge is used [1]. For controlling the could be among the vast and varied informat ion on the web, before the ext raction of semantic documents about its relationship with the ontology to ensure. In fact, filtering and separating documents, related or non-related from other documents, is related to search results for extraction of informat ion will get better interest. For this paper, specific domain ontology for web pages is defined. Then to determine the status of a sample document, its texts are matched with the ontology and then based on the results relation between document and ontology is decided.

II. RELATED WORK

Before semantic ext raction from web documents text, we need to be sure of its relatedness to the scope of ontology. In the past years, different methods have been applied for diagnosing the type of document relation to ontology. One of the methods was the use of heuristics, density, expected value and classification on a diagram. In this method, machine learning is used to recognize the acceptable line for having a document within ontology. In this method, algorithm C 4.5 is used for implementation the decision tree algorithm . Upon calculation of the three heuristics on a sample document and evaluation of the results on the decision tree, one can state the idea on documents relation to ontology [8]. In 2001, Quan Wang posed the use of probabilistic retrieval model for d istinguishing the type of documents relation to ontology. The three heuristics have been applied as used before, the d ifference lies in expected value heuristic, which is not calculated for the document in general. Instead, the calculation is on expected value for lexical objects separately. In order to show the heuristics results on a

Second International Conference on Computer and Network Technology

978-0-7695-4042-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ICCNT.2010.95

524

document, vector is used with n+2 long. The two elements of vector including density value(y) and grouping (z) and other n variables including expected value fo r n number of lexical objects are in a sample document.

For making decision on the type of document relation to ontology, we use logistic regression and probabilistic retrieval model. The degree of relation is shown using following formulas.

Having considered the limit of probability for calculat ing (0< p<1) we can say the less difference in values out of above formulas, the more relation it has to ontology [9]. Quan Wang used his experimental Results on the three types of different documents (related, non-related and similar). As a sample, he used the probabilistic retrieval model on 10 related documents, which are shown in TABLE I . The probable value for all of was one [9]. For doing the tests, first, he counted the number of lexical objects in each document. Then he showed the document vector. Then document vector was optimized,. A lso he calculated the density value and grouping values for the same document. Then upon using logistic regression, the probabilistic retrieval model was applied for determining the type of document relation to ontology. Some other results showed, the system suffers from some negative points.

TABLE I. list of related web site to ontology

Probability URL Row 1 http://www.delmarvaclassfield.com 1 1 http://www.thetelegraph.com 2 1 http://www.vermontclassifieds.com 3 1 http://www.ndweb.com/mdnonline/mdnonline.html 4 1 http://www.adn.com 5 1 http://www.hawaiisnews.com/cars 6 1 http://www.brewtonstandard.com 7 1 http://www.aikenstandard.com/ 8 1 http://adaeveningnews.com 9 1 http://www.tahoe.com/classifieds/tdt/9100.html 10

In semantic ext raction, most methods use such heuristics for distinguishing the document relat ion to ontology. In recommended method of this pro ject, we use density heuristics and expected value for our purpose. Of course, the two heuristics we apply for lexical objects. Also we calculate the view value and expected value for the density. Decision-making on document relation to ontology is based

on comparison of view values and expected values within an acceptable limit.

III. PRELIMINARIES A. Application Ontology

To Provide theoretical interest for this article, we define a sample application as an ontology as a cognitive model. In fact, this model shows a real environment in a limited space. This system uses the two methods, graphics and text. They are both equivalent. Application Ontology interested is in connection with the domain of car-ads [11] Figure 1 shows a portion of the textual representation of the car-ads ontology, which includes all object and relat ionship sets, cardinality constraints (lines 1-9), and a few lines of its data frames (lines 10-19). This figure shows only three set of the regular expression. For the representation of a complete ontology of car-ads, we need to 165 regular expressions. In a textual view, the symbol [ object] shows the non-lexical object. In fact, the main t itle o f ontology or ads is represented by this symbol. The min : max or min: ave : max, constraint specified next to the connection between an object set and a relationship set in a graphical representation is the participation constraint of the object set in the relat ionship set. Min, ave and max denote the minimum, average, and maximum number of t imes an object in an object set can, or is expected to, participate in a relationship set, respectively, whereas * designates an unknown but finite maximum number of t imes an object in an object set can participate in a relationship set. In the textual representation for the car-ads ontology, the participation constraints are listed from line 2 to line 9 [8].

1. Car [-> object]; 2. Car [0:0.908:1] has Model [1:*]; 3. Car [0:0.925:1] has Make [1:*]; 4. Car [0:0.975:1] has Year [1:*]; 5. Car [0:0.8:1] has Price [1:*]; 6. Car [0:0.45:1] has Mileage [1:*]; 7. PhoneNr [1:*] is for Car [0:1]; 8. PhoneNr [0:1] has Extension [1:*]; 9. Car [0:2.1:*] has Feature [1:*]; 10. Make matches [11] case insensitive 11. constant 12. { ext ract “\b chev \b”; }, { ext ract “\b chevy \b”; },

{ extract “\b dodge\b”; }, 13. … 14. end; 15. Model matches [16] case insensitive 16. constant 17. { ext ract “88”; context “\bolds\S*\s*88\b”; }, 18. … 19. end;

Figure 1. Car-ads ontology - textual

525

Regular expressions consider some limits for lexical object. For example, lines 10 to 14 have constraints for the object maker. Such that this object can be 10 characters maximum. The keywords in relat ion with considered object is defined in this section [8]. We can extract the related key words by using the data frame provided for ontology and by comparing the existing strings in the text and the regular expressions in the data frame.

B. Regression analysis

One of the main goals of many statistical researches is to create Dependencies that provide prediction of one or more variables according to others. One of the tools that we can achieve a good relationship is regression. Regression analysis is a statistical tool to study the relationship between a dependent variable and a set of independent variables. If more information that is associated with the subject could be considered, we can correct the predictions. The most common linear equation can be used on the regression relations between the two variables for implementation is as follows [13]:

In the equation above, y is a random variable that we want to predict their values according to known values p

And multip le regression coefficients, are constants, which must be determined on the viewed data. One of the main conditions of mult iple linear regression independent variables is linear independency [13]. In this paper, we use in an application Ontology the lexical objects as independent variables used in mult iple linear regression.

C. Density heuristic

A Web document D that is relevant to particular applicat ion ontology A should include many constants and keywords defined in the ontology. Based on this observation, we define a density heuristics. We compute the density of D with respect to O as follows [12]:

Where total number of matched characters is the number of characters of the constants and keywords recognized by O in D, and total number of characters is the total number of characters in D [12].

D. Expected-Values Heuristic

We apply the VSM model to measure whether a multip le-record Web document D has the number of values expected

for each lexical ob ject set of application ontology O. Based on the lexical object sets and the participation constraints in O; we construct an ontology vector OV. Based on the same lexical object sets and the number of constants recognized for these object sets by O in D, we construct a document vector DV. We measure the relevance of D to O with respect to our expected-values heuristic by observing the cosine of the angle between DV and OV [8]. To construct the ontology vector OV, we (1) identify the lexical object-set names—these become the names of the coefficients of OV, and (2) determine the average participation for each lexical object set with respect to the object set of interest specified in O—these become the values of the coefficients of OV [8]. Car ontology vector Based on lexical object defined in the ontology is as follows:

The names of the coefficients of DV are the same as the names of the coefficients of OV. We obtain the value of each coefficient of DV by automatically counting the number o f appearances of constant values in D that belong to each lexical object set. Observe that for document vectors we use the actual number of constants found in a document. To get the average (normalized for a single record), we would have to divide by the number of records—a number we do not know with certainty. Therefore, we do not normalize, but instead merely compare the cosine of the angles between the vectors to get a measure for our expected values heuristic. As mentioned, we measure the similarity between an ontology vector OV and a document vector DV by measuring the cosine of the angle between them. In particular, use the Similarity Cosine Function defined in, which calcu lates the acute angle [8].

P is the inner product of the two vectors, and N is the product of the lengths of the two vectors. When the distribution of values among the object sets in DV closely matches the expected distribution specified in OV, the angle θ will be close to zero, and cosine θ will be close to one.

IV. IMPLEMENT THE RECOMMENDED SYSTEM

We use the statistical techniques to determine the type of Ontology relationship with the sample document. Acceptable error rate of is considered. This means that the calculat ions in a 99% confidence intervals are investigated. We used the Web documents in this project as semi-structured and HTML type. We use also the car-ads ontology in this paper. Recommended algorithm is performed in three steps:

526

A. Step1: Using related ontology document to making multiple linear regression equation

First, one formula is presented to determine the type of lexical objects relation to ontology. This formu la can be used for two purposes:

1- Angle prediction between document vector and ontology vector. This method can be substituted for

in order to calculate θ angle. 2- Value determination and lexical object, cred it for

use in next calculations. The main goal in this phase is the second application.

For this purpose, we consider some documents related to car ontology and for each document, normalized vector is put to calculations. Also as the basic informat ion for the first time θ angle out of Cos is put to calculation for each document. In this phase, Quan Wang has collected some related documents, which have already been put to discussion. Information the vector optimized of documents related to these web sites is shown in TABLE II [9]. We use such information as multip le linear regression models. (Variable C is the same as Cos θ).

TABLE II. Normalized term frequencies along of the training set documents according to the car-ads ontology

PhoneNr Feature Price Mileage Model Make Year C 0.8 1.5 1.2 0.4 0.8 1.4 1.5 0.9 0.7 0.7 1.6 0.2 1.7 1.4 0.8 0.8 1.2 1.2 0.9 0.4 0.6 0.8 2.7 0.9 1.3 1.9 0.9 0.4 0.7 0.7 1.3 0.9 0.8 1.1 1.3 0.3 1.3 1.1 1.6 0.9 0.7 1.6 1.1 0.3 0.6 1.1 1.8 0.9 0.7 1.6 1.3 0.4 1.3 0.7 1.3 0.9 0.9 2.6 0.6 0.2 0.6 0.6 0.9 0.9 1.7 1.2 0.4 0.4 0.7 0.6 1.8 0.9 1.2 1.8 0.9 0.3 1.01 1.2 1.1 0.9

To create a model of relationship between lexical object car-ads ontology we use the information obtained TABLE II and we formed a multip le linear regression equation. For this purpose, we consider the lexical ob ject car-ads ontology as independent variables and the angle between optimized vector of document and ontology vector as the dependent variable. Order of lexical object in the document and the regression variables used is shown in TABLE III and we form this regression equation.

TABLE III. Order of lexical object and variables used in regression

PhoneNr Feature Price Mileage Model Make Year C

After defining the early step of the formation and regression orders, β coefficients belonging to each lexical object is specified in the regression. Using mult iple linear regression formula and coefficients

obtained in the final model the desired formula is shown below:

The above formula predicts angle cosine value between document vector and ontology vector considering the number of lexical objects within that document. For example, having put value from the first row at TABLE II in the above formula, we will have y=0.9346. To ensure accuracy of the calculations, values car ontology vector is placed in the above formula and then y value is calculated. Resulting calculat ion is equivalent to 0.999, or almost 1. In the best situation when the two vectors match, angle between the two vectors would be zero. Value of Cos of angle is equal to one. By the calcu lations performed above, the accuracy obtained by formula is approved. The set of the β coefficients belonging to lexical objects is considered as vector B. In fact, this vector specifies weight and value that determine the lexical object.

B. Step 2: Comparison between viewed value and expected value in a sample document

After receiv ing a document to determine its relat ionship with ontology, the first document vector is made in first step. Then, using the following formula, we calculate the document optimized vector.

Then we do multip ly the value and in the vector, results by and

y are shown. Doing this, we can

give credit to any object related to the two vectors of and . Then By the following formula and test , one area sure to accept or not accept connection document ontology can be defined.

: Expected frequencyy : v iew frequency. K: number of lexical objects in the ontology defined. If the relationship

jj is true, then our

calculations within the confidence would be accepted.

527

Desired Values for are extracted from the Chi-

square distribution table. To perform calculations in the above formula, we use the value of the vector and

we use the value of the vector.

Considering the amount , If the condition is established, it can be said that, Document optimized vector would be acceptable in the range of the ontology valid vector. If the considered case was not met, we would be able to state unrelatedness of document to ontology. But if the case was met, we would go to next phase to distinguish properly the document and do the numbers.

C. Step3: Calculation of document density, comparison of the expected density and viewed density in each record.

This step is on the basis of the calculated density records. Calculations are performed in four stages :

1) Firs stage: density Calculation, number of characters and records of sample document: As the study, documents and semi structure types are in HTML. We can count the total records of documents structurally. To continue working, the number of document records in the sample and the total characters should be counted.

2) Second stage: Viewed density calculation for each lexical object in ontology in the document sample: In ontology, the maximum number of each lexical acceptable object would be specified in one record. Having these limitations and considering the number of records for the document sample, we make decision for the number of views gained from each lexical object. The views shown in the document vector is dominant. A lso limitations of the desired car ontology as vector are shown. Other than the Feature and PhoneNr, the maximum number of lexical objects in the ontology would be equal to one. For these two special cases, ontology is not limited. But in this paper, we consider an acceptable limitation fo r the calculations. Usually for each record, we do not use more than five Feature and two PhoneNr. But about the same number would be considered in the vector . Having considered these conditions and with regard to the number of document records, we modify vector w, and then we will d isplay it under the

,,,. For example, if the number

of Event Year object is equal to 16, and there are 15 records in the document, the 15 events of the Year object would only be accepted. The maximum number of characters that each lexical object can have, defined in the destination ontology limitations (For example, line 10 of the car ontology), then This

limitat ion would be considered in the vector .

Using the following formula, maximum number of viewed acceptable characters would be calcu lated for each lexical Object.

Also, having divided the number of accepted Characters for each lexical object by the number of Viewed characters of document, we calculate the density value for each lexical object.

3) Third stage: Expected density calculated for each lexical object in the ontology in the document sample: Using the ontology vector and the number of records document sample, we can calcu late the approximate number of expected occurrences for each lexical object. This calculation is based on the following formula.

Then by the use of the following formula, the number of expected characters would be calculated for each object.

After calculat ing the number of Expected characters for each object, and considering the number of characters in the document, we use the following formula for the expected density of each lexical object in the document sample.

Because we do not know about the method of occurrence of expected lexical values in future, we consider the maximum number of characters allowed for each lexical object defined in the ontology. Therefore, for uniformity of comparisons, we ignore the actual number of characters for v iewed lexical values and let in the maximum quality value for each lexical object, but there is a little difference between this number and the real size.

4) Fourth stage: Comparison of two vectors, expected lexical object density and the viewed lexical object density: In this stage, we compare the two vectors and

of one sample document, and then we will put to test the credit calculations within a reliab le range. Before doing this, all the values of the two vectors would be divided by the acceptable density value of the document sample.

528

With this work, we find values for acceptable density. The more each acceptable density value, the closer the result vector element into the density is to zero. As a result, reliable and defined area and the test is closer to . To calculate the acceptable density of documents, the following formula is used:

The total accepted characters of total elements would be obtained from vector . With the division of vectors

and by value the vector of and

optimized will be fo rmed. We use the following formula for measuring the two vectors.

For this purpose, the values of ff vector and values of the vector are used. Considering the amount , If the condition is established, it can be said that View density vector values is acceptable to the values expected density vector within the acceptable range.

V. EXPERIMENTAL RESULTS

To evaluate the results, we test the recommended system in three different categories of documents .The first group includes 10 related documents, the second group includes 10 non-relevant document and third group includes 8 non-relevant documents which is similar to car ontology. The model performance evaluated by computing the recall and precision ratios on each test document set.

TP: Number of documents that related with the ontology, and the proposed method of diagnosis related. FN: Number of documents that related with the ontology, but it added the proposed method of detection allows unrelated. FP: Number of documents that are unrelated with the ontology, but they related to the proposed method allows diagnosis. Meanwhile, such as related documents address for address unrelated documents and unrelated documents address but the same ontology, the Paper is cited Mr. Wang. As was said of the proposed method includes two steps. For each step, a column with the t itles of accepts or reject is considered. If the calculat ion results in the acceptable range, we choose, accept and reject otherwise. At the end of the table, one column as type of diagnosis is given. The proposed algorithm under the following conditions, the column end as accept or reject is completed . 1- Related: heuristic first and second heuristic are accepted 2- Unrelated: first heuristic is accept and reject the second

heuristic. 3- Unrelated: first heuristic is rejecting. Results of tests in the proposed method are shown in Table IV. Results of 10 documents related to the first row, second row 10 other documents unrelated to the last row and 8 other documents unrelated to but is similar ontology. According to

the information in Table IV values TP=10, FN =0 and FP=2. Thus, the values of recall and precision follow:

,

Table IV: Results of the proposed method tested on a set of documents

Recognize RowAccept Step2Accept Step1

Related 0.1071 0.12811Related 0.073 0.4001 2Related 0.0795 0.22493Related 0.1287 0.0321 4Related 0.0868 0.21895Related 0.1712 0.1354 6Related 0.7997 0.1067 7Related 0.5440 0.0680 8Related 0.1731 0.2213 9Related 0.0700 0.0181 10

Unrelated - 9.963 - 1.044 11Unrelated - 18.987 0.475 12Unrelated - 13.980 - 1.229 13Unrelated - 2.814 0.331 14Unrelated - 1.840 - 0.972 15Unrelated - 12.965 0.848 16Unrelated - 2.920 0.434 17Unrelated - 1.122 0.734 18Unrelated - 2.521 0.667 19Related 0.4261 0.239 20

Unrelated - 0.963 0.649 21Unrelated - 1.447 0.816 22Unrelated - 1.016 0.653 23Unrelated - 0.961 0.687 24Unrelated - 1.309 0.654 25Unrelated - 0.960 0.434 26Unrelated - 1.331 0.700 27Related 0.1165 0.298 28

If we increase the accepted error rate then the system accuracy would be reduced. The results of the tests in a 95%

529

confidence interval was evaluated .The recall rate reduced of 100% and the precision decreased to 50%. In normal conditions, the rate system performance is much more. Because a set of irrelevant documents selected for the system evaluation is very similar to that of the car ontology, which was already considered. However, Normal, usually non-related documents that determine the assignment given to the system, is not similar to the related documents.

REFRENCES

[1] Mark Vickers, "Ontology-Based Free-From Query Processing Web", 2006.

[2] Alan Wessman ,"A framework for Extraction Plans And Heuristics In An Ontology-Based ", 2005.

[3] Yuanqiu Zhou, "Generating Data-Extraction Ontologies by Example", 2005.

[4] Tesi di Laurea,"Manually vs semiautomtic domain specific ontolo y building" ,2003.

[5] Wang , " Source Discovery And Schema Mapping For Data Integrrration" ,2003.

[6] zhang, Chen "Ontology - driven Adaptive Web Information Extraction System"., 2001.

[7] David W. Embley, "Extracting and Structuring Web Data" , ,Brigham Young University ,2002 .

[8] D.W. Embley, Y.-K. Ng," Recognizing Ontology - Applicable Multiple-Record Web Documents", 2001.

[9] Wang," A Binary - Categorization Approach For Classifying Multiple e-Record Web Documents Using a Probabilistic Retrieval Model", 2001.

[10] J, Handler and D. McGuiness "Agents and the Semantic Web", 2001.

[11] D. Embley, D. Campbell,"Conceptualmodel - based data extraction from multiple record web pages Data and Knowledge Engineering" , 1999.

[12] David Embley, Norbert Fuhr, "Ontology Suitability for Uncertain Extraction of Information from Multi- Record Web documents" , June 22, 1999 .

[13] John Neter , William Wasserman ,"Applied Linear Regression Models" ,1983.

530