[IEEE 2010 Second International Conference on Computer and Network Technology - Bangkok, Thailand (2010.04.23-2010.04.25)] 2010 Second International Conference on Computer and Network

Angle Prediction between document vector and ontology vector, using multiple linear regressions

Reza Mohamadi Bahram Abadi Student Of Azad University Of Oloum-va-Tahghighat

Ahwaz, Iran [email protected]

Mohammadi Hossein Yektaie Faculty Member

Islamic Azad University Of Abadan Abadan, Iran

[email protected]

Mashallah abbasi Faculty Member

Islamic Azad University Of Oloum-va-Tahghighat Ahwaz, Iran

[email protected]

Abstract - Considering the growing development of information at World Wide Web, the users find it difficult to have access to the documents s/he requires. The purpose of this paper is to present a method for making the search by user more systematic and limited using some statistical techniques. For this purpose, we will present a formula by multiple linear regression models in order to model the relation between lexical objects and ontology. Then for stating ideas on a sample document, we count view values in that document, which are conforming to lexical objects in ontology, and next we will form the document vector. With having optimized document value in the formula out of multiple linear regressions, we can predict the degree of angle between the document vector and ontology vector. The closer the angle to zero, the more relation the document has ontology. Experimental Results show the recommended method would be able to distinguish 100% accuracy of this angle.

Keywords: application-ontology; Web documents; multiple linear regression; Information filtering.

I. INTRODUCTION

World Wide Web consists of a large number of Web documents. Users to access the desired documents, work ahead are difficult. For that users can find information about the interest, they need targeted search methods to find valid data is felt. The main prob lem is that, most of the informat ion in web pages for humans is understandable if the machine cannot understand the meaning of them [2]. If the Web pages designed as a semantic then semantic informat ion extract ion from those pages is easy. But now all the pages of World Wide Web have been implemented, as a semantic We must use the technology uses the web pages of contemporary meaning of the simulation. Therefore, we need intelligence program that can read Web pages and data and communication between them to form into Structured [10]. Semant ic extract ion ontology is one of these methods. Extract informat ion based on ontology is not affiliated web structure constants but also the detected documents

described the content is dependent and in a specific field of knowledge is used [1]. For controlling the could be among the vast and varied informat ion on the web, before the ext raction of semantic documents about its relationship with the ontology to ensure. In fact, filtering and separating documents, related or non-related from other documents, is related to search results for extraction of in formation will get better interest. In this paper, using a formula for mult iple linear regression presented models between lexical objects in ontology. Using this model and optimized document vector, expected the rate of angle between document vector and ontology vector. The closer each values of this angle to zero, the more relation the document has to ontology.

II. RELATED WORKS

To limit the scope of the search for users in semantic extraction from the web documents, there is a need to filter the related-ontology documents from among existing documents in world web. There have been many methods in the past for d istinguishing the type of document relat ion to ontology. Expected value heuristic is one of these methods. For doing this heuristic, we use the vector space model; there are two vectors in this model. One is ontology vector and the other is document vector. Having measured these two vectors upon checking cosine of angle between the two vectors, we judge the type of document relation to ontology [8]. In 2001, Quan Wang posed the use of probabilistic retrieval model for d istinguishing the type of documents relation to ontology. In this method, expected value heuristic has been used. the difference lies in expected value heuristic, which is not calculated for the document in general. Instead, the calculation is on expected value for lexical objects separately. In order to show the heuristics results on a document, vector is used with n+2 long.The two elements of vector including density value(y) and grouping (z) and other

Second International Conference on Computer and Network Technology

978-0-7695-4042-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ICCNT.2010.121

595

n variables including expected value for n number of lexical objects are in a sample document.

For making decision on the type of document relation to ontology, we use logistic regression and probabilistic retrieval model. The degree of relation is shown using following formulas.

Having considered the limit of probability for calculat ing (0< p<1) we can say the less difference in values out of above formulas, the more relation it has to ontology [9]. The recommended method of the paper attempts to merge the two said methods and present a new way for predict ing the angle between the document vector and ontology vector. This method can predict the accurate angle between the document vector as a sample and the ontology vector.

III. PRELIMINARIES

A. Application Ontology

To Provide theoretical interest for this article, we define a sample application as an ontology as a cognitive model. In fact, this model shows a real environ ment in a limited space. This system uses the two methods, graphics and text. They are both equivalent. Application Ontology interested is in connection with the domain of car-ads[11] Figure 1 shows a portion of the textual representation of the car-ads ontology, which includes all object and relat ionship sets, cardinality constraints (lines 1-9), and a few lines of its data frames (lines 10-19). This figure shows only three set of the regular expression. For the representation of a complete ontology of car-ads, we need to 165 regular expressions. In a textual view, the symbol [ object] shows the non-lexical object. In fact, the main t itle o f ontology or ads is represented by this symbol. The min : max or min: ave: max constraint specified next to the connection between an object set and a relat ionship set in a graphical representation is the participation constraint of the object set in the relationship set. Min, ave, and max denote the minimum, average, and maximum number of t imes an object in an object set can, or is expected to, participate in a relat ionship set, respectively, whereas * designates an unknown but fin ite maximum number of times an object in an object set can participate in a relationship set. In the textual representation for the car-ads ontology, the participation constraints are listed from line 2 to line 9 [8].

1. Car [-> object]; 2. Car [0:0.908:1] has Model [1:*]; 3. Car [0:0.925:1] has Make [1:*]; 4. Car [0:0.975:1] has Year [1:*]; 5. Car [0:0.8:1] has Price [1:*]; 6. Car [0:0.45:1] has Mileage [1:*]; 7. PhoneNr [1:*] is for Car [0:1]; 8. PhoneNr [0:1] has Extension [1:*]; 9. Car [0:2.1:*] has Feature [1:*]; 10. Make matches [10] case insensitive 11. constant 12. { ext ract “\b chev \b”; }, { ext ract “\b chevy \b”; },

{ extract “\b dodge\b”; }, 13. … 14. end; 15. Model matches [16] case insensitive 16. constant 17. { ext ract “88”; context “\bolds\S*\s*88\b”; }, 18. … 19. end;

Figure 1. Car-ads ontology - textual

Regular expressions consider some limits for lexical object. For example, lines 10 to 14 have constraints for the object maker. Such that this object can be 10 characters maximum. The keywords in relat ion with considered object is defined in this section [8]. We can extract the related key words by using the data frame provided for ontology and by comparing the existing strings in the text and the regular expressions in the data frame.

B. Regression analysis One of the main goals of many statistical researches is to create Dependencies that provide prediction of one or more variables according to others. One of the tools that we can achieve a good relationship is regression. Regression analysis is a statistical tool to study the relationship between a dependent variable and a set of independent variables. If more information that is associated with the subject could be considered, we can correct the predictions. The most common linear equation can be used on the regression relations between the two variables for implementation is as follows [13]:

In the equation above, y is a random variable that we want to predict their values according to known values

And multip le regression coefficients, are constants, which must be determined on the viewed data. One of the main conditions of mult iple linear regression independent variables is linear independency [13]. In this paper, we use in an application Ontology the lexical

596

objects as independent variables used in mult iple linear regression.

C. Expected-Values Heuristic

We apply the VSM model to measure whether a multip le-record Web document D has the number of values expected for each lexical ob ject set of application ontology O. Based on the lexical object sets and the participation constraints in O; we construct an ontology vector OV. Based on the same lexical object sets and the number of constants recognized for these object sets by O in D, we construct a document vector DV. We measure the relevance of D to O with respect to our expected-values heuristic by observing the cosine of the angle between DV and OV [8]. To construct the ontology vector OV, we (1) identify the lexical object-set names—these become the names of the coefficients of OV, and (2) determine the average participation for each lexical object set with respect to the object set of interest specified in O—these become the values of the coefficients of OV [8]. Car ontology vector Based on lexical object defined in the ontology is as follows:

The names of the coefficients of DV are the same as the names of the coefficients of OV. We obtain the value of each coefficient of DV by automatically counting the number o f appearances of constant values in D that belong to each lexical object set. Observe that for document vectors we use the actual number of constants found in a document. To get the average (normalized for a single record), we would have to divide by the number of records—a number we do not know with certainty. Therefore, we do not normalize, but instead merely compare the cosine of the angles between the vectors to get a measure for our expected values heuristic. As mentioned, we measure the similarity between an ontology vector OV and a document vector DV by measuring the cosine of the angle between them. In particular, use the Similarity Cosine Function defined in, which calcu lates the acute angle [8].

P is the inner product of the two vectors, and N is the product of the lengths of the two vectors. When the distribution of values among the object sets in DV closely matches the expected distribution specified in OV, the angle θ will be close to zero, and cosine θ will be close to one.

IV. IMPLEMENT THE RECOMM ENDED SYSTEM A. Definition of the recommended system

To predict and calculate the angle between document vector and the ontology vector, the statistical methods are used. In First, we consider a set of documents associated with the ontology, And then using heuristic expected value, informat ion would be extracted from the documents and informat ion could be used for a mult iple linear regression model, which will be displayed. If the optimized vector values of the document were put in the regression formula, then the angle between the ontology vector and document vector is calculated. The smaller th is angle and the closer the rate of Cosine to one, there would be more relation to ontology. We used the Web documents in this project as semi-structured and HTML type. We use also the car-ads ontology in this paper.

B. Implementation

In order to determine the type of relation of lexical objects dependent on ontology as a sample and model the sample, one formula is presented. This formula is used to predict the angle between the document vector and ontology vector. Also this method can be substituted for

gyin

order to calculate θ angle. For forming multip le linear regression models, we need some ontology-related documents. For this purpose, we use a number of ontology-related document are shown in TABLE I [9].

TABLE I: Website related with car ontology

URL Row http://www.delmarvaclassfield.com 1 http://www.thetelegraph.com 2 http://www.vermontclassifieds.com 3 http://www.ndweb.com/mdnonline/mdnonline.html 4 http://www.adn.com 5 http://www.hawaiisnews.com/cars 6 http://www.brewtonstandard.com 7 http://www.aikenstandard.com/ 8 http://adaeveningnews.com 9 http://www.tahoe.com/classifieds/tdt/9100.html 10

With the help of expected value heuristic, we calculate the document vector and then the optimized document vector for each of the document separately. Also as the basic informat ion for the first time θ angle out of Cos is put to calculation for each document. Information optimized document vector related to these web sites is shown in TABLE II [9]. We use such information as multip le linear regression models. (Variable C is the same as Cos θ).

TABLE II: expected value of car ontology lexical object per document

PhoneNr Feature Price Mileage Model Make Year C 0.8 1.5 1.2 0.4 0.8 1.4 1.5 0.9 0.7 0.7 1.6 0.2 1.7 1.4 0.8 0.8 1.2 1.2 0.9 0.4 0.6 0.8 2.7 0.9

597

1.3 1.9 0.9 0.4 0.7 0.7 1.3 0.9 0.8 1.1 1.3 0.3 1.3 1.1 1.6 0.9 0.7 1.6 1.1 0.3 0.6 1.1 1.8 0.9 0.7 1.6 1.3 0.4 1.3 0.7 1.3 0.9 0.9 2.6 0.6 0.2 0.6 0.6 0.9 0.9 1.7 1.2 0.4 0.4 0.7 0.6 1.8 0.9 1.2 1.8 0.9 0.3 1.01 1.2 1.1 0.9

Lexical object in the car ontology as independent variables in the regression are used. These objects not have relative. Angle between vectors optimized document and ontology vector as the dependent variable. The order of lexical objects in the document i, and the regression variables used as in TABLE III are considered, and the same regression equation would be formed based on this.

TABLE III: Lexical object and variables used in regression

PhoneNr Feature Price Mileage Model Make Year c

Regression to Implementation of SPSS software is used. After defining the early stages and executing commands to the formation regression, β coefficients belonging to each lexical object is specified in the regression. The mult iple linear regression formula g

and coefficients obtained in the final model as the desired formula are shown below.

Above formula shows the value of angle between document vector and Cosine ontology vector with the number of lexical object in the event that document will predict. To ensure accuracy of the calculations, values ontology vector u in the above formula will be placed and we calcu late y.

In the best case when the two vectors match, the angle between the two is equal to zero. Cosine value of this angle will be equal to one also. Accuracy of calculation formula will prove obtained formula .

C. Downloading Web document samples to evaluate Document D 1 as the sample would be tested. Used car ontology, document vector is calculated as

We optimize Vector using the ontology vector . To do this, we calcu late these two vectors.

1 http://www.elkintribune.com

Document optimized vector , is displayed with vector

and calculated as follows.

Values of vector, into the regression formula would be put to test and then we calculate the y value, which is equal to:

y = 0.99489

Angle calculated based on the formula 5, is equivalent to Cos θ = 0.9956. And it can be concluded that the recommended method would calculate angle ontology document vector with high accuracy.

V. EXPREIMENTAL RESULT

In this section, we will evaluate the recommended model. For this purpose, the recommended method on a set of documents examined would be put to the test. In all cases the recommended method correctly p redicts the document vector and the angle between the ontology vector with recall and precision rate of 100%.

REFRENCES

[1] Mark Vickers, "Ontology-Based Free-From Query Processing Web", 2006.

[2] Alan Wessman ,"A framework for Extraction Plans And Heuristics In An Ontology-Based ", 2005.

[3] Yuanqiu Zhou, "Generating Data-Extraction Ontologies by Example", 2005.

[4] Tesi di Laurea,"Manually vs semiautomtic domain specific ontolo y building" ,2003.

[5] Wang , " Source Discovery And Schema Mapping For Data Integrrration" ,2003.

[6] zhang, Chen "Ontology - driven Adaptive Web Information Extraction System"., 2001.

[7] David W. Embley, "Extracting and Structuring Web Data" , ,Brigham Young University ,2002 .

[8] D.W. Embley, Y.-K. Ng," Recognizing Ontology - Applicable Multiple-Record Web Documents", 2001.

598

[9] Wang," A Binary - Categorization Approach For Classifying Multiple e-Record Web Documents Using a Probabilistic Retrieval Model", 2001.

[10] J, Handler and D. McGuiness "Agents and the Semantic Web", 2001.

[11] D. Embley, D. Campbell,"Conceptualmodel - based data extraction from multiple record web pages Data and Knowledge Engineering", 1999.

[12] David Embley, Norbert Fuhr, "Ontology Suitability for Uncertain Extraction of Information from Multi- Record Web documents" , June 22, 1999 .

[13] John Neter , William Wasserman ,"Applied Linear Regression Models" ,1983.

599