Generic Hybrid Semantic Search Approach

Embed Size (px)

Citation preview

  • 7/29/2019 Generic Hybrid Semantic Search Approach

    1/7

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 17, ISSUE 2, FEBRUARY 201321

    Generic Hybrid Semantic Search Approach Reham H. El-Deeb, Abdel Fatah. A. Hegazy and Aly A. Fahmy

    Abstract Based on the fact that information mining is a tedious repetitive procedure, enhancement of research in the area is essential. Themain abstract idea of these researches was formality and non-formality of the interface for research methodology. In a previous published

    research [1], we proposed a Semantic Form-Based Guided Search System (SFBGSS), which combines the advantages of both the NLIs andthe guided-based semantic search engines. The purpose of this research was to enhance the precision of the guided systems by infusing aform-based interface to it. The figures of the precision and recall for the previously proposed SFGBSS were 98% and 93% respectively. Theseexcellent results outperformed the precision and recall of NLP-Reduce by 38% and 43% respectively. But, it was concluded that it will bebetter if we have a generic user interface that extracts the classes, instances, and properties from the data sets, i.e. having the ability toproduce the values of listings in the runtime from the dataset hierarchy. In this new research, a Generic Hybrid Semantic Search Approach(GHSSA) is proposed in order to be exerted on any standard data set in addition to performing data profiling using the five-number summarystatistical model. This will in return, maximize the benefits of our new search engine and extend its utilization scope. The GHSSA was testedon the three Mooney data sets [2]. And To set the bar a little higher, we also tested our approach on an Arabic data set which adds newdimensions to be a multilingual and a flexible approach.

    Index TermsSemantic web, Semantic Guided Search, Form-based Search Engines, data profiling.

    1 INTRODUCTION

    emantic Search significance has lead to the emergenceof natural language interfaces that permit users toconvey their need of information via natural

    language processing which was a substitute for the user s

    formal knowledge in ontologies. Using the knowledgesharing and exchanging of ontologies, they act as asubstantial pillar in the semantic web. Although the NLIshave been through many evolutionary stages, it is still

    cant generate precise results because it cant understand

    the whole query. Most Natural Language Interfaces only

    recognize a part of the natural language query.In addition to that, NLIs does not provide any

    information to the users about the available resources tosearch for, which make the users uncertain of the queriesthat will be appropriately answered. In [3], this issue istackled: Users need knowledge of what it is possible toask in a particular domain and so did [4]: Often, userswould attempt to paraphrase a sentence many timeswhen the reason for the system's lack of understandingwas due to the fact that the system did not have dataabout the query being asked. This divergence is calledthe habitability problem. As for guided based systems, is

    another type of semantic search engines. It is not asflexible as NLIs but it has high precision rates.

    The semantic search tools could be divided into fourgroups depending on their user interface approach:keyword-based, view-based, natural language based andform-based systems [5]. Keyword-based systems allow theinput of several keywords and generate their equivalentsemantic entities. These Keyword-based systems give the

    impression of regular information retrieval systemsapparently, but allows user to precisely identify theirinformation needs by interpreting each query term intosemantic phrases. View-based systems sustain querycreation and domain investigation using the presentationand navigation of ontology structure. Natural languagesystems translate natural language sentences which aresubmitted by the user into ontological queries via variouslinguistic techniques. Form-based systems direct the user inconstructing semantic queries by the means of form

    structure and form controls, bearing in mind the ontologystructures. The more difficult issue about this approach isits scalability for outsized ontologies, in regards to thescrolling lists limitation on the items count that can beincorporated in it. In addition to, the number of formcontrols that a form can contain. Therefore these previousissues might limit the usability of form-based interfaces.

    Thats why we believe that data profiling is an essentialmilestone to help in overcoming the limitations of form-based systems .In addition to, improving the data qualityand the understanding of data.

    One of the most valuable technologies for enhancing

    data accuracy is Data profiling. It checks the data in anexisting data source and gathers information and statisticsconcerning that data. This Data profiling exploitsaggregates like count and sum. In addition to, varioustypes of explanatory statistics like minimum, maximum,mean, standard deviation, and variation. Diverse analysesare executed on different structural levels. For instance, toacquire an understanding of frequency distribution ofdifferent types and values as well as use of columns, eachcolumn could be profiled independently.

    Data profiling has various techniques, one of them isfive-number summary which is considered as an

    explanatory statistic offering information about a set ofannotations. It is composed of the five most vitalpercentiles: the sample minimum, the lower quartile, themedian, the upper quartile, and the sample maximum. Incomparison with the mean and standard deviation, the

    R. El-Deeb is with the Arab Academy for Science Technology & MaritimeTransport Cairo, Egypt.

    A. Hegazy is with the Arab Academy for Science Technology & MaritimeTransport, Cairo, Egypt.

    A. Fahmy is with the Faculty of Computers and Information, CairoUniversity, Cairo, Egypt.

    S

  • 7/29/2019 Generic Hybrid Semantic Search Approach

    2/7

    22

    five-number summary is; in most cases; better especiallyfor describing a slanted distribution or a distribution withexcessive outliers. The mean and standard deviation arereasonable for outliers-free distributions which are

    considered symmetric. In real life we cant always expect

    symmetry of the data. Its a common practice to include

    number of observations (n), mean, median, standarddeviation, and range as common for data summarizationpurpose.

    2 STATEOFTHEARTOFSEMANTIC

    SEARCHTOOLS

    In the coming sections, we will illustrate the core featuresof the previously mentioned modes of interactionconcentrating on their strengths and weaknesses in termsof their usability.

    2.1KEYWORD-BASED SYSTEMS

    These systems utilize the accessibility of unambiguoussemantics to improve the performance of conventionalkeyword search. Keyword-based tools most importantbenefit is allowing end-users to specify queries with astraightforward manner; which is very familiar to them.Giving end-users the ability to use these systems withoutany prior knowledge of the ontologys exact vocabularyor structure. Also, without the need to master a specialquery language. The way the search algorithm processesthe queries and their keyword selection method,determines the success of the search.

    The TAP search engine [6] is a keyword-based

    semantic search systems, that was one the pioneers tobuild such systems. It makes use of the conventionalkeyword search algorithms. The present tools keyword-matching mechanisms are treated at the syntactic level,using string-matching techniques. This makes themdomain independent, because they are not attached todomain ontologies but unfortunately making them unableto recognize the information needs of end-users.Therefore, they dont always generate successful searchresults. As a result, keyword semantic search routineshould integrate both semantic and syntactic matchingmechanisms by employing domain-specific ontology and

    lexical resources like WordNet. In order to, match theuser keyword with its semantic equivalents. This wasdemonstrated in ZOOM5 and the distinctive featurespresented in [7] and [8], which generated interestingsemantic matching results.

    2.2FORM-BASED SYSTEMS

    All computer users use Forms in their typical day-to-day interactions. Resulting to make forms an acceptedapproach for semantic search interfaces. By making usersselect query values from valid expressions lists, form-based interfaces can overcome mapping issues that arise

    in other interaction modes. They give the user the abilityto envision what the except-able searches would look likeby viewing the user what is there in the domain andtherefore supporting his understanding of it. Form-basedinterfaces are supported by the Corese library tool, which

    is a portable tool that can assemble and tuneparameterized searches. It has been tested for more thanten diverse domains [9]. However, it does not effectivelysupport investigative browsing like Magnet [10], which isa module of Haystack. In survey of other approaches,mSpace7 [11] produces form-like interfaces straight from

    the domain structure, introduces the user to an alieninformation space, where he does not know how it isstructured. Yet, on the other hand, it is incapable ofsearching in a diverse environment.

    2.3VIEW-BASED SYSTEMS

    Domain understanding, query building and queryenhancement are the vital features in View-based searchtools. Using the related domain ontology, the view-basedsystems structures the view in a graphical or tree-basedmanner in order to illustrate the underlying semanticmetadata. From its benefits, is that the content

    categorization scheme and the query vocabulary can bepresented in spontaneous formats, which results in betterunderstanding of the domain by the user. Queries areoften built by means of navigation. Also, as an example ofview-based systems, GRQL [12] gives end-users theability to construct queries in the runtime by visuallybrowsing through the specified ontology domain.

    Nevertheless, the time of query construction vianavigation can be a limitation in regards to inflexibility. Insurvey of other approaches, SEWASIE [13] supports end-users in query construction; however, when the relatedontology gets complex, the steps to construct a query can

    get large. In Ontogator [14], a multi-facet search tool wascreated to pace up the query formulation, byincorporating the keyword search with the view-basednavigation routine.

    When it comes to scalability, view-based toolsperform poorly, due to its time-consuming interaction.Another limitation is its inability to effectively presentviews of the domain, which will result in making end-users lost in the information space.

    2.4NATURAL LANGUAGE SYSTEMS

    What draws end-users to natural language semantic

    tools is its simple interface and interaction. Generating thequerys answer from unprocessed text, and sustainingquery development in information retrieval [15], has beenthe main aim, for prior natural language questionanswering systems. On the other hand, integratingsemantic mark-up with question answering (QA) systemsopens the door for new possibilities for new QA systems.

    With this integration, these new QA systems with theuse of semantic information can offer accurate answers toqueries presented in natural language. What highlightsthese systems is that they offer a method to originateseveral parameters queries with more flexibility than the

    other previously mentioned systems, without obligatingthe end-users to have any prior search languageknowledge. An example on these kinds of systems isAquaLog8 [16], which is ontology-based and portable.

  • 7/29/2019 Generic Hybrid Semantic Search Approach

    3/7

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 17, ISSUE 2, FEBRUARY 201323

    However, if we try to review the constraints ofSemantic QA systems, we find that these systems do notprovide the end-users with any hints about the domainthey are querying, which means that the end-users mustbe acquainted with the domain in order to create validquestions. In elaboration, these systems do not assist the

    user in understanding the domain, very much like theconstraints of the keyword search systems.GINO system [17], tried to solve these limitations by

    presenting step by step directions to end-user while

    creating a query in the quasi-English form, to guaranteeonly the passing of suitable queries.

    Another example is Orakel9 [18] which uses twodiverse lexicons: the domain lexicon and the generic,domain-independent lexicon. The domain-independentlexicon consists of English language related words like forexample questions pronouns including When,How,etc. The domain lexicon is produced on the fly

    from each knowledge base thus, differing from oneapplication to another.However, a Natural Language system does have its

    limitations. In order to have the flexibility of constructinga natural language query with numerous query-parts, youwill have to sacrifice the ability to interpret the wholequery, where there is a constraint on the number ofphrases that could be parsed correctly. Therefore, not allcomplex parameterized queries could be interpreted. Butwhat should be accounted for is that it handles scalabilityto large ontologies very well.

    3 GHSSAPROPOSEDAPPROACH

    3.1APPROACHHYPOTHESIS

    A system which does not obligate end-users tohave any prior knowledge of the components ofthe domain, be skillful in any formal searchlanguage, or be acquainted with the knowledgebases structure, unlike the Natural languageInterfaces which permits end-users who have aprior idea about the domain to question thesemantic Knowledge base.

    The Top level classes must be a subClassOfthe top-most generic class

    Thing

    which is the superclassof every OWL class. All these siblings in the class

    hierarchy must be at the same level of generality,where the main class is the one having the highernumber of relations. [OWL Web OntologyLanguage Overview W3C Recommendation 10February 2004]

    The main intention of this research is to offer a generichybrid semantic search approach (GHSSA) to accomplishthe following goals:

    Enlarging the utilization scope and maximizingthe benefits of the SFGBSS to be used with anyStandard English data set in a data setindependent basis.

    Displaying different comparative operandsaccording to the data type of each relation.

    Performing data profiling on data values for betterUser experience.

    Validating the robustness of this novel approach

    using Arabic data set which adds a new dimensionfor our approach to be a multilingual and flexible.

    3.2GHSSAALGORITHM

    To achieve the previously mentioned objectives theGHSSA Algorithm is proposed as shown in Fig. 1.

    3.3KNOWLEDGEBASE DATASETS

    Rather than using only one specific data set as in SFBGSS[1], a library of Mooney Natural Language Learning [2]containing four data sets is used in GHSSA as follows:

    Geo Query Data: Data for parsing queries about asimple U.S. geography database.

    Restaurant Query Data: Data for parsing queriesabout a database of restaurant information in N.California, which was previously used in SFBGSS.

    Fig.1. GHSSA Algorithm

    Fig.2. Geo Data Set Layout

    http://www.w3.org/TR/owl-guide/#DefiningSimpleClasseshttp://www.w3.org/TR/owl-guide/#DefiningSimpleClasseshttp://www.w3.org/TR/owl-guide/#DefiningSimpleClasseshttp://www.w3.org/TR/owl-guide/#DefiningSimpleClasses
  • 7/29/2019 Generic Hybrid Semantic Search Approach

    4/7

    24

    Jobs Query Data: Data for parsing queries about job

    announcements posted in the newsgroupaustin.jobs.

    Geo Query Data in Arabic: Data for parsing queriesabout a simple U.S. geography database intranslated into Arabic.

    These datasets were used in specific because theyhave been used in most of the related work in this area.

    Thats why they are considered as a credible and standardsource. Also, it facilitates the capability of comparisonwith other systems. Each data set contains a collection ofEnglish Queries in addition to the domain knowledge

    base.

    4 GHSSAIMPLEMENTATION

    The consecutive eight figures represent the screenshots ofthe GHSSA with different datasets. Each pair illustrates auser query and its corresponding system response.

    Fig.5. Geo Data Set Layout Arabization.

    Fig. 7. GHSSA

    Geo Data Set Result.

    Fig.3. Jobs Data Set Layout

    Fig. 6. GHSSA Geo Data Set.

    Fig.4. Restaurant Data Set Layout.

  • 7/29/2019 Generic Hybrid Semantic Search Approach

    5/7

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 17, ISSUE 2, FEBRUARY 201325

    Fig. 13. GHSSA Geo Data Set in Arabic Result.

    Fig. 9. GHSSA Jobs Data Set Result.

    Fig. 10. GHSSA Restaurant Data Set.

    Fig. 11. GHSSA Restaurant Data Set Result.

    Fig. 12. GHSSA Geo Data Set in Arabic.

  • 7/29/2019 Generic Hybrid Semantic Search Approach

    6/7

    26

    5 EXPERIMENTAL RESULTS

    The GHSSA was tested on the 3 Mooney Data Setsmentioned above and then compared with the Nlp-

    Reduce NLI [19]. Fig. 14 shows the Precision and recallperformance measures for both systems.

    Fig. 15 shows the total precision and total recall of theSFBGSS (phase 1)[1] and the GHSSA (phase 2) comparedwith Nlp-Reduce in both phases.

    6 CONCLUSION

    In the past few years various semantic search enginesemerged, each implements a different approach. Thepurpose of this research was to enhance the precision ofthe guided systems by infusing a form-based interface toit. The GHSSA surpassed in implementing a Generic

    Hybrid Semantic Search engine that overcomed thelimitations of Natural Language Interfaces habitability

    problem by providing the user with the data values in the

    domain and the Natural Language Interfaces limitationto the number of query-parts in a phrase that it can becorrectly interpreted by displaying all the relations thatexist so that the user can choose as many as required. InRegards to the Form-based search engine scalabilitylimitation, we provided the data profiling of the datausing five-number statistical model which can beimplemented on any range of values no matter how largeit gets. The GHSSA also encompassed some features that

    were implemented in previous Semantic Search engineslike portability. In addition to, some new features thatwere not executed in other Semantic search engines as faras we know like, displaying different comparativeoperands according to the data type of each relation andquerying an Arabic data set.

    REFERENCES

    [1] Reham Hesham El-Deeb, Abdel Fatah. A. Hegazy, Aly AlyFahmy. Semantic Form-Based Guided Search System. (2012). Inthe 22nd International Conference on Computer Theory and

    Applications. Alexandria, Egypt.[2] L.R. Tang, R.J. Mooney, Using multiple clause constructors in

    inductive logic programming for semantic parsing. In: 12th

    Europe. Conf. on Machine Learning, Freiburg, Germany. 2001,

    pp. 466477.

    [3] A. Bernstein, & E. Kaufmann, GINO - A Guided Input NaturalLanguage Ontology Editor. In Proceedings of the 5th

    International Semantic Web Conference (ISWC 2006). Athens,

    Georgia, 2006, pp. 144-157.

    [4] A. Bernstein, E. Kaufmann, & C. Kaiser, Querying theSemantic Web with Ginseng: A Guided Input Natural

    Language Search Engine. In Proceedings of the 15th Workshop on

    Information Technology and Systems (WITS 2005). Las Vegas, NV,

    2005, pp. 45-50.[5] Victoria Uren, Yuangui Lei, Vanessa Lopez, Haiming Liu,

    Enrico Motta, Marina Giordanino. (2007). The usability ofsemantic search tools: a review. The Knowledge EngineeringReview. (pp 361-377).

    [6] Guha, R., McCool, R. & Miller, E. 2003 Semantic search. In 12thInternational Conference on World WideWeb. pp. 700709.

    [7] Mihalcea, R. & Moldovan, D. 2005 Semantic indexing usingwordnet senses. In Proceedings of the ACL-2000 workshop on

    Recent advances in natural language processing andinformation retrieval: held in conjunction with the 38th AnnualMeeting of the Association for Computational Linguistics. pp.

    3545.

    [8] Buscaldi, D., Rosso, P. & Sanchis Arnal, E. 2005 A WordNet-based queryexpansion method for geographical informationretrieval. In CLEF 2005 Workshop at GeoCLEF 2005. Vienna,Austria.

    Fig. 8. GHSSA Job Data Set.

    Fig. 14. Comparison between the precision and recall of the 3 Data

    Sets.

    Fig. 15. Comparison between the precision and recall of phase 1 andphase 2.

  • 7/29/2019 Generic Hybrid Semantic Search Approach

    7/7

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 17, ISSUE 2, FEBRUARY 201327

    [9] Corby, O., Dieng-Kuntz, R., Faron-Zucker, C. & Gandon, F. 2006Searching the semantic web: approximate query processing

    based on ontologies. IEEE Intelligent Systems 21(1), 2027.

    [10]Sinha, V., Karger, D. R. 2005 Magnet: supporting navigation insemistructured data environments. In 2005ACM SIGMOD

    International Conference on Management of Data. Baltimore,

    Maryland, ACM Press, pp. 97106.

    [11] schraefel, m.c., Wilson, M., Russell, A., & Smith, D.A., 2006mSpace: improving information access to multimedia domainswith multimodal exploratory search. Communications of the

    ACM 49(4), 4749.

    [12]Athanasis, N., Christophides, V. & Kotzinos, D. 2004 Generatingon the fly queries for the semantic web: the ICS-FORTHgraphical RQL interface (GRQL). In 3rd International Semantic

    Web Conference (ISWC04). Hiroshima, Japan, pp. 486501.

    [13]Catarci, T., Di Mascio, T., Franconi, E., Santucci, G. & Tessaris,S. An ontology based visual tool for query formulation support.In 16th European Conference on Artificial Intelligence (ECAI-

    04). 2004. Valencia, Spain, pp. 308312.

    [14]Hyvonen, E., Saarela, S. & Viljanen, K. 2003 Ontogator:combining view-and ontology-based search with semanticbrowsing. In XML Finland 2003, Open Standards, XML, and the

    Public Sector. Kuopio, Finland, pp. 8285.

    [15]Mc Guinness, D. 2004 Question answering on the semantic web.IEEE Intelligent Systems 19(1), 8285

    [16]Lopez, V., Pasin, M. & Motta, E. 2005 AquaLog: an ontology-portable question answering system for the semantic web. In2nd European Semantic Web Conference (ESWC 2005).

    Heraklion, Crete, Greece, pp. 546562.

    [17]Bernstein, A. & Kaufmann, E. 2006 GINO-a Guided InputOntology Editor. In Proceedings of the International Semantic

    Web Conference. pp. 144157.

    [18]Cimiano, P. 2004 ORAKEL: A natural language interface to an f-logic knowledge base. In 9th International Conference onApplications of Natural Language to Information Systems

    (NLDB). pp. 401406.

    [19]E. Kaufmann and A. Bernstein, "How Useful Are NaturalLanguage Interfaces to the Semantic Web for Casual End-

    Users?," Proceedings of the 6th International Semantic WebConference (ISWC 2007), Busan, Korea: 2007, pp. 281-294.

    [20]A. Bernstein and E. Kaufmann. Making the semantic webaccessible to the casual user: Empirical evidence on theusefulness of semiformal query languages. IEEE Transactions

    on Knowledge and Data Engineering, under review.

    [21]A. Bernstein, E. Kaufmann, C. Kaiser, and C. Kiefer, "Ginseng:A Guided Input Natural Language Search Engine for Querying

    Ontologies,"Jena User Conference, Bristol, UK: 2006.[22]Esther Kaufmann, Abraham Bernstein, Renato Zumstein,

    Querix: A Natural Language Interface to Query OntologiesBased on Clarification Dialogs, In: 5th International Semantic

    Web Conference (ISWC 2006), Springer, November 2006.

    [23]C. W. Thompson, P. Pazandak, and H. R. Tennant. Talk to yoursemantic web. IEEE Internet Computing, 9(6):75-78, 2005.