ICET2012 Camera 336

Dynamic Entity and Relationship Extraction fromNews Articles

Mazhar Ul Haq, Hasnat Ahmed, Ali Mustafa QamarDepartment of Computing,

School of Electrical Engineering and Computer Science (SEECS),National University of Sciences And Technology (NUST),

Islamabad, [email protected], [email protected], [email protected]

AbstractIn structured as well as unstructured data, infor-mation extraction (IE) and information retrieval (IR) techniquesare gaining popularity in order to produce a realistic output. TheInternet users are growing day by day and becoming a popularsource for spreading the information through news/ blogs etc. Tomonitor this information, a lot of quality work has been donein that perspective. Related to news monitoring, our proposedunsupervised machine learning approach will fetch the entitiesand relationships from the news document itself and throughcomparison with other related news documents, it will form acluster. We propose, in this paper, a dynamic model for entityextraction and relationship in order to monitor the news reportedin the news articles.

Index TermsEntity extraction, unsupervised classification,document grouping, relationship extraction.

I. INTRODUCTION

News and blog monitoring has become very important forthe companies, celebrities as well as for the political partiesin order to decide their plan of action. They need to knowwhat is being reported in relation to them, in order to decidetheir policies and to decide what their competitor do in themarket. To facilitate these kinds of needs, there are a numberof solutions that have been proposed and are currently beingused in the market. Most of these solutions need manualinteraction to fulfill the needs. These manual interactionsrequire a large corpus for each particular domain; rules forthat domain, and manual annotation as an input for the systemto work in a proper manner. However, for a system which isdesigned to work in a domain independent paradigm, it ispractically impossible to provide things like above, therefore,we have designed a system which could extract entities andfind relationship among them using unsupervised machinelearning.Different researchers are now working in this area: Mohamedet al. [1] have worked on dynamically finding the relationshipamong two nouns using trained classifiers. This was extensionof their previous work for Never-Ending Language Learningsystem. This area of research is very important for researchersin order to find the techniques which could help in the interac-tion with wide variety of documents without worrying abouttheir domain. Therefore, our aim is to develop a techniquethat can be used in news content monitoring system and would

ultimately help us in finding out the relationship between thoseentities. It will also help us in an another way: one can analyzewhich entity is been associated with which other entities andby what properties. In this way, it will serve as a frameworkfor policy makers to find out evidences for policies.In order to model the above mentioned needs, we need to ex-tract the relationships between entities that have been reportedin the selected corpus. However, we first have to identifythe entities before finding relationships. Moreover, identifyingentities using traditional approaches will not be effective dueto many challenges like: availability of a large corpus whichhas already been annotated using some technique, or makinga list containing all possible type of entities along with allthe language ambiguities. Therefore, in this research, we haveproposed an unsupervised way of learning entities from thegiven document. Afterwards, on the basis of extracted entitiesand their percentages, we will propose a unique way ofclassifying documents into classes and extract relationshipsbetween them.In this paper, we have focused on two main areas: entityextraction and relationship extraction. Lots of people haveworked in the area of entity extraction: during the 6th MessageUnderstanding Conference they named this process as namedentity extraction which includes identifying people, place,organization and numeric expression. However, our approachis a little bit different from the traditional NERC (NamedEntity Recognition and Classification) since we want to modelthe entities based on their relationships without classifyingthem into their categories (like person, place or organizationetc.). Therefore, we have given them the name of entity.Our approach is aimed at finding out maximum possible factsthat are associated with any entity according to the reportedevent in the news articles. Therefore, grouping of a documenthas also been done before the extraction of relationships.Document grouping is based on the entities that have beenreported in news articles. We will explain in detail how ourtechnique groups documents without giving an overhead to thesystem.

978-1-4673-4450-0/12/$31.00 2012 IEEE

II. RELATED WORK

Mohamed et al. [1] have also proposed a dynamic approachfor discovering relations given a large corpus file. They havedivided the output into two categories as valid and invalidrelations and finally they weighted both of them. Their systemhas both the strengths and weakness of traditional relationextraction and open relation extraction. Never-Ending Lan-guage Learner (NELL) uses initial ontology and outputs theextraction of facts from the web. In future, their system outputwill be the input of NELL to gain more accuracy.Different people have used different techniques to find outentities form the text. Diesner and Carley [2] have proposedthe use of conditional random field (CRF) for identifyingthe entity classes from a social-technical system. They haveused CRF along with machine learning techniques in orderto extract the relational information from the text corpus.Although this technique is quite useful and has shown goodresults and has classified the data into semantic model, yetthese kinds of techniques are not very useful when you havegot diverse domains. Researchers have also worked on a listof entities which need to be searched from the text documentas in Agraval et al. [3]. They have proposed an Adhoc entityextraction technique from text collection using inverted indexcreated on the document and have shown that their techniqueis faster than the traditional entity extraction processes. Wittenet al. [4] have presented text compression techniques to findout tokens which can be modeled as entities in the text. Theirapproach also requires training document for each domain innews article which might not be practical. All of these methodsare based on supervised machine learning techniques in whichsystem analyzes the provided corpus and generates the list. Byusing different techniques, it tries to find out the entities fromthe provided text documents.Similarly, people have also proposed semi-supervised andunsupervised way of learning entities from the given text.The most popular technique used in semi-supervised machinelearning technique is bootstrapping which requires few setof clues to start the learning process for identifying entities,Pasca et al. [5] have demonstrated the same technique by usingpattern generalization techniques. Similarly the unsupervisedway of learning entities work by clustering the data based ondifferent heuristics about the entities e.g. entity will be noun,start with a capital letter or have the same context.However, there is relatively less amount of research workin the area of identifying relationship between entities au-tomatically. The systems generally require that some infor-mation about the relationships is specified that need to belearned. For example, the systems developed by Agichhteinand Gravano [6] as well as by Carlson et al. [7] are based onbootstrapping process i.e. one has to define the relationshipname and the entity type to which this particular relation willbe applied. Carlson et al have also defined that the mutualexclusion could be created between the category predicateswhich reduce the semantic drift, resulting in obtaining theprecision of 89%. Although these techniques are quite useful

and have worked well, but they are costly when there existthousands of relationships and predicates. Banko et al. [8] havecreated positive and negative sets for extracting relationships.Although in their method, one should not have to provide theseed examples as well as the names of the relationships, butstill the training documents must be provided before it startsworking. The unsupervised machine learning techniques havealso been used to learn the relationship between entities. Inorder to remove the short coming of supervised and semi-supervised way of learning relationships, Hasegawa et al. [9]have proposed a technique based on creating a vector whichcontains the features of the entities and their pair is takenunder consideration by creating a similarity matrix. But thissimilarity matrix does create lots of noise because of thewriters writing style. Zhang et al. [10] have proposed a treebased similarity matrix to cluster the similar relationshipsamong the tuples. However, the same problem occurs withthis technique as seen with that of Hasegawa et al. Moreoverthe writers style of writing may introduce a noise. Zhang etal. have only focused on a single document. On the contrary,we are more focused on finding the facts that are reportedabout entities in different articles. We want to find the truerelationships between entities so that one can use these factsfor defining their strategies.

III. METHODOLOGYA. Entity Extraction

The Named Entity Recognition or simply entity recognitionis a process which has been inspired from machine learningtechniques. People have worked in all the three areasincluding supervised learning, semi-supervised learning andunsupervised learning. The area in which most of the workhas been done is supervised machine learning which includesrule-based systems and sequence labeling systems. Althoughthe supervised learning methods are quite successful, yet theyrequire a large collection of annotated documents before itcan work on actual system. But in our case, we have manydiverse areas which may include sports, politics, business,current affairs etc. Therefore, it would be impractical for usto provide documents related to all of these areas.We have an unsupervised machine learning algorithm in orderto model dynamic entity modeling in which case, we aremore focused on the process of clustering the groups basedon different rules. We first take a document and perform partsof speech (POS) tagging in that document. After tagging,we try to find out the marked nouns in the document beforecalculating the following features of the gathered nouns:

1) Case: There are three parameters that have been definedfor this feature. The most important one is whether itstarts with a capital letter. In case it starts with a capitalletter, then this might be a possible candidate for entity.The second parameter is whether all letters of the entityname are uppercase or not (usually organization namesare all in uppercase) and third is whether it is a mixedcase word or not.

TABLE ICANDIDATE ENTITIES EXTRACTED FROM ARTICLES

Candidate For Entities Count PercentPunjab Assembly 2 5.714286

Bahawalpur 2 5.714286

South Punjab 3 8.571429

Minister Rana SanaullahKhan

2 5.714286

Speaker Rana MohammadIqbal Khan

1 2.857143

Pakistan Peoples Party (PPP) 5 14.28571

Usman Bhatti 1 2.857143

PML-N 4 11.42857

2) Punctuation: Whether some punctuation is used in theword or not for example hyphen, apostrophe or & sign.

3) Digits: If digits are used in a word or not for example3M or W3C etc.

4) Prefixes or suffixes: Whether some prefixes/suffixes areused for example Mr. or Ms.

After calculating the features, we filter the words whichare not candidates for entities in the document. We thencalculate the occurrence weight of the found word in thedocument because in News data, people are generally morefocused on reporting similar kinds of entity. They try topresent facts related to one or more entities which are presentin the reported event. Here we will present our research andperform different experiments to find out what weights willbe good to consider something as an entity.

We have also calculated the number of times the entitiesare repeated in the document i.e. the frequency count of theentities along with their percentages.

Let N be the set of nouns present in a document D, suchthat N = {n1, n2, n3, ..., nj} where j is the total number ofnouns marked in the document D. Here each ni is selectedsuch that it groups the nouns that are adjacent to each other.We perform this step because POS tagging marks each wordwith some grammatical relation and there is a possibility thattwo NPs are marked one after another. However, in actualpractice, these two NPs which came adjacent to each otherare one noun for example Pakistan Peoples Party (PPP). Hereall of the four words correspond to one entity. However, POSis tagging each word, thats why they are four NPs adjacentto each other. After creating this set N , we will then buildan occurrence grid which records the count of the particularnoun in the given document. It must be noted that we willcount all of the occurrences. Such that if a part of that nounis repeating, then it will also be considered in the occurrenceof that noun. For example Pakistan Peoples Party (PPP) is anoun, if rest of the document contains PPP 10 times, then itwill contribute to the same noun Pakistan Peoples Party (PPP).

TABLE IISET G FOR GROUP J

No Candidates For Entities Percentn1 Punjab Assembly 5.714286

n2 South Punjab 5.714286

n3 Minister Rana SanaullahKhan

5.714286

n6 Pakistan Peoples Party (PPP) 5.714286

n8 PML-N 5.714286

n13 Federal Government 5.714286

n16 Election Commission 5.714286

B. Document Grouping

In order to move forward we first have to group the similarevent reporting documents together so that we can extract thecorrect relationship between two entities. Here the documentsare grouped based on the candidate entities list which has beengenerated by the previous step.Therefore we call this step as document grouping task,in which we try to assign a Boolean value to each pair(Dj , Gj) G where G = {G1, G2, G3, ..., G|G|} is a setof dynamically generated groups. A value, T is assigned to(Dj , Gj) which shows the decision made by the functionF by assigning one value from {T, F}. As we alreadyknow that for each document we have created the set Ni ={n1, n2, n3, ....nj}, we will utilize these set of candidate nounsto categorize the documents into groups.Initially first document D1 is assigned to group G1, and its setN1is assigned to G1. Then for each document Di we will takeintersection of Nj with group Gj (i.e.Nj Gj)). This willgive us a set Uj . We have two cases here, first one is whenUj is null set; if this is the case then clearly the documentDj does not belongs to Gj . But if Uj is not null, then wewill calculate the accumulated percentage of Dj in Gj . Ifaccumulated percentage is greater than the defined threshold,then it is considered as a match, otherwise Dj does not belongsto Gj . If Dj belongs to Gj , then we will take the union ofGj and Nj in order to get a new set which is assigned to Gj .It must be noted that each Gk is a noun along with itscorresponding percentage in the document. When we take theintersection of Gj with Dj , it will always give us the set Uj .We sum all of the percentages associated with the nouns inGj to get the similarity percentage. Table II has shown us thesame thing, whereas Table III is showing the set N for jth

document.If we consider the above two sets, we will find that

there are some entities that are present in both sets liken1, n2, n3, n6, and n8. If we sum up their percentages, thesewill add up and makes the total of 52 which says that thedocument Dj is 52% similar to group Gj . Since we havedefined the threshold of 50%, therefore, this documents couldbe assigned to this group. But we have also considered thematching of mostly occurred nouns in both documents i.e. atleast one of the mostly occurred nouns must be matched withone of the noun in a group and similarly for the groups mostly

TABLE IIISET N FOR DOCUMENT J

Candidates for Entities Count PercentPunjab Assembly 3 7.317073

Pakistan Peoples Party 1 2.439024

PPP 6 14.63415

South Punjab 5 12.19512

Minister Rana Sanaullah 1 2.439024

PML-N 6 14.63415

Bahawalpur 3 7.317073

National Assembly 1 2.439024

Fata 1 2.439024

Hazara 1 2.439024

Pakistan Muslim LeagueQuaid-e-Azam (PML-Q)

1 2.439024

PA 3 7.317073

Deputy ParliamentaryLeader

1 2.439024

PA Shaukat Mehmood Basra 1 2.439024

Article 239 1 2.439024

NA 5 12.19512

occurred noun. Since we can see that it is happening in thesetwo sets, therefore, we can assign this document to the groupand take the union of the two sets in order to make a bigger set.

C. Relationship Extraction

As we know that the relationship extraction is the mostimportant part of the whole system which must work automati-cally. Our key idea for extracting relationship between entitiesis to use the redundant information that has been reportedby the different sources about the same event. Since peoplewill report the same relationship using different contexts, wewill use those contexts to extract the information. In previoussteps we have already performed the entity extraction and theclassification of a document. Now we are aiming to make aList L for each document, which contains entity-connectingphrase-entity as its elements, such that these patterns are usedmore than once in the document and patterns must be likeE1 Verb E2, E1 NP Perp E2, E1 Verb Perp E2, and E1 toVerb E2 (where E1 and E2 are two entities), and all thosepatterns which are like the before mentioned ones will beconsidered as a part of list L. Now we have the list setL = {L1, L2, L3, ...., Lk} for k documents belonging to thesame event. From these documents, we want to build a contextSet C such that each Ci contains all the contexts of the sametwo entities.

D. Algorithm/ Pseudo Code

Input : All context sets S = {C1, C2, ...., Cm} where mis the total no of different entity pairs, have been found fromthe articles reporting same event.

Output : Set of entities and their relationships.

TABLE IVRELATION CALCULATION

Has Approved Passed AdoptedWeight 0.167 0.667 0.166Rank 2 1 3

Steps :

1) From context sets S, build context occurrence weightmatrix for each Ei and Ej and normalize the matrix

2) Rank the relation in each column according to their co-occurrence in the documents

3) Select that relation which has got the highest score onthe document.

Here for selecting the relationship between entities wehave built a context occurrence matrix. In this matrix we haveput all the possible context into a list and calculated theiroccurrence in the document, for example, Punjab Assemblyhas approved the resolution for Bahawalpur and South Punjab.Some articles have used this as passed the resolution andsome has used Adopted as shown in Table IV. We havecounted their occurrences and have calculated the weight asfollows:

Matrix (i, 1) =

Vim

k=1Vk

Here m is the total no of contexts used for entities Ei and Ej

E. Extracting Valid Relations Among Entities

There will be some invalid relations that need to be consid-ered here:Incomplete : There will be some areas, where the relationwill not be completed with the addition of nouns and otherentities. For example, consider Punjab Assembly has approvedresolution for South Punjab and Bahawalpur. In this relation,there are three entities involved i.e. Punjab Assembly, SouthPunjab and Bahawalpur.Ambiguous clustering : As we know that if we have donewrong clustering then the relations extracted will also bewrong, therefore, thresholds for clustering should be set care-fully.We will consider the above mentioned issues by adding anadditional step in our relationship extraction i.e. re-validationof relationship. In this step, we will monitor the existence ofmore than one entity in the same context. In case it exists,we will add them in order to complete our relations betweenentities.

IV. EXPERIMENTAL RESULTS

To test our technique we have developed web crawlers, thatdownloads the news articles from the web. These downloadedarticles serve as input to the system. The input documents arefirst transformed into XML formatted documents. Informationlike date and time of publishing, headline of the News article

Fig. 1. Document view when it is transformed from web document

TABLE VENTITY EXTRACTION RESULTS FROM 5 RANDOMLY CHOSEN DOCUMENTS

Document Found Present False Positive MissingDoc1 16 17 1 2Doc2 10 10 1 1Doc3 17 17 1 1Doc4 10 11 0 1Doc5 17 18 1 2

Fig. 2. Document view when it is transformed from a web document

is then added. Each paragraph is separated in different XMLTags as shown in Fig. 1. Similarly we have also used EnglishGigaword Corpus which is a very comprehensive archive ofnews articles. They have gathered data of 24 months from6 different sources from January 2009 to December 2010. Itmust be noted that the news article format gathered from twosources are made same as shown in Fig. 1. This is followedby randomly chosing the articles from both of the datasources for result calculation. The randomly chosen articlesare marked both manually as well as by the system followedby a comparison between the obtained results.

This algorithm has shown very good results for all threesteps of the process. The results showed that the relation

TABLE VIEXPERIMENTAL RESULTS

Proposed ApproachPrecision 72.20 %Recall 68.00 %F-Score 70.002 %

extracted with this algorithm has precision of 71.6% and recallof 72.2% as shown in Table VI and entity extraction is alsovery accurate as seen in Table V and Fig. 2.

V. CONCLUSION AND FUTURE WORK

Although this technique extracts very less entities andrelationships but their validity will always be high and willhelp to organize the news articles. There is a need to test thistechnique extensively and find out the possible improvementsso as to increase the Recall of the proposed system. Onepossible extension of this work is the generation of contextmatrix for the entities explaining whether the selected entitiesare used as object or source of the relation. This will help theclassification of the news articles.

REFERENCES[1] T. P. Mohamed, E. R. Hruschka Jr and T. M. Mitchell Discovering

Relations between noun categories, EMNLP 11 Proceedings of theConference on Empirical Methods in Natural Language Processing, pp.1447-1455, 2011.

[2] J. Diesner and K. M. Carley, Conditional random fields for entityextraction and ontological text coding, Springer Science + BusinessMedia LLC, 2008

[3] S. Agrawal, K. Chakrabarti, S. Chaudhuri and V. Ganti, Scalable AdhocEntity Extraction From Text Collections Proceedings of the InternationalConference on Very Large Databases (VLDB), 2008.

[4] I. H. Witten, Z. Bray, M. Mahoui and W. J. Teahan, Using LanguageModels for generic entity Extraction International Conference on Ma-chine Learning (ICML) Workshop on Text, 1999.

[5] M. Pasca , D. Lin , J. Bigham , A. Lifchits and A. Jain, Organizing andSearching theWorldWideWeb of Facts - Step One: The One Million FactExtraction Challenge, in National Conference on Artifical Intelligence,2006.

[6] E. Agichtein and L. Gravano, Snowball: Extracting relations from largeplain text collections, in Fifth International conference on DigitalLibraries (ICDL), 2000.

[7] A. Carlson, J. Betteridge, E. R. Hruschka Jr. and T. M. Mitchell, CouplingSemi-Supervised Learning of Categories and Relationships, in NAACLHLT 2009 Workshop on Semi-Supervised Learning for Natural LanguageProcessing, 2009.

[8] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead and O. Etzioni,Open Information extraction from the Web, in International Joint Con-ference on Artificial Intelligence (IJCAI), 2007.

[9] T. Hasegawa, S. Sekine and R. Grishman, Discovering relation amongnamed entities from large corpora, in Proceediing of the 42nd AnnualMeeting on Association for ComputaionalLinguitics, 2005.

[10] M. Zhang, J. Su1, D. Wang1, G. Zhou, and C. L. Tan, Discoveringrelation among named entities from large raw corpus using similarity-bases clustering, in IJCNLP05 Proceedings of the Second internationaljoint conference on Natural Language Processing Pages 378-389.

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False /Description >>> setdistillerparams> setpagedevice

Documents

ICET2012 Camera 336