A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

Embed Size (px)

Citation preview

  • 7/30/2019 A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

    1/8

    A Summary Sentence Extraction Method

    for Web-based Mailing List Review

    Application and Its Effectiveness Study

    Satoru FUJITANI, Kanji AKAHORI

    Graduate School of Decision Science and Technology, Dept. of Human System Science,

    Tokyo Institute of Technology.

    2-12-1, O-okayama, Meguro, Tokyo 152-8552, Japan,[email protected], [email protected]

    E-mail based communication is gradually making its way into the distant collaborative

    learning environment. But, compared with traditional lecture cum discussion learning

    environment in e-mail-based collaborative discussion, it is difficult to know the latest

    statuses of the learners for providing immediate feedback effectively due to limitedinformation resources.

    The authors propose an information retrieval method for mailing list review. Relevant

    nouns, as keywords in a message, were pulled out from e-mails, and summary was

    extracted by these keywords. Japanese natural language processing technology is

    employed in the proposed method. These extraction procedures are used to form the

    basis for the Web-based mailing list reference application. The authors claim that

    these features in the proposed method help to identify an outline of discussion topics

    in a mailing list, and improve the readability of e-mails.

    Statistical evaluation of this method has been proceeded and it has been found as an

    effective retrieval method for e-mail-based communication. Also, The comparative

    experiment between the extraction result of 46 undergraduate students and the

    algorithm result of extraction method suggests that the proposed method can detect

    major sentences in e-mail articles properly.Keywords: e-mail discussion, mailing list, web-based application, natural language

    processing

    1 IntroductionWithin the Internet, e-mail has the most widespread usage. E-mail is increasingly used as a meansfor storing the latest information and for interactive learning among the learners. In particular,mailing list help learners send messages to a whole group of people and receive back messagesfrom them. Messages sent through mailing list can be followed by all members in the group [1]. Inthe future, the use of mailing list is expected to increase even more as a part of collaborativelearning.

    In e-mail-based collaborative learning, it is difficult for teachers to provide spontaneous

    analysis and feedback. Therefore, teachers must try to facilitate the process of learning effectively.Even learners themselves have the same difficulties in e-mail communication, e.g. a deluge ofe-mail [2]. Learners need to exert more efforts to choose what is more important among thenumerous e-mail messages that they process without teachers. Our earlier research on anetwork-based collaborative learning project facilitated by the mailing list of overseas Japaneseschools [3] suggests that sorting and indexing support tools is crucial in order to apply appropriateideas for educational improvement. We further suggest that extracting according to learnersinterests from stored information can be helpful in keeping their focus on the discussion topic.

    Many improvements have been suggested for an easy-to-use network collaborative learningenvironment. But most of them are for graphical user interface [4-5]. So, in this paper the authorswill suggest a trial keyword and summary extracting method from the series of e-mail (calledthread) of mailing list. Japanese natural language processing technology will be employed forappropriate keyword extraction. From a thread, the keywords will clarify the topics that thesubscribers deal with.

  • 7/30/2019 A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

    2/8

    The purpose of this research is to verify where indicators, important words for summaryextraction, are included in a series of e-mail text, as thread empirically, and to develop summaryextraction system implemented by the result of empirical fact-findings.

    Many researchers have discussed methods for content-based text extraction [6-12]. Previousresearchers equate the summary sentence with some important sentences selected from thedocument. This method assumes that there are certain indicators for importance of words, phrases,

    or complete sentences.We propose an e-mail discussion extraction model in order to acquire new information. We

    assume that there is some cohesion between the original e-mail and the reply e-mail. The sentenceextraction algorithm is based on empirical experiments. The keyword extraction algorithm assumesa continuity of keyword appearance in the e-mail thread that enables parsing the thread to extracttypical terms with a high frequency of occurrence in each message. This keyword extractionmethod is good for domain-specific knowledge. With the help of the e-mails original title andgiven keywords, learners and newcomers can learn about the message topics. A computer learningenvironment can easily archive the result of e-mail exchanges. With this method, learners activitiescan be inspected and reviewed.

    This keyword extraction method has formed the basis for a WWW-based mailing list

    reference environment [13-14].

    2 Extracting keywords from e-mail thread2.1 Some characteristics of Japanese Sentences for the extractionSpoken Japanese is characterized as follows:(1) Pronouns do not usually used replace proper nouns, original nouns are instead repeated.(2) Repeated nouns are not usually replaced with the pronouns, and are used with affixing the

    directives.For instance, in the context of e-mails, the other partys name tends to be repeated without

    replacing it with a pronoun. Japanese people are used to greeting and referring by surname. That iswhy the listener/receiver may not be able to distinguish the senders gender. Common nouns arealso used in this way.

    Because of this practice, we can expect to find the same noun with same symbol in variouscoreferenced documents.

    2.2 Characteristics of paragraph structure in E-mail threadsThe language style in electronic exchanges is informal, and it is located somewhere between thespoken language and the written language. The usage and the structure in electronic exchanges arevery different [15]. It is difficult to decide the stylistics of e-mail document. Also, the heuristics forthe location of the selected sentences in the document are seldom used. For instance, we canexpect to see a reference to a preceding document anywhere in the reply e-mail.

    2.3 Why Keywords Extraction?Take the case of a text-based discussion on network. When a person wants to participate in anongoing discussion on a certain topic, at first (s)he reads the series of e-mail or thread, from thebeginning one by one. Moreover, if the topics diverge variously and someone wants to pursue oneof these topics, it also needs to read all the relevant threads in detail.

    Although e-mails documents cannot find locational rules or structure, e-mails have acoreference between a preceding and a reply e-mail. Also, quite commonly, e-mail documents areemployed for references. In this paper, retrievals by keywords are based on these features ofe-mails, and a summary sentence extraction method is proposed later.

    In this paper, we assume that mailing list communication takes place as follows: (1) In amailing list, participants refer to the content of the preceding messages, and cite them occasionally.(2) Also in a mailing list, important words in the discussion appear many times, especially nounsthat correspond to the topic; (3) Therefore, important words in the discussion accumulate at a highfrequency in a discussion thread.

  • 7/30/2019 A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

    3/8

    We propose the following keyword extraction method: first, keywords that are related to thecontent of the messages in the discussion are extracted and displayed automatically. Based on thekeywords, sentences thought to be necessary for the discussion are pulled out and are listedtogether. Using this method, the topics in the discussion are expected to emerge more clearly.Furthermore, the flow of the discussion could be summarized to some extent.

    Appending a subject is another practice among users of e-mail. Those who use e-mail can

    append a subject by writing a topic in the header of the e-mail, and the subject is considered thebest way to represent the topic in a message. But, in practice, users seldom change the subject inreply e-mail. The reply e-mails subject is then automatically changed by the computer byprefixing Re: before the original subject; e.g. Re: (original subject). With subsequent exchangeof reply e-mails, even the topic in the thread changes gradually. As such, we cannot refer to thesubject as a keyword of the message.

    2.4 Characteristics of Extracted Keywords in This ResearchSeveral methods of information retrieval have been used for extraction of a part of Japanese text.For example, Kurohashi et al. [17] assume that the most important description of a word in a textis the part where the word occurs with the highest frequency. They have tried their methods on 20

    books. This method uses the density of the individual word in a book.An earlier research by Fujitani & Akahori [13] shows that mailing lists have less continuity oftopics than Netnews or WWW-based bulletin board systems, even if participants set theirmessages as the response of the preceding message. In contrast, in this paper we identify thekeyword(s) in mailing list. Participants of the mailing list, who are members of a special interestgroup, have a tendency to refer to these keywords repeatedly in their e-mails.

    2.5 Algorithm for Keyword ExtractionIn this paper, the authors apply the morphemeanalysis system, JUMAN [18], for theextraction. Figure 1 shows a diagram of thiskeyword extraction method. The morpheme

    analysis method is as follows:(1) The morpheme is analyzed for the

    contributing message, and by judging thepart of speech for every morpheme, thenouns are pulled out. These are assumedto be keyword candidates in themessage. Authors set Japanese languageheuristics for this procedure (asestablished by the preceding researches).First, nouns written only in Hiragana areexcluded because many of the nouns are

    used in a different way from the originalliteral meaning. Second, nouns of oneKanji character are excluded [19]because most of the nouns are socommon that they are not suitable askeywords.

    (2) In same way, selection of keyword candidates from the preceding message and the responsemessage are carried out. For the response message, if there are two or more messages sentconsecutively, nouns that are pulled out from at least one message are all included in the list ofkeyword candidates.

    (3) In consecutive messages, the intersection of the keyword candidates is obtained. The authorsdefine this noun set as Pre noun sets of the target (response) message, which is concerned

    to the preceding message. On the other hand, another set of nouns can be obtained using theresponse message and target message similarly. That set of nouns will be referred to as thePost noun set of the target (preceding) message. In this paper, following noun sets in Figure

    response message

    Figure 1:diagram of the keyword extraction method

  • 7/30/2019 A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

    4/8

    2 are defined. In a thread with more than two messages, intersection of Pre and Postnoun set can be obtained for each in both preceding and response messages. These noun setstake more specific characteristics into account than simple noun sets. For example,

    noun set has only nouns that newly appeared. These noun sets are used to lookfor a more appropriate keyword extraction method. An evaluation experiment of the keywordextraction method is described in next section.

    3 Important Sentence Extraction MethodGenerally, a summary sentence is a very simplified expression taken from the contents with a

    certain length, e.g., a research paper. In individual cases of learning too, summary sentences areused to understand meanings of the contents [16].Instead of making new sentences for the summary, we generate a list of important sentences.

    In order to identify the sentence that provides the context for the keywords, we used theimportant sentences extraction method. By important sentences of the target message, wemean sentences that contain the keywords that were already extracted from each target message.Also, we assumed that the summary sentences are generated by compilation of the importantsentences. Usually several sentences were extracted as the important sentences from eachmessage.

    4 Evaluation Experiment of Keywords and Summary Sentences ExtractionGiven the above theoretical framework, we conducted an evaluation experiment of thesekeywords and summary sentences extraction method.

    4.1 Purpose of the ExperimentThe purpose of this experiment is to verify an appropriate keywords extraction method foridentifying e-mail topics or for making summary sentences, by comparing summary sentenceextraction results where the keyword is extracted using the methods as defined in this research,and where the keyword is extracted through human cognition.

    In this research, the summary sentences are generated utilizing the result of the keywordsextraction method. Specifying what summary extraction method is most suitable to humancognition can therefore be replaced with specifying what keywords extraction method is most

    suitable.

    4.2 SubjectsSubjects were 46 female university undergraduate students who were taking a lecture on teacherpreparation course. All participants have their own e-mail accounts and they have receivedinstruction on e-mail usage and basic computer operations at the university.

    4.3 ProcedureIn this experiment, the authors used e-mail messages from mailing list concerning appliededucational use of the Internet in overseas Japanese schools. The mailing list was maintained by us.It consisted of about 300 people, out of which, about 90% were elementary or junior-high schoolteachers.

    During the experiment, the subjects were first given a brief explanation of mailing lists ingeneral. After that, they each received handouts consisting of a series of eleven e-mail messages.

    Figure 2: List of defined noun sets in this research

  • 7/30/2019 A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

    5/8

    The subjects were instructed to read each message and perform extraction work. They wereinstructed to pick out the important sentences in each message as described below.(1) First, select the one sentence considered to be the most important, and enclose it with a

    square.(2) Then underline the sentence(s) considered to be important and necessary to understand the

    content.

    We collected the handouts one week later, and counted the adoption frequency of eachimportant sentence.

    4.4 Evaluation method and resultsAs mentioned before, summary sentences are defined as the sentences that the subjects identifiedwith either a square or with an underline.

    For the purpose of evaluation and comparison between the results of the experiment and theimportant sentences as selected by the extraction methods, we defined two index values. The firstindex value, the summary degree of each message to evaluate the summary as the result of theexperiment which applies when the subjects adopt and concentrate on a very small number ofsentences. Thus, the value of the summary degree grows.

    The second index value, Selection degree for each important sentence extraction method is

    defined as follows.

    If the sentence extracted by each method is the same as the important sentences, bothindexes becomeequal. Furthermore,when each methoddetects thesentence which thesubject does not

    adopt in theexperiment, onlythe denominator ofthe selection degreegrows and theselection degreebecomes small. Onthe other hand, ifonly the sentencesthat a considerablenumber of subjects

    adopted areextracted by eachmethod, theselection degreegrows conversely.

    Therefore,when a morepointed summarywas produced byeach method andthe selection degree by each method becomes larger than the summary degree, it is interpreted tomean that the important sentences extraction method can make a more refined summary.

    Table 1 and Figure 3 shows where the selection degree of each method and the summarydegree and the experiment were calculated for 11 messages used in the experiment. In Table 1and Figure 3, the selection degree becomes higher than the summary degree when we compare

    (Summary Degree) = (the total number of marks on sentences by all the subjects) / (the number of marked sentences)

    (Selection Degree) = (the total number of marks by all t he subject s on extracted sentences using the method) / (the number of extracted sentences)

    Table 1: Thecomparison betweensummary degree byexperiment andselection degree byimportant sentenceextraction methods.

    Bold-faced values arelarger than summaryde ree.

    Figure 3 : The comparison between the experiment result and

    important sentence extraction methods

    0

    5

    10

    15

    20

    25

    30

    35

    809 810 812 843 845 84 7 8 50 85 1 852 853 854

    Article No.

    Summary

    Degree/SelectionDegree

    not

    not

    not()

    Selection Degree

    Pre________

    Post

    Pre

    ________

    PrePost

    Post

    PrePost

    Summary Degree

    Experimental result

    Figure 3: The comparison between the experiment result and

    im ortant sentence extraction methods

  • 7/30/2019 A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

    6/8

    the summary in the experiment and the extracted important sentences made by the set of nounsusing the response message. In particular, the result from the set of nouns isremarkably high. It shows that the subjects adopted summary sentences as the sentences whichcontain the nouns that newly appeared in the message and consecutive thereafter in a thread.Although we should note that this result is dependent on the topic in the e-mail, we may alsoexplain that the subjects have judged that it is suitable for a summary of mailing list messages to

    adopt newly appeared topics in e-mail messages.

    5 Statistical Evaluation of the Retrieval MethodSummary sentence extraction methods have already been discussed in Section 2. Therefore,statistical evaluation of our method with other keyword retrieval method is needed. The evaluationof the retrieval method has been proceeded as follows. It has been found as an effective retrievalmethod for e-mail-based communication.

    5.1 Purpose of the EvaluationThe purpose of this evaluation is to verify appropriateness of keywords extraction method for

    identifying e-mail topics or for making summary sentences, by comparing keyword extractionresults using tf*idf keyword segmentation rule. The authors proposed frequency-based keywordextraction method, but this method uses coreference between preceding and reply e-mail. Authorsexamine whether the proposed method based on coreference is better than the keywordextraction method.

    5.2 ProcedureTf*idf value for keyword segmentation calculates individual word weights [6]. For each word in adocument, following weighted measure [16] is used:

    frequency)(document*frequency)termw (=

    where

    log

    (

    termincludewhichdocumentsofoccurance

    corpusindocumentsofnumberfrequencydocument

    document)intermofoccurancefrequencyterm

    =

    =

    This identifies the words in a document thatare relatively unique (hence, signature) to thatdocument.The authors applied this weighted measure forall the appeared independent words in e-mailthread which are used in evaluationexperiment. The authors define importantsentences of the target message as sentences

    that contain the above-mentioned keywords.Important sentence definition on tf*idf valuerule is same also in the proposed method.Also, There is no limit on the number ofsignature words used for a given document.Significance threshold are determinedempirically which marks best selectiondegree in e-mail thread.

    Figure 4 shows the results of selectiondegree between the important sentence extraction methods. Tf*idf value based rule didnt markhigh score during the thread. However, by the present extraction method, high scores are markedin the medium part of e-mail thread. We suggest that this method doesnt follow up the desired

    keyword towards the end, because these documents of e-mail thread tend to disregardconnectivity which makes this method weak.

    0

    5

    10

    15

    20

    25

    30

    35

    809 810 812 843 845 847 850 851 852 853 854

    Article N o.

    Sum

    ma

    ry

    De

    gree

    /

    Se

    lec

    tion

    De

    gree

    not tf.idf>1

    Figure 4:The relationship of selection degree between methods

    Proposed Method tf*idf Method Experiment

  • 7/30/2019 A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

    7/8

    6 Generation of Summary Sentence and DigestThe authors developed summary sentence and digest generation functions as a compilation ofextracted important sentences of the messages. We use the important sentences extraction method

    which uses a noun set according to the result of the experiment. In this procedure,the authors regard these important sentences as summary sentences. These functions are anaddition to the WWW-based mailing list reference interface.

    Figure 5 shows thesystem diagram for summaryand digest generation frome-mail messages. Members inthe mailing list use their MUA(Mail User Agent) and poste-mail to mailing list.Simultaneously the e-mail isdistributed by MTI (Mail

    Transfer Interface), summaryand digest generation serverextracts keywords andreconfigures an e-mailsummary database. After that,the summary and digestgeneration server producesthe following outputs.(1) A WWW page with tree hierarchy view of e-mail messages. The hierarchy of messages can be

    pursued corresponding with each e-mail headers. This WWW page contains the messagenumber, posted date and time, the senders E-mail address, title, and summary sentences.

    (2) A mailing lists digest using the summary sentences. First, the digest generation server puts

    e-mail messages in order. According to the hierarchy of every group of messages, called thethread, the server joins the summary of each message in the order of the message number.The authors call this the digest of a thread. This digest is distributed by the mailing list atperiodic intervals.

    However, the message must be sent consecutively back and forth in the above-mentionedextraction method. Thus, the authors defined the following rules. In the message of the head, theresult of an important sentence which uses the method using noun set Post is expediently assumedto be summary sentences. In the message of the bottom, noun set Pre is expediently used togenerate summary sentences. Moreover, when a new consecutive message arrives andconsecutive messages increase, the summary sentences are generated again.

    7 ConclusionsWe have described a summary sentence extraction method of e-mail discussion and its web-basedapplication to mailing list review. We have proceeded an empirical trial to identify the location ofimportant sentences, and developed summary sentence and digest generation functions as acompilation of extracted important sentences of the messages. The main conclusion is that theextracted summary sentences which include new keywords correspond better to human ones thansummaries based on extracting sentences which emphasize continuing thread keywords.

    The authors have felt the necessity of a more empirical verification of the appropriateness ofthis keyword and summary extracting method in practical use.

    AcknowledgementOur special thanks are due to Madhumita Bhattacharya, Ph.D., the National Institute ofMultimedia Education, Chiba Japan, and Dr. Tanmoy Bhattacharya, Department of Phonetics

    45

  • 7/30/2019 A Summary Sentence Extraction Method for Web Based Mailing List Review Application and Its Effectiveness Study

    8/8

    Linguistics, University College of London, U.K., for reading the manuscript and making a numberof helpful suggestions.

    References[1] Roerden, L. P., Net Lessons: Web-based projects for your classroom, CA: OReilly & Assoc. (1997)

    [2] Chandra, H., Electric Mail: An Examination of High-End Users, Proceeding of Selected Research and

    Development Presentations at the 1996 National Convention of the Association for Educational

    Communications and Technology (1996)

    [3] Akahori, K. & Fujitani, S., A Survey of Educational Use of Internet in Overseas Japanese Schools (in

    Japanese), Proceedings of the 12th annual conference of Japan Society for Educational Technology, pp.63-64

    (1996)

    [4] Ahern, T. C., The Effect of Interface on the Structure of Interaction in Computer-Mediated

    Small-Group Discussion, Journal of Educational Computing Research, vol.11, no.3, pp.235-250 (1994)

    [5] Shintani, T. & Uchimura, T., Adventure of Media-kids -Record of education practice on Internet-

    (in Japanese), Tokyo: NTT publishing (1996)

    [6] Salton,G. & McGill, M.J., Introduction to Modern Information Retrieval, N.Y.: McGraw-Hill (1983)

    [7] Brandow, R., Mitze, K., & Rau, L. F., Automatic Condensation of Electronic Publications by Sentence

    Selection, Information Processing & Management, vol.31, no.5, pp.675-685 (1995)

    [8] Edmundson, H. P., New methods in automatic abstracting, Journal of the Association for Computing

    Machinery, vol.16, no.2, pp.264-285 (1969)[9] Kupiec J., Pedersen J., & Chen F., A Trainable Document Summarizer, Proceedings of the 18th

    Annual International ACM/SIGIR Conference, pp.68-73 (1995)

    [10] Sato M., Sato S., & Shinoda Y., Automatic Digesting of the NetNews (in Japanese), Journal of IPSJ,

    vol.36, no.10, pp.2371-2378 (1995)

    [11] Ishihara, M. & Akahori, K., Development of a System to Generate Digests of Internet Articles for

    Supporting Discussions (in Japanese), Japan Journal of Educational Technology, vol.22, no.1, pp.1-13 (1998)

    [12] Hearst, M. A. & Plaunt, C., Subtopic Structuring for Full-Length Document Access, Proceedings of

    the 16th Annual International ACM/SIGIR Conference (1993)

    [13] Fujitani, S. & Akahori, K., Keyword Sampling and Interface Improvement for the Review of e-mail

    (in Japanese), Proceedings of the 21st annual conference of Japan Association of Science Education, pp.57-58

    (1997)

    [14] Fujitani, S. & Akahori, K., Protocol Analysis of Decision Making Strategies under Computer Network

    Environment, Proceedings of First International Conference on Cognitive Science (ICCS97), Seoul, SouthKorea, pp.140-145 (1997)

    [15] Robinson, B., Electronic Communication and Collaborative Writing (in Japanese), Global Commons

    Inc., Japan, Trans., An Environment for Collaborative Learning: Proceedings of 4th Computer Pals

    International Conference [On-line], Available WWW: http://kids-commons.net/cpaw/jirei02/jirei018. html (1991)

    [16] Sato, S., Information Retrieval, In Nagao, M., Kurohashi, S., Sato, S., Ikehara, S., & Nakano, H.

    (Eds.), Linguistic Sciences: vol. 9. Language Information Processing (in Japanese), Tokyo: Iwanami Book

    Publishing (1998)

    [17] Kurohashi, S., Shiraki, N., & Nagao. M., A Method for Detecting Important Descriptions of a Word

    based on its Density Distribution in Text (in Japanese), IPSJ SIG Notes, Tokyo: Information Processing

    Society of Japan (IPSJ), 96-NL-115, pp.43-50 (1996)

    [18] Matsumoto, Y., Japanese morpheme analysis system JUMAN ver.3.0 (in Japanese), Kyoto, Japan:

    Kyoto University (1996)

    [19] Kokogawa, T., Keywords Extraction Methods for Group Information Sharing (in Japanese),Proceedings of IEICE Technical Report, Tokyo: the Institute of Electronics, Information, and Communication

    Engineers (IEICE), AI96-15, pp.51-55 (1996)