Using Morphological Analysis in an Information Retrieval ...947155/FULLTEXT01.pdf · DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

Using Morphological Analysis in an Information Retrieval System for Résumés

SARA NORRBY

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Using Morphological Analysis in an InformationRetrieval System for Résumés

SARA [email protected]

Master’s Thesis in Computer Science (30 ECTS credits)at the School of Computer Science and Engineering

Royal Institute of TechnologySupervisor at CSC: Dilian Gurov

Examiner: Viggo KannEmployer: Netlight Consulting AB

July 2016

AbstractThis thesis investigates the usage of an information retrievalsystem among résumés in Swedish and how the usage ofmorphological methods, such as lemmatization, affects theresults. In order to investigate this, a small information re-trieval system was built using lemmatization and compoundsplitting. This thesis also discusses how the relevance ofa résumé can be decided and evaluates the informationretrieval system in terms of precision, recall and rankingability. The results show that using morphological analy-sis had a positive effect in some cases, especially when thequery contained more Swedish words than names of skills.In the cases where there were mostly technical skills in thequery it proved to have a negative impact. Lemmatizationwas the method that had a small positive effect on rankingability but the compound splitting had a negative impactregardless on the queries’ features.

ReferatAnvändning av morfologisk analys i ettinformationssökningssystem för CVn

I detta examensarbete undersöks hur användning av morfo-logisk analys, så som lemmatisering, påverkar prestandanhos ett informationssökningssystem för CV:n på svenska.Det tas också upp hur relevans hos ett CV kan bedömas ochinformationssökningssystemet utvärderas utifrån precisionoch täckning men även ”discounted cumulative gain” vil-ket är ett mått på rankningsförmåga. Resultaten visar attmorfologisk analys ger positiva effekter i de fall då frågantill söksystemet innehåller många svenska ord. När fråganinnehöll många namn på olika tekniker så visade det sigvara negativt att använda morfologi, framförallt när detgäller uppdelning av sammansatta ord. Lemmatisering varden metod som hade positiv effekt i vissa fall medan upp-delning av sammansatta ord endast hade en negativ effekt.

AcknowledgementI would like to express my sincere gratitude to those who have helped me

in any way during my thesis project. I would especially like to thank:

Dilian Gurov, my supervisor at KTH, for the guidance during the thesisand making sure I was on track.Viggo Kann, my examinor, for great advice in the beginning of thisthesis.Kristoffer Högberg and John-Oskar Ahlström, my supervisors atNetlight, for their support and feedback.Marcus Rönnmark for the great support and encouragement during allstages of this project.

Contents

1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Résumé Structure and Usage . . . . . . . . . . . . . . . . . . . . . . 32.2 Creating an Information Retrieval System . . . . . . . . . . . . . . . 42.3 Term Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Using an Information Retrieval System . . . . . . . . . . . . . . . . . 52.5 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5.1 Lemmatization and Stemming . . . . . . . . . . . . . . . . . 62.5.2 Compound Splitting . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 The Swedish Language . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Related Research 9

4 Methods and Tools 114.1 Interviews with Experts . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Lucene - A Fast Search Library . . . . . . . . . . . . . . . . . . . . . 12

4.2.1 SwedishAnalyzer . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.2 Queries and Parsers . . . . . . . . . . . . . . . . . . . . . . . 124.2.3 Boosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Stava - A Swedish Spelling Corrector . . . . . . . . . . . . . . . . . . 13

5 Evaluation of an IR System 155.1 Testing an Information Retrieval System . . . . . . . . . . . . . . . . 155.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3 Non-Binary Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . 165.4 Evaluate Ranking Ability . . . . . . . . . . . . . . . . . . . . . . . . 17

6 The Interviews 196.1 Interview Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 The Interview Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 Creation of the IR System and Tests 257.1 The Information Retrieval System . . . . . . . . . . . . . . . . . . . 25

7.1.1 The Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.1.2 The Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.2 The Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.2.1 The Collection of Résumés . . . . . . . . . . . . . . . . . . . 287.2.2 The Information Needs and the Queries . . . . . . . . . . . . 287.2.3 The Relevance Judgements . . . . . . . . . . . . . . . . . . . 29

7.3 Pre-Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8 Results 318.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.2 Discounted Cumulative Gain . . . . . . . . . . . . . . . . . . . . . . 33

9 Discussion 379.1 How Can You Assign Relevance to a Résumé? . . . . . . . . . . . . . 379.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . 379.3 The IR System’s Ranking Ability . . . . . . . . . . . . . . . . . . . . 389.4 How Did Lemmatization Affect the Result? . . . . . . . . . . . . . . 399.5 How did Compound Splitting Affect the Result? . . . . . . . . . . . 399.6 Should an IR System for Résumés be Used? . . . . . . . . . . . . . . 409.7 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . . . 41

10 Conclusion 4310.1 Possible Problems and Suggested Improvements . . . . . . . . . . . . 4310.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 45

Appendices 47

A Precision/Recall-Graphs 49

B Interview Questions 53

Chapter 1

Introduction

The résumé is of high importance in recruitment, recruiters use it as a way todetermine whether an applicant have the right qualifications for a position or not.Another situation where it is important is when a consultant company is in theprocess of selling a consultant to a client, but the task of finding a suitable candidatefor a certain job or task can be a time exhaustive activity. A place to start is togo through the possible candidates’ résumés. Instead of manually looking througheach and every résumé that the recruiter thinks might be suitable one can searchamong the documents to try and pick out the best ones without reading them. Theusage of search engines to find documents is well established and modern searchengines can also rank the results and provide the most relevant documents first, theuse of such a system for résumés could decrease the amount of work and perhapssuggest résumés that could have been overlooked in an initial choice.

This thesis will examine whether an information retrieval (IR) system can helprecruiters find suitable résumés for certain jobs or tasks. The study will take placeat a Swedish IT consultant company and all of the résumés therefore belong totechnical consultants and are written in Swedish. This study will also use compu-tational linguistics to try and enhance the performance of the IR system. It is afield that has been researched, but mostly for the English language, and not withrésumés as a focus.

Morphology is the computational linguistic area which this thesis will focus onand it is the study of words and their structure. Words are built up by buildingblocks which makes it possible to create variations of words that have similar mean-ing, for example by giving a word different inflections [20]. Take the Swedish wordfor developer, ”utvecklare” and the word for the developers ”utvecklarna” as an ex-ample, they have similar meanings but they are not identical. This is an issue for IRsystems, since they find the exact matches and therefore miss documents that couldbe relevant. Using morphology in the IR system could help tackle this problem, inthis study the usage of lemmatization, converting words to their normal form [26],and compound splitting will be studied. Morphology is especially interesting in thisstudy because Swedish is a morphologically complex language [17], there are many

1

CHAPTER 1. INTRODUCTION

possible inflections and it has many compound words.The contributions of this thesis are both the investigation of whether an infor-

mation retrieval system can find relevant résumés given a description of a job ortask as well as the study on the usage of morphology on résumés and the Swedishlanguage.

1.1 Research QuestionThe question that has been examined in this thesis is:

How can an information retrieval system find and rank résumés relevant to acertain task or job and how does the use of morphological analysis, such as lemma-tization, affect the performance of that system?

This question has been divided into smaller sub-questions which will be answeredin order to complete the thesis.

• How can the relevance of a résumé be evaluated?

• Does the information retrieval system retrieve relevant résumés?

• How does lemmatization affect the results?

• How does the splitting of compounds affect the result?

In order to answer the research question presented in this section, a small infor-mation retrieval system using morphology was implemented, this system was thentested and evaluated in terms of precision, recall and ranking ability. Interviewswere conducted at the company where this study was taking place, to get insight inhow the process of finding consultants for a job or task looks today and answers tohow the relevance of a résumé could be evaluated.

1.2 DelimitationsThis study only focuses on morphological analysis, not any semantic or other linguis-tic method. The focus is also only on information retrieval on résumés in Swedish.

1.3 OutlineThe goal of the thesis is to determine how morphological analysis affects the re-sults and whether using an information retrieval system for the purpose of findingrelevant résumés is a good idea. This report will start with the background to ré-sumés, information retrieval and morphology. For this thesis, interviews have beenheld, a small information retrieval system has been created and tests have beenmade examining this system and how morphology affect the results. The system isevaluated in terms of precision, recall and ranking ability.

2

Chapter 2

Background

In this section we will look into the background of information retrieval systemsand the concept of morphology. The section describes how they relate to each otherand to the Swedish language. The section does also cover research on the structureon résumés.

2.1 Résumé Structure and Usage

Since it is résumés that will be used as documents for the information retrievalsystem section we will look into research on the contents of résumés.

Résumés are used in several different cases. The most common one is when aperson is applying for a job, and the résumé is sent as a first step in the process.There is also the case of consultant companies, trying to sell their employees toother companies by sending their résumés with the proposal. Although the usageof résumés for recruitment purposes is commonly used, the research revolving itsstructure is limited. In the study by Thoms et al. [29], they examined résumésof college graduates, and whether the presence of certain characteristics affectedthe outcome in recruiting situations. They looked at measures such as length ofrésumé, presence of a grade point average (GPA) and whether a statement shouldbe specific or general. The résumés were examined by both trained human resources(HR) personnel, HR students and people without any connection to HR. The resultsshowed that the presence of some key characteristics, for example a short résuméand the presence of an average GPA of 3.0, was better than a longer résumé or notgiving any GPA. The authors discuss studies on what is preferred in a résumé andthe results of the studies have conclusions about what features that are conceivedto be important and not, and inconclusive results where no conclusions can bedrawn. For example a study by Hutchinson [18] show that the kind of informationto be included are simple and straightforward information but to leave informationabout high school education, too much information on current employers or personalinformation such as a picture. Some content, like extracurricular activities, gavemixed results which made it inappropriate to draw conclusions.

3

CHAPTER 2. BACKGROUND

2.2 Creating an Information Retrieval SystemInformation retrieval is the process of retrieving any type of information, typicallyretrieving text to satisfy an information need [26], and today we typically associateit with a search engine that we use to get information of the Internet. The study ofinformation retrieval system is not a new field and there are several different waysof constructing such a system. We will go into detail on some of these ways withfocus on the index, the different models and the query.

Manning et al. [26] write that an index is built from the documents that wewant to be able to search among, it can be books, articles blog posts or any kind oftext we want to be able search among. In what way we choose to create our index isdependent upon the set of documents. Some collections are almost static, it can forexample be a collection of books by a late author. That documents of the collectionwill probably not be updated and additions or deletions are rare and therefore weonly have to create the index once. Other collections are regularly being updated,documents are being added or deleted. In these cases we might want to use dynamicindexing, which essentially means that the index is reconstructed every once in awhile.

The authors mention that in the process of creating an index you often refine thetext. This involves removing stop words, which are words that are not important forthe content of both text and query, for example pronouns and conjunctions. Oncethese words are removed you can perform additional linguistic methods, discussedin section 2.5. According to Manning et al. the indexing process is time consumingand the indexer is governed by hardware constraints, in the case of indexing largecollections, like the Internet, this process cannot be performed by a single machineand needs to be distributed.

Manning et al. also explain that there are many different ways of retrievinginformation, these are explained as different retrieval models. A Boolean retrievalmodel is a model where we can make a query out of the search terms together withthe operators: AND, OR, NOT. Probabilistic models rely on probability theoryand represents results after probable relevance. In a Vector Space Model (VSM)each document is represented as a vector and all documents in a collection belongto a common vector space. Each term has its own axis in the vector space. Whena VSM is used in the IR system the query itself is also converted into a vector,this makes it possible to find documents by measuring which document vectors areclosest to the query vector. It also makes it possible to calculate the similaritybetween documents based on how close the vectors are to each other.

Bendersky et al. [9] say that when a user performs a query the IR system usuallyalters the query in some way. There are two different types of altering a query:

• Query refinement

• Structural alterations to query

The authors explain that query refinement is the first step and is to alter the

4

2.3. TERM FREQUENCY

query on a morphological level. It is at this point errors in spelling are removed andsplitting of compound words is performed. The second type is performed after theinitial query refinement and alters the query in a structural way. This can includedivision of the query into concepts and weighting these different concepts.

To determine which documents that are relevant for a certain query, there is aneed to be able to compare the documents in a collection. This brings up the topicof term frequency.

2.3 Term FrequencyIn information retrieval the relevance of a document for a certain query is basedon an assigned score and there are different approaches to calculating the score.Manning et al. [26] mention a method that uses term frequency, how frequentcertain words are in the document. Term frequency can also be written as tft,d,where t stands for the term and d the document. The score of a document iscalculated with respect to the search terms and the weight of those terms in thedocument. By comparing the different weights assigned to the documents after thisprocess a ranking can be done. The weight for a specific term in the documentis based on the number of occurrences of it. By using term frequency as score inthis way you do however have the issue that all terms are considered to be equallyimportant in the document when that is not the case. To tackle this, the authorexplains that one can use inverse document frequency, denoted idft, instead of termfrequency to assign the weight of a term. The idft is calculated as:

idft = logN

dft(2.1)

where dft is the number of documents in a collection containing the term t and Nis the total number of documents in a collection. Document frequency is definedas the number of documents that the term t is present in. If a word is present inall documents in a collection, it is not meaningful to give that term much weight inthat collection [26]. For example in a collection of résumés for technical consultantsthe word developer might be mentioned many times. When a search for a certaintype of developer is performed it might not be helpful to give the word developermuch weight if it is present in all documents. By inverting the weight a term thatis uncommon in the collection, it receives a high value and a word that is commongets a low value. When idf and tf are combined you get what is called tf − idft,d,calculated as the product of idf and tf , which is a score that takes term frequencyin a document and the collection into consideration [26].

2.4 Using an Information Retrieval SystemAccording to Keskustalo et al. [23] users of a search engine typically use shortqueries when they search, between one and three words per query. The paper also

5


shows that the users perform a series of queries, within the same task. On average2.5 queries per task. The study tests the use of different types of test queries. One-word queries, where the user uses few words in a query, incremental query extensionwhere a user add words one by one to a query, a query where two words are fixedand a third word is varied and single verbose query which is a long query thatcontains about 17 words that form the title and description of a document. Theauthors conclude that the use of multiple short queries in a session was successfulin finding the document that the user intended to find, but using a single verbosequery yielded the best results in all stages of the query session, it managed to finda highly relevant document in more than 75 % of the topics while using any of theother shorter queries in the study.

There are many places where a user can encounter search functionality, especiallyon the Internet, but search engines are the perhaps the most common. The mostpopular one, Google, has 100 billion searches made each month [27]. A search ona web search engine can generate many thousands of hits. The user cannot beexpected to look through all of the results and in a study by Granka et al. [16]they used eye-tracking to monitor user’s behavior with search engines. Their resultsshow that users spend almost an equal amount of time on reviewing the first andsecond result, but thereafter the time spent on looking at a result declines. Afterten documents the amount of time per hit, the time declined steeply, this was alsoaffected by the fact that the search engine displayed ten hits per page.

2.5 MorphologyIn this section the topic of morphology will be explained, and the two differentmethods that will be used in the IR system will be explained.

Morphology is the study of the structure of a word and how it is built. Martinand Jurafsky [20] describes it as small building blocks of a word are called mor-phemes and they can be divided into two groups, stems and affixes. A stem is themain part of a word and affixes are morphemes that are added to a stem to givedifferent meanings to it. In the word dogs for example, dog is the stem and -s is theaffix. Using affixes allows a word to occur in different forms, it can give it differentinflections and derivations. How common these variations are differs between differ-ent languages. According to Hedlund et al. [17] a language can be considered to besimple or complex with regards to morphology. English, is considered to be simplewhile Swedish is a language that is considered to be morphologically complex.

2.5.1 Lemmatization and StemmingLemmatization and stemming are two methods that can be used to handle thechallenges that the variations of words impose. Manning et al. [26] explains thatthis is done by reducing a word to a simpler form and even though lemmatizationand stemming aim for the same goal they achieve it in different ways. Stemmingcuts of the ends of a word, to get rid of affixes and get the simple form of a word.

6

2.6. THE SWEDISH LANGUAGE

For example the word writing would with a stemmer be reduced to “writ”. Thisprocess is not flawless and can give incorrect results, it will improve the recall whileit lowers the precision according to Manning et al. Lemmatization does it in amore proper way, by doing a morphological analysis using a vocabulary to get thedictionary form of the given word. The word writing will in this case be normalizedto “write” which is the dictionary word form. Using stemming will reduce precisionfor the English language but Braschler and Ripplinger [10] have examined stemmingand decompounding for German and their results show that the usage of stemmingin combination with splitting of compound words in German, that in addition toan increase in recall, there was an increase in precision as well.

2.5.2 Compound SplittingSwedish is not only considered to be special since it is morphologically complex, it isalso rich with compound words [17]. To split compound words does therefore seemparticularly interesting when it comes to information retrieval of languages withthat characteristic. When splitting compound words there are different variationsto the method. Braschler and Ripplinger [10] explain that a splitting method canbe regarded as an aggressive one, which means that it will split the compoundinto its smallest parts. This means that some words such as the Swedish word“fotbollslag” (football team) will be split into “fot”, “bolls” and “lag” instead of“fotbolls” and “lag”. Since documents with the word foot or ball may have nothingto do with football, this may be a disadvantage that causes an IR to find irrelevantdocuments. A more conservative option is to only split the word into its smallerparts if all of the parts have the same part of speech as the original compound.There is also the option of a more relaxed version than the aggressive one butstricter that the relaxed approach, where you demand that at least one of the wordsof the splitting has the same part of speech as the original compound. Braschlerand Ripplinger did however show, on information retrieval in German, that thereis little or no difference between using an aggressive or conservative method forcompound splitting, when it is used in combination with stemming.

2.6 The Swedish LanguageIn the paper by Hedlund et al. [17] they discuss aspects of the Swedish languageand how they can affect the performance of information retrieval systems. Previousresearch within the field of IR and computational linguistics has mainly involved theEnglish language. Swedish and other northern languages have features that makesthem different and results from the research on the English language can thereforenot be applied to them as well. The authors analyze the Swedish language in bothdocuments and as a query language. An ideal search would have results where allthe relevant documents are retrieved and that none of the irrelevant ones are, thisis seldom the case. These troubles are mainly due to linguistic problems. Hedlundet al. [17] summarize them as:

7


• Selection of search keys

• Morphological variation of search keys

• Referred and omitted search keys

• Search key ambiguity

• Multilinguality

The first problem involves the fact that documents describing the same topicmay use different words and expressions. The second one has to do with morphologyand the third one is connected to references that are dependent on context. Thefourth problem concerns the issue of words that have more than one meaning, andto determine which one that is of interest can help the IR. The fifth problem isregarding information retrievals where more than one language is involved.

Out of these problems the second one is of certain interest and is connected tothe matching process. Variations of search keys is problematic in IR since if thesearch key needs to be identical to a term in order to get a match and retrievethe document. The authors explains that in the English language this is not a bigproblem, since English is morphologically simple. Swedish and other languages likeGerman and French are not simple in that sense [17].

Performing a morphological analysis provides benefits when it comes to bothindexing and the retrieval. To tackle to challenges of a morphologically complexlanguage lemmatization and stemming can be used when indexing and searching.A study on stemming that used Swedish news articles by Carlberger et al. [12]concluded that using it increases precision by at least 15 % and the relative recallby 18 %, even though earlier studies show that stemming makes little or no changesfor English, Slovenian and Dutch. The authors are also convinced that the costof creating a stemmer is proportional to the gain. There are however cases wherestemming in German IR is unfavorable, and that is when the queries used containednames or words where an inflection is rarely used [10].

8

Chapter 3

Related Research

There is much research on information retrieval and computational linguistics, de-scribed in earlier sections. There is however a limited amount of research regardinginformation retrieval and computational linguistics where the focus has been onrésumés and to find relevant ones. There has however been research done on thetopic of matching skills to task. Colucci et al. [13] proposed an ontology-basedsemantic match of skill descriptions. They formalize some of the issues of the pro-cess and have a focus on one-to-one skill matching. They discuss what a semanticapproach to the problem should take into account, the advantages of a logic basedapproach are that they can use the ontological structure, distinguish between differ-ent types of matches and propose a ranking that is close to what would be chosenby a human. Colucci et al. [14] also discuss a semantic based approach, wherethe focus is one many-to-many matching and use a suitability matrix where thesuitability of profiles for assignments and people’s skills are represented. To assignthe suitability between assignment and person they use description logic to formu-late descriptions. The problem is visualized as a bipartite graph and they apply analgorithm for solving that problem.

E-recruitment is an area which has been growing ever since applying to a jobwas possible to do through the Internet [11]. Since then there have been severalattempts to create e-recruitment programs to make the process easier, there areseveral studies that have examined methods for this type of programs. Kessler etal. [24] present a system called E-Gen which attempts to analyze and categorize joboffers and the relevance of candidate responses. The responses from the candidatescontains both a résumé and cover letter and the job offers are taken from e-mailsand parsed into different parts such as location and contract. In their methodthey use computational linguistic methods, both lemmatization and the divisionof compound words. The processing of the text makes it possible for them topresent the documents as a bag-of-words. They measure the similarity betweenthe candidate profile and the job description in several ways, for example withEntertex, cosine and Overlap. They conclude that the problem of processing the jobinformation is a difficult one, since the information is unstructured, cosine similarity

9

CHAPTER 3. RELATED RESEARCH

seemed to be the best way to tell the difference between a résumé or cover letterand a job offer. Another study that examines similarity is one by Cabrera-Diego etal. [11]. The similarity examined is the one among résumés, particularly résumésthat are chosen for a certain job. They infer that these résumés should have more incommon with each other than with the rest of the résumés, which were not chosen.For their experiments they use five different methods and like the study by Kessleret al. [24], they conclude that cosine similarity gives the best results and supportstheir hypothesis.

10

Chapter 4

Methods and Tools

This chapter covers the different methods used in this thesis. To get an under-standing of how the work process, for a person working with finding consultants forclient projects, looks interviews have been held. The interviews did also try to getinformation about what is important to look for in a résumé.

An information retrieval system will be constructed and the essential featuresof the library used, Lucene, will be explained. The tool for morphological analysis,Stava, will also be presented.

4.1 Interviews with Experts

In order to determine how to assign the relevance of a résumé for a certain jobor task and to understand how the process of finding a match for an incomingassignment looks interviews were performed with people working with that task onan everyday basis.

The creation of the questions, the tasks and the interviews were done togetherwith Huy Tran, from Linköping University. The purpose of the interviews was tounderstand how the people worked today with finding consultants for an incomingproject, if and how they used search tools in that process and how they chose theirsearch terms. The contents of the interviews were adapted to both this study,questions more specific to the company and for the collaborator, Huy Tran’s study.The results from the interviews that are interesting for this thesis can be seen insection 6.

The interviews were semi-structured, since the set of questions was fixed butthe questions themselves were open. This type of interview was chosen since wewanted to hear the interviewees own thoughts and be able to explore interestingtopics if brought up. The interview consisted of two sets of questions and was con-ducted in Swedish, the translated sets can be seen in Appendix B. Which of thetwo sets of questions an interviewee got depended on whether the interviewee usedan internal tool for searching or not. If it was not used, the interview skipped twoquestions. After the set of questions a set of three tasks was presented, the task was

11

CHAPTER 4. METHODS AND TOOLS

to find a consultant résumé who would fit an assignment description. The assign-ment descriptions were based on real requests and descriptions that the companyhad received from clients. Long descriptions had been shortened beforehand, somewordings were changed and some requirements added to get a variety among thetasks. There were three different sets of tasks and each interviewee got one set. Theinterviews were recorded to not miss any information.

4.2 Lucene - A Fast Search Library

Apache Lucene [1] is a library written in Java and it is used for creating searchengines and providing search functionality. Lucene is for example used in searchservers such as Apache Solr and ElasticSearch. Lucene indexes the full text whichmakes it possible to perform full text searches. Lucene uses a combination of twodifferent retrieval models [7], the Vector Space Model and the Boolean retrievalmodel mentioned in section 2.2. To assign weights to terms Lucene uses tf-idf, butit is also possible to give boosts, see section 4.2.3. Lucene also provides numerousfeatures for analyzing the text, but mostly for English. The feature for the Swedishlanguage is a stemmer and the SwedishAnalyzer explained in the next section.Lucene was used for building the information retrieval system but was not used forthe morphological analysis.

4.2.1 SwedishAnalyzer

Lucene has a Swedish analyzer [6]. The analyzer is used both when indexing thedocuments and when parsing a query. The Swedish analyzer was used for its func-tionality to remove stop words from the documents and queries. Stop words arewords that are common in the language [26] and often belong to parts of speechsuch as pronouns, conjunctions and prepositions. The set of stop words used byLucene for its Swedish Analyzer contains 114 different words.

The analyzer does also have a stemmer, but it was not used since lemmatizationwill be used provided by the program Stava mentioned later on in this chapter. Toavoid using the stemmer a custom version of Lucene’s Swedish analyzer was used.It has all the features and functionalities of the original analyzer except for thestemmer.

4.2.2 Queries and Parsers

Lucene has many different ways to construct queries [4]. There are regular termqueries, where the string with the query is parsed and perceived as one query withone or more terms. There is also the Boolean query, which consists of many smallerqueries. The string with the query is broken into smaller pieces and there is anassignment of whether that terms must be present, should be present or should notbe present to get a matching document.

12

4.3. STAVA - A SWEDISH SPELLING CORRECTOR

After the initial choice of what query to use has been set, Lucene uses a queryparser. Like the name suggests a query parser parses the query, and there are manydifferent types. There is the standard version, called QueryParser but there is alsothe MultifieldQueryparser. The multifield query parser makes it possible to searchdifferent fields that have been created when before the index is [3]. If we take arésumé as example we could create different fields for the section about education,technical skills or earlier projects and decide to search on or more of them with thequery.

4.2.3 BoostsLucene also has the feature to assign a boost to documents, fields and queryterms [2]. If a boost is assigned to documents or fields in a document, this isdone when the index is created. If a query is to be boosted it is done when thequery is created, where you will choose to boost the terms of the query differently.When searching for a résumé this could for example be used to boost skills that arerequired without boosting the optional skills.

4.3 Stava - A Swedish Spelling CorrectorStava [15], [21] is a spelling corrector program that has many different functionscreated by Kann and Hollman. It can detect misspelled words and suggest correc-tions, it can also split compounds and tag words with a part of speech and providethe dictionary word form of a given word. Stava is run from command line andexamples of input and output can be seen below:

Compound Splittingprojektledareprojekt|ledare

Lemmatizationprojektledarennn.utr.sin.def.nom=projektledare

The program takes one word as input and prints output for each input. If a wordis not recognized or misspelled it is defined as an unknown word, spelling correctionscan also be shown. Stava is a tool used for grammar and spelling for Swedish. Itdoes also provide lemmatization and division of compound words, which are thefeatures important for this study. Stava uses word lists that are based on SAOL,the Swedish dictionary that contains 126,000 words [5], for its lemmatization andcompound splitting. Stava also has several lists with different types of words, threelists with computer related terms, a list with names et cetera. All the lists withcomputer terms and the list with names are used in this study, since the index andqueries contain many computer terms and also names.

13

Chapter 5

Evaluation of an IR System

In this chapter the measurements that are used to evaluate the information retrievalsystem are presented. There are both measurements for precision and recall as wellas measurement of ranking ability. It is also explained how an information retrievalsystem is tested.

5.1 Testing an Information Retrieval SystemTo determine how well an IR system is performing, how easy it would be for a userto find the relevant information, tests need to be performed. According to Manninget al. [26] in order to test an IR system you need:

1. A collection to test on

2. A set of information needs

3. The relevance judgement for each document-information need pair

There are sample test collections that often are used, for example TREC [8],that provides tests for both ad hoc retrieval and web retrieval. TREC also providestests and relevance judgements. The set of information needs are the instances thatare going to be tested, in the case of testing a collection of résumés it would be thedescription of a project or client need. To determine whether a result of a test issuccessful or not we need to have a judgement of how relevant a document is foreach and every information need. It is important that the relevance for a documentis based on the information need and not on the query itself. According to Manninget al. [26] a document is considered relevant if it contains the information that theuser needs, not simply because the terms of the query are present. When initialtests are performed there might be a need to tweak certain parameters to get abetter result, what is important in this situation is to perform the next set of testson a different set of information needs and queries, since otherwise you will onlyknow that you have enhanced the performance for the initial test queries and notthe performance in general [26].

15

CHAPTER 5. EVALUATION OF AN IR SYSTEM

The most common measurements used for evaluating information retrieval sys-tems are called precision and recall. They do however not account for the abilityto rank documents of different relevance. This section will explain the commonmeasurements and a way of measuring ranking ability.

5.2 Precision and RecallPrecision and recall are used to determine the performance of an IR system and areexplained by Manning et al [26]. Precision concerns whether the documents thatwere found are relevant and recall indicates how many of the relevant documentsin the collection that were retrieved. The difficulty is not having great precisionor recall, it can easily be done by for example retrieving one correct document forgreat precision or all documents to achieve maximum recall. The difficulty is tohave as great recall as possible without losing precision. How precision and recallare calculated can be seen in equations 5.1 and 5.2.

Precision = |relevant items retrieved||retrieved items| (5.1)

Recall = |relevant items retrieved||relevant items| (5.2)

When it comes to evaluating ranked results the precision and recall can bevisualized in a precision-recall graph, with precision on one axis and recall on theother. The precision decreases as the recall increases.

One can also combine precision and recall into one measurement called F-measure or F1 score which gives one measure that signifies how accurate a testis.

F = 2× precision× recall

precision + recall(5.3)

The authors also mention that a precision-recall graph has a sawtooth look,it is possible to disregard these small fluctuations and one way to do this is touse interpolated precision pinterp. The interpolated precision at a level of recall isthe highest precision found at the recall level, how it is calculated can be seen inequation 5.4.

p(r) = max precision(r′), (r′ ≥ r) (5.4)

5.3 Non-Binary RelevanceThe different measurements presented in section 5.2 use a binary measure of thedocuments relevance, relevant or not. However, relevance of documents is not alwaysa binary matter. Documents can be considered to be more or less relevant or

16

5.4. EVALUATE RANKING ABILITY

completely irrelevant. An alternative to evaluating relevance as a binary measure,is to assign different levels of relevance to a document. Tang et al. [28] explorethe concept of multilevel scale for relevance. They got the best result with a seven-point scale but emphasize that their results needs replication to evaluate if the resultcan be generalized. Kekäläinen and Järvelin [19] propose a four point concept forrelevance where 0 is considered irrelevant and 1-3 is considered relevant where 3is highly relevant. These numbers, indicating relevance, will from now be knownas relevance levels. The scale they used was ordinal, it could not be stated that adocument at relevance level 1 was three times less relevant than a document withrelevance level 3. The authors concluded that using a multilevel scale for relevancecan reveal interesting information that one would not detect otherwise. It can bethat an IR system is great at retrieving documents that are highly relevant butworse at retrieving documents of low relevance.

5.4 Evaluate Ranking Ability

One additional type of measurement is discounted cumulative gain (DCG). Järvelinand Kekäläinen [22] mentioned that it is used to evaluate the ranking of the resultsof an IR system. If an IR system presents the results in a ranked order, it isinteresting to evaluate the ranking with more than a precision-recall graph. Touse DCG the documents in the search need to have different levels of relevance,for example irrelevant, relevant or highly relevant and it is assumed that a highlyrelevant document is more valuable than a relevant one. It is also implied thata high ranking number, meaning that the document appears further down in theranked list, should be considered as less valuable to the user, since she is less inclinedto look at documents with a high ranking number. The idea of DCG is that if anIR system gives a document that is highly relevant a high ranking number, it isnegative and should be reflected in the evaluation of that IR system. The DCGvector for documents in a ranked list is calculated as:

DCG[i] ={

G[1], if i = 1DCG[i− 1] + G[i]/blog(i), otherwise

where G[i] is the relevance grade of the document at index i. It is possible to choosewhich logarithm to use in order to model the behavior of the user. If the usertypically only examines the ten first results than the results should lose in valuemore steeply than if a user usually examines the first 100. [19]

The DCG sum for a specific test is individual and not possible to compare toother tests, since they have different maximum values. By normalizing the DCGyou get a value between 0 and 1 that is comparable between the tests. It is denotedas normalized discounted cumulative gain (NDCG) and it is calculated as:

NDCG = DCG

IDCG

17

CHAPTER 5. EVALUATION OF AN IR SYSTEM

where IDCG is the ideal discounted cumulative gain, it is the maximum possiblevalue. The maximum value is calculated by having an ideal ordering of the docu-ments, from relevance level 3 to 0.

18

Chapter 6

The Interviews

In this section the results of the interviews conducted with people working withsales of consultants are described and analysed. The results of the interviews areused for decision making in both implementation choices and test creation, whichis why they are presented here.

In total seven interviews were conducted, each lasting between 15 to 20 minutes.All interviewees were people working with the task of finding client projects for thecompany’s consultants. In section 4.1 the purpose of the interview is stated, and italso explained that there were two sets of questions. Three of the seven intervieweesgot interview set number 1 and four got set number 2.

6.1 Interview Results

The first question was regarding how they proceed when they need to find a con-sultant for a project. Since the interviews were semi-structured, the results seen infigure 6.1 represent in how many of the answers to the first question, the differentconcepts were mentioned. The order in which the concepts are presented in thegraph signifies in which order they were most often mentioned, from first at the leftto last at the right side of the graph.

The interviewees explained their process and the concepts above were the onesmentioned. All interviewees mentioned checking availability, which means that theylook at which consultants are available and could accept the assignment. Threeout of the interviewees mentioned that they think by themselves, if they knowsomeone that could fit the required profile, from the top of their heads. Two of theinterviewees told that they would ask co-workers if they knew anyone that mightbe a match. Four of the interviewees mentioned that they look at a tag line withkeywords. The tag line is something that all the people working with this processcan edit, and it contains a few key-words with skills and job titles. Six out of theseven interviewees mentioned that they check the résumé and two mentioned thatthey call the consultant to talk about the position. One person mentioned havinga close discussion with accounting.

19

CHAPTER 6. THE INTERVIEWS

Figure 6.1. Question about how the interviewees found a match for a clientproject.

The second question was regarding whether they used an internal search toolfor finding résumés. The results can be seen in figure 6.2.

Four of the interviewees, 57%, said that they use the tool. Three intervieweesanswered that they did not use it. In this question some of the interviewees men-tioned issues with the internal search tool, which made them use it less or not atall. If the interviewees answered that they do use the internal search tool, they gota couple of questions regarding their usage, and those results are not of importanceto this study. The other interviewees skipped directly to the following question. Itis however interesting that a majority of the interviewees use the existing searchtool, since it strengthens this thesis focus on information retrieval systems.

The following question is particularly interesting, the results can be seen infigure 6.3. It shows the results for the question, ”If you were to tell how good amatch a consultant is based on a résumé, what would you look at?”

The figure shows what different concepts were mentioned and how many ofthe interviewees that mentioned the concepts stated. Five people mentioned thatthey would look at the stated skills. Three mentioned that they would look atthe previous projects and jobs that the person has. Three people mentioned thatthey would look at the headlines, which describe what title the consultant has had

20

6.1. INTERVIEW RESULTS

Figure 6.2. Question about whether they used the internal search tool or not.

Figure 6.3. Question about what the interviewees looked for in a résumé.

during projects. Three people mentioned experience and another mentioned lengthof résumé. Two people mentioned requirements, which means that any skills thatare a must from the clients’ perspective. Three people mentioned that they wouldlook if the consultant had worked in the same industry, two people mentioned buzzwords. One person mentioned they would look at the complexity of the previousprojects and another mentioned that they would look if the person had worked atthe company before. Five of the interviewees mentioned personality, not that it was

21

CHAPTER 6. THE INTERVIEWS

something they would look at in the résumé, but that you cannot tell a person’spersonality from a résumé. They talked about that the personality needs to fitthe company and that the consultants own feelings for a project are important.It is possible to divide the concepts mentioned into two categories, concrete andabstract. Concrete concepts are things that are possible to read from the résumédirectly, like technical skills and company names. Other concepts are more abstract,like experience and personality, in those cases it is the reader that interprets whetherthe résumé belongs to a person that is experienced and not.

6.2 The Interview TasksDuring the second half of the interview, the interviewees were given three differentuse-cases. These were descriptions of potential client projects and they were askedto use an internal search tool to find résumés that could match the project. Thedifferent query words used can be seen in table 6.2.

Interviewee Use-case 1 Use-case 2 Use-case 31 javautvecklare <name of company>

esalesQA krav, management ,testledning, PM3

2 java systemutveck-lare, datadriven,backendutvecklare

microservices, machine-learning

projektledare, kanban,test

3 java, devops, java,python, automatiser-ing

java, angular mobil, ux

4 SDK esales <name of a consul-tant>

5 - dataprocessering, dis-tribuerade system,hadoop, microservices

mocha+chai

6 java, infra java mysql iOS android7 java, test, ”objective

c” mobil<name of company> test, pm3, pejl

Table 6.1. The terms chosen by the interviewees in the different use-cases.

Most of the queries were short, which is expected with Keskustalo’s study [23]mentioned in section 2.4 in mind. Almost all of the terms chosen were presentin the description text. While the tasks were being performed they were asked toexplain how they were thinking when performing the search. And several mentionedchoosing words that were not so common, for example not searching for Javascript

22

6.2. THE INTERVIEW TASKS

in some cases and that a search on Java would get a lot of matches. It is correctthat typing a common term would result in many matches, but a common wordalso gets a lower scoring value, which makes them appear far down in the resultsif they do not contain any of the other query terms. When choosing words for thequeries in the tests they will be skills mentioned in the description. But instead ofonly choosing the words that are believed to be rare in the résumés all the relevantskills and description will be chosen, since if the skills are in the description they arerelevant. We can also see that there is a great variety in what terms the intervieweeschose to use, this is also to be expected with Hedlund et al.’s [17] first problem insection 2.6 in mind. That the interviewees select different search keys makes itdifficult to draw any conclusions on how the choice of terms is made.

23

Chapter 7

Creation of the IR System and Tests

In this chapter the creation of the IR system and the test construction will beexplained and the choices made explained. Pre-tests made, to find any bugs inimplementation and to find optimal settings, will also be presented.

The process to create the information retrieval systems had three different steps:

1. Parse the résumés and use Stava and alter them morphologically

2. Create the index

3. Implement the search functionality and use Stava in the process

To parse the résumés was not a big step, but to use Stava to alter the contenttook more time. At this point it was noted that performing a compound splittingon a résumé took a considerable amount of time. The lemmatization did howevernot take as much time. To create the index and implement the search function-ality Lucene was used, and Stava was also connected to the IR system to providemorphological alterations to queries.

7.1 The Information Retrieval SystemThe last step of the implementation was to use Lucene to build an informationretrieval system. This meant creating an index, using Stava in the process, and tobe able to take a query and perform a search on that index, again using Stava. Thepurpose of implementing an IR system was not to create a fully working system tobe used, but rather to create a system where the tests could be performed.

7.1.1 The IndexBefore creating the index the following procedure was followed. The word docu-ments were parsed, into plain text files. These text files were then lemmatized withthe use of Stava, every word where the form was not normal form, was replacedwith the dictionary word form. The lemmatized plain text files were then used in

25

CHAPTER 7. CREATION OF THE IR SYSTEM AND TESTS

the index. Lucene used the custom Swedish analyzer discussed in section 4.2.1. Theindexed text is therefore the lemmatized text, without any stop words.

The compound splitting was not used during this stage and there are two reasonsfor this. The first reason is that the process was time consuming, it would have takena lot of time and computer power to analyze all documents. The second reason isthat compound splitting would then be handled either by adding the split words tothe text or replacing the compound with its parts in the document. Either way, thisaffects the performance in a way that compounds will be matched an extra time,getting a boost to the score or the actual compound is not present, which should bemore interesting than the compound’s parts. The benefits of the compound splittingis tested by having it in the query instead, which will be explained in the followingsection. For lemmatization to be beneficial it has to be used in both index andquery, otherwise the words will not be a match. It was also a much faster process,that could be done without any major issues.

The index was also created with three different fields, content which containedall of the text from the résumés, special skills which contains a section in the résumésthat describe the skills which the consultant have as a specialty and title which isthe work title that the consultant uses. The index had fields to make it possible totest if any boost to any of the fields yielded better results. Since we had three fieldsin the index, a multifield parser (see section 4.2.2) was used.

7.1.2 The QueriesOnce the index was finished the process of constructing queries from user inputbegan. The parsing of the query used the same analyzer as the parsing for theindex. For the construction of the queries Stava was used with both lemmatizationand splitting of compounds. In the example below an explanation of how the parsingof a query string to a query is explained. The queries used were Lucene’s standardqueries, explained in section 4.2.2. An initial idea was to use the Boolean queriesexplained in the same section, where it can be specified certain skills that have tobe present in order to get a match. Sometimes the requests from clients can specifysome skills that are a must, and that needs to be taken into account. But if sucha consultant is not available, it is still interesting to find the best possible match.Therefore the standard query is used instead of the Boolean queries.

Required: projektledareOptional: javautvecklaren

The query above would in the end result in a query that looks like this:Query: projektledare, javautvecklare, java, utvecklare, projekt,ledare

The input string is first put through the compound splitting process, the termsof the splitting are added to the end of the string. The string is then put throughthe lemmatization process where all words are converted to their normal form.

26

7.2. THE TESTS

The query is constructed with the SwedishAnalyzer, all stop words are thereforeremoved. A query example where the query is a web developer with project leadingskills can look like this:

Required: en webbutvecklare med projektledningsförmågaOptional: -Query: webbutvecklare, projektledningsförmåga, webb, utvecklare,projekt, ledning, förmåga

We can see from the example that the words en (a) and the word med (with)have been removed and the compound words have been split and the parts addedto the end of the query. The search then returns a specified of amount of match-ing documents. No score limit was set on what documents are retrieved, all thedocuments that were found were also retrieved. The reason for this was that it isdifficult to choose a suitable limit and we could still evaluate the results at differentnumber of documents retrieved. The alternative would be to for example return alldocuments with a score higher than a certain threshold.

7.2 The TestsSince a standard test collection with information needs and relevance judgementscannot be used, one needed to be created. It was a time consuming process, toevaluate the relevance of a résumé in regards to a certain request from a client.Because of this the size of the test collection is considerably smaller than what iscommon in testing of information retrieval systems. For the tests a 100 résumés intotal was used. In a real word instance a collection of 100 documents could representthe number of applications for a single job. Thus the use of 100 documents, althoughsmall in comparison to other tests of information retrieval systems, can be justifiedfor this area of usage. The documents that were chosen were the last updated,complete, résumés available. A summary of the five different tests can be seen intable 7.1.

Test 1 2 3 4 5Average Relevance (0-3) 0.85 0.63 0.73 0.99 1.18Relevant documents 55 36 34 57 80

Table 7.1. A summary of the pre-tests

The tests was performed for different settings in order to test how lemmatizationand compound splitting affects the results and to be able to make a comparison.The tests were done with six different settings which can be seen in table 7.2.

Standard-Standard is the setting with no alterations to either index or query, ex-cept for the parsing that Lucene performs with its Swedish analyzer. The Standard-Morpho setting has an unaltered index and a query that is both lemmatized and

27


Settings LemmatizedIndex

LemmatizedQuery

CompoundSplit Query

UnalteredIndex

UnalteredQuery

Standard-Standard X X

Standard-Morpho X X X

Morpho-Morpho X X X

Morpho-Standard X X

Morpho-Lemma X X

Morpho-Split X X

Table 7.2. A description of what characteristics the different settings have.

compound split. Morpho-Morpho has a lemmatized index and a query that is bothlemmatized and compound split. The Morpho-Standard setting has a lemmatizedindex and an unaltered query. The Morpho-Lemma has a lemmatized index andquery and the Morpho-Split has a lemmatized index and compound split query.

7.2.1 The Collection of Résumés

Since the goal of this work is to determine if morphology gives any benefit whenit comes to ranking of résumés, a standard test collection cannot be used. Thetest collection will instead consist of résumés of technical consultants at an ITconsultant company. These résumés all follow the same template and are writtenin Swedish. The length of the documents are about eleven pages and contain bothlists of technical skills as well as detailed descriptions of earlier experience.

7.2.2 The Information Needs and the Queries

The information needs consist of expressed needs for a personnel that can fill aposition at a potential client. These descriptions vary in length and in how detailedthey are, one information need can be one sentence or a couple of pages long.

The information needs used for the pre-tests were the same ones as for theinterview, see section 4.1.

The information needs for the tests were real requests that the company hadreceived from clients. The length of the requests varied between 105 and 295 words.The average length was 197 words. The information needs were all written inSwedish. A summary of what positions were wanted can be seen in table 7.3.

28

7.2. THE TESTS

Role Level Special SkillsFullstack developer Senior -Frontend developer - UXBackend developer Senior JavaProject Leader - Requirements analysisFullstack developer - Frontend focus

Table 7.3. Information needs description

To test the IR system these information needs are used when the queries are tobe given. The relevance judgements are made with the information needs but thequeries consist of chosen words from the information needs. The queries used in thetests are verbose queries, long ones, since they yield the best results as mentionedin section 2.4.

7.2.3 The Relevance JudgementsRelevance judgements can be performed in different ways. In proceedings by Kore-nius et al. [25] they had a test collection consisting of articles in Finnish. They hadrelevance judgements, with a four point scale, for 16,539 articles and 30 queries,which had been evaluated by four different people. Some of the pairs were evalu-ated by more than one person and in that case, when there was a disagreement onrelevance it was reevaluated and settled. Of these 16,539 documents only a sampleof 5000 was used due to lack of computational resources.

In this study the relevance judgements was performed with the use of guide-lines, which were made with the interviews with experts as background. The finalguidelines were also reviewed and approved by an expert working with the sales ofconsultants.

The relevance judgements were made in three steps, first the information needwas reviewed and the key skills and important features were found. It was alsodecided how many skills that were required for a certain relevance grade and ifexperience was a factor. The next step was translating the skills and features intokeywords which were used by a small search program that summarized the résumés.The summary contained information about name and title, the required skills thatthe résumé contained, the number of projects the consultant had worked and thenumber of quarters. It was also stated how many projects the consultant hadworked with each of the required skills and a summary of what optional skills therésumé contained. This summary was used to determine which résumés that werecompletely irrelevant for the information need, and also to indicate how relevantthe résumé seemed to be. The last step was to look through the résumés that wereconsidered to be relevant. A relevance level from 1-3 was assigned based on thecontents of the first summary, the contents of the first three pages of the résuméand the information about skills used at projects. The relevance was decided with

29


Projects with R-skill ¬ R-Skill R-Skill R-Skill &O-skill

R-Skill &O-skills

0 0 1 1 11 - 2 2 31 ≥ - 2 3 3

Table 7.4. Relevance guidelines: where R-skill means required skill and O-skillmeans optional skill.

the help of the guidelines in table 7.4. Like the relevance levels described in section5.3 the scale of relevance is ordinal. This means that we cannot conclude that arésumé with relevance level 2, for a certain information need, is twice as relevant asa résumé with relevance level 1, we can only say it is more relevant.

7.3 Pre-TestsIn order to test the implementation, to find any some mo and to be able makeimprovements on the information retrieval system, pre-tests were performed. Thesetests used a small test collection of nine documents and three information needs.The relevance judgements were made in the manner explained in section 7.2.3.During the initial tests, several parameters were tweaked. This involved weighingdifferent sections of the résumés differently, for example giving the special skillssection more boost. Giving different boosts for required skills and optional skills.None of these changes had a positive impact on the results, so the settings ultimatelyused was no boosts at all.

30

Chapter 8

Results

In this section the results of the tests are presented. The results will show theprecision and recall of the information retrieval system, it will also present theranking ability in the form of precision/recall-curves and normalized discountedcumulative gain. The results are presented for the six different settings explainedin section 7.2.

8.1 Precision and RecallThe average precision and recall and F1 score of the tests, for each different settings,can be seen in table 8.1. Since no score level was set, and the maximum number ofallowed retrievals was set to 100 all of the documents found were retrieved.

Settings Precision Recall Precision’ F1 score F1 score’Standard-Standard 0.55 1 0.67 0.71 0.80Standard-Morpho 0.53 1 0.66 0.69 0.80Morpho-Morpho 0.53 1 0.68 0.69 0.81Morpho-Standard 0.55 1 0.66 0.71 0.80Morpho-Lemma 0.55 1 0.67 0.71 0.80Morpho-Split 0.53 1 0.65 0.69 0.79

Table 8.1. The precision, recall and F1 score values for the different settings.The precision’ value indicates what the precision was when the last relevantdocument had been retrieved.

The values in table 8.1 are the precision, recall and precision values when allretrieved documents are taken into account and the values marked with an ’ arethe values when the last relevant document has been found. These numbers arean average over the five different tests. The average precision in the first columnfor Standard-Standard was 0.55 and 0.53 for Standard-Morpho. Using Morpho-Standard and Morpho-Lemma got a precision value at 0.55. Morpho-Morpho and

31

CHAPTER 8. RESULTS

Morpho-Split got 0.53 in precision. That the recall is 1 indicates that all of therelevant documents were retrieved, which all of the settings managed. The preci-sion varies dependent upon how many irrelevant documents that were retrieved, thesettings that had fewer false positive results also had higher precision. The secondprecision’ column, indicates what the precision would have been if we would havestopped looking when the last relevant document had been found. We can see thatthe precision overall would be higher and that Morpho-Morpho received the best re-sult. If we look at the F1 scores we can see that the F1 score calculated on the firstprecision column is 0.71 for Standard-Standard, Morpho-Standard and Morpho-Lemma. The score is 0.69 for Standard-Morpho, Morpho-Morpho and Morpho-Split. The second F1 score’ shows that Standard-Standard, Standard-Morpho,Morpho-Standard and Morpho-Lemma has a score of 0.80. Morpho-Morpho got0.81 and Morpho-Split got 0.79.

Settings DocumentsRetrieved

Relevancelevel 3

Relevancelevel 2

Relevancelevel 1

Standard-Standard 96% 5.06 2.54 1.74Standard-Morpho 100% 5.48 2.59 1.77Morpho-Morpho 100% 5.73 2.58 1.74Morpho-Standard 96% 4.12 2.56 1.75Morpho-Lemma 97 % 4.93 2.59 1.74Morpho-Split 100% 5.72 2.65 1.79

Table 8.2. This table shows how many of the documents that were retrieved.The other values represent the factor of how many more documents that hadto be considered before all documents of the relevance level were found.

In table 8.2 we can see how many of the documents in the index that wereretrieved. The other values represents what factor we have to multiply the number ofdocuments, at a certain relevance level, with to find all documents at that relevancelevel. We can see that to find all the most relevant documents we had to look atabout five times as many documents as would have been necessary if the rankingwould have been perfect. The tests had between four and ten documents at relevancelevel 3. We can see that we would have to look almost at twice as many documentsas there are relevant ones to find all the relevant documents.

In figure 8.1 we can see the precision/recall-curve, it indicates the connectionbetween the two measurements. We can see the results for all of the differentsettings. All of the curves starts where the precision is 1 and recall is close to 0.As the recall increases, by looking at more and more retrieved documents, we cansee that the precision declines. When the recall value is between 0.1 and 0.4 thereis a greater difference between the different settings than between 0.5 and 1. Tosee the precision/recall graphs for each individual setting where the interpolatedprecision/recall-curves are present, see appendix A.

32

8.2. DISCOUNTED CUMULATIVE GAIN

Figure 8.1. In this graph we can see all of the precision/recall curves.

8.2 Discounted Cumulative Gain

In this section we can see the discounted cumulative gain for the different settings.Both the DCG and the NDCG are presented. The formula for calculating DCGcan be seen in section 5.4. As explained in section 5.4 the DCG value can be setdepending on how the user is expected to behave, if the users are expected to lookthrough many results or few. The logarithm base used here is 2, since the users areexpected to look through the majority of the results.

Settings DCG Normalized DCGStandard-Standard 25.96 0.91Standard-Morpho 24.96 0.87Morpho-Morpho 24.41 0.85Morpho-Standard 24.69 0.86Morpho-Lemma 24.63 0.86Morpho-Split 24.25 0.85

Table 8.3. The discounted cumulative gain (DCG) and the NDCG for thedifferent settings.

In table 8.3 we can see the DCG and the NDCG. The values are averages over thefive different tests. Standard-Standard got 0.91 in NDCG and Standard-Morpho got0.87. Both Morpo-Morpho and Morpho-Split got 0.85 and both Morpho-Standardand Morpho-Lemma got 0.86.

33

CHAPTER 8. RESULTS

Figure 8.2. This graph shows the NDCG for the different settings at differentamounts of considered documents.

Looking at figure 8.2 where NDCG curves can be seen, the NDCG is low atjust one retrieved document, indicating that the first result in the different tests isnot one of the most relevant documents. The NDCG does however increase steeplyamong the first 20 documents and is then converging. Up to 20 retrieved documentsthe Standard-Standard has a slightly better result than the other settings. It is alsovisible that Morpho-Split has the lowest NDCG after 20 retrieved documents.

In figure 8.3 the NDCG is shown for the settings and grouped for each differenttest.

Figure 8.3. The NDCG value for each setting for the different tests.

34

8.2. DISCOUNTED CUMULATIVE GAIN

Note that the Y-axis starts at 0.8. Standard-Standard together with Standard-Morpho got a better result in the first test, and Standard-Standard got the bestresult in the second test. In the third test Standard-Morpho and Morpho-Lemmareceived good results. In test 4 Morpho-Morpho got the best result and in the lasttest Standard-Standard had the highest result. The last two tests got the highestresults and test 4 also had less dispersion than test 1 and 2.

35

Chapter 9

Discussion

In this chapter the results will be analyzed and discussed. The performance ofthe IR system will be evaluated, the effect of using lemmatization and compoundsplitting discussed and the research questions will be answered. The results fromthe interviews will also be brought up.

9.1 How Can You Assign Relevance to a Résumé?One of the challenges of this study was to come up with a method for assigningthe relevance level of a résumé for a certain job description. In order to answerthis question, interviews were performed with people working with the task on adaily basis. The results of the interviews showed that many of the interviewees haddifferent methods, but some similarities were found. All but one of the intervieweesmentioned that they do check the résumé, but at a quite late stage in the process.What was considered important in the résumés was technical skills, experience,previous titles and companies the person had had experience with. These conceptsshould be considered in the process of assigning relevance to a résumé. It is alsogood to use a non-binary scale for the relevance judgements, since résumés are moreor less relevant for different jobs. In this study it did also reveal interesting resultswhich would not have been visible otherwise.

The research on the topic of résumés is limited, and the interviews made in thisstudy are few. Since résumés are an important part of recruitment more studies onthe subject can be done. Such results can contribute to both the people trying towrite a good résumé and for any program that attempts to help in the recruitmentprocess.

9.2 Precision and RecallLooking at the results of the precision and recall in table 8.1, when all retrieveddocuments are considered, we can see that the recall values for all the differentsettings are the same, this also makes the F1 measure less interesting since it does

37

CHAPTER 9. DISCUSSION

not show anything not already visible in the precision values. This is due to thefact that no limit was set on the number of documents retrieved and all the relevantdocuments were retrieved. Looking at the precision, see that the Standard-Standardhas the best result. This is not surprising since using lemmatization and compoundsplitting reduce precision and find more documents relevant. That is also why thethe Morpho-Standard has a slightly better precision result. These two settings,Standard-Standard and Standard-Morpho, retrieved fewer documents and the onesnot retrieved were all irrelevant.

Since the information retrieval system had no score limit, but presented all ofthe retrieved documents, the precision was low and recall high. If we look at thesecond precision’ value, where we do not look at all of the retrieved documents butonly the ones retrieved before the last relevant one, we can see that the precisionvalues present slightly different results. The Morpho-Morpho setting has the highestprecision. This is also visible in the precision/recall graph in figure 8.1, but it isslightly difficult to read.

If a limit had been set, so that the documents retrieved that had the lowestscore were not presented, Morpho-Morpho would have gotten slightly better resultsthan the other settings and the risk of not presenting a relevant document wouldhave been lower than for the other settings.

9.3 The IR System’s Ranking Ability

The ranking ability of the IR system is evaluated by looking at the precision/recallgraphs and the NDCG values. If we start by looking at table 8.3 we can see thatit is the Standard-Standard settings that has the highest NDCG value, 0.91, andMorpho-Split and Morpho-Morpho has the lowest value, 0.85. A perfect ranking,where all documents on relevance level 3 are first, followed by relevance level 2documents and all relevance level 1 documents before any irrelevant ones wouldhave received a score of 1. If we look at the precision/recall curves in figure 8.1we can see that the different settings are more dispersed among when the recall islow and the precision is high. There after there are little difference between thesettings, which is not surprising since all the settings eventually reaches recall 1where the precision is about 0.6. That the lowest precision is relatively high, isnot common in most cases that users are used to, using a web search engine forexample. But in this case, where the case is finding a relevant résumé among manysimilar résumés, it is not unexpected. Whether these ranking results are good ornot, depends on what the user expects from the information retrieval system andwhat the alternative is. The conclusions we can draw here is that if morphologywas to be used with no score limit, it would have a negative impact on the rankingability whether using an information retrieval system with this ranking ability willbe discussed in section 9.6.

38

9.4. HOW DID LEMMATIZATION AFFECT THE RESULT?

9.4 How Did Lemmatization Affect the Result?

Lemmatization, converting words to normal form, was used in both the creation ofthe index and in the queries. It was expected that the recall would increase andthat it could harm the precision.

If we look at the precision and recall values in table 8.1 we can see that thewhen we look at all of the documents retrieved, the precision is lower when usinglemmatization. In these tests it indicates that more irrelevant documents wereretrieved, since the recall value is 1. Looking at the precision’, when we only considerthe n first documents until all relevant documents were found, we can see that usinglemmatization gives small benefits if it is used in both index and query.

In terms of ranking ability it seems as though using lemmatization in the indexhas a negative impact, if we look at the NDCG values in table 8.3. Using lemmati-zation in the query has a negative impact when used with a non-lemmatized indexand little impact when used together with a lemmatized index.

If we look at the graph 8.3 we see that using lemmatization in the query, with anon-lemmatized index give slightly better results than the Standard-Standard set-ting in two tests and notably worse in one case. In the other two cases it has slightlyworse results than using no lemmatization. The results show that lemmatizationis good to use in some cases, but seems to have a negative impact in most of thetests. The reason for this is probably that all of the job descriptions and the ré-sumés contained several technical skills. In section 2.6 we could read results fromBraschler and Ripplinger’s study that stemming had been proven to have a negativeeffect in cases where the query contained names or words where inflections are rarein German, and the results of this thesis show that this could also be the case forSwedish and lemmatization since the technical skills rarely have inflections.

Lemmatization does however seem to be have a mildly positive effect for theranking ability if the retrieved documents are filtered with a threshold value on thescore, and also a small positive effect on precision. If we look at table 8.2 we cansee that in order to find the most relevant documents as soon as possible in the listof results, we should use the Morpho-Standard setting, the second best is Morpho-Lemma. Using lemmatization has a positive effect in this case. Using lemmatizationshould also have had a positive effect on the recall, like Braschler and Ripplingershowed that stemming had in section 2.5.1, but since the recall was 1 for all settingssuch results could not be seen.

9.5 How did Compound Splitting Affect the Result?

Compound splitting has been used in the query, not in the index, and while thelemmatization seem to have had some positive effect, compound splitting seemsto only have had a negative impact on the results regardless if it was used withlemmatization or not. Looking at table 8.1 the three settings that got the lowestscore have queries with compound splitting in common. Looking at the precision

39

CHAPTER 9. DISCUSSION

where we only consider the results until all relevant documents were found, Morpho-Split had the worst result and the two settings that got the highest results were theones not using compound splitting.

In figure 8.1 we can see that it is the Morpho-Split curve that has the lowestvalues between the recall values 0.1 and 0.5. If we look at the NDCG values in table8.3 it is Morpho-Morpho and Morpho-Split that got the lowest values. Standard-Morpho got a better result than the settings using lemmatized index but still worsethan using no morphology.

If we continue looking at the figure 8.3 we can see that the Morpho-Split has theworst result in the three first tests. In test 4 it does however have the second bestresults, this was the query with a many Swedish words and some compounds. Intest five it has the second worst result. The compound splitting had the issue thatit recognized some names of technical skills as Swedish compounds. The splittingof these words did, understandably, have a negative impact.

Another reason can be that in order for compound splitting to provide a benefit,it should be used in both index and query. This was not done in these tests soin order to investigate this further, a suggestion for future work can be to studycompound splitting in both index and query.

9.6 Should an IR System for Résumés be Used?

The IR system used was able to find the relevant résumés in the tests. To use anIR system for the purpose of finding relevant résumés would present the user withdocuments where most of or all of the relevant documents were present. There ishowever the question of how many documents the user will look through and howmany irrelevant documents that is acceptable in the result. In many of the tests alldocuments in the index were retrieved, and to present all the all of the retrieveddocuments would not be a solution that would perform well in an instance wherethe collection of documents is bigger. If a limit on the score was set to filter out theleast relevant documents of the query, it would reduce the number of documentsbut it would also affect the recall in a negative way. As mentioned in section 2.4,people are not inclined to look far down the lists of results when using a searchengine, but this is a different situation. People have the task of finding competentpersonnel and it can be expected that more time can be put into the actual searchprocess than in an ordinary information retrieval situation.

In the tests the precision, when half of the relevant documents had been found,were above 0.8 for most of the settings, which is a high value and the amount ofirrelevant documents retrieved at this point should be acceptable. The question ofwhether using an IR system for the purpose of finding relevant résumés is a goodidea, depends on what the expectations of it are and how it will be used. If theIR system is used in an early stage to do an initial filtering it might be a goodidea. Especially if it is used by people like in the interviews that often think bythemselves of what people they know that would fit. The process of looking through

40

9.7. ETHICS AND SUSTAINABILITY

the results would be both an acknowledgement that people they initially thoughtof seem relevant and to remind them of people that were not on the top of theirminds. If you on the other hand would rely on an IR system and choose the topten results as the most relevant résumés it would be a bad idea, since the rankingability was not great, especially among the top results.

9.7 Ethics and SustainabilityAs mentioned in chapter 3 there are systems that are developed with the purpose ofmaking it easier to find the right people for a certain task. An IR system may notbe able to have as good results as specially tailored software but if the alternative isto think by oneself and ask colleagues an IR system could be a good complement. Itmight help to highlight people that would not have been considered otherwise andit could also make the recruitment process fairer. An additional advantage is thatan information retrieval system has no bias, and will consider the résumés based onthe contents of them. The IR system will not care for the name or picture on therésumé and can also not have any personal feelings that might affect the outcome, asa human being could have. To use any type of system to retrieve the most relevantrésumés could be a tool for a more unbiased and fair recruitment process. From anethical perspective it can be seen as positive to rely on a system designed to presentthe most relevant people, but this assumes that the creation of the program andthat the decisions on what should be favored by the program is fair. A programwith this functionality could be designed to be biased, like the people designing it,it could for example sort out people with a certain background. If the program isdesigned to be fair and unbiased, it would be a more fair recruitment process.

Other than the potentially more unbiased recruitment process the usage of sucha system could make recruitment processes more effective. Making processes moreeffective is important for all companies, this program would be especially importantfor companies where recruitment is a big part of the business and could contributewith economic benefits. The usage of such a program would not have any larger im-pact on environmental or social sustainability than the systems that most companiesuse today.

41

Chapter 10

Conclusion

The aim of this report was to answer the question, ”To what extend can an in-formation retrieval system find and rank résumés relevant to a certain task or joband how does the use of morphological analysis, such as lemmatization, affect theperformance of that system?”. This was done by breaking the question down intosmaller sub-questions. An information retrieval system using lemmatization andcompound splitting was built using Lucene and Stava. The tests showed that usinglemmatization can provide a positive effect in terms of precision and ranking abil-ity, especially in cases where the words use for a query are Swedish words such as"kravanalys" or "projektledare" rather than names of technical words such as "Java"or "SQL", but only if a suitable threshold value, on what score is acceptable for aresult, can be found and set. To use compound splitting had a negative effect. Thiswas due to the fact that the résumés contained a lot of words that were not Swedishwords but names of skills, and they were not recognized by Stava or recognized asSwedish words with no connection to the skill and could also be due to the fact thatit was not used in the index in any of the settings tested.

When assigning relevance to a résumé in respect to a certain job or task, skills,experience and previous projects, titles and previous companies should be consideredto be important.

Whether using an information retrieval system or not is a good idea and how itcan find and rank résumés, depends on how the user uses it. Given the results of thisthesis it seems like it can be helpful for providing an initial filtering but only if youare prepared to look through many of the results since it finds the highly relevantresults relatively early in the results. It is a bad idea if you use it and trust theranking ability, especially among the top results since irrelevant documents couldget a top spot.

10.1 Possible Problems and Suggested Improvements

In hindsight there are a couple of improvements that could have been done thatcould have affected the results in a positive way. One of the main issues was the

43

CHAPTER 10. CONCLUSION

compound splitting that harmed the performance, this was due to the fact that itrecognized some technical words as Swedish compound words and then split them.The queries then consisted of words that had no relevance to the initial term. Thiscould have been solved by these specific words being added to the lists of computerterms. This issue is supposedly the same for résumés regardless of what the profileof the résumé since there are terms in all areas that would not be recognized or couldbe recognized as the wrong word. This recognition of a word as the wrong one isone of the problems, mentioned by [17] in section 2.6, called search key ambiguity.In this case it is however not a Swedish word that has two different meanings, butthe name of technical skill that is recognized as a Swedish word.

The relevance judgements in this thesis were only made by one person, if theresources would have been available it would have been beneficial if the peopleinterviewed could have done the relevance judgements themselves, where each pairwould be evaluated by more than one person. In that way any human error couldhave been reduced and the results more reliable.

10.2 Future WorkTo further explore this area, it could be of interest to test using more documents inthe index and to use résumés from different types of areas of work. Since this studyonly used résumés from technical consultants, there were a lot of relevant résumésfor each test in the index, by using different types of résumés the ability to sort outcompletely irrelevant documents could be studied further.

If similar tests are to be done the suggestion is to ask experts on the subject toperform the relevance judgements, and to have multiple people judge the relevanceof a documents to eliminate any errors. In that way the results will be more reliable.

It could also be interesting to use the feature of assigning different weightsto query terms and certain paragraphs in the résumé document, since they havedistinct sections that cover different topics. Any type of boost to terms could alsobe investigated. In the pre-tests of this study no positive effects were detected butthat could be due to the fact that it was a very small test.

44

Bibliography

[1] Apache Lucene - Apache Lucene Core. https://lucene.apache.org/core/.(Accessed on 05/04/2016).

[2] Apache Lucene - Scoring. https://lucene.apache.org/core/3_5_0/scoring.html#ScoreBoosting. (Accessed on 05/04/2016).

[3] MultiFieldQueryParser (Lucene 5.4.1 API). https://lucene.apache.org/core/5_4_1/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html. (Accessed on 05/04/2016).

[4] Query ( Lucene 6.0.0 API). https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/search/Query.html. (Accessed on 05/18/2016).

[5] Svenska Akademiens ordlista (SAOL) | Svenska Akademien.http://www.svenskaakademien.se/svenska-spraket/svenska-akademiens-ordlista-saol. (Accessed on 05/04/2016).

[6] SwedishAnalyzer (Lucene 6.0.0 API). https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/sv/SwedishAnalyzer.html. (Accessed on 05/04/2016).

[7] TFIDFSimilarity (Lucene 6.0.0 API). https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html. (Accessed on 05/04/2016).

[8] Text REtrieval Conference (TREC) English documents introduction. http://trec.nist.gov/data/intro_eng.html, 03 2004. (Accessed on 06/21/2016).

[9] Metzler Donald Bendersky, Michael and W Bruce Croft. Effective query for-mulation with multiple information sources. In Proceedings of the fifth ACMinternational conference on Web search and data mining, pages 443–452. ACM,2012.

[10] Martin Braschler and Bärbel Ripplinger. How effective is stemming and decom-pounding for German text retrieval? Information Retrieval, 7(3-4):291–316,2004.

45

BIBLIOGRAPHY

[11] Durette Barthélémy Lafon-Matthieu Torres-Moreno Juan-Manuel Cabrera-Diego, Luis Adrián and Marc El-Bèze. How can we measure the similaritybetween résumés of selected candidates for a job? In Proceedings of the Inter-national Conference on Data Mining (DMIN), page 99. The Steering Commit-tee of The World Congress in Computer Science, Computer Engineering andApplied Computing (WorldComp), 2015.

[12] Dalianis Hercules Hassel Martin Knutsson-Ola Carlberger, Johan et al. Im-proving precision in information retrieval for Swedish using stemming. In theProceedings of NODALIDA, pages 21–22, 2001.

[13] Di Noia Tommaso Di Sciascio-Eugenio Donini-Francesco M-Mongiello MarinaColucci, Simona and Marco Mottola. A formal approach to ontology-basedsemantic match of skills descriptions. J. UCS, 9(12):1437–1454, 2003.

[14] Di Noia Tommaso Di Sciascio-Eugenio Donini-Francesco M-Mongiello MarinaColucci, Simona and Giacomo Piscitelli. Semantic-based approach to taskassignment of individual profiles. J. UCS, 10(6):723–730, 2004.

[15] Viggo Kann Rickard Domeij and Joachim Hollman Mikael Tillenius. Imple-mentation aspects and applications of a spelling correction algorithm. 1998.

[16] Joachims Thorsten Granka, Laura A and Geri Gay. Eye-tracking analysis ofuser behavior in WWW search. In Proceedings of the 27th annual internationalACM SIGIR conference on Research and development in information retrieval,pages 478–479. ACM, 2004.

[17] Turid Hedlund, Ari Pirkola, and Kalervo Järvelin. Aspects of Swedish mor-phology and semantics from the perspective of mono-and cross-language infor-mation retrieval. Information Processing & Management, 37(1):147–161, 2001.

[18] Kevin L Hutchinson. Personnel administrators’ preferences for resume content:A survey and review of empirically based conclusions. Journal of BusinessCommunication, 21(4):5–14, 1984.

[19] Kalervo Järvelin and Jaana Kekäläinen. IR evaluation methods for retrievinghighly relevant documents. In Proceedings of the 23rd annual internationalACM SIGIR conference on Research and development in information retrieval,pages 41–48. ACM, 2000.

[20] Dan Jurafsky and James H. Martin. Speech and language processing. PearsonEducation International, second edition, 2000.

[21] Viggo Kann. KTHs morfologiska och lexikografiska verktyg och resurser. Lex-icoNordica, 17(17), 2010.

46

BIBLIOGRAPHY

[22] Jaana Kekäläinen and Kalervo Järvelin. Using graded relevance assessmentsin IR evaluation. Journal of the American Society for Information Science andTechnology, 53(13):1120–1129, 2002.

[23] Heikki Keskustalo, Kalervo Järvelin, Ari Pirkola, Tarun Sharma, and MarianneLykke. Test collection-based IR evaluation needs extension toward sessions–acase of extremely short queries. In Information Retrieval Technology, pages63–74. Springer, 2009.

[24] Béchet Nicolas Roche Mathieu El-Bèze Marc Kessler, Rémy and Juan Torres-Moreno. Automatic profiling system for ranking candidates answers in humanresources. In On the Move to Meaningful Internet Systems: OTM 2008 Work-shops, pages 625–634. Springer, 2008.

[25] Tuomo Korenius, Jorma Laurikkala, Kalervo Järvelin, and Martti Juhola.Stemming and lemmatization in the clustering of Finnish text documents. InProceedings of the thirteenth ACM international conference on Information andknowledge management, pages 625–633. ACM, 2004.

[26] Raghavan Prabhakar Schütze Hinrich Manning, Christopher D et al. Introduc-tion to information retrieval, volume 1. Cambridge university press Cambridge,2008.

[27] ASSOCIATED PRESS. Google reveals more searches are madefrom mobiles than PCs for the first time | Daily Mail On-line. http://www.dailymail.co.uk/sciencetech/article-3069322/Google-reveals-searches-mobile-devices-PCs-time.html, May 2015.(Accessed on 05/18/2016).

[28] Shaw Jr William M Tang, Rong and Jack L Vevea. Towards the identificationof the optimal number of relevance categories. Journal of the Association forInformation Science and Technology, 50(3):254, 1999.

[29] McMasters Rosemary Roberts Melissa R Thoms, Peg and Douglas A Dom-bkowski. Resume characteristics as predictors of an invitation to interview.Journal of Business and Psychology, 13(3):339–356, 1999.

47

Appendix A

Precision/Recall-Graphs

Figure A.1. Precision/Recall-curve for Standard-Standard setting.

Figure A.2. Precision/Recall-curve for Standard-Morpho setting.

49

APPENDIX A. PRECISION/RECALL-GRAPHS

Figure A.3. Precision/Recall-curve for Morpho-Morpho setting.

Figure A.4. Precision/Recall-curve for Morpho-Standard setting.

50

Figure A.5. Precision/Recall-curve for Morpho-Lemma setting.

Figure A.6. Precision/Recall-curve for Morpho-Split setting.

51

Appendix B

Interview Questions

1. When you are to find a consultant for a request, how do you do it?

2. Do you use SearchLight?

If they use SearchLight:

1. How often does it happen that you find more than one potential résumé?

2. If you were to compare how suitable a consultant is to a certain project, basedon their résumé, what would you look at?

3. Does it happen that you do not find a consultant after an initial search?

4. What do you do in those cases?

5. How do you choose your search terms?

6. In a scenario where there would have been 50 available consultants, do youthink your method would have been any different? If so, in what way?

If they do not use SearchLight:

1. If you were to compare how suitable a consultant is to a certain project, basedon their résumé, what would you look at?

2. In a scenario where there would have been 50 available consultants, do youthink your method would have been any different? If so, in what way?

53

www.kth.se