A HYBRID RECOMMENDER SYSTEM FOR CITATION RECOMMENDATION · recommendation for user B would be item 3 (because user B already cited Items 1 and 2, and both items are most similar to

WHITE PAPER

A HYBRID RECOMMENDER SYSTEM FOR CITATION RECOMMENDATION

Snigdha MoitraProject Manager, EXL Analytics

Swati Jain, Ph.D.Vice President, EXL Analytics

Sudipto ChaudhuriSenior Engagement Manager, EXL Analytics

[email protected]

Dhiraj SinghSenior Consultant, EXL Analytics

Written by

Aug 02, 2020

EXLSERVICE.COM 2

Citation recommendation is an interesting and significant research area, as it solves information overload in academia and business by automatically suggesting relevant references for text documents. In order to help firms or researchers understand a document, relevant and important previous work must be cited as references for the current document.

However, frequently one or more cross reference fields in a document do not display the correct content, which prevents useful citations from being generated. This problem can be solved by evaluating textual and topological similarity measures between citing documents along with pairwise comparisons between their contents to identify similarity between them.

Currently, most search engines rely primarily on keyword based search for making recommendations of similar text content. This paper introduces a hybrid recommender system as a powerful alternative to currently used search engines.

Three recommender algorithms, namely Content-Based Filtering, Market Basket Analysis, and Item-Based Collaborative filtering have been used1. Since an algorithm working efficiently and accurately for a particular dataset may perform differently on other datasets, a hybrid recommender system can provide robust recommendations by assembling all these different techniques together in situations where they may not work efficiently in isolation.

They have been successfully applied in numerous fields such as retail, e-commerce, movies, music, e-learning, and mobile service. While recommender systems for many areas have been in various stages of development, limited work has been done in the field of citation recommendations for text documents.

1. INTRODUCTION: Recommender Systems, which function to suggest items of potential

interest to customers, have attracted a growing amount of attention.

EXLSERVICE.COM 3

This analysis aims at creating a recommender engine for text documents hosted on the web portal or digital repository of an organization. The data corpus under consideration consisted of legal cases2 along with their citations. Every time a user accessed a particular legal document (parent case), the objective of the algorithm was to rank documents cited by this parent document (citations) in order of their relevance to the searched parent document, thereby providing personalized suggestions.

This analysis used a hybrid of three different recommender algorithms to rank the citations corresponding to any parent case. These algorithms have been briefly explained in this section.

For a robust analysis, only those parent cases which had at least five citations were considered. A total of 14K records were taken with approximately 2,000 parent cases. Note that a parent case ‘A’ can also be a citation for a different parent case ‘B’, and therefore the same case may appear in the data both as a parent as well as citation document. These documents contained details of legal case type (bankruptcy, tax deficiency, and others.), time period, court type (federal, tax, supreme), as well as opinions, arguments used by the court, and other variables.

Parent Case ID Citation Case ID Parent Case Description Citation Case Description

12475 50588 Description (text data) Description (text data)





A sample of a parent case and corresponding citations is shown below.

• • Content Based Approach

• • Collaborative Filtering Approach

• • Market Basket Analysis

2. OBJECTIVE AND DATA

3. APPROACH

EXLSERVICE.COM 4

3. A. Content Based Approach

Content-Based Filtering, also referred to as Cognitive Filtering, compares the content in two documents and provides a similarity score.

3. A.I. COSINE SIMILARITY

With Cosine Similarity, the similarity between two vectors based on the angle between them is evaluated. The smaller the angle, the greater the similarity. A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents. However, as the size of the document increases, the number of common words also tends to increase, even if the documents talk about different topics. Cosine Similarity helps to overcome this fundamental flaw in a ‘count-the-common-words’ or Euclidean distance approach and neutralizes the impact of document length4. Even if two similar documents are far apart by the Euclidean distance because of their size, they can still have a smaller angle between them, and higher similarity. Cosine Similarity therefore determines how similar two documents are irrespective of their size.

The content of each document is represented as a set of descriptors or terms, which are typically the words that occur in a document. Similarity can be calculated using multiple approaches. A combination of two such algorithms - Cosine Similarity and Fuzzy Similarity, have been used in the hybrid model3. Similarity between each parent and citation document pair was calculated based on their text content - the higher the combined similarity score between a pair, the higher the relevance of the citation case to the parent case. Therefore, a citation document with a higher similarity score with the parent case was assigned a higher rank by the algorithm.

Mathematically, the measurement is the cosine of the angle between two vectors projected in a multi-dimensional space. In this context, the two vectors are arrays containing the word counts of two documents. If the angle is zero, their similarity is one (the larger the angle is, the smaller their similarity). The measurement is independent of vector length (the two vectors can even be of different lengths), which makes it a commonly used measure for high-dimensional spaces.

COSINE SIMILARITY CALCULATION FOR TWO VECTORS A AND B:

EXLSERVICE.COM 5

With Cosine Similarity sentences must be converted into vectors. One way to do that is to use a bag-of- words with either TF (term frequency) or TF-IDF (term frequency-inverse document frequency). The choice of TF or TF-IDF depends on application and is immaterial to how Cosine Similarity is actually performed. TF is good for text similarity in general, but TF-IDF is good for search query relevance.

Here is an example calculating Cosine Similarity for two sentences:

• • Sentence I: Federal income tax has regulated subsidiaries

• • Sentence II: State income tax is compulsory

SENTENCE FEDERAL INCOME TAX HAS REGULATED SUBSIDIARIES STATE IS COMPULSARY

I 1 1 1 1 1 1 0 0 0

II 0 1 1 0 0 0 1 1 1

SENTENCE FEDERAL INCOME TAX HAS REGULATED SUBSIDIARIES STATE IS COMPULSARY

I 0.41 0.41 0.41 0.41 0.41 0.41 0 0 0

II 0 0.45 0.45 0 0 0 0.45 0.45 0.45

STEP 1 - We will calculate Term Frequency using bag-of-words:

STEP 3 - As the two vectors have already been normalized to have a length of 1, the Cosine Similarity can be calculated with a dot product:

Therefore, the Cosine Similarity of the two sentences is 0.37

STEP 2 - The main issue with the term frequency counts shown above is that they favor documents or sentences that are longer. One way to solve this issue is to normalize the term frequencies with the respective magnitudes or L2 norms. Summing up squares of each frequency and taking a square root, the L2 norm of Sentence 1 is 2.45, and in Sentence 2 it is 2.23. Dividing the above term frequencies with these norms, the following results are produced:

Term Frequency

Normalized Term Frequency

Term Frequency of the two sentences

Normalization of term frequencies using L2 Norms

COSINE SIMILARITY = INCOME + TAX = (0.41*0.45) + (0.41*0.45) = 0.37

EXLSERVICE.COM 6

3. B. Collaborative Filtering Approach

As opposed to Content Based Filtering, Collaborative Filtering provides recommendations by examining user-item relations and building a user-user/item-item co-occurrence matrix.

The recommendations are based on users’ past behavior. There are two categories of memory based5 Collaborative Filtering – User-based and Item-based.

User based collaborative filtering analyzes behavior of users’ and predicts what a user will like based on its similarity with other users. However, this method is not efficient in cases of data sparsity (large number of items with respect to users) or when user profiles change.

Item-item collaborative filtering looks for items that are similar to the items that user has already referred and recommends the most similar items. This method is more stable as compared to user-based Collaborative Filtering, because the average item has a lot more co-occurring items than co-occurring users. In this analysis, parent documents and citation documents are regarded as users and items respectively. Document pairs are used to create user – item or item – item matrices, and similarity between documents is generated to recommend relevant citations for parent cases. Such an approach would score and recommend new citations which are currently not cited by the parent document.

3. A.II. FUZZY SIMILARITY

In computer science, fuzzy string matching (also known as approximate string matching) is a technique of finding strings that match a pattern approximately, rather than exactly. In other words, fuzzy string matching is a type of search that will find matches even when users misspell words or enter only partial words for a search.

The fuzzy string matching process uses the “Levenshtein” distance to calculate the differences between sequences. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

Fuzzy string searches can be used in various applications, such as:

• • Search query returns results even if the user input contains additional or missing characters, or other types of spelling errors. For example, if a user types “Venizvela” into a search engine, a list of hits is returned along with “showing results for Venezuela”.

• • Software that can be used to check for duplicate records. For example, at times a customer is listed multiple times with different purchases in a database due to different spellings of their name (i.e. Judy Martin vs. Judy Martinez) a new address, or a mistakenly-entered phone number.

EXLSERVICE.COM 7

3. B.I. ITEM-BASED COLLABORATIVE FILTERING

Item-based Collaborative Filtering is an algorithm for making recommendations where similarities between different items in a dataset are calculated by using one of a number of similarity measures. These similarity values are then used to predict user-item pairs not present in the dataset. The following example shows how Item-Based Collaborative Filtering works:

USER ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 ITEM 6

A X X XB X XC X XD X X XE X XF X X X XG X XH XI X

ITEM 6 0 0 0 0 0ITEM 5 0.26 0.29 0.52 0.82 0ITEM 4 0.32 0.35 0.32 0.82 0ITEM 3 0.4 0.45 0.32 0.52 0ITEM 2 0.67 0.45 0.35 0.29 0ITEM 1 0.67 0.4 0.32 0.26 0

ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 ITEM 6

STEP 1 - A user-item historical matrix is created to generate item recommendations for a user based on the items they have previously bought (documents cited). For simplicity, in this example, only one criterion was used to create the matrix for determining similarity: whether the user purchased the item or not6.

STEP 2 - Given the history matrix, a co-occurrence matrix of items between all possible combinations of items is created. Item-item similarity is calculated by using Jaccard Similarity, which is defined as the size of intersection divided by the size of union of two sets. The number will be between 0 and 1. A value of 1 means the items are perfectly similar; 0 means they are not similar. In this analysis, the Jaccard Similarity between two items (or citation documents) was calculated by taking common users (or parent documents) as a share of total users across both items.

Sample User-Item Matrix

Sample Item-Item Matrix

Total Parent Cases for Citation 1 = 100

Total Parent Cases for Citation 2 = 250

For instance, if there are 100 parent cases for citation 1, 250 parent cases for citation 2 and 50 common parent cases between citation 1 and citation 2:

0.1667JACCARD SIMILARITY

NO. OF COMMON PARENT CASES B/W CITATION 1 & CITATION 2 50

TOTAL NO. OF PARENT CASES ACROSS BOTH CITATIONS (100+250-50)

EXLSERVICE.COM 8

STEP 3 - To generate recommendations for a parent case, it’s essential to see all documents that were cited by the parent case. The co-occurrence matrix recommends items most similar to the target item. In this example, the top recommendation for user B would be item 3 (because user B already cited Items 1 and 2, and both items are most similar to Item 3). For user C, item 4 is the top recommendation, followed by item 2.

• • Support is an indication of how frequently an item set appears in the dataset. The support of item (X) with respect to transactions (T) is defined as the proportion of transactions T in the dataset which contains the item set X.

• • The confidence value of a rule, X Y is an indication of the conditional probability of Y being purchased given that X is already purchased.

For example consider an association: Wine → Cheese [Support - 9%, Confidence - 65%]. Few customers have this item-set (both wine and cheese) in their basket (only 9% of total transactions contain both wine and cheese). However, if we buy wine, it is somewhat likely that we will also buy cheese (65% of transactions where wine is purchased also end up buying cheese). This means, if there are any existing offers for our cheese products, wine-buyers should be informed, since they are the best potential buyers.

In this analysis, the Market Basket Analysis technique was used to identify the best possible combinations of parent and citation cases7. Individual parent and citation cases were considered as products, while combinations of cases occurring together were considered as transactions or baskets.

Association rules between cases were generated. Support was measured as the percentage of rules containing the parent case and citation, while confidence measures the percentage of times that a citation occurred, given that parent case also occurred. The more frequently two documents and cases were referenced together, the higher the support and confidence for them to be associated or similar.

3. C. MARKET BASKET APPROACH

Market Basket Analysis (also called Product Association Analysis) is a widely used technique among marketers and retailers used to identify the best possible combination of products or services which are bought together by customers.

It is one of the key techniques used to uncover associations between items and works by looking for combinations of items that occur together frequently in transactions.

Association Rules are widely used to analyze retail basket or transaction data, where constraints on various measures of significance and interest are used. The best-known constraints are minimum thresholds on support and confidence.

EXLSERVICE.COM 9

3. D. HYBRID MODEL

Each concept explained in this section has its own set of disadvantages, which limits its suitability for generating recommendations. For example, text-based analysis has to cope with unclear nomenclatures, synonyms or context depending on the meaning of words. Thus, the recommender system cannot identify related citations if different terms are used8.

Collaborative Filtering is less effective in domains where more items than users exist. Another drawback is that a critical mass of cases and citations are required to receive useful recommendations.

Given that each individual method of generating recommendations has its own merits and drawbacks, this analysis used a combination of all the three different methods explained above – Content based (Cosine and Fuzzy Similarity), Item Based Collaborative Filtering and Market Basket Analysis9.

A hybrid recommender system overcomes the challenges of individual methods, ensuring that any one approach is not governing results. A combination of recommendations

is more reliable and effective. For each parent case, citation documents were scored using each of the recommendation systems. A weighted average of similarity scores generated from all three approaches gave the final score, based on which the citations were ranked.

The Cosine score was given the highest weightage (50%) because the cases consisted of textual documents and Cosine Similarity provides a direct (more robust) relationship between parent cases and citations based on their context. A higher aggregated score suggested higher similarity or association of a certain citation case with its parent case, and was therefore ranked higher. In the case of an equal aggregate score, citations were sorted based on Cosine Similarity.

EXLSERVICE.COM 10

Parent Case ID

Citation Case ID Cosine Score Fuzzy Score Collaborative

Filtering ScoreMarket Basket Analysis Score

Aggregate Score Rank

12475 50588 0.93 0.65 0.85 0.50 0.800 1

12475 50559 0.92 0.66 0.85 0.50 0.796 2

12475 49180 0.85 0.74 0.85 0.50 0.769 3

12475 56417 0.88 0.58 0.85 0.50 0.768 4

12475 53888 0.85 0.54 0.40 0.50 0.659 5

Parent Case ID

Citation Case ID

Parent Case Description

Citation Case Description

12475 50588Description (text data)

Description (text data)









As discussed in the approach, similarity scores were computed for parent - citation document pairs using each recommendation technique and the weighted average score was used to rank most relevant citations for each parent case.

Every time a search is made for a parent case on a given website, these rankings can be used to sort relevant citation cases for the parent case, and accordingly the most relevant cases can be displayed on the website as recommendations

Sample Input Data Recommender Engine

Sample Output Data

4. RESULTS A sample case in the example cited below shows how the final ranking was determined. Citations for each parent case were ranked on the basis of their similarity to the parent case.

for the case being searched. The hybrid approach ensures that the aggregate scores are robust and reliable, not giving undue weightage to any one method of classification.

EXLSERVICE.COM 11

Companies across industries are beginning to implement recommendation systems in an attempt to enhance their customers’ purchasing experience, increase sales, and engage and retain customers.

These engines allow the collection of a huge amount of information relating to users behavior, which can be systematically stored within user profiles to be used for future interactions.

Besides improving the customer experience, the information gathered can also be used in conjunction with ad targeting tools, or to trigger email campaigns based on online interactions. In an ever expanding digital world, as more products become available online, recommender engines are crucial to the future of e-commerce, not only because they help increase customer sales and interactions, but also because they help companies weed out their inventory and supply customers with products they really like. The type of recommendation algorithm that will work best for any given situation depends entirely on the industry of application and data type.

5. INDUSTRY APPLICATIONS AND WAY FORWARDWhile Cosine Similarity is broadly used for text data (since it provides direct similarity between content of documents), Collaborative Filtering is used for providing personalized recommendations of videos (YouTube/ Netflix etc.) and E-commerce (Amazon / Flipkart). Traditional brick and mortar retail stores (Walmart / Costco) rely on the Market Basket approach to provide on the counter product recommendations at the time of check out.

While this paper analyzed legal documents, the hybrid recommender system developed can be applied to generate personalized recommendations of textual documents (articles, journals, books etc.) belonging to any domain, as long as the data consists of pairs of parent as well as citation or associated documents. The algorithm combines document content and co-occurring relation between these documents to generate recommendations.

They also incorporate item profitability into their algorithms, so as to drive higher margins from a purchase by prioritizing products that are more profitable.

These algorithms are designed to capture customer preferences and reach customers across a range of channels including email, social media, off-site shopping widgets, mobile apps, and retail customer service centers. Over the past few years, deep learning10 based recommender systems have also become popular in the industry with more and more companies finding merit in their application.

More advanced forms of recommender systems include relevant predictive recommendations created by digging deeper into customers’ interests and preferences (for instance, based on users’ browsing history as well as demographics, and not just past purchases).

EXLSERVICE.COM 12

1 It is unusual to use Market Basket Analysis in performing recommendations for text documents. In our analysis, since pairs of parent and citation cases were available in the data to start with, we assumed each pair to be equivalent to transactions and each document in the pair to be equivalent to products in the transaction, enabling us to perform association rules on our data (with the rationale that documents frequently cited together are similar). 2 The term case has been used as a proxy for legal documents – each document in the corpus consisted of information related to legal cases. 3 Cosine Similarity measure is superior to alternatives like Jaccard and Euclidean methods in case of documents of varying lengths or with repetitive words. Fuzzy string matching is also a highly used algorithm in search engines for the need to overcome challenges of spelling errors and typos. Thus, a combination of both methods has been used while assessing content similarity of parent document with citation documents (content based approach). 4 The Euclidean distance between two points is the length of the line segment connecting them. 5 Based on the method of implementation, there are two types of collaborative filtering techniques – memory based and model based. Memory based CF utilizes the entire user-item data & uses statistical methods (like nearest neighbors) to search for a set of users who have similar transactions history to the active user. Model based CF provides recommendations by developing a model from user ratings using ML techniques such as classification, clustering & rule based approaches. Although model based approaches have better prediction power and are more capable of handling the problem of sparcity & scalability compared to memory based, they require high resource, time and memory to develop & may lose information in the process of dimensionality reduction. Memory based methods are easy to implement & accommodate new data. Keeping in mind our resource constraints, we leveraged memory based Item – Item CF technique for our analysis. 6 More complex systems could go into greater detail by using user profiles that represent their tastes / factoring in how much they rate or like an item / weighing in how many times they purchased items similar to the potential recommended item / making assumptions on their liking on the basis of whether they simply viewed an item, even though no purchase was made. 7 Market basket analysis (association rules) and collaborative filtering answer fundamentally different questions. Collaborative Filtering can answer a question “What items do users with interests similar to a user like?” Whereas association rules answer a question “What items do frequently appear together?” The answer to the first question can be used to recommend content that one hasn’t seen previously but that have been purchased / rated by a group of other users with similar interests. The similarity of interests can be estimated from explicit indicators, for

Footnotes example, the users purchased same products. Association rules can recommend products that a user is likely to purchase based on a set of products currently in his basket. Association rules are independent of personal preference profiles and for mining them we need a dataset of transactions from all users.

8 Text embedding can be used to study the semantic similarity in text pairs & measure the degree of equivalence by studying underlying context of documents. Since our data corpus consisted of legal cases where the language used technical legal terms, we chose using cosine similarity so that documents that consisted of the same legal terms were classified as more similar.

9 The hybrid algorithm was designed in Python. For market basket analysis, we used the python libraries MLxtend & Apriori to generate support, confidence & association rules. For content based technique (cosine similarity & fuzzy similarity) we used the libraries sklearn and fuzzywuzzy respectively. The source code for collaborative filtering was indigenously designed for our purpose.

10 Deep learning algorithms are very versatile and provide various advantages over other ranking and recommendation algorithms like lambda rank or matrix factorizations which has accelerated their adoption.

ReferencesJannach D., Zanker M., Felfernig A. and Friedrich G., “Recommender Systems: An Introduction”. Leskovec J., Rajaraman A. and Ullman J., “Mining Massive Datasets” Haruna K., Ismail M.A., Damiasih D., Sutopo J. and Herawan T. (2017), “A collaborative approach for research paper recommender system”. Haruna K. and Ismail M.A. (2018),” Research paper recommender system evaluation using collaborative filtering”, AIP Conference Proceedings Gipp B., Hentschel C. and Beel J. (2009), “Scienstein: A Research Paper Recommender System”. Lhuillier A., “Evaluating Recommender Systems effect on Content Diversity: An agent-based framework”. Bhagvatula C., Feldman S., Power R. and Ammar W. (2018), “Content Based Citation Recommendation”. He Q., Pei J., Kifer D., Mitra P. and Giles C., “Context Aware Citation Recommendation”.

EXLSERVICE.COM

GLOBAL HEADQUARTERS320 Park Avenue, 29th FloorNew York, New York 10022T +1 212.277.7100 F +1 212.771.7111

United States • United Kingdom • Australia • Bulgaria • Colombia • Czech Republic • India Philippines • Romania • South Africa

EXL (NASDAQ: EXLS) is a leading operations management and analytics company that designs and enables agile, customer-centric operating models to help businesses enhance revenue growth and profitability. Our delivery model provides market-leading business outcomes using EXL’s proprietary Digital EXLerator™ Framework, cutting-edge analytics, digital transformation and domain expertise. At EXL, we look deeper to help companies improve global operations, enhance data-driven insights, increase customer satisfaction, and manage risk and compliance. EXL serves the insurance, healthcare, banking and financial services, utilities, travel, transportation and logistics industries. Headquartered in New York, New York, EXL has more than 29,000 professionals in locations throughout the United States, Europe, Asia (primarily India and Philippines), Latin America, Australia and South Africa.

This article is not intended to constitute legal, compliance, regulatory, privacy or similar professional advice, and EXL does not provide services to clients in those areas. The reader is advised to engage experts to provide professional advice on any legal, compliance, regulatory or privacy topics covered herein. For more information, see www.exlservice.com/legal-disclaimer

© 2020 ExlService Holdings, Inc. All Rights Reserved.

[email protected]

Documents

A HYBRID RECOMMENDER SYSTEM FOR CITATION RECOMMENDATION · recommendation for user B would be item 3 (because user B already cited Items 1 and 2, and both items are most similar to