Author paper midterm

Author- Paper Identification Problem

Team :

Karthik Reddy Vakati

Nachammai C

Pooja Mishra

Guided ByProf Duc Tran

Problem Statement

• To determine the correct author from the author’s dataset for a particular paper.

• Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles

• This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author

Type of data Data provided by KDD challenge is in csv format.

Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)

Author -( Id, Name, Affiliation) Paper-Author -( PaperId , AuthorId, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage) Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds) Test - (AuthorId , PaperIds) Validation -(AuthorId,PaperIds,Usage)

Data Points The data points include all papers written by an

author, his affliation (University, Technical Society, Groups). Paper-Author -( PaperId , AuthorId, Name, Affiliation)

The meta data includes journals written by him and conferences attended by an author. Paper -( Id, Title, Year, ConferenceId , JournalId,

Keywords) Author -( Id, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage)

Issues with data

Issues with data The csv files needed cleaning Few had attributes spilled over 3 rows Some rows had more attributes than the

required number of attributes Special characters caused issue

Wrote a Perl script to Clean data and format it

Issues with data-I

Issues with data-II

Predictions & IntuitionsPrediction: Given a paper and an author, one should be able to identify

whether the given paper was written by the author.

Intuition: We initially identified this problem as a Clustering problem. We

chose clustering because a set of papers written by one author can be grouped together and then for a given paper and author we can identify if the paper is from author’s cluster.

The features PaperId, AuthorId, PaperTitle, AuthorName play a significant role in the prediction.

Feature selection

We used following features from Train dataset while building the model :

ConfirmedPaperIds DeletedPaperIds

Tools Used & Model Trained

Tools Used: Weka R Apache Mahout

Model Trained: Simple K-Means J-48 ZeroR

K-means clustering using Weka Training the data

Visualization of k-means clustering result

Simple K-means clustering using R

Error in R for Clustering

> y=read.table("Paper_fixed.csv",header=TRUE,sep=',')

> y[1:10,]

> km3 <- kmeans(x,3)

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

In addition: Warning message:

In kmeans(x, 3) : NAs introduced by coercion

Conclusion

Why clustering does not work for this problem? Handling of mixed set of attributes is an issue in R Simple Kmeans clustering works on calculating the distance

from centroids and thus needs numeric attributes and distances. Hence clustering is not a best approach for our problem

To overcome the problem we are trying to convert the data into numeric integer values and then numeric distance measures are applied for computing

However, this problem looks more like a classification problem - to classify whether a paper is written by an author

Moving on to Classification algorithms..

Tree J-48

Naïve Bayes

Results using Tree-J48 algorithm

Results using ZeroR algorithm

Visualization of ZeroR results for Precision

Next Steps

We are working on the feature engineering - feature transformation – work on the Author name attribute and transform it into a common format for all Author names.

Once we have the feature engineering done - We will working principally on Naïve Bayes and other classification algorithms that we think will suit our problem

And fine tune the model…

Thank you!!

Author paper midterm

Education

Lecture 16: Nuclear Physics Reviebelz/phys5110/lecture16.pdf · Midterm Exam Midterm on Monday March 2nd Allowed resources: – One sheet of paper with notes, equations – Calculator,

Old English, Midterm paper 2013 (10%). PK & ÞE 2013.pdf · Old English, Midterm paper 2013 (10%). PK & ÞE Your assignment is to (1) translate the passage into English, French or

UMT Complete Rudiments Course Midterm Exam UMT Complete Rudiments Course Midterm Exam Author: Your Name Created Date: 2/4/2015 2:07:59 PM

Discrete Math 2 - Midterm 1 · 2015. 4. 7. · Title: Discrete Math 2 - Midterm 1 Author: Trevor Created Date: 4/6/2015 6:09:06 PM

Midterm paper p johnson and jzcrawford

D ] u Z À ] Áwhidden/CSCI3110/notes/midterm...Title Midterm Review Author Chris Created Date 7/2/2019 4:19:04 PM

Biology Midterm Exam - Edl · Biology Midterm Review ... a.m.-to qo-ea-n..£u) hL,tthotnð-iaJvo thaÿanÌ ... — 0-1 t2-m./QL,£LtuAU . Title: Biology Midterm Exam Author:

MGT301 MIDTERM 21 2009 papers and 1 2008 paper

English 11 Midterm Review. What is the correct way to head a paper?

Midterm paper

Paper Author (s)

MIDTERM EXAMINATION Programming By vuZs Team (Aniqa … · CS201 - Introduction to Programming Midterm Paper(2) 2010 MIDTERM EXAMINATION CS201-Introduction to Programming By vuZs

MGT502 Organizational Behavior All Midterm Solved Paper …api.ning.com/files/U7YWT3FNhpxW6jl3pIqtutKrzcqlCzmitnofknEpdCMul… · MGT502 Organizational Behavior All Midterm Solved

•WRITE CLEARLY! MIDTERM #2 PRACTICE TEST #1 USE BLANK PAGES AS SCRATCH PAPER ... MIDTERM #2 PRACTICE TEST #1 the exam cover sheets look kind of like this. 2 ... MIDTERM #2 PRACTICE

FIN623 MIDTERM 9 papers - Ningapi.ning.com/.../FIN623_MIDTERM_9papers.pdf · Paper 1 MIDTERM EXAMINATION Spring 2010 FIN623- Taxation Management (Session - 3) (Marks: 1 ) - Please

MIDTERM * Midterm Review:Thursday, March 7 * Midterm Date:Tuesday, March 12

Gubernatorial Midterm Slumps - Harvard Universityscholar.harvard.edu/files/jsnyder/files/gub_midterm_slumps_100921... · This paper studies gubernatorial midterm slumps in ... We

Discrete Math 1 – Midterm 1 Solutions Discrete Math 1 - Midterm 1.jnt Author Trevor Created Date 4/30/2015 2:46:30 AM

CS2210 Midterm Paper 2 for Class11 2014-15 With Mark Key

2018 Orientation Presentation Family Jim · 2019-06-04 · 3 Midterm Midterm, Paper due Lab, homework Midterm 4 Graded homework Quiz, draft due Midterm In-class presentation 5 Midterm