Using Full-text Academic Articles and Wikipedia to Find

Preview:

Citation preview

Data and Tools

Abstract

Contact InformationShutianMa:mashutian0608@hotmail.comChengzhi Zhang:zhangcz@njust.edu.cn

UsingFull-textAcademicArticlesandWikipediatoFindAlternativeFreeBioinformaticsSoftwareShutian Ma;Chengzhi Zhang

DepartmentofInformationManagement,NanjingUniversityofScienceandTechnology,Nanjing,China,210094

Experimental Result

v Scientific literature114,510 papers in XML format - PLOS ONE11,013 articles containing bioinformatics software

v Software list143 specific bioinformatics software - Wiki20 commercial software according to their licensesOnly 97 software have info box information

vToolsLSI and Doc2Vec - GenismPython package of LDANode2Vec - OpenNEDiffer2Vec and struc2vec - Github

Taking bioinformatics software as a case study, this paper wants to findfree software which are similar with commercial ones and havepotential to be alternatives. Content and network information areapplied for preference-oriented results, which encapsulates similarity inhow people describe them in wiki and how people use them inresearch.

What do we want to do? Find Free Software!

Method

RepresentSoftwareUsingFull-textandWiki

ComputeSimilaritybetweenFreeand

CommercialSoftware

ConstructGroundTruthandEvaluate

RecommendationResult

Except categories which startwith characters "List of".

FindBioinformaticsSoftwareInformationthat

TellsusWhichisFree/Commercial

Information in Wiki infobox is converted into“Knowledge Graph”

Use returned results numberto calculate normalizedpointwise mutual information

Paper-based SoftwareRecommendation Performance

Wiki-based SoftwareRecommendation Performance

Conclusion

þ UseWikicontento Usefull-text papercontentþ Wikicontent oWikigraphþ CombineWikicontent&graphoWikicontentoWikigraphþ Graphembedding canhelptoimprovePaper-based

recommendation.þ Combineallinformation, recommendation performance isn’tgetting

muchhigherasexpected.

ü Wikipediacontentandinfoboxcanbebalancedtogether foranefficient softwarerecommendationtechnique.

ü Graph-based information canhelptorichsemanticinformation.

ü It’snotsuitabletousesuchkindoffull-text publicationdatasettorepresententities likesoftwareorsomeothers inresearch

when doing linear combinations,all weights are set to be one.

v Software Representation GenerationLSI, LDA and Doc2Vec – Represent software via vectors based oncontent in 100 dimensionsNode2Vec, Differ2Vec and struc2vec – Represent software via vectorsbased on graph in 128 dimensions

v Software Graph Construction

NodeType Value

developedbywhatkindofteam university,companyandperson

yearofstablerelease 14differentyears

writteninwhatkindofprogramminglanguage 17languages

operationsystem Linux,Unix,WindowsandMacOS

appliedplatform 6kinds

availablelanguage Englishorcross-language

softwaretype 44types

Recommended