103
Using Lucene/Solr to Build CiteSeer X and Friends Dr. C. Lee Giles Information Sciences and Technology Computer Science and Engineering The Pennsylvania State University University Park, PA, USA [email protected] http://clgiles.ist.psu.edu

Using Lucene/Solr to Build CiteSeerX and Friends

Embed Size (px)

DESCRIPTION

Presented by C. Lee Giles, Pennsylvania State University - See complete conference videos - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. We propose the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and discuss its uses in building enterprise search and cyberinfrastructure for the sciences and academia. We highlight application domains with examples of specialized search engines that we have built for computer science, CiteSeerX, chemistry, ChemXSeer, archaeology, ArchSeer. acknowledgements, AckSeer, reference recommendation, RefSeer, collaboration recommendation, CollabSeer, and others, all using Solr/Lucene. Because such enterprise systems require unique information extraction approaches, several different machine learning methods, such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining, etc. are critical for performance.

Citation preview

Page 1: Using Lucene/Solr to Build CiteSeerX and Friends

Using  Lucene/Solr  to  Build  CiteSeerX  and  Friends    

Dr. C. Lee Giles Information Sciences and Technology Computer Science and Engineering The Pennsylvania State University

University Park, PA, USA [email protected]

http://clgiles.ist.psu.edu

Page 2: Using Lucene/Solr to Build CiteSeerX and Friends

Prof.  C.  Lee  Giles  •  Intelligent  and  specialty  search  engines;  cyberinfrastructure  

for  science,  academia  and  government  –  Modular,  scalable,  robust,  automaEc  cyberinfrastructure  and  

search  engine  creaEon  and  maintenance  –  Large  heterogeneous  data  and  informaEon  systems  –  Specialty  search  engines  and  portals  for  knowledge  integraEon  

•  CiteSeerx  (computer  and  informaEon  science)  •  ChemXSeer  (e-­‐chemistry  portal)  •  GrantSeer  (grant  search)  •  RefSeer    (recommendaEon  of  paper  references)  

•  Scalable  intelligent  tools/agents/methods/algorithms  –  InformaEon,  knowledge  and  data  integraEon  –  InformaEon  and  metadata  extracEon;  enEty  disambiguaEon  –  Unique  search,  knowledge  discovery,  informaEon  integraEon,  

data  mining  algorithms  –  Web  2.0  methods  

•  Automated  tagging  for  search  and  informaEon  retrieval  •  Social  network  analysis  

http://clgiles.ist.psu.edu

Page 3: Using Lucene/Solr to Build CiteSeerX and Friends

SeerSuite  Contributors/Collaborators:  recent  past  and  present  (incomplete  list)  

Projects:  CiteSeer,  CiteSeerX,  ChemXSeer,  ArchSeer,  CollabSeer,  GrantSeer,  SeerSeer,  RefSeer,  AlgoSeer,  AckSeer,  BotSeer,  YouSeer,  …  

•  P.  Mitra,  V.  Bhatnagar,  L.  Bolelli,  J.  Carroll,  I.  Councill,  F.  Fonseca,  J.  Jansen,  D.  Lee,  W-­‐C.  Lee,  H.  Li,  J.  Li,  E.  Manavoglu,  A.  Sivasubramaniam,  P.  Teregowda,  H.  Zha,  S.  Zheng,  D.  Zhou,  Z.  Zhuang,  J.  Stribling,  D.  Karger,  S.  Lawrence,  J.  Gray,  G.  Flake,  S.  Debnath,  H.  Han,  D.  Pavlov,  E.  Fox,  M.  Gori,  E.  Blanzieri,  M.  Marchese,  N.  Shadbolt,  I.  Cox,  S.  Gauch,  A.  Bernstein,  L.  Cassel,  M-­‐Y.  Kan,  X.  Lu,  Y.  Liu,  A.  Jaiswal,  K.  Bai,  B.  Sun,  Y.  Sung,  J.  Z.  Wang,  K.  Mueller,  J.Kubicki,  B.  Garrison,  J.  Bandstra,  Q.  Tan,  J.  Fernandez,  P.  Treeratpituk,  W.  Brouwer,  U.  Farooq,  J.  Huang,  M.  Khabsa,  M.  Halm,  B.  Urgaonkar,  Q.  He,  D.  Kifer,  J.  Pei,  S.  Das,  S.  Kataria,  D.  Yuan,  T.  Suppawong,  others.  

•  Current  funding:  NSF,  Dow  Chemical  

Page 4: Using Lucene/Solr to Build CiteSeerX and Friends

Outline  

•  MoEvaEon  –  Data  science;  Cyberinfrastructure  –  Vast  growth  in  domain  science  data  and  documents  

•  SeerSuite  –  Tool  for  creaEng  Seers  –  Specialized  data  and  document  search  and  recommendaEons  

•  Tables,  formulae,  figures,  references  …  –  Use  of  Solr/Lucene  

•  Disciplinary  sciences,  indexes  &  informaEon  extracEon  (the  Seers)  –  Computer  science  –  Chemistry  –  Briefly  other  Seers  

•  OpportuniEes  for  Research  •  Conclusions  and  DirecEons  

Page 5: Using Lucene/Solr to Build CiteSeerX and Friends

The  Evolu3on  of  Science  -­‐  the  4th  Paradigm  

•  Observa3onal  Science    –  ScienEst  gathers  data  by  direct  

observaEon  –  ScienEst  analyzes  data  

•  Analy3cal  Science    –  ScienEst  builds  analyEcal  model  –  Makes  predicEons.  

•  Computa3onal  Science    –  Simulate  analyEcal  model  –  Validate  model  and  makes  predicEons    

•  Data  Driven  Science  –  Data  captured  from  the  web,  by  

instruments,  or  from  documents  –  Data  generated  by  simulaEon  –  Placed  in  data  structures  /  files  –  ScienEst(s)  analyze(s)  data  –  Access  &  search  crucial  

Jim Gray’s paradigm

Page 6: Using Lucene/Solr to Build CiteSeerX and Friends

Data  Access  Varies  with  Discipline  or  Small  vs  Big  Science  

•  Small  vs  Big  science  –  “Data  from  Big  Science  is  …  easier  to  handle,  understand  and  archive.  

Small  Science  is  horribly  heterogeneous  and  far  more  vast.  In  Eme  Small  Science  will  generate  2-­‐3  Emes  more  data  than  Big  Science.”      

•  ‘Lost  in  a  Sea  of  Science  Data’  S.Carlson,  The  Chronicle  of  Higher  EducaEon  (23/06/2006)    

–  Data  is  local  –  Data  will  not  be  shared  

•  At  some  point  there  will  be  needed    –  indices  to  control  search  –  parallel  data  search  and  analysis  

•  Cyberinfrastructure  can  help  –  If  you  can’t  move  the  data  around,  –  Bandwidth  of  a  van  loaded  with  disks      take  the  analysis  to  the  data!  –  Do  all  data  manipulaEons  locally  

•  Build  custom  procedures  and  funcEons  locally  

Page 7: Using Lucene/Solr to Build CiteSeerX and Friends

SeerSuite  •  Open  source  search  engine  and  digital  library  tool  kit  used  to  

build  search  engines  and  digital  libraries  –  CiteSeerX  ,  ChemXSeer,  RefSeer,YouSeer,  CollabSeer,  etc.  

•  Supports  research  in  –  Indexing  and  search  –  Digital  libraries  –  Data  mining  &  structures  –  InformaEon  and  knowledge  extracEon  –  Social  networks  –  Scientometrics/infometrics  –  Systems  engineering,  User  design  –  Sokware  engineering  and  management  –  Web  crawling  

•  Trains  students  in  search  and  sokware  systems  –  EducaEonal  tool  for  search  engine  creaEon  –  Students  highly  sought  in  industry  and  government  

Page 8: Using Lucene/Solr to Build CiteSeerX and Friends

SeerSuite  -­‐  proper3es  •  Modular,  scalable,  extensible,  robust  design  

–  Extensible  to  many  problems  and  disciplines  •  Integrated  features  

–  Focused  crawler  -­‐  Heritrix  –  Indexer  -­‐  Solr/lucene  –  Metadata  extracEon  -­‐  modular  –  Ranked  results  

•  Builds  on  experience  with  other  domain  engines  and  OS  tools  –   Lucene  and  Solr  –   The  MySQL  Database  and  InnoDB  Storage  Engine  –   Apache  Tomcat  –   Spring  Framework  –   Acegi  Security  –   AcEveMQ  –   AcEveBPEL  Open  Source  Engine  –   Apache  Commons  Libraries  –   SVMlight  support  vector  machine  package  –   CRF++  condiEonal  random  field  package  

•  Hardware  independent;  Linux  •  Reuse  not  reinvent  

Page 9: Using Lucene/Solr to Build CiteSeerX and Friends

Data Mining & Information Extraction in Seers •  Data acquisition

•  SeerSuite systems often crawls the public web for new data •  Many data types available

•  Richness of data offers unique data mining features •  CiteSeerX as testbed/sandbox

•  Large scale data resources •  Millions of documents, authors, etc. •  Some common features/metadata

•  Commercial grade indexer (Solr/Lucene)

•  Scalable to G’s of documents and M’s of users •  “Watson”

•  Modular design •  Cloudable

•  State of the art algorithms (machine learning) for large scale unique metadata (information) extraction & mining

•  Unique parsers and indexing •  Quality of extraction •  Precision/recall •  Ranking •  Architecture/integration

Page 10: Using Lucene/Solr to Build CiteSeerX and Friends

Seer  Friends  •  In  various  stages  of  the  system  lifecycle  with  various  data  resources  

and  indexes:  –  Mature  and  developing,  code  released  

•  CiteSeer,  now  CiteSeerX  •  ChemXSeer  •  TableSeer  •  YouSeer  

–  New,  future  TBD,  not  all  aspects  public  •  ArchSeer  •  AlgoSeer  •  CollabSeer  •  RefSeer  •  SeerSeer  •  GrantSeer  

–  Dead  or  limping  by  (could  be  revived)  •  AckSeer  (acknowledgement  indexing)  (revived!)  •  BizSeer  •  BotSeer  

–  Proposed,  but  do  not  exist  •  BrainSeer  •  CensorSeer  •  ArXivSeer  

Page 11: Using Lucene/Solr to Build CiteSeerX and Friends

Why  Solr/Lucene?  •  Only  open  source  considered  –  cost  •  CompeEtors:  

–  Indri  – Wumpus  –  Terrier  –  Others?  

•  Must  scale  for  both  number  of  documents  and  users  •  Easily  integrable  and  customizable  

–  Other  indexes,  crawlers,  ingesEon,  metadata  extractors    •  Well  used  (Watson)  •  AcEve  community  of  support  

–  Enterprise  plaporm  a  plus  •  Easy  to  transiEon  to  government/industry/academia  

–  Apache  license  

Page 12: Using Lucene/Solr to Build CiteSeerX and Friends

http://citeseerx.ist.psu.edu

Next Generation CiteSeer, CiteSeerX

•     2  M  documents  

•     40  M  citaEons  

•   2  to  5  M  authors  •   2  to  4  M  hits  day  

•   800K  individual  users  •   en3re  data  shared  

•   Index  -­‐  50  G  

Page 13: Using Lucene/Solr to Build CiteSeerX and Friends

History:  CiteSeer  (aka  ResearchIndex)  

C. Lee Giles

Kurt Bollacker

Steve Lawrence

  Project  at  NEC  Research  InsEtute,  Princeton    1st  academic  document  search  engine    Very  popular  with  computer  science  

  Hosted  at  NEC  from  1997  –  2004.    Moved  to  Penn  State  as  collaborators  lek.  

  Provided  a  broad  range  of  unique  services  including    AutomaEc  citaEon  indexing,  reference  linking,  full  text  indexing,  similar  documents  lisEng,  automated  metadata  extracEon  and  several  other  pioneering  features.  

  Refactored  and  redesigned  as  CiteSeerx    Released  2008    Lucene  based  indexing  

CiteSeer continuously running for 15 years!

Page 14: Using Lucene/Solr to Build CiteSeerX and Friends

SeerSuite/CiteSeerX Architecture

•  Web Application

•  Focused Crawler

•  Document Conversion and Extraction

•  Document Ingestion

•  Data Storage

•  Maintenance Services

•  Federated Services

Teregowda, USENIX ‘10

Page 15: Using Lucene/Solr to Build CiteSeerX and Friends

4 systems:

•  Production •  Crawling •  Staging •  Research

All or some can be cloudized

Teregowda, USENIX 2010

Page 16: Using Lucene/Solr to Build CiteSeerX and Friends

CiteSeerX  Services    CiteSeerX  is  a  very  automated  system:  

  Full  OAI  metadata  if  available    Full  text  Indexing  (many  different  indexes)  

  Documents    CitaEons    Tables    More  forthcoming    (Algorithms,  Figures,  Acknowledgements).  

  CitaEon  Graph.    Ranking  based  on  citaEons.    Linking  documents    

-  Co-­‐citaEons  -  CiEng  documents  

  Author  DisambiguaEon    DisEnguish  between  authors  with  similar  names.    Profiles  and  publicaEon  informaEon  for  author.  

  AutomaEc  crawling  from  list  and  submissions    PersonalizaEon  

-  Login  based  access  to  features  on  CiteSeerX.  -  CorrecEons  to  metadata.  -  Storage  of  queries.  -  CollecEon  of  papers  -  Follows  document  metadata  changes.  

Page 17: Using Lucene/Solr to Build CiteSeerX and Friends

Focused  Crawling  •  Maintain  a  list  of  parent  URLs  where  documents  were  previously  found  

–  Parent  URLs  are  usually  academic  homepages.  •  300,000  unique  parent  URLs,  as  of  summer  2011  

–  Parent  URLs  are  stored  in  a  database  table  with  two  addiEonal  fields  for  scheduling:  

•  Last  Eme  changed,  get  new  documents  from  the  page.  •  EsEmated  change  rate  according  to  previous  crawls  of  this  page.  

•  The  crawling  process  starts  with  the  scheduler  selecEng  1000  parent  URLs  which  have  the  highest  probability  of  having  new  documents  available.    –  Assume  Poisson  process  for  the  change  behavior  of  a  parent  page.    

•  Suppose  a  parent  page  P’s  last  observed  change  occurred  at  Eme  t1,  and  its  esEmated  change  rate  is  R,  then  at  Eme  t2  (t2  =  t1  +  Δ),  the  probability  that  it  has  changed  again  since  t1  is  1  –  exp(-­‐R*Δ)  

•  Larger  R  or  larger  Δ  will  give  larger  probability.  •  Aker  each  crawl,  the  change  rate  of  the  scheduled  parent  URL  should  be  recalculated.  

•  Crawling  run  incrementally  daily  (invoked  by  a  Linux  cron  job  at  12  am)  –  Most  discovered  documents  have  been  crawled  before.    

•  Use  hash  table  comparison  for  detecEon  of  new  documents  •  Normally  retrieve  a  few  thousand  NEW  documents  per  day,  someEmes  less  than  1k.  

•  Moved  to  whitelist  vs  blacklist    Zheng, CIKM’09

Page 18: Using Lucene/Solr to Build CiteSeerX and Friends

documents  from  crawled  urls  90% all citations from the first 550 sites

90% all documents from the first 1250 sites

Page 19: Using Lucene/Solr to Build CiteSeerX and Friends

How  will  we  get  metadata  for  fields?  

Now... that should clear up a few things around here

Page 20: Using Lucene/Solr to Build CiteSeerX and Friends

Metadata  ExtracEon  

•  Documents  are  converted  from  PDF/PS  to  text  using  converters.  

–  Converters  include  TET,  pd{ox,  pdkotext,gs.  

•  Documents  are  filtered  checking,  for  existence  of  references  and  duplicaEon  (checksum).  

•  Use  tools  or  build  your  own  –  Metadata  extracEon  system  uses  machine  learning  

methods  like  SVM  (Header  Parser),  CRF  (ParsCit)  to  extract  various  enEEes  from  the  document.  

•  Rule  based  templates  are  applied  before  extracEon.  

Page 21: Using Lucene/Solr to Build CiteSeerX and Friends

id

version cluster

title

abstract

venue

venueType

pages

publisher

public

n-cites

crawldate selfCites repositoryID

10.1.1.130.782

This .. 2009

2

JOURNAL

455-500

True 34

6

12/30/2008 10

“Tensor Decompositions and Applications”, SIAM REVIEW, 2009, pp 455-500 Abstract: This …. Cited 34 times, 6 times by Author

AutomaEcally  Created  DB  of  paper  in  CSX  

year

System

Extractor/ User/

Inference

Inference/ User

Assigned By

Tensor Decompositions and Applications

9248987

SIAM

SIAM REVIEW

Page 22: Using Lucene/Solr to Build CiteSeerX and Friends

Extraction

Storage

Load Balancer

Web 1

Web 2

Index

Index - Tables

Repository

Database

Load Balancer

Web Application

Ingestion Crawler

User Request

Queries

Requests

3  Tier  Architecture  

Page 23: Using Lucene/Solr to Build CiteSeerX and Friends

CiteSeerX  Sokware  Overview  •  IngesEon  Process:  Responsible  for  obtaining  and  preparing  a  document  and  the  

related  metadata.  –  Process  the  document  

•  Submi|ed  by  the  user  or  Crawler  –  Extract  Metadata  

•  Header  •  CitaEons  •  Acknowledgements  

–  Store  the  metadata  and  documents.  •  CitaEon  Matching  

–  Iden>fying  the  underlying  graph  structure  –  documents  ci>ng  this  document  and  the  rela>onship  between  documents  and  cita>ons  

•  Inference  matching  and  graph  generaEon  

–  User  CorrecEons  (Version  Maintenance)  –  Determine  and  accept  valid  user  correc>ons  –  Regular  NoEficaEon  Mechanisms  –  Ensure  that  the  user  is  no>fied  when  new  documents  are  added  to  the  collec>on  

•  Linked  to  MyCiteSeer.  

•  Update  and  Maintenance  –  Update  and  make  valid  the  full  text  index  and  various  sta>s>cs.  –  StaEsEcs  

–  Index  updates  

Page 24: Using Lucene/Solr to Build CiteSeerX and Friends

CiteSeerX  Search    Enabling  Search  

  Fulltext  

  Fields  created  

-  Title  

-  Authors  

-  CitaEons  

-  Venue  

-  Keywords  

-  Abstract  

-  Range  (PublicaEon)  

-  CitaEons  

Page 25: Using Lucene/Solr to Build CiteSeerX and Friends

Field  Schema  

Field Type Indexed/Stored DOI String Y/Y - Unique Citation/Document String Y/Y Title Text Y/Y Author A Text Y/Y

Authors Normalized A Text Y/N

ncites (# cited by) Integer Y/Y

URL String Y/Y

cites Tokens Y/N

citedby Tokens Y/N

Timestamp Date Y/Y

* - A Text is a Text field which does not have a stopword filter or stemming ^ - Tokens are a Text field with only duplicate removal and whitespace tokenizer

Page 26: Using Lucene/Solr to Build CiteSeerX and Friends

CiteSeerX  Search  Results    Results  SorEng  

  Relevance  (default)  

-  Based  on  dismax  query  handling  with  boosEng.  

  CitaEons  

-  CitaEons  received  by  the  document  in  collecEon  plus  default  

  Year  

-  PublicaEon  date.  

  Recency  

-  Date  of  acquisiEon.  

Sorting

Page 27: Using Lucene/Solr to Build CiteSeerX and Friends

CiteSeerX  CitaEon  Graph  

  RelaEonships  

  CitaEon  graph    

-  Store  Cited  by  and  Cites  in  index  

  Build  

-  Build  document  graph  by  querying  index  for  relaEonship.  

E

D

A

C

B

Cites

Cited by

Page 28: Using Lucene/Solr to Build CiteSeerX and Friends

Adding  documents  

  Ingest  documents  for  new  crawls  

-  Add  metadata  to  collecEon  

-  Add  full  text  to  system  

-  Link  metadata  in  collecEon  

  Run  maintenance  scripts  

-  Poll  updates  and  post  to  Solr.  

  Fulltext  

  Metadata  

  RelaEonships  

  Challenge:  Maintain  data  freshness.  

Page 29: Using Lucene/Solr to Build CiteSeerX and Friends

Query  Response  

Database

Index

Web

Web Interface

•  Query  forwarded  to  Solr  from  the  presentaEon  layer  (JSP)  

•  Solr  generates  ranked  response  in  JSON  

•  Build  each  record  in  xml  with  the  database  (Add  fields:  Abstract)  

•  PresentaEon  layer  (JSP)  formats  records  based  on  ranking.  

Page 30: Using Lucene/Solr to Build CiteSeerX and Friends

Ranking  with  BoosEng  (Relevance)  

  Use  of  Boost  FuncEon,  Minimum  Match,  Query  Fields    Boost  FuncEon  –    the  effect  of  citaEons  -  Map  number  of  citaEons  >  1  to  500  

  Minimum  Match  –  2      Query  Fields  -  Text  (1)  

-  Title  (4)  -  Abstract  (2)  

Page 31: Using Lucene/Solr to Build CiteSeerX and Friends

Query  Response    Query  at  Interface  (JSP)  

  Hand  over  to  Web  applicaEon  (Java/Spring)  

  Hand  over  to  Solr    Ranked  response  from  Solr  (JSON)  

  Response  unwrapped  and  more  details  included  with  informaEon  from  DB  

  Present  response  at  Interface  (JSP)  

Web Interaface

Web Application

Index

DB

Q

Q R

F

R Text

Text JSON

HashMap

HashMap

Page 32: Using Lucene/Solr to Build CiteSeerX and Friends

Name  DisambiguaEon  •  Name  disambiguaEon  (NER)  

–  A  person  can  be  referred  to  in  different  ways  with  different  a|ributes  in  mulEple  records;  the  goal  of  name  disambiguaEon  is  to  resolve  such  ambiguiEes,  linking  and  merging  all  the  records  of  the  same  enEty  together  

•  Three  types  of  name  ambiguiEes:  –  Aliases  -­‐  one  person  with  mulEple  aliases,  name  variaEons,  or  name  

changed    e.g.  CL  Giles  &  Lee  Giles,  Superman  &  Clark  Kent  

–  Common  Names  -­‐  more  than  one  person  shares  a  common  name,    e.g.  Jian  Huang  –  103  papers  in  DBLP  

–  Typography  Errors  -­‐  resulEng  from  human  input  or  automaEc  extracEon  

•  Goal:  disambiguate,  cluster  and  link  names  in  a  large  digital  library  or  bibliographic  resource  such  as  Medline,  CiteSeerX,  etc.  

Page 33: Using Lucene/Solr to Build CiteSeerX and Friends

•  EnEty  disambiguaEon  problem  –  Determine  the  real  idenEty  of  the  

authors  using  metadata  of  the  research  papers,  including  co-­‐authors,  affiliaEon,  physical  address,  email  address,    informaEon  from  crawling  such  as  host  server,  etc.  

–  EnEty  normalizaEon  •  MoEvaEon  

–  Enhance  search  funcEonaliEes  for  digital  repositories  

•  Fielded  search  by  author  name  –  Improve  metadata  quality  –  Improved  social  network  analysis  –  Government  and  business  

intelligence  •  E.g.  census  data  and  credit  

records  

•  Challenges  –  Accuracy  –  Scalability  –  Expandability  

Efficient  Large  Scale  En3ty  Disambigua3on  Testbed:  CiteSeerX  and  PubMedSeer  

SimilarityFunction

JaccardSimilarity

Soft-TFIDF

Similarity

MetadataExtraction

Module

Online SVM with Active Learning

Distance LearnerAnnotator

Author 1Paper 3

Author 2Paper 4

CandidateClass

SVMDistanceFunction

DBSCANClustering

Module

BlockingModule

•  Key  features  –  LASVM  distance  funcEon  

•  AcEve  learning  –  Simpler  and  more  accurate  model  

–  Be|er  generalizaEon  power  

•  Online  learning  –  Expandable  to  new  training  data  

–  DBSCAN  clustering  •  Ameliorate  labeling  inconsistency  (transiEvity  problem)  •  Efficient  soluEon  to  find  name  clusters  

•  N  logN  scaling  

documents

Actors, entities

Huang, et.al PKDD 2006 Treeratpituk, et.al JCDL 2009

Page 34: Using Lucene/Solr to Build CiteSeerX and Friends

Author  DisambiguaEon  Field  

•  Currently  uses  author  fields  – For  author  search  (both  for  author  menEons  and  for  disambiguated  authors)  

•  Future  direcEon    – Use  Lucene  index  for  blocking  in  author  disambiguaEon  –  creaEng  candidate  set  of  author  menEons  that  could  belong  to  the  same  cluster  

Page 35: Using Lucene/Solr to Build CiteSeerX and Friends

Author  DisambiguaEon  •  Random  Forest  (RF)    

–  Use  random  feature  selecEon+bootstrap  sampling  to  construct  mulEple  decision  trees  from  one  training  data  –  Aggregate  votes  of  a  collecEon  of  decision  tree  as  final  decision  –  The  more  independent  each  tree  is,  the  be|er  the  improvement  over  a  single  decision  tree  

•  Author  disambiguaEon  with  Random  Forest  –  Various  meta  data  is  used  as  features  in  Random  Forest  to  determine  whether  two  author  name  from  two  papers  

refer  to  the  same  person  •  E.g.  Author  names,  affiliaEon,  coauthors,  keywords,  journal  informaEon,  year  of  publicaEons,  etc  

–  MulEple  distance  funcEons  are  used  for  each  type  of  meta  data  •  E.g.  TFIDF,  Jaccard  distance,  for  comparing  affiliaEons  

•  Compared  with  previous  SVM-­‐based  approach  –  Shown  to  provide  higher  accuracy  than  SVM  in  pair-­‐wise  author  disambiguaEon  task  –  Easy  parameterizaEon  in  the  training  phrase  (only  number  of  trees  and  randomness  at  each  node,  no  decision  on  

kernel  funcEon  needed),  and  performance  is  not  sensiEve  to  parameters  chosen  –  Provide  measurement  for  importance  of  each  individual  features  (how   informaEve  each  feature  is,  and  how  

sensiEve  the  decision  is  to  noise  in  a  parEcular  feature),  which  is  not  trivial  for  SVM  with  non-­‐linear  kernel  –  Training  Eme  &  classificaEon  Eme  is  linear  to  the  number  of  tree  and  data  size  

•  Also  provide  higher  disambiguaEon  accuracy  when  compared  with  other  tradiEonal  method  (LogisEc  Regression,  Naïve  Bayes,  Decision  Tree)  

Treeratpituk, Giles, JCDL09

Page 36: Using Lucene/Solr to Build CiteSeerX and Friends

Data and Publications in the Field of Chemistry

Chemistry • not physics - no arXiv – or computer science - no CiteSeer

• Legacy of early information access - Chem Abstracts • Cheminformatics is not bioinformatics

Chemistry has been up to recently a data poor field Data sharing tradition just being established Data creation is exploding - local (small science)

Journals and societies sensitive to their IP issues dominate the field Unsubstantiated IP claims such as data in the paper belongs to the publisher Discourage online versions of publications - ACS

Large powerful international companies have a vested interest in research Chemical information extraction tools are easily monetized Standards exist - CML, InCHI

“Fixing the past so we can fix the future.” Jeremy Frey Chemistry is an old discipline with publications going back 100 years

Chemistry is compound centric, not algorithmic centric Search is about the compound! Compounds have a rich data environ

3D graph structure, energies, etc.

Page 37: Using Lucene/Solr to Build CiteSeerX and Friends

ChemXSeer Architecture Integrate and implement well-used open source tools

Use CiteSeerX tools when possible Integrate into SeerSuite Search

Chemical formulae unique search Table search Figure search More data (grey literature) than documents

•  Automated information extraction modules based on machine learning methods •  Lucene/Solr indices for extracted fields, •  Relational databases for datasets,

Work closely with chemists to understand their needs Tools for data conversion

Provide a public portal and repository for easy use User access controls

Integrated visualization tools like JMOL for Gaussian data residing into our repository

API’s for users for extracted data

Data and documents standards de facto: xml, pdf, etc.

Page 38: Using Lucene/Solr to Build CiteSeerX and Friends

chemxseer.ist.psu.edu

Page 39: Using Lucene/Solr to Build CiteSeerX and Friends

ChemXSeer Formula Search

• Extraction and search of chemical formulae in scientific documents has been shown to be very useful.

• Intersection of two research areas: • Information retrieval • Chemoinformatics

•  Formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search)

B. Sun, WWW’07, WWW’08, TOIS’11 D. Yuan, ICDE’12

Page 40: Using Lucene/Solr to Build CiteSeerX and Friends

Challenges in Formula Search

How to identify a formula in scientific documents?

Non-Formula “… This work was funded under NIH grants …” “ … YSI 5301, Yellow Springs, OH, USA …” “… action and disease. He has published over …”

Formula “… such as hydroxyl radical OH, superoxide O2- …” “ and the other He emissions scarcely changed …”

Machine learning algorithms (SVM + CRF) yield high accuracies for correct formula identification.

Page 41: Using Lucene/Solr to Build CiteSeerX and Friends

SegmenEng  chemical  names  •  Goal:  to  discover  semanEcally  meaningful  sub-­‐terms  in  

chemical  names  –  Methylethyl  alcohol  –  methionylglutaminylarginyltyrosylglutamylserylleucyl  

phenylalanylalanylglutaminylleucyllysylglutamylarginyl  lysylglutamylglycylalanylphenylalanylvalylprolylphenyl  alanylvalylthreonylleucylglycylaspartylprolylglycylisol  eucylglutamylglutaminylserylleucyllysylisoleucylaspartyl  threonylleucylisoleucylglutamylalanylglycylalanylaspartyl  alanylleucylglutamylleucylglycylisoleucylprolylphenyl  alanylserylaspartylprolylleucylalanylaspartylglycylprolyl  threonylisoleucylglutaminylasparaginylalanylthreonylleucyl  arginylalanylphenylalanylalanylalanylglycylvalylthreonyl  prolylalanylglutaminylcysteinylphenylalanylglutamyl  methionylleucylalanylleucylisoleucylarginylglutaminyllysyl  hisEdylprolylthreonylisoleucylprolylisoleucylglycylleucyl  leucylmethionyltyrosylalanylasparaginylleucylvalylphenyl  alanylasparaginyllysylglycylisoleucylaspartylglutamylphenyl  alanyltyrosylalanylglutaminylcysteinylglutamyllysylvalyl  glycylvalylaspartylserylvalylleucylvalylalanylaspartylvalyl  prolylvalylglutaminylglutamylserylalanylprolylphenylalanyl  arginylglutaminylalanylalanylleucylarginylhisEdylasparaginyl  valylalanylprolylisoleucylphenylalanylisoleucylcysteinyl  prolylprolylaspartylalanylaspartylaspartylaspartylleucyl  leucylarginylglutaminylisoleucylalanylseryltyrosylglycyl  arginylglycyltyrosylthreonyltyrosylleucylleucylserylarginyl  alanylglycylvalylthreonylglycylalanylglutamylasparaginyl  

Page 42: Using Lucene/Solr to Build CiteSeerX and Friends

Chemical  Search  Aspects  

•  Parsing  •  ExtracEon  and  tagging  •  Indexing  •  Ranking  

Page 43: Using Lucene/Solr to Build CiteSeerX and Friends

Chemical  EnEty  ExtracEon  and  Tagging  •  Name  tagging  

–  Each  chemical  name  can  be  a  phrase  –  Example  

•  "...  Determina>on  of  lac4c  acid  and  ...“  •  "...  insec>cide  promecarb  (3-­‐isopropyl-­‐5-­‐methylphenyl  methylcarbamate)  acts  

against  ..."  

•  Formula  tagging  –  Each  formula  is  a  single  term  –  Example  

•  "...  such  as  hydroxyl  radical  OH,  superoxide  ..."  

–  Non-­‐formula  example  •  "...  YSI  5301,  Yellow  Springs,  OH,  USA  ...  ”  

•  Tagging  examples  –  Name  tagging:  

"...    of  <name-­‐type>lac>c  acid</name-­‐type>  and  ...“  –  Formula  tagging:  

"...  radical  <formula-­‐type>OH</formula-­‐type>  ,  superoxide  ..."  

Page 44: Using Lucene/Solr to Build CiteSeerX and Friends

Textual  Chemical  Molecule  InformaEon  Indexing  and  Search  

•  SegmentaEon-­‐based  index  scheme  –  Used  for  indexing  chemical  names  –  First  segment  a  chemical  name  hierarchically  

and  then  index  substrings  at  each  node  methylethyl

ethylmethyl

meth ethyl yl

me th

•  Frequency-­‐and-­‐discriminaEon-­‐based  index  scheme  –  Used  for  indexing  chemical  formulas  –  SequenEally  select  frequent  and  discriminaEve  subsequences  of  a  

formula  from  the  shortest  to  the  longest  

•  Index  Schemes:    –  Which  tokens  to  index?  –  Indexing  all  subsequences  generates  a  large  size  index  

Page 45: Using Lucene/Solr to Build CiteSeerX and Friends

Features  for  Formula  Indexing  

•  Formula  –  A  sequence  of  chemical  element  or  par3al  formula  with  corresponding  frequencies  

–  E.g.  CH3(CH2)2OH  •  ParEal  formula  

–  ParEal  formula:  a  subsequence  of  a  formula  –  E.g.  C,  H,  O,  CH3,  CH2,  OH,  CH3(CH)2,  H3(CH)2,  CH3(CH)2O,  etc.  

•  Index  construcEon  –  ParEal  formulas  with  frequencies:  e.g.  <C,3>,<H,6>,<CH2,2>,  etc.  

–  Too  many  parEal  formulas,  need  feature  selec3on  

Page 46: Using Lucene/Solr to Build CiteSeerX and Friends

Criteria  of  Feature  SelecEon  

•  Criteria  of  feature  selecEon  –  Frequent  features  (Freqs≥Freqmin)  

–  DiscriminaEve  features  (αs  ≥αmin)  •  If  a  sequence’s  selected  subsequences  are  enough  to  disEnguish  formulas  containing  them  from  other  formulas,  this  sequence  is  redundant.  

•  DiscriminaEon  score  

 where  F  is  the  selected  feature  set,  and  Ds  is  the  set  of  formulas  containing  s.  

||/|| ''' ssssFss DDpI ∧∈=α

Page 47: Using Lucene/Solr to Build CiteSeerX and Friends

An  Example  for  Formula  Indexing  

•  Data  set:    –  1.CH3COOH,  2.CH3(CH2)2OH,  3.CH3(CH2)3COOH  

•  Parameter:    –  Freqmin=2,  αmin=1.1  

•  Steps:  –  Length=1,  Candidates={C,H,O},  F={C,H,O}  –  Length=2,  Candidates={CH3,H3C,CO,OO,OH,CH2},  Frequent  

Candidates={CH3,CO,OO,OH,CH2}  

 Frequent  &  DiscriminaEve  Candidates={CO,OO,CH2}    F={C,H,O,CO,OO,CH2}  

–  Length=3,  …  

1|}3,2,1{|/|}3,2,1{}3,2,1{| 33 == CHHCCH Iα

5.1|}3,1{|/|}3,2,1{}3,2,1{| == COOCCO Iα

Page 48: Using Lucene/Solr to Build CiteSeerX and Friends

Formula  Search  

•  SF.IEF:  Subsequence  Frequency  &  Inverse  EnEty  Frequency  

•  Exact  formula  search  –  Search  for  exact  representaEons.  E.g.  =C1-­‐2H4-­‐6  matches  CH4  and  

C2H6,  not  H4C  or  H6C2.  

•  Frequency  formula  search  –  Full  frequency  search:  search  for  formulas  with  specified  chemical  

elements  and  frequency  ranges,  ignoring  the  order,  no  unspecified  elements.  E.g.  C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  not  CH4O,  C2H6O2.  

–  ParEal  frequency  search:  similar  but  allow  unspecified  elements.  E.g.  *C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  and  CH4O  and  C2H6O2  as  well.  

–  Ranking  funcEon  

SF(s,e) =Freq(s,e)| e |

,IEF(s) = log |C ||{e | s p e} |

))(||/()(),(),( 22 ∑∑∈∈

×=qsqs

sIFFfsIFFesSFeqscore

Page 49: Using Lucene/Solr to Build CiteSeerX and Friends

Formula  Search  substructure  •  Substructure  formula  search  

–  Search  for  formulas  that  may  have  a  substructure.  E.g.  -­‐COOH  matches  CH3COOH  (exact  match:  high  score),  HOOCCH3  (reverse  match:  medium  score),  and  CH3CHO2  (parsed  match:  low  score).  

–  Ranking  funcEon    where  Wmatch(q,f)    is  the  weight  for  exact  match,  reverse  match,  and  parsed  match  

•  Similarity  formula  search  –  Search  for  formulas  with  a  similar  structure  of  the  query  formula.  

Feature-­‐based  approach  using  parEal  formula  matching.  E.g.  ~CH3COOH  matches  CH3COOH,  (CH3COO)2Co,  CH3COO-­‐,  etc.  

–  Ranking  funcEon  

•  ConjuncEve  search  of  the  basic  types  of  formula  searches  –  E.g.  [*C2H4-­‐6  -­‐COOH]  matches  CH3COOH,  not  C2H4O  or  

CH3CH2COOH.  

•  Document  query  rewriEng  –  E.g.  document  query  atom  formula:=CH4  is  rewri|en  to  atom  (CH4  

OR  CD4),  if  formula  search  of  =CH4  matches  CH4  and  CD4.  

score(s,e) =Wmatch(s, f )SF(s,e)IFF(s) / | e |

score(q,e) = Wmatch(q,e )W (s)SF(s,q)SF(s,e)IFF(s)spq∑ / | e |

Page 50: Using Lucene/Solr to Build CiteSeerX and Friends

Formula  Search  -­‐Query  Models  

Many  models  are  possible  from  exact  to  semanEc  Models  discriminated  by  matching  algorithms  

•  Exact  search  –  Search  for  exact  representaEons  –  E.g.  =C1-­‐2H4-­‐6  matches  CH4  and  C2H6,  not  H4C  or  H6C2  

•  Frequency  searches  –  Full  frequency  search:  search  for  formulae  with  specified  chemical  elements  and  

frequency  ranges,  ignoring  the  order,  no  unspecified  elements  –  E.g.  C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  not  CH4O,  C2H6O2  –  ParEal  frequency  search:  similar  but  allow  unspecified  elements  –  E.g.  *C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  and  CH4O  and  C2H6O2  as  well  

•  Substructure  search  –  Search  for  formulae  that  may  have  a  substructure  –  E.g.  -­‐COOH  matches  CH3COOH  (exact  match:  high  score),  HOOCCH3  (reverse  match:  

medium  score),  and  CH3CHO2  (parsed  match:  low  score).  •  Similarity  search  

–  Search  for  formulae  with  a  similar  structure  of  the  query  formula.  Feature-­‐based  approach  using  parEal  formulae  matching.  

–  E.g.  ~CH3COOH  matches  CH3COOH,  (CH3COO)2Co,  CH3COO-­‐,  etc.  

Page 51: Using Lucene/Solr to Build CiteSeerX and Friends

Ranking  formulae  

•  Ranking  formulae  has  to  depend  on  need  and  importance  •  Focus  on  structural  methods  and  frequency  •  Importance  can  be  introduced  by  citaEon  rank  or  pagerank  or  others  •  SF.IFF  

–  Substructure  frequency  and  inverse  formula  frequency  •  Frequency  searches  

–     

–  where  |f|  is  the  total  frequency  of  elements  

•  Substructure  search  –     

–   where  Wmatch(q,f)    is  the  weight  for  exact  match,  reverse  match,  and  parsed  match  

•  Similarity  search  –       

))(||/()(),(),( 22 ∑∑∈∈

×=qeqe

eIFFfeIFFfeSFfqscore

||/)(),(),( ),( fqIFFfqSFWfqscore fqmatch=

||/)(),(),()(),( ),( fsIFFfsSFqsSFsWWfqscoreqs

fqmatch∑=p

Page 52: Using Lucene/Solr to Build CiteSeerX and Friends

Chemical  compounds  as  graphs  •  Chemical  compound  modeled  as  a  semanEc  graph  with  properEes  

Above figures are copied from eMolecules.com

Atom: vertex/node in the graph Bond: edge in the graph Dimensions: 3 or 4

Page 53: Using Lucene/Solr to Build CiteSeerX and Friends

What’s  Chemical  Structure  Search  •  Substructure  Search  

– Given  an  input  chemical  structure  sketch,  find  all  the  chemical  compounds  containing  the  input  as  a  substructure.    

•  Super  structure  Search  – Given  an  input  chemical  structure  sketch,  find  all  the  important  descriptors  (substructures/  funcEonal  group)  contained  in  the  input.    

•  Similarity  Search  – Given  an  input  chemical  structure  sketch,  find  all  the  chemical  compounds  “similar”  to  the  input.    

Page 54: Using Lucene/Solr to Build CiteSeerX and Friends

Table Search

Tables are widely used to present experimental results or statistical data in scientific documents; some data only exists in these tables.

Current search engines treat tabular data as regular text •  Structural information and semantics not preserved.

Goal: automatically identify tables, extract table metadata from pdf documents into xml and rank data

Table Metadata Representation: •  Environment metadata: (document specifics: type, title,…) •  Frame metadata: (border left, right, top, bottom, …) •  Affiliated metadata: (Caption, footnote, …) •  Layout metadata: (number of rows, columns, headers,…) •  Cell content metadata: (values in cells) •  Type metadata: (numeric, symbolic, hybrid, …)

Y. Liu AAA’07, JCDL’07.

Page 55: Using Lucene/Solr to Build CiteSeerX and Friends

Tables  •  A history that pre-dates that of sentential text

–  Cuneiform clay tablets •  Not received the same level of formal characterization

enjoyed by sentential text •  Varying and irregular formats •  Different intuitive understanding of what a “table” is.

–  Is the Periodic Table of the Elements a table? –  Tables vs. Lists? –  Tables vs. Forms? –  Tables vs. Figures? –  Genuine table vs. non-genuine table? [12]

•  Our definition: scientific genuine table –  Caption + tabular structure –  Ruling lines are not required

Page 56: Using Lucene/Solr to Build CiteSeerX and Friends

TableSeer  Beta design of a table search engine

Page 57: Using Lucene/Solr to Build CiteSeerX and Friends

TableSeer  System    

Architecture  

Page 58: Using Lucene/Solr to Build CiteSeerX and Friends

Page  Box-­‐Cu�ng  Algorithm  

•  Improves  the  table  detecEon  performance  by  excluding  more  than  93.6%  document  content  in  the  beginning  

Page 59: Using Lucene/Solr to Build CiteSeerX and Friends

Sample  Table  Metadata  Extracted  File  

•  <Table>  

•  <DocumentOrigin>Analyst</DocumentOrigin>  •  <DocumentName>b006011i.pdf</DocumentName>  

•  <Year>2001</Year>  •  <DocumentTitle>Detec3on  of  chlorinated  methanes  by  3n  oxide  gas  sensors  </DocumentTitle>  

•  <Author>Sang  Hyun  Park,  a  ?  Young-­‐Chan  Son,  a  Brenda  R  .  Shaw,  a  Kenneth  E.  Creasy,*  b  and  Steven  L.  Suib*  acd  a  Department  of  Chemistry,  U-­‐60,  University  of  Connec3cut,  Storrs,  C  T  06269-­‐3060</Author>  

•  <TheNumOfCiters></TheNumOfCiters>  •  <Citers></Citers>  

•  <TableCap3on>Table  1  Temperature  effect  o  n  r  esistance  change  (  D  R  )  and  response  3meof  3n  oxide  thin  film  with  1  %  C  Cl  4</TableCap3on>  •  <TableColumnHeading>D  R  Temperature/  ¡ã  C  D  R  a  /  W  (  R  ,O  2  )  (%)  R  esponse  3me  Reproducibiliy  </TableColumnHeading>  

•  <TableContent>100  223  5  ~  22  min  Yes  200  270  9  ~  7-­‐8  min  Yes  300  1027  21  <  2  0  s  Yes  400  993  31  ~  1  0  s  No  </TableContent>  •  <TableFootnote>  a  D  R  =(  R  ,  CCl  4  )  -­‐  (  R  ,O  2  ).  </TableFootnote>  

•  <ColumnNum>5</ColumnNum>  •  <TableReferenceText>In  page  3,  line  11,  …  Film  responses  to  1%  CCl4  at  different  temperatures  are  summarized  in  Table  1……</TableReferenceText>  

•  <PageNumOfTable>3</PageNumOfTable>  •  <Snapshot>b006011i/b006011i_t1.jpg</Snapshot>  

•  </Table>  

Page 60: Using Lucene/Solr to Build CiteSeerX and Friends

TableRank  

• Rank tables by rating the <query, table> pairs, instead of the <query, document> pairs: preventing a lot of false positive hits for table search, which frequently occur in current web search engines • The similarity between a <table, query> pair: the cosine of the angle between vectors

• Tailored term vector space => table vectors: • Query vectors and table vectors, instead of document vectors

Page 61: Using Lucene/Solr to Build CiteSeerX and Friends

Table  Index  

  Index    CapEons  

  Footnotes    Reference  Text  

  BoosEng    CapEons  (2)  

  FuncEon:    -  Inversely  (recip)  proporEonal  to  #cites.  

Page 62: Using Lucene/Solr to Build CiteSeerX and Friends

Term  WeighEng  for  Tables  –  TTF  –  ITTF:  (Table  Term  Frequency-­‐Inverse  Table  Term  Frequency)  

–  TLB:  Table  Level  Boost  Factors  (e.g.,  table  frequency)  –  DLB:  Document  Level  Boost  factors  (e.g.,  journal/proceeding  order,  document  

citaEon)    

Page 63: Using Lucene/Solr to Build CiteSeerX and Friends

Table  term  ranking  

• A term occurring in a few tables is likely to be a better discriminator than a term appearing in most or all tables • Similar to document abstract, table metadata and table query should be treated as semi-structured text

• Not complete sentences and express a summary • P = 0.5 (G. Salton 1988)

•  b is the total number of tables • IDF(ijk): the number of tables that term t(i) occurs in the matadata m(k)

Page 64: Using Lucene/Solr to Build CiteSeerX and Friends

Table  Level  Boost  and  Document  Level  Boost  

Btbf is the boost value of the table frequency Btrt is the boost value of the table reference text (e.g., the normalized length), and Btp is the boost value of the table position. r is a parameter, which is 1 if users specify the table position in the query. Otherwise, r = 0.

IVj: document Importance Value (IV). If a table comes from a document with a high IV , all the table terms of this document should get a high document level boost ICj: the inherited citation value (ICj) DOj: source value (the rank of the journal/conference proceeding) DFj: document freshness

Page 65: Using Lucene/Solr to Build CiteSeerX and Friends

Table  citaEon  network  •  Similar  to  the  PageRank  network  

–  Documents  construct  a  network  from  the  citaEons  –  The  “incoming  links”  –  the  documents  that  cite  the  document  in  which  

the  table  is  located  –  ExponenEal  decay  used  to  deal  with  the  impact  of  the  propagated  

importance  •  Unlike  the  PageRank  network  

–  Directed  Acyclic  Graph  –  Importance  Value  (IV)  of  a  document  not  decreased  as  the  number  of  

citaEons  increases  –  IV  not  divided  by  the  number  of  outbound  links  

•  A  document  may  have  mulEple,  one,  or  no  tables      •  Each  table  is  consisted  as  a  set  of  metadata    •  Same  keywords  may  appear  in  different  metadata  in  different  

tables    

Page 66: Using Lucene/Solr to Build CiteSeerX and Friends

Table  Search  Summary  •  An  novel  first  table  ranking  algorithm  -­‐-­‐  TableRank  •  A  tailored  table  term  vector  space  •  A  table  term  weighEng  scheme  –  TTF-­‐ITTF  

– AggregaEng  impact  factors  from  three  levels:  the  term,  the  table,  and  the  document  

•  Index  table  referenced  texts,  term  locaEons,  and  document  backgrounds  

•  Design  and  implement  first  table  search  engine,  TableSeer,  to  evaluate  the  TableRank  and  compare  with  popular  web  search  engines  

•  Code  released  •  Currently  implement  in  CiteSeerX  -­‐  millions  of  tables  •  Improving  extracEon  –  Dow  Chemical  support  

Page 67: Using Lucene/Solr to Build CiteSeerX and Friends

Automated Figure Data Extraction and Search"•  Large amount of results in digital documents are recorded in figures, time series, experimental

results (eg., NMR spectra, income growth) and this is the only record of the data"

•  Extraction for purposes of:"–  Further modeling using presented data"–  Indexing, meta-data creation for storage & search on figures for data reuse"

•  Current extraction done manually!!

Documents  

Plot  Index  Document  Index  

Merged  Index  

Extracted  Plot   Extracted  Info.  

User  

Digital  Library  

Page 68: Using Lucene/Solr to Build CiteSeerX and Friends

Seer Figure/Plot Data Extraction and Search

Numerical data in scientific publications are often found in figures.

Tools that automate the data extraction from figures provide the following: •  Increases our understanding of key concepts of papers •  Provides data for automatic comparative analyses. •  Enables regeneration of figures in different contexts. •  Enables search for documents with figures containing specific experiment results.

X. Lu JCDL’06 & IJDAR’09, Brouwer JCDL’08, Kataria AAAI’08

Page 69: Using Lucene/Solr to Build CiteSeerX and Friends

Metadata & data to extract: 2 Dimensional Plot"

Snapshot of a document Extracted 2D plot

X-Axis Label

Legend

Axis Units

Ticks

Data Points

Y-Axis Labels

Page 70: Using Lucene/Solr to Build CiteSeerX and Friends

Our  Approach  to  Plot  Data  ExtracEon  • Identify and extract figures from digital documents

• Ascii and image extraction (xpdf) • OCR - bit map, raster pdfs

• Identify figures as images of 2D plots using SVM (Only for Bit map images)

• Hough transform • Wavelets coefficients of image • Surrounding text features

• Binarization of the 2D plots identified for preprocessing (No need for Vectorized Images)

• Adaptive Thresholding •  Image segmentation to identify regions

• Profiling or Image Signature •  Text block detection

• Nearest Neighbor •  Data point detection

• K-means Filtering •  Data point disambiguation for overlapping points

• Simulated Annealing

Page 71: Using Lucene/Solr to Build CiteSeerX and Friends

•  System integration within ChemXSeer or CiteSeerX"–  XML data generation"–  Open source tool in Lucene/SOLR "

•  Extension to other figures (3D, …)  

10" 20" 30" 40" 50" 60" 70"

5"10"15"20"25"30"0"

2e+07"4e+07"6e+07"8e+07"1e+08"

1.2e+08"

Future Directions

Page 72: Using Lucene/Solr to Build CiteSeerX and Friends

ChemXSeer Highlights •  Portal for academic researchers in environmental chemistry which integrates the scientific literature with experimental, analytical and simulation results and tools

•  Provides unique metadata extraction, indexing and searching pertinent to the chemical literature by using heuristics combined with machine learning

•  Chemical formulae and names •  Tables •  Figures •  Publication functions as in CiteSeerX •  Interoperability ORE-Chem development •  Novel ranking required

•  After extraction, data stored API accessible xml for users

•  Hybrid repository (Not fully open): Serves as a federated information interoperational system •  Scientific papers crawled and indexed from the web •  User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) •  Scientific documents and metadata from publishers (e.g. Royal Society of Chemistry)

•  Access control for publisher-provided content and user-submitted experiment data

•  Takes advantage of developments in other funded cyberinfrastructure and open source projects

•  CiteSeerX, PlanetLab, Lucene/Solr, ORE, others •  Some released open source

Page 73: Using Lucene/Solr to Build CiteSeerX and Friends

•  CollabSeer  currently  supports  400k  authors  •  h|p://collabseer.ist.psu.edu  

Experimental Collaborator recommendation system

Page 74: Using Lucene/Solr to Build CiteSeerX and Friends

CollaboraEon  recommendaEon  

•  Metadata  of  authors  and  coauthors  and  topics  of  interest  (similar  to  expert  recommendaEon)  

•  Use  social  network  and  topics  to  recommend  collaborators  of  collaborators  (FOF)  

•  Devise  SN  index  and  ranking  scheme  

•  Explore  models  of  vertex  similarity  •  Built  on  SeerSuite  

•  Other  recommendaEons?  –  Experimental  methods  

–  Chemicals?  

Gou JCDL’10, Gou MIR’10 Chen JCDL’11, SAC’12

Page 75: Using Lucene/Solr to Build CiteSeerX and Friends

RecommendaEon  list  and  user’s  topic  of  interest  

Page 76: Using Lucene/Solr to Build CiteSeerX and Friends

•  Users  refine  the  recommend  list  by  clicking  on  their  topic  of  interest.  (lek:  refined  by  “query  processing”,  right:  default  recommendaEon  list)  

Page 77: Using Lucene/Solr to Build CiteSeerX and Friends

•  How  two  potenEal  collaborators  are  linked  by  common  collaborators  

Page 78: Using Lucene/Solr to Build CiteSeerX and Friends

CollabSeer  Framework  

Page 79: Using Lucene/Solr to Build CiteSeerX and Friends

IntegraEon  of  Vertex  Similarity  and  Textual  Similarity  

•     –  S:  vertex  similarity  

–  SC.O.T.:  collaborator’s  contribuEon  to  a  specified  topic  – Use  the  product  of  exponenEal  funcEons  to  avoid  zero  vertex  similarity  score  or  zero  contribuEon  (textual  similarity)  score  to  turn  the  whole  measure  into  zero  

•  Other  measures?  

Page 80: Using Lucene/Solr to Build CiteSeerX and Friends

•  RefSeerX:  recommend  citaEons  for  papers  

•  Based  –  ExisEng  citaEons  –  CitaEon  context  –  Venue  and  importance  –  Contemporary  vs  seminal  

paper  citaEons  

The authors are unaware of related work  they do not know they are looking for  recommends related citations

Use these

Page 81: Using Lucene/Solr to Build CiteSeerX and Friends

He, WWW ‘10, WSDM ’11; Kataria, CIKM ’10, IJCAI’11,

Page 82: Using Lucene/Solr to Build CiteSeerX and Friends
Page 83: Using Lucene/Solr to Build CiteSeerX and Friends

Expert  Search  

• Expert search for authors, currently in alpha

Page 84: Using Lucene/Solr to Build CiteSeerX and Friends

Expert  Search  

• Expert search for authors, currently in alpha

Page 85: Using Lucene/Solr to Build CiteSeerX and Friends

Keyphrase  ExtracEon  for  experts  

SecEon  Parser  

Candidate  Extractor  

Random  Forest  

Top  Keyphrases  

Training  Data  

DBLP  data  

Text  Document  

Parse document into sections with regular expression

Use DBLP statistic to extract keyphrase candidates

Train random forest to classify & rank whether a phrase is a keyphrase

Treeratpituk, P., Teregowda, P., Huang, J. and Giles, CL. SEERLAB: A System for Extracting Keyphrases from Scholarly Documents, Semeval-2010 task 5: Automatic keyphrase extraction from scientific article. ACL workshop on Semantic Evaluations (SemEval 2010), Sweden, July 2010.

Page 86: Using Lucene/Solr to Build CiteSeerX and Friends

GrantSeer  •  Prototype  search  engine  for  PI  profiles  and  their  grant  

informaEon  to  assist  funding  agencies,  deans  of  research,  foundaEons  

•  Link  PIs  with  their    –  Grants    –  PublicaEons  –  CitaEons  –  OrganizaEon  –  ExperEse  –  Others?  

•  Data  that  can  be  shared  –  CiteSeerX  or  Google  Scholar  data  –  Database  of  funded  research  

Funded by NSF – Julia Lane

Page 87: Using Lucene/Solr to Build CiteSeerX and Friends

Cover  page  NSF  XML  extracEon  

Page 88: Using Lucene/Solr to Build CiteSeerX and Friends

GrantSeer:  PI  profile  

grants awarded

publications + citations PI’s expertise

Page 89: Using Lucene/Solr to Build CiteSeerX and Friends

Algorithm  Search  

• Homepage search for authors, currently in alpha

Page 90: Using Lucene/Solr to Build CiteSeerX and Friends

AlgorithmSeer  

Algorithm  Search  

-­‐  ExtracEon  -­‐  Indexing  -­‐  Ranking  

Suite Workshop ICSE ‘11

Page 91: Using Lucene/Solr to Build CiteSeerX and Friends

Algorithm Search

Page 92: Using Lucene/Solr to Build CiteSeerX and Friends

Metadata extraction • Extract

• Pseudo-codes and their metadata • Captions • Reference sentences • Synopsys • Etc.

• Index metadata using Solr to make the pseudo-codes searchable • Each search result has a pointer to the page in the document where the pseudo-code appears

Page 93: Using Lucene/Solr to Build CiteSeerX and Friends

Index Fields

id <string> caption <text> reftext <text> (Reference Sentences) synopsis <text> (Summarizing Text) page <sint> (Page Number) paperid <string> (Document ID) year <sint> (Year of Publication) ncites <sint> (Number of Citations)

Page 94: Using Lucene/Solr to Build CiteSeerX and Friends

AckSeer  

94

Page 95: Using Lucene/Solr to Build CiteSeerX and Friends

AckSeer  

95

Page 96: Using Lucene/Solr to Build CiteSeerX and Friends

NameNumber of

Acknowledge-mentsTotal

CitationsC/A

Metric NameNumber of

Acknowledge-mentsTotal

CitationsC/A

Metric

Funding Agencies EducationalInstitutions

National ScienceFoundation 12287 144643 11.77 Carnegie Mellon

University 640 10840 16.94

Defense AdvancedResearch Projects Agency 4712 80659 17.12 Massachusetts Institute

of Technology 500 10509 21.02

Office of Naval Research 3080 48873 15.87 California Institute ofTechnology 464 4170 8.99

DeutscheForschungsgemeinschaft 2780 9782 3.52 Santa Fe Institute 368 3387 9.2

National Aeronautics andSpace Administration 2408 21242 8.82

French NationalInstitute for Research inComputer Science

321 3399 10.59

Engineering and PhysicalScience Research Council 2007 16582 8.26 Stanford University 314 3693 11.76

Air Force Office ofScientific Research 1657 16850 10.17 University of California

at Berkeley 306 10439 34.11

National Sciences andEngineering ResearchCouncil of Canada

1422 12050 8.47National Center forSupercomputingApplications

261 4777 18.3

Department of Energy 1054 5562 5.28 International ComputerScience Institute 180 2078 11.54

Australian ResearchCouncil 1010 5464 5.41 Cornell University 180 1656 9.2

European UnionInformation TechnologiesProgram

825 9594 11.63 University of Illinois atUrbana-Champaign 177 5304 29.97

National Institutes ofHealth 709 7279 10.27 USC Information

Sciences Institute 176 3283 18.65

Army Research Office 666 7709 11.58 University of CaliforniaLos Angeles 176 2003 11.38

Netherlands Organizationfor Scientific Research 646 2843 4.4 McGill University 152 3001 19.74

Science and EngineeringResearch Council 489 6976 14.27 Australian National

University 123 549 4.46

Companies IndividualsInternational BusinessMachines 1380 23948 17.35 Olivier Danvy 268 8000 29.85

Intel Corporation 962 14441 15.01 Oded Goldreich 259 4615 17.82Digital EquipmentCorporation 831 16390 19.72 Luca Cardelli 247 10846 43.91

Hewlett-Packard 735 11186 15.22 Tom Mitchell 226 5494 24.31

Sun Microsystems 651 12042 18.5 Martin Abadi 222 9647 43.46

Microsoft Corporation 368 6061 16.47 Phil Wadler 181 7252 40.07

Silicon Graphics, Inc 279 3898 13.97 Moshe Vardi 180 6094 33.86

Xerox Corporation 265 4309 16.26 Peter Lee 167 8941 53.54

Siemens Corporation 241 8395 34.83 Avi Wigderson 160 2901 18.13

Bellcore 192 2393 12.46 Matthias Felleisen 154 4705 30.55

Nippon Electric Company 164 942 5.74 Benjamin Pierce 152 4641 30.53

AT&T- Bell Labs 146 1549 10.61 Noga Alon 152 2388 15.71

Apple Computer 135 3159 23.4 John Ousterhout 152 6369 41.9

Motorola 122 1352 11.08 Frank Pfenning 148 2049 13.84

Texas Instruments 92 1165 12.66 Andrew Appel 144 7630 52.99

Funding agency impact •  based on acknowledgement indexing •  # of acknowledgements •  total citations •  #Citation / #ack metric

Based on acknowledgment entities extracted from 150K acknowledgements in CiteSeer

Giles, PNAS, 2004

New system available this spring AckSeer

Funding Agency Impact

Page 97: Using Lucene/Solr to Build CiteSeerX and Friends

Author Citations Acknowledge-mentsC/A

MetricOlivierDanvy 847 268 29.85

OdedGoldreich 3277 259 17.82

LucaCardelli 3847 247 43.91

TomMitchell 3336 226 24.31

MartinAbadi 3507 222 43.46

PhilWadler 3780 181 40.07

MosheVardi 3786 180 33.86

Peter Lee 1790 167 53.54AviWigderson 2566 160 18.13

MatthiasFelleisen 1622 154 30.55

BenjaminPierce 1484 152 30.53

Noga Alon 2640 152 15.71JohnOusterhout 3693 152 41.9

FrankPfenning 1639 148 13.84

AndrewAppel 2064 144 52.99

Most Acknowledged Authors and Impact Factor

Interviewed by Nature as to why he was the most acknowledged computer scientist

Who is most acknowledged?

Mom or dad Theorists or experimentalists

Who has a better metric?

Page 98: Using Lucene/Solr to Build CiteSeerX and Friends

Clouding CiteSeerX •  Hosting cloud CiteSeerX instances

•  Economic issues •  Cost of hosting •  Cost of refactoring the source to be hosted in the cloud.

•  Computational/technical issues •  What workflow to cloudize •  Component modification for efficient operation •  VM size: storage, memory and CPU sizing as a function of

needs •  Establishing computational needs and availability clusters •  Appropriate load balancing across multiple sites. •  Security of data stored including metadata and user data.

•  Policy issues •  Privacy of user data •  Copyright issues.

Teregowda Cloud’10 USENIX’10

Page 99: Using Lucene/Solr to Build CiteSeerX and Friends

SeerSuite  Research/Development  Opportuni3es  •  Old  Seers  

–  Improve  or  revive  old  systems  and  port  them  into  compeEEve  SeerX  space  •  eBizSeer  to  eBizSeerX;  BotSeer  to  BotSeerX;  ArchSeer  to  ArchSeerX  

•  New  Seers  –  New  domains  such  as  physics,  neuroscience,  biology,  algorithms,  TBD  (build  new  indexes)  –  MyCiteSeerX  

•  Be|er  features  –  Parsing  –  EnEty  disambiguaEon  –  CitaEon  analysis  –  Ranking;  ranking,  ranking  

•  New  features  –  New  parsing,  indexing,  ranking  

•  Tables,  figures,  equaEons,  algorithms,  maps,  carbon  daEng,  chemical  formulae,  etc  –  Homepage  linking  –  ORE  search  and  data  integraEon  –  CollaboraEve  spaces  –  API/web  services  –  IntegraEon  with  DL  such  as  Fedora  –  New  clusters  

•  Topics,  venues,  affiliaEons  –  Recommender  systems  –  SNA  analysis  –  Others  

Collabora>ons  welcomed!    Data  and  sohware  available  

Page 100: Using Lucene/Solr to Build CiteSeerX and Friends

Research  SeerSuite  supports  •  Many  uses  as  a  research  testbed  and  support  structure  

–  Scaling  of  algorithms  for  IR,  IE,  data  mining,  social  networks,  ...  –  NLP  methods  on  large  text  collecEons  –  ML  methods  to  automaEcally  extract  data  –  Novel  indexing  and  ranking  –  Federated  search  –  CollaboraEve  and  social  networks  –  Focused  crawling  –  new  data  resources  –  Interface  design  and  integraEon  –  Systems  analysis  

•  Many  development    applied  research  issues  –  IntegraEon  with  other  DLs  –  Automated  feature  development  –  Transfer  to  nontechnical  use  –  Cloud  based  delivery  

Page 101: Using Lucene/Solr to Build CiteSeerX and Friends

Summary  •  Propose  an  infrastructure  for  academic  and  scienEfic  search  engine/digital  library  

creaEon  -­‐  SeerSuite  –  Modular,  scalable,  extensible,  robust  –  Based  on  commercial  grade  open  source  (Solr/Lucene);  easy  to  use  –  Easy  to  apply  to  other  domains  (separable  indexes  and  projects  -­‐  integraEon)  

•  Allows  scalable  data  mining  and  informaEon  extracEon  for  actual  systems  –  Unique  informa4on  extrac4on  plugins  –  Focus  on  unique  scalable  extracEon/data  mining  methods  

•  Most  methods  less  than  N2  complexity  

–  AutomaEcally  populates  databases  or  data  structures  •  Demonstrate  with  beta  systems  in  

–  Computer  science,  Archaeology,  Chemistry,  Robots.txt,  PubMed,  YouSeer,  Tables,  Figures,  Maps,  References,  CollaboraEons,  DisambiguaEon  

–  Personal  features  •  Systems  are  reasonably  easy  to  build;  issues  are  

–  Data  collecEon  or  data  access  –  InformaEon  extracEon,  indexing,  ranking  

•  Many  uses  as  a  research  testbed  –  Data  sharing  models  

•  Want  to  find  a  Seer,  search  Google  or  use  my  homepage.  

Page 102: Using Lucene/Solr to Build CiteSeerX and Friends

Opportun3es  •  Science  is  being  flooded  with  data  

–  SimulaEons,  sensors,  web  •  Digital  humaniEes  is  right  behind  •  Needs  in  

–  Large  scale  data  management  (tera  to  peta)  •  NoSQL  databases:  graphs,  documents,  floaEng  point,    

–  Large  scale    •  data  mining  •  informaEon  extracEon  •  search  

•  Domain  experEse  crucial  •  Reuse  not  reinvent  (much  is  out  there)  •  Solr/Lucene  is  great  for  both  demos,  producEon  and  research.  

Page 103: Using Lucene/Solr to Build CiteSeerX and Friends

•  clgiles.ist.psu.edu    •  [email protected]  •  SourceForge.com  

“Human attention is the scarce resource, not information.” Herbert A. Simon, Nobel Laureate, 1997.

For  more  informaEon