19

Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Embed Size (px)

Citation preview

Page 1: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute
Page 2: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Edanz Journal Selector a Prototype based on Solr/Nutch/Hadoop

Page 3: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Liang SHEN

@shenzhuxi Web Developer European Bioinformatics Institute Drupal/Solr

Page 4: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Edanz Journal Selector (2011)

Page 5: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

So many journals!

Page 6: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

DEMO

Page 7: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Open Access

•  By National Center for Biotechnology Information, U.S. National Library of Medicine •  Approximately 26,000 records are included in the PubMed journal lists

Page 8: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Feeds Journal TOCs •  21,498 journals from 1,677 publishers •  Institute for Computer Based Learning •  Heriot-Watt University

Page 9: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Springer •  Springer Metadata API

•  Provides  metadata  for  over  5  million  online  documents  •  Springer Open Access API

•  Provides  metadata,  full-­‐text  content,  and  images  for  over  80,000  open  access  ar:cles    

Page 10: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Open Source Stack

•  Infrastructure: Amazon Web Service •  Data processing: Hadoop/Hive •  Index: Solr/Lucene •  Web service: Drupal •  Piwik

Page 11: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

HDFS  

Index  

Feeds  API   Web  

Page 12: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute
Page 13: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Springer Journal Selector

Page 14: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Chinese

Page 15: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Japanese

Page 16: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Scalability •  Shards

Page 17: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Internet vs. Intranet

Page 18: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Re-think after 3 years

Don't use Hadoop (<5TB)

Page 19: Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European Bioinformatics Institute

Thanks! Liang Shen