Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Invenio TechnologySelected Practical Software Development Lessons
From A Large Digital Library System
Tibor Šimko<[email protected]>
Department of Information TechnologyCERN
August 2013 / openlab talk
Tibor Šimko (CERN) Invenio Technology openlab 2013 1 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 2 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 3 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 4 / 62
What is Digital Library?
“library in which collections are stored in digital formats (asopposed to print, microform, or other media) and accessible bycomputers”(1) institutional document repositories(2) world-wide subject-based information systems
Example #1: CERN Document Server
managing CERN and selected non-CERN high-energy physicsand related documents since ∼1993more than 1,000,000 recordsarticles, books, theses, photos, videos, and morepowered by Invenio since 2002http://cdsweb.cern.ch/
Tibor Šimko (CERN) Invenio Technology openlab 2013 5 / 62
What is Digital Library?
“library in which collections are stored in digital formats (asopposed to print, microform, or other media) and accessible bycomputers”(1) institutional document repositories(2) world-wide subject-based information systems
Example #1: CERN Document Server
managing CERN and selected non-CERN high-energy physicsand related documents since ∼1993more than 1,000,000 recordsarticles, books, theses, photos, videos, and morepowered by Invenio since 2002http://cdsweb.cern.ch/
Tibor Šimko (CERN) Invenio Technology openlab 2013 5 / 62
CDS: Collection Tree
Tibor Šimko (CERN) Invenio Technology openlab 2013 6 / 62
CDS: Search for Books
Tibor Šimko (CERN) Invenio Technology openlab 2013 7 / 62
CDS: Search for Photos
Tibor Šimko (CERN) Invenio Technology openlab 2013 8 / 62
CDS Features: Commenting
Tibor Šimko (CERN) Invenio Technology openlab 2013 9 / 62
Invenio Features: Reviewing
Tibor Šimko (CERN) Invenio Technology openlab 2013 10 / 62
CDS: Create Personal Alert
Tibor Šimko (CERN) Invenio Technology openlab 2013 11 / 62
CDS: Add to Personal Basket
Tibor Šimko (CERN) Invenio Technology openlab 2013 12 / 62
CDS: Display Personal Basket
Tibor Šimko (CERN) Invenio Technology openlab 2013 13 / 62
CDS: Organize and Share Your Baskets
Tibor Šimko (CERN) Invenio Technology openlab 2013 14 / 62
CDS: Journals and Bulletins
Tibor Šimko (CERN) Invenio Technology openlab 2013 15 / 62
What is digital library?
Example #2: INSPIRE
world-wide high-energy physics information systemrun by CERN, DESY, FNAL, SLACmetadata curation since 1960s, Invenio technology since 2007citation analysis, author/affiliation analysisclose partnership with arXiv and ADShttp://inspirehep.net/
Tibor Šimko (CERN) Invenio Technology openlab 2013 16 / 62
INSPIRE: full-text search
Tibor Šimko (CERN) Invenio Technology openlab 2013 17 / 62
INSPIRE: cite summary
Tibor Šimko (CERN) Invenio Technology openlab 2013 18 / 62
INSPIRE: citation history
Tibor Šimko (CERN) Invenio Technology openlab 2013 19 / 62
INSPIRE: author pages
Tibor Šimko (CERN) Invenio Technology openlab 2013 20 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 21 / 62
Invenio Key Features
navigable collection tree (regular, virtual)powerful search engine
Google-like speed for up to 5M recordscombined metadata, reference and fulltext search
flexible metadata (MARC, OA)handling any kind of document (multimedia)customizable input, formatting and linking
personalization and collaborative features:alerts, baskets, groups, reviews, commentsinternationalisation (28 languages)
open source, GNU General Public Licenseco-developed by CERN (2002–), EPFL (2004–), DESY/FNAL/SLAC(2008–), CfA (2009–), Cornell (2011–)installed at 30+ institutions world-wide
Tibor Šimko (CERN) Invenio Technology openlab 2013 22 / 62
Invenio Modules: Overview
Author
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Database
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Database
Ingestion Modules
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Database
Ingestion Modules
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Database
Ingestion Modules
Processing Modules
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Database
Ingestion Modules
Processing Modules
Dissemination Modules
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Database
Ingestion Modules
Processing Modules
Dissemination Modules
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Database
Ingestion Modules
Processing Modules
Dissemination Modules
Curation Modules
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Overview
Author
Sources Librarian
User
Database
Ingestion Modules
Processing Modules
Dissemination Modules
Curation Modules
Tibor Šimko (CERN) Invenio Technology openlab 2013 23 / 62
Invenio Modules: Ingestion
Author
Tibor Šimko (CERN) Invenio Technology openlab 2013 24 / 62
Invenio Modules: Ingestion
Author
WebSubmit
WebSession, WebAccess
Tibor Šimko (CERN) Invenio Technology openlab 2013 24 / 62
Invenio Modules: Ingestion
Author
WebSubmit
WebSession, WebAccess
Metadata Full-text
full-text documentmetadata
Tibor Šimko (CERN) Invenio Technology openlab 2013 24 / 62
Invenio Modules: Ingestion
Author
WebSubmit
WebSession, WebAccess
Metadata Full-text
full-text document
BibUpload
BibSched
BibConvert
metadata
MARCXML
Tibor Šimko (CERN) Invenio Technology openlab 2013 24 / 62
Invenio Modules: Ingestion
Author
WebSubmit
WebSession, WebAccess
Metadata Full-text
full-text document
BibUpload
BibSched
BibConvert
metadata
MARCXML
BibHarvest
OAI Data Source
Tibor Šimko (CERN) Invenio Technology openlab 2013 24 / 62
Invenio Modules: Ingestion
Author
WebSubmit
WebSession, WebAccess
Metadata Full-text
full-text document
BibUpload
BibSched
BibConvert
metadata
MARCXML
BibHarvest
OAI Data Source
ElmSubmit
Non-OAI Data Source
Tibor Šimko (CERN) Invenio Technology openlab 2013 24 / 62
Invenio Modules: Processing
Metadata Full-text
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
BibDocFile
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
BibDocFile
BibEncode
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
BibDocFile
BibEncode
Clusters BibIndex
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
BibDocFile
BibEncode
Clusters BibIndex
WebColl
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
BibDocFile
BibEncode
Clusters BibIndex
WebColl
BibRank
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
BibDocFile
BibEncode
Clusters BibIndex
WebColl
BibRank
BibFormat
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
BibDocFile
BibEncode
Clusters BibIndex
WebColl
BibRank
BibFormat
BibSort
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Processing
Metadata Full-textRefExtract
BibClassify
BibDocFile
BibEncode
Clusters BibIndex
WebColl
BibRank
BibFormat
BibSort
BibAuthorID
Tibor Šimko (CERN) Invenio Technology openlab 2013 25 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasket
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert BibHarvest
OAI Harvester
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert BibHarvest
OAI HarvesterWebComment
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert BibHarvest
OAI HarvesterWebComment
WebMessage
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert BibHarvest
OAI HarvesterWebComment
WebMessageWebJournal
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert BibHarvest
OAI HarvesterWebComment
WebMessageWebJournal BibCirculation
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert BibHarvest
OAI HarvesterWebComment
WebMessageWebJournal BibCirculation
WebStat
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Dissemination
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert BibHarvest
OAI HarvesterWebComment
WebMessageWebJournal BibCirculation
WebStat WebHelp
Tibor Šimko (CERN) Invenio Technology openlab 2013 26 / 62
Invenio Modules: Curation
Metadata Librarian Full-text
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
BibClassify
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
BibClassify
RefExtract
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
BibClassify
RefExtract
Tasks
BibCatalog
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
BibClassify
RefExtract
Tasks
BibCatalog
Knowledge Bases
BibKnowledge
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
BibClassify
RefExtract
Tasks
BibCatalog
Knowledge Bases
BibKnowledge
BibExport
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
BibClassify
RefExtract
Tasks
BibCatalog
Knowledge Bases
BibKnowledge
BibExportBibMatch
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Curation
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
BibClassify
RefExtract
Tasks
BibCatalog
Knowledge Bases
BibKnowledge
BibExportBibMatch
BibMerge
Tibor Šimko (CERN) Invenio Technology openlab 2013 27 / 62
Invenio Modules: Summary
∼40 modulescodebase
∼350,000 lines of Python code∼15,000 lines of JavaScript code∼7,000 lines of XSL code∼8,000 lines of autotools code
∼120 authors and contributors since 2002∼48 authors and contributors in 2012 (18 new)many short-term students, importance of informal coding standards
∼10 years of developmentstarted at CERN, first release in 2002now co-developed world-wide (EU, US)
lego programming... but no silver bullet
Tibor Šimko (CERN) Invenio Technology openlab 2013 28 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
I n v e n i oADSINSPIRE arXiv
2008– 2009– 2011–
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
I n v e n i oADSINSPIRE arXiv
2008– 2009– 2011–
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
I n v e n i oADSINSPIRE arXiv
2008– 2009– 2011–
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
I n v e n i oADSINSPIRE arXiv
2008– 2009– 2011–
I n v e n i oBlogForever CRISP OpenAIRE M9
2011– 2012– 2009– 2012–
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
I n v e n i oADSINSPIRE arXiv
2008– 2009– 2011–
I n v e n i oBlogForever CRISP OpenAIRE M9
2011– 2012– 2009– 2012–
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Developer Community
I n v e n i oEPFLCDS AUTH UAB ...
2002– 2004–2006+ 2002–
I n v e n i oADSINSPIRE arXiv
2008– 2009– 2011–
I n v e n i oBlogForever CRISP OpenAIRE M9
2011– 2012– 2009– 2012–
330k LOC - Invenio core sources
10k LOC - INSPIRE overlay sources
Tibor Šimko (CERN) Invenio Technology openlab 2013 29 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 30 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 31 / 62
Why Python?
easy to read and understand(good for many temporary developers)suitable for rapid prototyping(good for organic-growth software development model)write code to throw it away
Tibor Šimko (CERN) Invenio Technology openlab 2013 32 / 62
Art of Ikebana
Ikebana, “giving life to flowers”Japanese art of flowerarrangement, “way of flowers”“disciplined art form in whichnature and humanity arebrought together”natural shapes, graceful linesminimalism
Tibor Šimko (CERN) Invenio Technology openlab 2013 33 / 62
Art of Ikebana Programming
example of anonymous functions
Java?
new Callable() {public Object call(Object x) {
return x.times(k)}
}
Python!
lambda x: k * x
Tibor Šimko (CERN) Invenio Technology openlab 2013 34 / 62
Art of Ikebana Programming
example of anonymous functions
Java?
new Callable() {public Object call(Object x) {
return x.times(k)}
}
Python!
lambda x: k * x
Tibor Šimko (CERN) Invenio Technology openlab 2013 34 / 62
Speeding Up Python
bytecode interpreted language: what about speed?Cython permits to write C extensions easilycombining efficiency of C with high-levelness of Python
Example: intbitset.pyx
ctypedef unsigned long long int word_t
ctypedef struct IntBitSet:int sizeint allocatedword_t trailing_bitsint totword_t *bitset
Tibor Šimko (CERN) Invenio Technology openlab 2013 35 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 36 / 62
Why Git?
good for distributed teamsgood for offline developmentpowerful branching/merging, first class citizenshipcommit early, commit often(to private repositories)rebase and clean when ready(before pushing for public consumption)using pull-on-demand collaboration model(as opposed to shared-push collaboration model)
Tibor Šimko (CERN) Invenio Technology openlab 2013 37 / 62
Git Collaboration
pull-on-demand collaboration modelinherent code review and QA processes before integrationmodules maintainers aka “integration lieutenants”
Tibor Šimko (CERN) Invenio Technology openlab 2013 38 / 62
Git Branches
C1 master
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
M3
v1.0.1
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
M3
v1.0.1M4
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
M3
v1.0.1M4
N1 next
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
M3
v1.0.1M4
N1 next
C6
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
M3
v1.0.1M4
N1 next
C6
N2
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
M3
v1.0.1M4
N1 next
C6
N2
C7
v1.1.0
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
M3
v1.0.1M4
N1 next
C6
N2
C7
v1.1.0
M5
v1.0.2
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Branches
C1 masterC2 C3
v1.0.0
C4
M1 maint-1.0M2
C5
M3
v1.0.1M4
N1 next
C6
N2
C7
v1.1.0
M5
v1.0.2
maint-X.Y — release maintenance branches
master — new feature branch
next — things not yet release-ready
Tibor Šimko (CERN) Invenio Technology openlab 2013 39 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
C3
merge
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
C3
merge
F1 some-new-feature
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
C3
merge
F1 some-new-featureF2
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
C3
merge
F1 some-new-featureF2
C4
merge
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
C3
merge
F1 some-new-featureF2
C4
merge
N3
merge
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
C3
merge
F1 some-new-featureF2
C4
merge
N3
merge
E1 some-experimental-feature
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
C3
merge
F1 some-new-featureF2
C4
merge
N3
merge
E1 some-experimental-featureE2
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Git Development
C1
M1
N1
master
maint-1.0
next
B1 some-bugfix
C2
M2
N2
B2
M3
merge
C3
merge
F1 some-new-featureF2
C4
merge
N3
merge
E1 some-experimental-featureE2
N4
merge
Tibor Šimko (CERN) Invenio Technology openlab 2013 40 / 62
Continuous integration
Tibor Šimko (CERN) Invenio Technology openlab 2013 41 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 42 / 62
Unit testing
test-driven development when appropriatee.g. before/while developing strip_accents(), write:
Example: search_engine_tests.py
class TestStripAccents(unittest.TestCase):"""Test for handling of UTF-8 accents."""
def test_strip_accents(self):"""search engine - stripping of accented letters"""self.assertEqual("memememe",
search_engine.strip_accents('mémêmëmè'))self.assertEqual("MEMEMEME",
search_engine.strip_accents('MÉMÊMËMÈ'))
Tibor Šimko (CERN) Invenio Technology openlab 2013 43 / 62
Functional testing
functional/acceptance/regression testingtestbed site (Atlantis of Institute Fictive Science)e.g. Python mechanize module to emulate browser
Example: websearch_regression_tests.py
class WebSearchSearchEnginePythonAPITest(unittest.TestCase):
"Check typical search engine Python API calls on the demo data."
def test_search_engine_python_api_for_failed_query(self):
"websearch - search engine Python API for failed query"
self.assertEqual([],
perform_request_search(p='aoeuidhtns'))
def test_search_engine_python_api_for_successful_query(self):
"websearch - search engine Python API for successful query"
self.assertEqual([8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 47],
perform_request_search(p='ellis'))
Tibor Šimko (CERN) Invenio Technology openlab 2013 44 / 62
Web testing
sometimes we need to run tests in real browser– e.g. pages with heavy JavaScript
using Selenium extension for Firefox– record and replay browser actions– test for text existence or non-existence on pages– test for link labels and targets
Example: websearch_web_tests.py
class InvenioWebSearchWebTests(InvenioWebTestCase):
def test_search_ellis(self):
"""websearch - web test search for ellis"""
self.browser.get(CFG_SITE_URL)
p = self.browser.find_element_by_name("p")
p.send_keys("ellis")
p.submit()
self.page_source_test(expected_text=[
'Thermal conductivity of dense quark matter ' + \
'and cooling of stars'])
Tibor Šimko (CERN) Invenio Technology openlab 2013 45 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 46 / 62
Designing A Search Engine
performance-driven design assumptions:– high number of selects, low number of updates– fast searching, slow indexation– cache everything cacheable
search functionality:– search for words, phrases, regular expressions– search in any field, authors, titles, etc
index design:– forward indexes: word1 −→ [rec1, rec2, . . . ]
word2 −→ [rec2, rec7, . . . ]– reverse indexes: rec1 −→ [word1, word8, . . . ]
rec2 −→ [word1, word2, . . . ]Zipf’s law on word frequency:
– few words occur very often (e.g. the)– most words are infrequent (even e.g. boson)
Tibor Šimko (CERN) Invenio Technology openlab 2013 47 / 62
Search Engine Under Cover
Tibor Šimko (CERN) Invenio Technology openlab 2013 48 / 62
Measuring the Performance
three important speed factors to consider:– speed of finding sets (DB Server)– speed of demarshaling sets (DB↔Web App Server)– speed of intersecting sets (Web App Server)
Example: speed of various parts (2002, before optimization)
action / query: "CERN 2002" "of the this"
-----------------------------------------------
fetching 0.28 sec 0.34 sec
demarshaling 0.78 sec 1.10 sec
adding colls 0.37 sec 0.63 sec
intersecting 0.64 sec 1.19 sec
-----------------------------------------------
total search time 2.07 sec 3.22 sec
Tibor Šimko (CERN) Invenio Technology openlab 2013 49 / 62
Optimizing Data Structures
data structures tested:– ‘sorted’ (lists, Patricia trees)– ‘unsorted’ (hashed sets, binary vectors)
fast prototyping: (Python, Lisp in 2002)– throw-away coding to test ideas
Example: lists vs dicts, 350K sets in 800K universe
marshaling lists ..... 532616+532571 bytes in 1.33 sec
demarshaling lists ... 350000+350000 items in 0.10 sec
merging lists ........ 546965 items in 0.34 sec
intersecting lists ... 153035 items in 0.35 sec
marshaling dicts ..... 576491+576450 bytes in 0.87 sec
demarshaling dicts ... 350000+350000 items in 0.36 sec
merging dicts ........ 546965 items in 0.09 sec
intersecting dicts ... 153035 items in 0.15 sec
Tibor Šimko (CERN) Invenio Technology openlab 2013 50 / 62
. . . and the winner is:
binary vectors found the best compromise!using Numeric Python module (in 2002)typical search time gain: 4.0 sec→ 0.2 sec (in 2002)typical indexing time loss: 7 hours→ 4 days (in 2002)mostly spare data modelled via mostly dense data structure?free your mind, think critically
further optimisation:Numeric module not addressing real bits, only bytesso home-made intbitset C extension (2007)
– addressing real bits, saving factor of 8 already– saving space, saving (indexing) time
use of external information retrieval tools (2011)– Solr, Xapian
Tibor Šimko (CERN) Invenio Technology openlab 2013 51 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 52 / 62
Not Invented Here
technology overview:– load balancing: HAProxy– web application: Apache, WSGI, Python, Flask, Jinja, Cython– database: SQLAlchemy, MySQL/PostgreSQL/SQLite, MongoDB– indexing: Solr, Xapian– caching: Memcached, Redis– UI: Twitter Bootstrap, jQuery– mobile app: Apache Cordova– tools: Git, Trac, Jenkins, Selenium
Tibor Šimko (CERN) Invenio Technology openlab 2013 53 / 62
Example: Invenio “next” branch UI
Tibor Šimko (CERN) Invenio Technology openlab 2013 54 / 62
Example: ZENODO
Tibor Šimko (CERN) Invenio Technology openlab 2013 55 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 56 / 62
Splitting Web App Server and DB Server
load of CDS Web and DB servers at the split time:
split leads to efficient use of OS resources by lone, non-competingWeb and DB daemon processes
Tibor Šimko (CERN) Invenio Technology openlab 2013 57 / 62
Multi-Node Architecture
800 hits per sec on CDS during Higgs seminar July 4th
LoadBalancer
Apache 1
Apache 2
Worker 1/2
Worker 1/1
Worker 1/3
Worker 2/2
Worker 2/1
Worker 2/3NODE 2
DB Master
DB Slave
Solr
shared fs
Redis
HTTP
HTTP
WSGI
WSGI
SQL R/W
SQL R/O
query
key,val
files
replication
Tibor Šimko (CERN) Invenio Technology openlab 2013 58 / 62
Measuring Scalability
using siege and ab to simulate concurrent users and to measurethroughput on a sample of typical URLs
Example: inspirehep.net under gentle siege
$ siege -d 1 -c 20 -t 1m -f inspirehep_urls.txtTransactions: 1329 hitsAvailability: 100.00 %Elapsed time: 60.23 secsData transferred: 37.12 MBResponse time: 0.41 secsTransaction rate: 22.07 trans/secThroughput: 0.62 MB/secConcurrency: 8.96Successful transactions: 1329Failed transactions: 0Longest transaction: 3.05Shortest transaction: 0.01
Tibor Šimko (CERN) Invenio Technology openlab 2013 59 / 62
Measuring Scalability: “ab” on top of “siege”
Tibor Šimko (CERN) Invenio Technology openlab 2013 60 / 62
Outline
1 IntroductionDigital LibraryInvenio
2 Case StudiesEpisode 1: PythonEpisode 2: GitEpisode 3: TestingEpisode 4: Building Efficient IndexesEpisode 5: NIHEpisode 6: Scalability
3 Conclusions
Tibor Šimko (CERN) Invenio Technology openlab 2013 61 / 62
Conclusions
selected lessons from building a digital library system– 350,000+ LOC from 110+ authors over 10+ years
selected technology:– load balancing: HAProxy– web application: Apache, WSGI, Python, Flask, Jinja– database: SQLAlchemy, MySQL/PostgreSQL/SQLite, MongoDB– caching: Memcached, Redis– UI: Twitter Bootstrap, jQuery– project tools: Git, Trac, Jenkins, Selenium
morale from selected anecdotes?– value of rapid prototyping– value of organic-growth software development model– value of coding aesthetics and minimalism– “Never Lose A Holy Curiosity” (A. Einstein)
Tibor Šimko (CERN) Invenio Technology openlab 2013 62 / 62