Upload
molly-amy-harvey
View
219
Download
0
Embed Size (px)
DESCRIPTION
Timeline Work started October 2009 3 years ago! 1.0 in June 2010 Took 9 months Since 1.0 no major changes 2 years 3 months ago
Citation preview
Medici 2.0
The story so far…
TimelineWork started October 2009
3 years ago!
1.0 in June 2010 Took 9 months
Since 1.0 no major changes 2 years 3 months ago
Initially“Manage large collections of multimedia
research artifacts”Focus on image dataHad to use Tupelo to store everythingWas split between web and desktop
RoadblocksTupelo’s performance and scalability (or lack
thereof)Difficult extensibility (previews & extractors)High technical debt
Rewriting vs. RefactoringRewriting bad!But Medici’s technical debt is huge!Medici also focused on things that ended up
not being as important as other features Semantic web > previews/extractors Users don’t understand semantic web… They do understand all these previewers and data
types!
Research and DevelopmentResearch vs. DevelopmentWhat is the right amount of each?Medici should be an enabler of new researchBut we can’t just give researchers cloud
storageLet’s keep in mind this tug of war
System Design
Original areas of focus IngestionRetrievalProcessingScalabilitySocial AnnotationAttributionUser Management
Short term projectsGroupscope (video analytics, social science)MSC (clinical data, images, genomics)NARA (archival, integration)Sead (archival, curation, gis)Seagrant (sensor data, gis)XSEDE (hpc video analysis and retrieval, digital
humanities)
Short term requirements• Archival• Clinical Data• GIS• HPC• Video
• Archivists• Biomedical /
Bioinformatics• GIS• HPC• Digital Humanities
Too many use cases?
Let’s take a step back. What’s at the core?
Research DataOrganize, Search, Analyze
New areas of focus IngestionRetrievalProcessingScalabilitySocial AnnotationAttributionUser Management
Yes, they are the same
So what’s new?Are we just fixing bugs?Are we just removing Tupelo?No, this is our chance to make it better by
Focusing
Fix what is broken
Improve the design
Involve the community
Fix what is brokenTupelo doesn’t scaleNo good open source free RDF stores availableA relational database solution would be fine for
a small lab, but is that the max we want to target?
Plenty of NOSQL options if scalability is a priority
Sidenote: plenty of sites scale by manually sharding their data across RDBMS… but not an easy task
Fix what is brokenFull fledged Web API is missingMany projects want to use Medici as a resource
but not as the frontend appOne tool cannot fit all
Improve the design Improve user experience by doing less things
but do them really well
Provide clear extension points for the community to extend the system Adding a previewer should not require recompiling the app
Top notch UI (it requires time and effort) Combine collections and Datasets into one
A dataset is now a collection of files Relationships between files used in a dataset to better
organize (folder hierarchy could be one example of this)
Drop the desktop application
Improve the designAdd “Projects” to moderate access and provide
groupsProvide “activity streams” to keep the
researcher in the loop Improve retrieval
Signals Multimedia
Enable on demand processingSupport more complex visualizations
Involve the community Publications are not the primary goal, users are
Hopefully it will be used for finding new results and thus lead to publishing
Establish an advisory board High visibility people in communities of interest Help drive design
Make the code and project as visible as possible Being open source is not enough Github would be ideal Opensource is a possibility but might not be enough
Provide a free public instance
What about Sematic Web?Don’t know of an RDF stores that scales. Do
you? Easy to use and mature?
But Semantic Web is not just RDFLinked data does not require SPARQL
RESTful services that return RDF
Can our Web API return RDF? Yes. Mappings between internal representation and
external services created by admin Appropriate defaults ship with system
5 guiding principles
1. Simplify the code base to make it easier to extend the system and adapt to new use cases
2. Improve user experience and user retention
3. Scale horizontally at all levels
4. Focus on what we do well and let others take care of what they do well. Focus on areas in which we can play a strong role, specifically research and research data
5. Develop the service API to encourage writing of clients by the community and make Medici both and app and a service
Guiding PrinciplesSimplify Improve user experienceScaleFocus on what we do wellBoth an application and a service
Architecture
Current Architecture
App Server
DB
HTTP
HTM
L
File System Extractors
HTTP
HTM
L
User sessions
RDFBlobs Internal Queue
Typical LAMP Setup
App Server
DB
HTTP
HTM
L
File System
HTTP
HTM
L
Event Bus
Long Running Jobs
Cache
Front End
MongoDB
ShardMongoD
BShard
App Server
Proxy
DB
HTTP
HTM
L
HTTP
HTM
LServices
App Server
App Server
IngestionRight now everything needs to physically be
moved into MediciBut communities have existing large
collections out of Medici Large disk farms Web pages Sensor feeds
Add the ability to index existing data Data doesn’t get pulled in, only link to retrieve it
Scaling ExtractionFil
e Up
load
Event Bus
Extractor(Java)
AMQP
Extractor(Python)
AMQP
Extractor(.NET)
AMQP
Storage
RetrievalLucene as served as well
index needs to be on one machine Low level semantics
Solr and Elastic Search provide higher level semantics
Multimedia retrieval will have to be ad-hoc Implemented as a an external service
On demand processing Tried it in the desktop using cyberintegrator tools Do something similar with web External services providing the implementation register
with Medici instance Medici launches individual processing by submitting to
external service This could be managed over a event bus
Contract should be very simple A tool can be anything that takes in a set of files, text
parameters and outputs files cyberintegrator workflow, software service, etc.
On demand processingA tool can include a link to the external service
for more advanced interaction For example edit a workflow
If we do single sign on right user will not have to login again
Technology
How should we pick technology?Developer productivity over featuresPrior knowledge by developerSmaller feature set better than over-
engineeredTry not to reinvent the wheel
LayersClient (browser)Application Server (server side app)Database (structured data, blobs, sessions) Index (information retrieval)Eventbus (pubsub queues)Services (domain specific)
Keep them separate -> Easily replaceable
Client (browser)GWT only client side (all javascript)Search engines have a hard time with thisMove to a mix between server side rendering
and client side renderingOr pick a solution that can do both
Only know of Google Closure
Responsive layout Works on screens of any size
Client (browser) jquery (ajax, selectors) underscore.js (functional programming) backbone.js (mvc)
Popular alternatives: ember.js, angular.js
modernizer (checks features supported in browser) mustache templates (hogan.js) coffescript (better javascript?) less (dynamic stylesheets)
Application ServerPotential solutions
Ruby + (Rails, Sinatra) Python + (Django, Flask) Javacript + Node.js + express + npm Scala + (Play, Scalatra)
No matter the solution, session state should not kept in app server
Node.js > Scala > Python > Ruby
Node.jsGood
Write javascript like in the client High performance (event driven, non-blocking io) socket.io (web sockets)
Bad Nested async calls
There are libraries that get around this Proven but new and in flux
Database Files in distributed file system
Hadoop GPFS MongoDB Gridfs
RDBMS + Gizzard NoSQL
Document based (json) MongoDB CouchDB (Couchbase) Developer friendly
Big table like Cassandra HBase
Replication = High AvailabilityDisaster recoveryMachine fails, slave is available Increase read performanceEverything still needs to fit on one machineRDBMS built with this scenario in mindYou buy specialized big machines to grow
Sharding = ScalingSplit data across nodesWhen data does not fit on one machineNOSQL is good at this (but no joins)Binary data can grow very largeYou need a lot of “metadata” to fill a modern
disk
MongoDB Sharding
GizzardYou provide the hash function
EventbusRabbitMQ
Stateful, AMQP, official clients in many languages Lower level Exceptional for many sources to few slow sinks How Reddit writes to RDBMS
Gearman “job” oriented Higher level
Integration with other systems Cyberintegrator
Tool provider of data-driven workflows
Software Services Tool provider of scripted software
Seagrant Geostreaming Extractors, visualizations, user management
MSC Clinical Data Extractors, visualizations, user management
Versus Multimedia indexing Tool provider for comparing files
Potential Architecture
HadoopFilesyste
m
Node.js
Nginx, varnish
MongoDB
HTTP
HTM
L
HTTP
HTM
LService
s
Node.js Node.js
Event Bus (rabbitMQ)
Extractor(Java)
Extractor(Python)
Internal API (Scala)
Elasticsearch
Software Engineering
SourceFew people with write permission to main
branchMost developers can develop on their own
branch and request a pullHelp keep the quality of the code high by
Having more than one person look at the code Puts responsibility on reviewer to actually review
Written to be read by others
ReleasesOne internal release a monthDoesn’t matter how much new stuff is in it It needs to workWe did this at the beginning
Continuous BuildsOf course!But also continuous testing
Stress testing Unit testing Automatic web scripting
Most of this is high cost up frontPays for itself on the long termOh of course manual testing
TestingAutomatic web testing
Phantom.js – headless webkit browser SeleniumHQ – automatic tests in the browser
Manual Use it internally Students
Unit testing Jasmine (javascript)
PrioritizeWe can’t do it all… right awayFocus on
User experience Scalability Extensibility
Let specific projects drive the development of specific plugins (previewers, extractors, tools)
TimelineMonthly Releases1. Setup stack, upload/index files, extractions2. Dataset creation, file relationships,
previewers3. User management, look and feel, 4. Processing, services5. Social annotation, projects
Is it still Medici?Should it be called something different? Is there really a Medici brand? Is it a strong brand?There is also data intensive project called
Medici at PNNL