60
Medici 2.0

Medici 2.0. The story so far… Timeline Work started October 2009 3 years ago! 1.0 in June 2010 Took 9 months Since 1.0 no major changes 2

Embed Size (px)

DESCRIPTION

Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2 years 3 months ago

Citation preview

Page 1: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Medici 2.0

Page 2: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

The story so far…

Page 3: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

TimelineWork started October 2009

3 years ago!

1.0 in June 2010 Took 9 months

Since 1.0 no major changes 2 years 3 months ago

Page 4: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Initially“Manage large collections of multimedia

research artifacts”Focus on image dataHad to use Tupelo to store everythingWas split between web and desktop

Page 5: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

RoadblocksTupelo’s performance and scalability (or lack

thereof)Difficult extensibility (previews & extractors)High technical debt

Page 6: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Rewriting vs. RefactoringRewriting bad!But Medici’s technical debt is huge!Medici also focused on things that ended up

not being as important as other features Semantic web > previews/extractors Users don’t understand semantic web… They do understand all these previewers and data

types!

Page 7: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Research and DevelopmentResearch vs. DevelopmentWhat is the right amount of each?Medici should be an enabler of new researchBut we can’t just give researchers cloud

storageLet’s keep in mind this tug of war

Page 8: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

System Design

Page 9: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Original areas of focus IngestionRetrievalProcessingScalabilitySocial AnnotationAttributionUser Management

Page 10: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Short term projectsGroupscope (video analytics, social science)MSC (clinical data, images, genomics)NARA (archival, integration)Sead (archival, curation, gis)Seagrant (sensor data, gis)XSEDE (hpc video analysis and retrieval, digital

humanities)

Page 11: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Short term requirements• Archival• Clinical Data• GIS• HPC• Video

• Archivists• Biomedical /

Bioinformatics• GIS• HPC• Digital Humanities

Page 12: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Too many use cases?

Let’s take a step back. What’s at the core?

Research DataOrganize, Search, Analyze

Page 13: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

New areas of focus IngestionRetrievalProcessingScalabilitySocial AnnotationAttributionUser Management

Yes, they are the same

Page 14: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

So what’s new?Are we just fixing bugs?Are we just removing Tupelo?No, this is our chance to make it better by

Focusing

Page 15: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Fix what is broken

Improve the design

Involve the community

Page 16: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Fix what is brokenTupelo doesn’t scaleNo good open source free RDF stores availableA relational database solution would be fine for

a small lab, but is that the max we want to target?

Plenty of NOSQL options if scalability is a priority

Sidenote: plenty of sites scale by manually sharding their data across RDBMS… but not an easy task

Page 17: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Fix what is brokenFull fledged Web API is missingMany projects want to use Medici as a resource

but not as the frontend appOne tool cannot fit all

Page 18: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Improve the design Improve user experience by doing less things

but do them really well

Provide clear extension points for the community to extend the system Adding a previewer should not require recompiling the app

Top notch UI (it requires time and effort) Combine collections and Datasets into one

A dataset is now a collection of files Relationships between files used in a dataset to better

organize (folder hierarchy could be one example of this)

Drop the desktop application

Page 19: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Improve the designAdd “Projects” to moderate access and provide

groupsProvide “activity streams” to keep the

researcher in the loop Improve retrieval

Signals Multimedia

Enable on demand processingSupport more complex visualizations

Page 20: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Involve the community Publications are not the primary goal, users are

Hopefully it will be used for finding new results and thus lead to publishing

Establish an advisory board High visibility people in communities of interest Help drive design

Make the code and project as visible as possible Being open source is not enough Github would be ideal Opensource is a possibility but might not be enough

Provide a free public instance

Page 21: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

What about Sematic Web?Don’t know of an RDF stores that scales. Do

you? Easy to use and mature?

But Semantic Web is not just RDFLinked data does not require SPARQL

RESTful services that return RDF

Can our Web API return RDF? Yes. Mappings between internal representation and

external services created by admin Appropriate defaults ship with system

Page 22: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

5 guiding principles

Page 23: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

1. Simplify the code base to make it easier to extend the system and adapt to new use cases

Page 24: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

2. Improve user experience and user retention

Page 25: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

3. Scale horizontally at all levels

Page 26: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

4. Focus on what we do well and let others take care of what they do well. Focus on areas in which we can play a strong role, specifically research and research data

Page 27: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

5. Develop the service API to encourage writing of clients by the community and make Medici both and app and a service

Page 28: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Guiding PrinciplesSimplify Improve user experienceScaleFocus on what we do wellBoth an application and a service

Page 29: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Architecture

Page 30: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Current Architecture

App Server

DB

HTTP

HTM

L

File System Extractors

HTTP

HTM

L

User sessions

RDFBlobs Internal Queue

Page 31: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Typical LAMP Setup

App Server

DB

HTTP

HTM

L

File System

HTTP

HTM

L

Event Bus

Long Running Jobs

Cache

Page 32: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Front End

MongoDB

ShardMongoD

BShard

App Server

Proxy

DB

HTTP

HTM

L

HTTP

HTM

LServices

App Server

App Server

Page 33: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

IngestionRight now everything needs to physically be

moved into MediciBut communities have existing large

collections out of Medici Large disk farms Web pages Sensor feeds

Add the ability to index existing data Data doesn’t get pulled in, only link to retrieve it

Page 34: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Scaling ExtractionFil

e Up

load

Event Bus

Extractor(Java)

AMQP

Extractor(Python)

AMQP

Extractor(.NET)

AMQP

Storage

Page 35: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

RetrievalLucene as served as well

index needs to be on one machine Low level semantics

Solr and Elastic Search provide higher level semantics

Multimedia retrieval will have to be ad-hoc Implemented as a an external service

Page 36: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

On demand processing Tried it in the desktop using cyberintegrator tools Do something similar with web External services providing the implementation register

with Medici instance Medici launches individual processing by submitting to

external service This could be managed over a event bus

Contract should be very simple A tool can be anything that takes in a set of files, text

parameters and outputs files cyberintegrator workflow, software service, etc.

Page 37: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

On demand processingA tool can include a link to the external service

for more advanced interaction For example edit a workflow

If we do single sign on right user will not have to login again

Page 38: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Technology

Page 39: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

How should we pick technology?Developer productivity over featuresPrior knowledge by developerSmaller feature set better than over-

engineeredTry not to reinvent the wheel

Page 40: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

LayersClient (browser)Application Server (server side app)Database (structured data, blobs, sessions) Index (information retrieval)Eventbus (pubsub queues)Services (domain specific)

Keep them separate -> Easily replaceable

Page 41: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Client (browser)GWT only client side (all javascript)Search engines have a hard time with thisMove to a mix between server side rendering

and client side renderingOr pick a solution that can do both

Only know of Google Closure

Responsive layout Works on screens of any size

Page 42: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Client (browser) jquery (ajax, selectors) underscore.js (functional programming) backbone.js (mvc)

Popular alternatives: ember.js, angular.js

modernizer (checks features supported in browser) mustache templates (hogan.js) coffescript (better javascript?) less (dynamic stylesheets)

Page 43: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Application ServerPotential solutions

Ruby + (Rails, Sinatra) Python + (Django, Flask) Javacript + Node.js + express + npm Scala + (Play, Scalatra)

No matter the solution, session state should not kept in app server

Node.js > Scala > Python > Ruby

Page 44: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Node.jsGood

Write javascript like in the client High performance (event driven, non-blocking io) socket.io (web sockets)

Bad Nested async calls

There are libraries that get around this Proven but new and in flux

Page 45: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Database Files in distributed file system

Hadoop GPFS MongoDB Gridfs

RDBMS + Gizzard NoSQL

Document based (json) MongoDB CouchDB (Couchbase) Developer friendly

Big table like Cassandra HBase

Page 46: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Replication = High AvailabilityDisaster recoveryMachine fails, slave is available Increase read performanceEverything still needs to fit on one machineRDBMS built with this scenario in mindYou buy specialized big machines to grow

Page 47: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Sharding = ScalingSplit data across nodesWhen data does not fit on one machineNOSQL is good at this (but no joins)Binary data can grow very largeYou need a lot of “metadata” to fill a modern

disk

Page 48: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

MongoDB Sharding

Page 49: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

GizzardYou provide the hash function

Page 50: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

EventbusRabbitMQ

Stateful, AMQP, official clients in many languages Lower level Exceptional for many sources to few slow sinks How Reddit writes to RDBMS

Gearman “job” oriented Higher level

Page 51: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Integration with other systems Cyberintegrator

Tool provider of data-driven workflows

Software Services Tool provider of scripted software

Seagrant Geostreaming Extractors, visualizations, user management

MSC Clinical Data Extractors, visualizations, user management

Versus Multimedia indexing Tool provider for comparing files

Page 52: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Potential Architecture

HadoopFilesyste

m

Node.js

Nginx, varnish

MongoDB

HTTP

HTM

L

HTTP

HTM

LService

s

Node.js Node.js

Event Bus (rabbitMQ)

Extractor(Java)

Extractor(Python)

Internal API (Scala)

Elasticsearch

Page 53: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Software Engineering

Page 54: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

SourceFew people with write permission to main

branchMost developers can develop on their own

branch and request a pullHelp keep the quality of the code high by

Having more than one person look at the code Puts responsibility on reviewer to actually review

Written to be read by others

Page 55: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

ReleasesOne internal release a monthDoesn’t matter how much new stuff is in it It needs to workWe did this at the beginning

Page 56: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Continuous BuildsOf course!But also continuous testing

Stress testing Unit testing Automatic web scripting

Most of this is high cost up frontPays for itself on the long termOh of course manual testing

Page 57: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

TestingAutomatic web testing

Phantom.js – headless webkit browser SeleniumHQ – automatic tests in the browser

Manual Use it internally Students

Unit testing Jasmine (javascript)

Page 58: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

PrioritizeWe can’t do it all… right awayFocus on

User experience Scalability Extensibility

Let specific projects drive the development of specific plugins (previewers, extractors, tools)

Page 59: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

TimelineMonthly Releases1. Setup stack, upload/index files, extractions2. Dataset creation, file relationships,

previewers3. User management, look and feel, 4. Processing, services5. Social annotation, projects

Page 60: Medici 2.0. The story so far… Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2

Is it still Medici?Should it be called something different? Is there really a Medici brand? Is it a strong brand?There is also data intensive project called

Medici at PNNL