Medici 2.0. The story so far… Timeline Work started October 2009 3 years ago! 1.0 in June 2010 ...

Preview:

DESCRIPTION

Timeline  Work started October 2009  3 years ago!  1.0 in June 2010  Took 9 months  Since 1.0 no major changes  2 years 3 months ago

Citation preview

Medici 2.0

The story so far…

TimelineWork started October 2009

3 years ago!

1.0 in June 2010 Took 9 months

Since 1.0 no major changes 2 years 3 months ago

Initially“Manage large collections of multimedia

research artifacts”Focus on image dataHad to use Tupelo to store everythingWas split between web and desktop

RoadblocksTupelo’s performance and scalability (or lack

thereof)Difficult extensibility (previews & extractors)High technical debt

Rewriting vs. RefactoringRewriting bad!But Medici’s technical debt is huge!Medici also focused on things that ended up

not being as important as other features Semantic web > previews/extractors Users don’t understand semantic web… They do understand all these previewers and data

types!

Research and DevelopmentResearch vs. DevelopmentWhat is the right amount of each?Medici should be an enabler of new researchBut we can’t just give researchers cloud

storageLet’s keep in mind this tug of war

System Design

Original areas of focus IngestionRetrievalProcessingScalabilitySocial AnnotationAttributionUser Management

Short term projectsGroupscope (video analytics, social science)MSC (clinical data, images, genomics)NARA (archival, integration)Sead (archival, curation, gis)Seagrant (sensor data, gis)XSEDE (hpc video analysis and retrieval, digital

humanities)

Short term requirements• Archival• Clinical Data• GIS• HPC• Video

• Archivists• Biomedical /

Bioinformatics• GIS• HPC• Digital Humanities

Too many use cases?

Let’s take a step back. What’s at the core?

Research DataOrganize, Search, Analyze

New areas of focus IngestionRetrievalProcessingScalabilitySocial AnnotationAttributionUser Management

Yes, they are the same

So what’s new?Are we just fixing bugs?Are we just removing Tupelo?No, this is our chance to make it better by

Focusing

Fix what is broken

Improve the design

Involve the community

Fix what is brokenTupelo doesn’t scaleNo good open source free RDF stores availableA relational database solution would be fine for

a small lab, but is that the max we want to target?

Plenty of NOSQL options if scalability is a priority

Sidenote: plenty of sites scale by manually sharding their data across RDBMS… but not an easy task

Fix what is brokenFull fledged Web API is missingMany projects want to use Medici as a resource

but not as the frontend appOne tool cannot fit all

Improve the design Improve user experience by doing less things

but do them really well

Provide clear extension points for the community to extend the system Adding a previewer should not require recompiling the app

Top notch UI (it requires time and effort) Combine collections and Datasets into one

A dataset is now a collection of files Relationships between files used in a dataset to better

organize (folder hierarchy could be one example of this)

Drop the desktop application

Improve the designAdd “Projects” to moderate access and provide

groupsProvide “activity streams” to keep the

researcher in the loop Improve retrieval

Signals Multimedia

Enable on demand processingSupport more complex visualizations

Involve the community Publications are not the primary goal, users are

Hopefully it will be used for finding new results and thus lead to publishing

Establish an advisory board High visibility people in communities of interest Help drive design

Make the code and project as visible as possible Being open source is not enough Github would be ideal Opensource is a possibility but might not be enough

Provide a free public instance

What about Sematic Web?Don’t know of an RDF stores that scales. Do

you? Easy to use and mature?

But Semantic Web is not just RDFLinked data does not require SPARQL

RESTful services that return RDF

Can our Web API return RDF? Yes. Mappings between internal representation and

external services created by admin Appropriate defaults ship with system

5 guiding principles

1. Simplify the code base to make it easier to extend the system and adapt to new use cases

2. Improve user experience and user retention

3. Scale horizontally at all levels

4. Focus on what we do well and let others take care of what they do well. Focus on areas in which we can play a strong role, specifically research and research data

5. Develop the service API to encourage writing of clients by the community and make Medici both and app and a service

Guiding PrinciplesSimplify Improve user experienceScaleFocus on what we do wellBoth an application and a service

Architecture

Current Architecture

App Server

DB

HTTP

HTM

L

File System Extractors

HTTP

HTM

L

User sessions

RDFBlobs Internal Queue

Typical LAMP Setup

App Server

DB

HTTP

HTM

L

File System

HTTP

HTM

L

Event Bus

Long Running Jobs

Cache

Front End

MongoDB

ShardMongoD

BShard

App Server

Proxy

DB

HTTP

HTM

L

HTTP

HTM

LServices

App Server

App Server

IngestionRight now everything needs to physically be

moved into MediciBut communities have existing large

collections out of Medici Large disk farms Web pages Sensor feeds

Add the ability to index existing data Data doesn’t get pulled in, only link to retrieve it

Scaling ExtractionFil

e Up

load

Event Bus

Extractor(Java)

AMQP

Extractor(Python)

AMQP

Extractor(.NET)

AMQP

Storage

RetrievalLucene as served as well

index needs to be on one machine Low level semantics

Solr and Elastic Search provide higher level semantics

Multimedia retrieval will have to be ad-hoc Implemented as a an external service

On demand processing Tried it in the desktop using cyberintegrator tools Do something similar with web External services providing the implementation register

with Medici instance Medici launches individual processing by submitting to

external service This could be managed over a event bus

Contract should be very simple A tool can be anything that takes in a set of files, text

parameters and outputs files cyberintegrator workflow, software service, etc.

On demand processingA tool can include a link to the external service

for more advanced interaction For example edit a workflow

If we do single sign on right user will not have to login again

Technology

How should we pick technology?Developer productivity over featuresPrior knowledge by developerSmaller feature set better than over-

engineeredTry not to reinvent the wheel

LayersClient (browser)Application Server (server side app)Database (structured data, blobs, sessions) Index (information retrieval)Eventbus (pubsub queues)Services (domain specific)

Keep them separate -> Easily replaceable

Client (browser)GWT only client side (all javascript)Search engines have a hard time with thisMove to a mix between server side rendering

and client side renderingOr pick a solution that can do both

Only know of Google Closure

Responsive layout Works on screens of any size

Client (browser) jquery (ajax, selectors) underscore.js (functional programming) backbone.js (mvc)

Popular alternatives: ember.js, angular.js

modernizer (checks features supported in browser) mustache templates (hogan.js) coffescript (better javascript?) less (dynamic stylesheets)

Application ServerPotential solutions

Ruby + (Rails, Sinatra) Python + (Django, Flask) Javacript + Node.js + express + npm Scala + (Play, Scalatra)

No matter the solution, session state should not kept in app server

Node.js > Scala > Python > Ruby

Node.jsGood

Write javascript like in the client High performance (event driven, non-blocking io) socket.io (web sockets)

Bad Nested async calls

There are libraries that get around this Proven but new and in flux

Database Files in distributed file system

Hadoop GPFS MongoDB Gridfs

RDBMS + Gizzard NoSQL

Document based (json) MongoDB CouchDB (Couchbase) Developer friendly

Big table like Cassandra HBase

Replication = High AvailabilityDisaster recoveryMachine fails, slave is available Increase read performanceEverything still needs to fit on one machineRDBMS built with this scenario in mindYou buy specialized big machines to grow

Sharding = ScalingSplit data across nodesWhen data does not fit on one machineNOSQL is good at this (but no joins)Binary data can grow very largeYou need a lot of “metadata” to fill a modern

disk

MongoDB Sharding

GizzardYou provide the hash function

EventbusRabbitMQ

Stateful, AMQP, official clients in many languages Lower level Exceptional for many sources to few slow sinks How Reddit writes to RDBMS

Gearman “job” oriented Higher level

Integration with other systems Cyberintegrator

Tool provider of data-driven workflows

Software Services Tool provider of scripted software

Seagrant Geostreaming Extractors, visualizations, user management

MSC Clinical Data Extractors, visualizations, user management

Versus Multimedia indexing Tool provider for comparing files

Potential Architecture

HadoopFilesyste

m

Node.js

Nginx, varnish

MongoDB

HTTP

HTM

L

HTTP

HTM

LService

s

Node.js Node.js

Event Bus (rabbitMQ)

Extractor(Java)

Extractor(Python)

Internal API (Scala)

Elasticsearch

Software Engineering

SourceFew people with write permission to main

branchMost developers can develop on their own

branch and request a pullHelp keep the quality of the code high by

Having more than one person look at the code Puts responsibility on reviewer to actually review

Written to be read by others

ReleasesOne internal release a monthDoesn’t matter how much new stuff is in it It needs to workWe did this at the beginning

Continuous BuildsOf course!But also continuous testing

Stress testing Unit testing Automatic web scripting

Most of this is high cost up frontPays for itself on the long termOh of course manual testing

TestingAutomatic web testing

Phantom.js – headless webkit browser SeleniumHQ – automatic tests in the browser

Manual Use it internally Students

Unit testing Jasmine (javascript)

PrioritizeWe can’t do it all… right awayFocus on

User experience Scalability Extensibility

Let specific projects drive the development of specific plugins (previewers, extractors, tools)

TimelineMonthly Releases1. Setup stack, upload/index files, extractions2. Dataset creation, file relationships,

previewers3. User management, look and feel, 4. Processing, services5. Social annotation, projects

Is it still Medici?Should it be called something different? Is there really a Medici brand? Is it a strong brand?There is also data intensive project called

Medici at PNNL