Upload
spark-summit
View
654
Download
0
Embed Size (px)
Citation preview
MappingBrainConnectivitythroughLarge-scaleSegmentationandAnalysis
StephenPlazaStuartBerg
@janeliaflyem@janelia-flyem
https://www.janelia.org/project-team/fly-em@stephenplaza
Outline• Connectomics Background
• ImageSegmentationandChallenges
• Large-scaleSegmentationFramework
• SparkArchitecturalDetails
• ResultsandDiscussion
• Amapofbrainconnectivity• Alistofneurons(graphnodes) and
howtheyareconnectedthroughsynapses(graphedges)
Whatisa(Structural)Connectome?
graphwithconnectionstrengths
neurons
WhyaConnectome?• Betterunderstandhowthebrainworks• However:anatomyoftenprovidesjustclues(likeamap,
often necessarynotsufficienttogetsomewhere)
Δ
X
Photoreceptors
Timedelay
Multiplication
Theory:Hassenstein &Reichardt (1956)
ExampleProblem:Howdoanimalsdetectmotion?
Takemura,FlyEM,etal,Nature‘13
Connectome HelpsUncoverAnswer
HowtoObtainaConnectome?
extractanimalbrain
Imagebrain(electronmicroscopy)
generateimages
Findneurons(cellmembrane)
Findsynapses
Problem:DatasetsareVeryLarge
• fly~105neurons(~100TBofimagedata)• rodents~109neurons• human~1011neurons 100x
100x
100x
OurGroup:FlyEM• FlyEMmission:Performcutting-edgeconnectomeusing
ElectronMicroscopy(EM)intheDrosophila (fruitfly)
?
EM celllibrary
graph
synapses
L1
T4
Mi1Tm3
imaging compute/algs
bioexpertise theorists
EMReconstructionPipeline
Example:FlyOpticLobe
Video courtesy of Ting Zhao
~5yearsoftotalhumaneffort
315,421synapticconnections
~842reconstructed cells
~27,000cubicmicrons(~27GBofdata)<<wholeflybrain
BottlenecksinGeneratingConnectomes
ImagingChallenges
• Yearstoimagesomethinglikeamousebrain(evenwithlatest advances)
• Flybrainisalready100TBofdata
ProofreadingDataset
• Extensivemanualcomponent(depends onsegmentation quality)
• Worsethan imaging(e.g.,1weekofimaging→1yearofproofreading)
Goal:ImproveSegmentation(better segmentationèlessmanualproofreading)
Outline• Connectomics Background
• ImageSegmentationandChallenges
• Large-scaleSegmentationFramework
• SparkArchitecturalDetails
• ResultsandDiscussion
SegmentationPipeline
BoundaryPrediction
imagestack
Watershed Agglomeration
over-segmentation(conservative)
mergeregions
segmentation
0.70.90.6
1.0
0.5 1.00.2
0.10.0
0.00.0
0.00.0
0.1
0.1
1
23
merge1and2 1 3
superpixelsvoxelprobabilities segments(neurons?)
ManualComponents
BoundaryPrediction
imagestack
Watershed Agglomeration
over-segmentation(conservative)
mergeregions
segmentation
Boundarytraining• Smalldataset• Labelbackground
/foreground/etc
Superpixel training• Smalldataset• Yes/noquestions
Manualrevisision• Wholedataset• Timeconsuming
?
SegmentationApproaches• randomforest,CNN
(e.g.,Ilastik [FredHamprecht lab])
• Agglomeration:greedyagglomeration,multi-cut,etc (e.g.,NeuroProof [Plaza,Parag])
BoundaryPrediction
Agglomeration
IdealSegmentation
smallerrorcausesbigconnectome change
Overallqualitysusceptibletosmallerrors(99%correctboundarypredictionmightnotbeenough)
neuroncanspan1000sofimages
SegmentationAlgorithmicChallenges
SegmentationAlgorithmicChallenges
Poorclassifier generalizability(untrainedareascanperformpoorly)
ImagingArtifacts(e.g.,membraneholes)
• Smallneurites stressresolutionofimaging(hardtosegmentmanually, traditionallyignoredinevaluation)
• Howtoevaluatealgorithms– Needlargegroundtruth(buthardtoproduce)– Smallthingscausebigerrors
(humansaren’tperfect butdomuchbetter)
smallneurites:10-40nanometer
SegmentationAlgorithmicChallenges
PracticalConsiderationstoLarge-ScaleSegmentation
• Datasettoolargetofitinmemory
• Complexity ofdistributed, large-scale compute:barrierofentryforalgorithmdevelopers
• Robustness: greaterriskthatlong-running operationdies
• Flexibility: abilitytopartiallyrerunsegmentation withbetteralgorithms
Segmentnewalgorithm
proofread
Outline• Connectomics Background
• ImageSegmentationandChallenges
• Large-scaleSegmentationFramework
• SparkArchitecturalDetails
• ResultsandDiscussion
OurSolutions• Distributed,scalablesegmentationframework
• Robustness:Implementcheckpointsforfailurerecovery
• Infrastructureandtoolstoenablecommunitycontributions– Pluginarchitecturetoallowcustomalgorithmdrop-in– Segmentationevaluation toolstofocusonrelevanterrors
ScalableSegmentationFramework
• Mostlylocalcomputation (easytoshard)• Prettyscalable (notcomputelimited currently)
Dataset(e.g.,>200GB-2TB>100,000cubicmicrons)
Map(overlappingsubvolumes)
Boundaryprediction,watershed,agglomeration
Stitch localvolumes(consistent labeling)
“Reduce”
Commitsegmentation
Write
Check-PointandRollback
• Groupsubvolumes intoseparateiterations• Serializeeachiteration(subvolume segmentation)
todistributeddisk• ErroriniterationN,allowrollbackofN-1stages
Boundaryprediction,watershed,agglomeration Stitch localvolumes(consistent labeling)
1 2 3
disk disk disk
Union
RerunSegmentationGoal:reusepreviousproofreadneurons
BoundaryPrediction Watershed Agglomeration
Subvolume SegmentationTask
0.70.90.6
1.0
0.5 1.00.2
0.10.0
0.00.0
0.00.0
0.1
0.1
1
23
merge1and2 1 3
superpixelsvoxelprobabilities segments(neurons?)
RerunSegmentationGoal:reusepreviousproofreadneurons
BoundaryPrediction Watershed Agglomeration
Subvolume SegmentationTask
0.70.90.6
1.0
0.5 1.00.2
0.10.0
0.00.0
0.00.0
0.1
0.1
1
2
3 merge1and2
1 3
superpixelsvoxelprobabilities segments(neurons?)proofreadneuron
4 5 4 5
StitchingSubvolumeSegmentationGoal:createglobalsegmentation(donotpropagate‘small’segmentationerrors)stitchbyoverlap(idealcase)
stitchbyoverlap(pathologicalcase)
1
2
3
4
4
3
1
23badsegmentation
3falsemerge
stitchbyconservativeoverlap(avoidbranching)
1
23
2
13don’tmerge
Howtoavoidbeingtooconservative?
Large-scaleSegmentationEvaluationGoal:allowlarge-scaleevaluationofdifferentalgorithms
map(subvolumes)
seg 1
seg 2(orgroundtruth)
subvolumecomparison
combinestatistics
finalreport
Examplemetrics• Large-scalesimilarity• Smallprocesssimilarity• Editdistance
• Goals:– Simpleimage-oriented APItoGET/POSTsubvolume data– Abstractstorage layerfromclient– Previoussegmentations versioned andsaved
• UseDistributed,Versioned, Image-oriented Dataservice (DVID)
AccessingLargeImageData
BillKatz
https://github.com/janelia-flyem/dvid
Outline• Connectomics Background
• ImageSegmentationandChallenges
• Large-scaleSegmentationFramework
• SparkArchitecturalDetails
• ResultsandDiscussion
WhyImplementinSpark?• Portabilitybetween clusterenvironments
(e.g.,AWS,Googlecompute,in-house SGE)
• Simplemodel fordistributedcomputing– Encouragegreatercommunityinvolvement– Easiertomaintainandextend
• Storeentiresegmentation forlargevolumes inmemory(enablefutureworkrequiringglobal,distributedmemoryaccesstosegmentation)
DesignFeatures• WritteninPython (pyspark)
• Primarily long-running,disjointtasksonlargesubvolumes=>needcompression androbustness tocrashes
• Allowcustomizableplugins (canbeanexecutablecalledbyPython)
Plugin1:BoundaryPrediction
Plugin2:Watershed
Plugin3:Agglomeration
Input:grayscaleOutput:voxelprobabilities
Input:voxelprobabilitiesOutput:labels
Input:voxelprobabilitiesOutput:labels
DesignFeatures• Fastlz4compression forpyspark serialization
• Fastlz4compressiondirectlyonnumpy arrays(cpickle performs slowlyonlargedatasets)
lz4 cpickle
cpickle
numpyarray
numpyarray
lz4
~1GB~30MB
labelvolume
lz4
isfasterthan
High-levelArchitecture
SparkDVID
DVID(containsdataset)
Sparkapplication“CreateSegmentation”
diskbackupsubvolumesegmentation
• 1subvolume/partition• Time-consuming• Noshuffling
MainTasks
matchboundaries
• Extractsubstackboundaryregion
• Shuffle/reduce
stitch• Several
stitches/subvolume• Veryfast
writesubvolume
• Remaplabels• Foreachwriteon
subvolumes
checkpoints
partitiondataset
Minimizelargeshuffles(mostlyoverlapboundaries)
Outline• Connectomics Background
• ImageSegmentationandChallenges
• Large-scaleSegmentationFramework
• SparkArchitecturalDetails
• ResultsandDiscussion
ExperimentalSetup• Goal:re-segmentpartiallyproofreadregion• Dataset:portionofflyopticlobe
– 232,000cubicmicrons– 453GB– 3,375subvolumes
• Eachworker(16cores,90GBmemory)
• Clustersize:(32workers, 512cores,2880GBmemory)• Onlysingle serverDBbehindDVID(fornow)
Results• Runtimes
– Subvolume segmentation:42hours (depends greatlyonpluginsused)(somesparkrecomputation duetoexecutorfailure)
– ~4-5hoursperiteration(7iterations)– Shufflingandstitching:58minutes– Writingsegmentation:20hours
• Fastrestartfromcheckpoint: ~30sec,only95GBserialized segmentation
• >25hoursduetoserial read/writesthroughsingle-server backend(willbefixedsoon)
Conclusions• Open-source large-scale segmentation inSpark:DVIDSparkServices
(https://github.com/janelia-flyem/DVIDSparkServices)
• Fastcheckpointing androllbackcapabilities• Robuststitching• Flexiblepluginarchitecture• Enables in-memorymanipulationofsegmented data
SparkChallenges• Centralizedsystemforcustomtask-level logging(monitoring/debugging)• Dynamiccluster sizing/settings (e.g., sometasksrequiremorememory)• Serializationoflarge(over2GB)RDDelements
FutureWork• Improvethroughputofbackenddatastore
• Testanddeployonthecloud(Google,AWS)
• Increaserobustness/flexibilitybyallowingpartialsegmentationwrite-out(stitchusingoverlapwithpreviouslywrittenresults)
??s
Walkthrough:DVIDSparkServices
Createacustomworkflow1. DefineJSONschema2. Inherit from“Workflow”– e.g.,CustomWorkflow(Workflow)3. Implement “dumpschema” (returnJSONschemastring)4. Implement “execute” (runsactualsparkapplication)
DVIDSparkServices(pythonmodule)
“workflows”(containsplugins)
JSONschema
JSON(runapplication)
reconutils sparkdvid
• IngestGrayscale• ComputeGraph• EvaluateSeg• CreateSegmentation
Walkthrough:DVIDSparkServices
DVIDSparkServices(pythonmodule)
“workflows”(containsplugins)
JSONschema
JSON(runapplication)
reconutils sparkdvid
Runningaworkflow(locally)1. Installdvidsparkservices (withconda)2. Downloadsparkbinary3. Addsparktopath4. spark-submit--masterlocal[8]workflows/launchworkflow.py CustomWorkflow –cconfig.json
Walkthrough:DVIDSparkServices
DVIDSparkServices“workflows”
LaunchinganapplicationontheclusterwithDVIDServicesServer1. InstallDvidServicesServer2. Modifyconfig.json asnecessary3. ModifySparkLaunch/*config asnecessary4. Launchserver(DVIDServicesServer –portXconfig.json)5. Navigatetowebfront-endandlaunchjob6. Usewebpagetomonitorjob
DVIDServicesServerDVIDServicesServer:Whatitdoes• Scriptstolaunchsparkoncluster• Provideintuitivewebinterface• SimpleAPIforjobtracking
Config.json SparkLaunch/*
Demo