Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

MappingBrainConnectivitythroughLarge-scaleSegmentationandAnalysis

StephenPlazaStuartBerg

@janeliaflyem@janelia-flyem

https://www.janelia.org/project-team/fly-em@stephenplaza

Outline• Connectomics Background

• ImageSegmentationandChallenges

• Large-scaleSegmentationFramework

• SparkArchitecturalDetails

• ResultsandDiscussion

• Amapofbrainconnectivity• Alistofneurons(graphnodes) and

howtheyareconnectedthroughsynapses(graphedges)

Whatisa(Structural)Connectome?

graphwithconnectionstrengths

neurons

WhyaConnectome?• Betterunderstandhowthebrainworks• However:anatomyoftenprovidesjustclues(likeamap,

often necessarynotsufficienttogetsomewhere)

Δ

X

Photoreceptors

Timedelay

Multiplication

Theory:Hassenstein &Reichardt (1956)

ExampleProblem:Howdoanimalsdetectmotion?

Takemura,FlyEM,etal,Nature‘13

Connectome HelpsUncoverAnswer

HowtoObtainaConnectome?

extractanimalbrain

Imagebrain(electronmicroscopy)

generateimages

Findneurons(cellmembrane)

Findsynapses

Problem:DatasetsareVeryLarge

• fly~105neurons(~100TBofimagedata)• rodents~109neurons• human~1011neurons 100x

100x

100x

OurGroup:FlyEM• FlyEMmission:Performcutting-edgeconnectomeusing

ElectronMicroscopy(EM)intheDrosophila (fruitfly)

?

EM celllibrary

graph

synapses

L1

T4

Mi1Tm3

imaging compute/algs

bioexpertise theorists

EMReconstructionPipeline

Example:FlyOpticLobe

Video courtesy of Ting Zhao

~5yearsoftotalhumaneffort

315,421synapticconnections

~842reconstructed cells

~27,000cubicmicrons(~27GBofdata)<<wholeflybrain

BottlenecksinGeneratingConnectomes

ImagingChallenges

• Yearstoimagesomethinglikeamousebrain(evenwithlatest advances)

• Flybrainisalready100TBofdata

ProofreadingDataset

• Extensivemanualcomponent(depends onsegmentation quality)

• Worsethan imaging(e.g.,1weekofimaging→1yearofproofreading)

Goal:ImproveSegmentation(better segmentationèlessmanualproofreading)






SegmentationPipeline

BoundaryPrediction

imagestack

Watershed Agglomeration

over-segmentation(conservative)

mergeregions

segmentation

0.70.90.6

1.0

0.5 1.00.2

0.10.0

0.00.0

0.00.0

0.1

0.1

1

23

merge1and2 1 3

superpixelsvoxelprobabilities segments(neurons?)

ManualComponents

BoundaryPrediction

imagestack

Watershed Agglomeration

over-segmentation(conservative)

mergeregions

segmentation

Boundarytraining• Smalldataset• Labelbackground

/foreground/etc

Superpixel training• Smalldataset• Yes/noquestions

Manualrevisision• Wholedataset• Timeconsuming

?

SegmentationApproaches• randomforest,CNN

(e.g.,Ilastik [FredHamprecht lab])

• Agglomeration:greedyagglomeration,multi-cut,etc (e.g.,NeuroProof [Plaza,Parag])

BoundaryPrediction

Agglomeration

IdealSegmentation

smallerrorcausesbigconnectome change

Overallqualitysusceptibletosmallerrors(99%correctboundarypredictionmightnotbeenough)

neuroncanspan1000sofimages

SegmentationAlgorithmicChallenges


Poorclassifier generalizability(untrainedareascanperformpoorly)

ImagingArtifacts(e.g.,membraneholes)

• Smallneurites stressresolutionofimaging(hardtosegmentmanually, traditionallyignoredinevaluation)

• Howtoevaluatealgorithms– Needlargegroundtruth(buthardtoproduce)– Smallthingscausebigerrors

(humansaren’tperfect butdomuchbetter)

smallneurites:10-40nanometer


PracticalConsiderationstoLarge-ScaleSegmentation

• Datasettoolargetofitinmemory

• Complexity ofdistributed, large-scale compute:barrierofentryforalgorithmdevelopers

• Robustness: greaterriskthatlong-running operationdies

• Flexibility: abilitytopartiallyrerunsegmentation withbetteralgorithms

Segmentnewalgorithm

proofread






OurSolutions• Distributed,scalablesegmentationframework

• Robustness:Implementcheckpointsforfailurerecovery

• Infrastructureandtoolstoenablecommunitycontributions– Pluginarchitecturetoallowcustomalgorithmdrop-in– Segmentationevaluation toolstofocusonrelevanterrors

ScalableSegmentationFramework

• Mostlylocalcomputation (easytoshard)• Prettyscalable (notcomputelimited currently)

Dataset(e.g.,>200GB-2TB>100,000cubicmicrons)

Map(overlappingsubvolumes)

Boundaryprediction,watershed,agglomeration

Stitch localvolumes(consistent labeling)

“Reduce”

Commitsegmentation

Write

Check-PointandRollback

• Groupsubvolumes intoseparateiterations• Serializeeachiteration(subvolume segmentation)

todistributeddisk• ErroriniterationN,allowrollbackofN-1stages

Boundaryprediction,watershed,agglomeration Stitch localvolumes(consistent labeling)

1 2 3

disk disk disk

Union

RerunSegmentationGoal:reusepreviousproofreadneurons

BoundaryPrediction Watershed Agglomeration

Subvolume SegmentationTask

0.70.90.6

1.0

0.5 1.00.2

0.10.0

0.00.0

0.00.0

0.1

0.1

1

23

merge1and2 1 3

superpixelsvoxelprobabilities segments(neurons?)

RerunSegmentationGoal:reusepreviousproofreadneurons

BoundaryPrediction Watershed Agglomeration

Subvolume SegmentationTask

0.70.90.6

1.0

0.5 1.00.2

0.10.0

0.00.0

0.00.0

0.1

0.1

1

2

3 merge1and2

1 3

superpixelsvoxelprobabilities segments(neurons?)proofreadneuron

4 5 4 5

StitchingSubvolumeSegmentationGoal:createglobalsegmentation(donotpropagate‘small’segmentationerrors)stitchbyoverlap(idealcase)

stitchbyoverlap(pathologicalcase)

1

2

3

4

4

3

1

23badsegmentation

3falsemerge

stitchbyconservativeoverlap(avoidbranching)

1

23

2

13don’tmerge

Howtoavoidbeingtooconservative?

Large-scaleSegmentationEvaluationGoal:allowlarge-scaleevaluationofdifferentalgorithms

map(subvolumes)

seg 1

seg 2(orgroundtruth)

subvolumecomparison

combinestatistics

finalreport

Examplemetrics• Large-scalesimilarity• Smallprocesssimilarity• Editdistance

• Goals:– Simpleimage-oriented APItoGET/POSTsubvolume data– Abstractstorage layerfromclient– Previoussegmentations versioned andsaved

• UseDistributed,Versioned, Image-oriented Dataservice (DVID)

AccessingLargeImageData

BillKatz

https://github.com/janelia-flyem/dvid






WhyImplementinSpark?• Portabilitybetween clusterenvironments

(e.g.,AWS,Googlecompute,in-house SGE)

• Simplemodel fordistributedcomputing– Encouragegreatercommunityinvolvement– Easiertomaintainandextend

• Storeentiresegmentation forlargevolumes inmemory(enablefutureworkrequiringglobal,distributedmemoryaccesstosegmentation)

DesignFeatures• WritteninPython (pyspark)

• Primarily long-running,disjointtasksonlargesubvolumes=>needcompression androbustness tocrashes

• Allowcustomizableplugins (canbeanexecutablecalledbyPython)

Plugin1:BoundaryPrediction

Plugin2:Watershed

Plugin3:Agglomeration

Input:grayscaleOutput:voxelprobabilities

Input:voxelprobabilitiesOutput:labels

Input:voxelprobabilitiesOutput:labels

DesignFeatures• Fastlz4compression forpyspark serialization

• Fastlz4compressiondirectlyonnumpy arrays(cpickle performs slowlyonlargedatasets)

lz4 cpickle

cpickle

numpyarray

numpyarray

lz4

~1GB~30MB

labelvolume

lz4

isfasterthan

High-levelArchitecture

SparkDVID

DVID(containsdataset)

Sparkapplication“CreateSegmentation”

diskbackupsubvolumesegmentation

• 1subvolume/partition• Time-consuming• Noshuffling

MainTasks

matchboundaries

• Extractsubstackboundaryregion

• Shuffle/reduce

stitch• Several

stitches/subvolume• Veryfast

writesubvolume

• Remaplabels• Foreachwriteon

subvolumes

checkpoints

partitiondataset

Minimizelargeshuffles(mostlyoverlapboundaries)






ExperimentalSetup• Goal:re-segmentpartiallyproofreadregion• Dataset:portionofflyopticlobe

– 232,000cubicmicrons– 453GB– 3,375subvolumes

• Eachworker(16cores,90GBmemory)

• Clustersize:(32workers, 512cores,2880GBmemory)• Onlysingle serverDBbehindDVID(fornow)

Results• Runtimes

– Subvolume segmentation:42hours (depends greatlyonpluginsused)(somesparkrecomputation duetoexecutorfailure)

– ~4-5hoursperiteration(7iterations)– Shufflingandstitching:58minutes– Writingsegmentation:20hours

• Fastrestartfromcheckpoint: ~30sec,only95GBserialized segmentation

• >25hoursduetoserial read/writesthroughsingle-server backend(willbefixedsoon)

Conclusions• Open-source large-scale segmentation inSpark:DVIDSparkServices

(https://github.com/janelia-flyem/DVIDSparkServices)

• Fastcheckpointing androllbackcapabilities• Robuststitching• Flexiblepluginarchitecture• Enables in-memorymanipulationofsegmented data

SparkChallenges• Centralizedsystemforcustomtask-level logging(monitoring/debugging)• Dynamiccluster sizing/settings (e.g., sometasksrequiremorememory)• Serializationoflarge(over2GB)RDDelements

FutureWork• Improvethroughputofbackenddatastore

• Testanddeployonthecloud(Google,AWS)

• Increaserobustness/flexibilitybyallowingpartialsegmentationwrite-out(stitchusingoverlapwithpreviouslywrittenresults)

??s

Walkthrough:DVIDSparkServices

Createacustomworkflow1. DefineJSONschema2. Inherit from“Workflow”– e.g.,CustomWorkflow(Workflow)3. Implement “dumpschema” (returnJSONschemastring)4. Implement “execute” (runsactualsparkapplication)

DVIDSparkServices(pythonmodule)

“workflows”(containsplugins)

JSONschema

JSON(runapplication)

reconutils sparkdvid

• IngestGrayscale• ComputeGraph• EvaluateSeg• CreateSegmentation


DVIDSparkServices(pythonmodule)

“workflows”(containsplugins)

JSONschema

JSON(runapplication)

reconutils sparkdvid

Runningaworkflow(locally)1. Installdvidsparkservices (withconda)2. Downloadsparkbinary3. Addsparktopath4. spark-submit--masterlocal[8]workflows/launchworkflow.py CustomWorkflow –cconfig.json


DVIDSparkServices“workflows”

LaunchinganapplicationontheclusterwithDVIDServicesServer1. InstallDvidServicesServer2. Modifyconfig.json asnecessary3. ModifySparkLaunch/*config asnecessary4. Launchserver(DVIDServicesServer –portXconfig.json)5. Navigatetowebfront-endandlaunchjob6. Usewebpagetomonitorjob

DVIDServicesServerDVIDServicesServer:Whatitdoes• Scriptstolaunchsparkoncluster• Provideintuitivewebinterface• SimpleAPIforjobtracking

Config.json SparkLaunch/*

Demo

Data & Analytics

Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza