EDIC RESEARCH PROPOSAL 1 Online Learning for … exam...The above formulations require probabilistic output of the classiﬁer, but there exist strategies also for non-probabilistic

EDIC RESEARCH PROPOSAL 1

Online Learning for Semi-Automatic Tracing ofLinear Structures

Agata MosinskaI&C, EPFL, CVLab

Abstract—Although reconstruction of curvilinear structures in2D images and 3D image stacks has gained much attention inthe recent years, the automated algorithms still do not match thehuman performance and the reconstruction is usually followedby time-consuming proof-editing. Moreover, methods relyingon supervised machine learning require laborious and time-consuming manual annotation by a human expert. There istherefore a great demand for tools that can reduce the timespent both on labelling and error correction.

In this report, we present an approach that can considerablyaccelerate both of those steps. By applying Active Learning (AL)and using the fact that our problem is represented as a graph, wereduced the required size of training set by 70%. We also describetwo graph-based semi-supervised algorithms, from which ALcan benefit, as they propagate information from labelled data.Additionally, we propose an interactive editing method thatpropagates user feedback and actively suggests possible mistakes.Ultimately, the error detection and the corresponding correctioncan be learned and become an automatic post-processing step inthe reconstruction pipeline.

Index Terms—delineation, active learning, semi-supervisedlearning on graph

Proposal submitted to committee: June 15th, 2015; Candi-dacy exam date: June 22th, 2015; Candidacy exam committee:Exam president: Prof. Sabine Susstrunk, thesis director: Prof.Pascal Fua, co-examiner: Dr Graham Knott.

This research plan has been approved:

Date: ————————————

Doctoral candidate: ————————————(name and signature)

Thesis director: ————————————(name and signature)

Thesis co-director: ————————————(if applicable) (name and signature)

Doct. prog. director:————————————(B. Falsafi) (signature)

EDIC-ru/05.05.2009

Fig. 1. An example of variety of curvilinear networks. Top left: aerial roadimage. Top right: brainbow image. Bottom left: darkfield neuron. Bottomright: blood vessels and neurons in confocal microscopy.

I. INTRODUCTION

CURVILINEAR tree-like structures are widely abundantin the nature on a range of scales, from neurons in

microscopic 3D stacks through roads in aerial images tosolar filaments in telescopic images (Fig.I). Recent years haveseen considerable efforts to create an universal and automaticreconstruction algorithm of those, as the large amount of databecomes infeasible for manual tracing. The large variety of theappearances and scales, as well as imaging artefacts, make thisproblem particularly difficult from the computer vision pointof view. Despite substantial efforts, automated reconstructionalgorithms do not match the performance of human annotator.It therefore seems that, at the moment, the user interaction isindispensable to make the method fully applicable. The desiredsolution should be reliable, fast and require as little user inputas possible.

Currently, the most successful reconstruction methods relyon supervised machine learning techniques, which conven-tionally require significant amount of labelled data. This factsignifies the importance of ground truth data collection andannotation. Unfortunately, this process is not only tediousand time-consuming, but also prone to human error due toinconsistency and ambiguity of the data. For that reason, thereis a great need for tools that would make annotation fasterand more convenient for the user. In this report we propose a


(a) (b) (c)

Fig. 2. Step in reconstruction algorithm. a) Original image b) An overcomplete graph of network edges that are then classified by path classifier c) Finalreconstruction using IP.

method that has a potential to considerably reduce effort andinconvenience involved in this activity.

We use the idea of Active Learning, which aims at achievinggreater accuracy with fewer examples by implementing smartquerying strategies. An active learner poses questions aboutqueries that it finds to be the most informative, which is in con-trast to random pooling of training examples. An AL approachintroduces a potential to learn an accurate classifier using onlya proportion of the available data and thus greatly decreasingannotation time and effort. We also describe two graph-basedsemi-supervised algorithms that propagate information fromlabelled instances to their neighbours. As we show, they canbe used to facilitate both AL and interactive proof-editing ofreconstruction algorithms.

In this report we firstly present the state-of-the-art approachto reconstruction of linear networks and its existing limitation.Subsequently, we describe related work in the area of AL andtwo semi-supervised methods which can be combined with it.Finally, we present preliminary results of our experiments andoutline ideas for future research.

II. CURVILINEAR NETWORK RECONSTRUCTION

The delineation of linear structures has been of interestsince the advent of computer vision. Nevertheless, due to thecomplexity and often poor quality of data, the full automationof this task remains an unsolved problem. On the other hand,semi-automated methods require an extensive and tedioushuman interaction, which was depicted in DIADEM challengeas the current bottleneck of neuron reconstruction [14]. Inapplications such as neuroscience, this dramatically slowsdown the analysis process and as a result a vast amount ofdata remains unseen.

The shortcomings of the existing local reconstruction meth-ods are often associated with using a greedy heuristic, whichis susceptible to noise and imaging artefacts. On the otherhand, global methods aim at optimising an objective functionand often get trapped in local minima. Below we present astate-of-the-art method that overcomes these problems.

A. Approach

In his work, Turetken et al. [23] presents a novel approachto tracing linear structures by formulating the reconstructionproblem as an Integer Program(IP) on a graph of potential

tubular paths. The algorithm can be summarised in the fol-lowing steps:

1) A tubularity score is computed at each image locationand on a range of radii. It quantifies the likelihood oftubular structure existing at given location and radius.

2) Regularly spaced high-tubularity seed points are chosenby a non-maximum suppression. A resulting directedgraph is an overcomplete representation of the curvi-linear structure.

3) A probabilistic weight is assigned to pairs of consecutiveedges by the path classifier. This requires prior trainingon annotated data, which consists of manually tracedpaths.

4) The weights are used to solve and integer programand compute the maximum-likelihood subgraph of theovercomple graph. It represents the final result.

There are three key differences that distinguish the describedalgorithm from its competitors. First of them is its guarantee ofoptimality (within a small tolerance). Secondly, it introducesa new way of scoring the individual segments by a pathclassifier instead of integrating pixels, which is prone to noise.Finally, it does not assume that the graph is a tree, thereforeallowing for cycles and at the same time penalizing earlybranch terminations and spurious junctions.

B. Limitations of the existing solution

Despite the remarkable contributions of the described work,there are still some limitations that should be addressed.Firstly, in order to give desirable results the path classifierrequires a significant amount of labelled data for training.Collecting it is a tedious and prone to human-error task. Thereis therefore a need for tools that would decrease the trainingset size without compromising the quality of classification.

One of the failure cases occurs with faint branches, whichresults in pre-mature termination of the branch or a dis-continuity within it. Poor spacing in z-dimension sometimescreates artificial intersections between branches. The algorithmintroduces geometric consistency by assigning probabilisticweights on pairs of edges. However, it still has got relativelylocal impact and does not account for global consistency.Lastly, the computation time is relatively long and preventsthe algorithm from being used in large scale studies.


III. ACTIVE LEARNING

AL is based on the idea that the learning algorithm canactively choose the most informative instances to be labellednext. This in turn can drastically reduce the need for train-ing examples. Its applications range from natural languageprocessing[21], through computer vision [20] to bioinformat-ics [10]. There are several scenarios in which the queries mightbe posed and how the informative instances are detected. Fora detailed summary of AL we refer the reader to the surveyby Settles [16].

The three main querying scenarios are membership querysynthesis, stream-based selective sampling and pool-basedsampling. In the first one [1] the learner can query anyinstance, including those generated artificially. This scenariocan be applied in the situation when limited amount of datais available but, in practice, the synthesised instances can beoften difficult to interpret by a human annotator. In the stream-based scenario [2], typically one instance is drawn at a timeand the learner decides whether to query its label or discard it.This is particularly useful when there is a high cost associatedwith obtaining the data itself. In delineation problem we do notface any of the above-mentioned challenges, hence the mostsuited method is a pool-based learning [9], where the learnersamples from the large pool of unlabelled data. Below wepresented different query strategies, which can be employedin a pool AL.

A. Query strategies

In the remaining of this section we will use the followingnotation: suppose we are given L= {x1, ..., xL}, a smalllabelled set of instances with assigned class Y = {y1, ..., yL}, which can take values C= {c1, ..., cK} and a much largerunlabelled set U= {x1, ..., xU}.

Uncertainty sampling Among the informativeness evalua-tion methods, the most popular one is uncertainty sampling [9].In this setting the learner chooses instances about which itis the least certain. The uncertainty measures include leastconfident sampling [15] defined as:

x∗ = arg maxx∈U

(1− P (y|x)) (1)

where y is the class label with the highest prosterior proba-bility. Another popular measure, originating from informationtheory, is entropy:

x∗ = arg maxx∈U

−∑k

P (yk|x) logP (yk|x) (2)

While the two measures behave in the same way in caseof binary classification, their behaviour differs in multi-classsettings. If we assume a three-class problem, entropy willfavour those instances for which all three classes are uncertain.In contrast, for margin sampling, samples with two uncertainlabels will be as informative as those with three uncertainlabels. Generally, entropy extends well to multi-class prob-lems [17] and when the goal is to minimise the log-loss, whilemargin sampling is more appropriate when the aim is to reducethe classification errors.

The above formulations require probabilistic output of theclassifier, but there exist strategies also for non-probabilisticmethods such as SVM [21], Boosting[7], Neural Networks [3]and CRFs [4]. Uncertainty sampling may suffer from so calledsampling bias, during which the learner chooses instances onlyfrom a certain subspace, which are not representative for therest of data.

Query-by-committee Another popular approach is thequery-by-committee (QBC) algorithm [18], which minimisesthe version space of competing hypotheses. It involves a com-mittee of models which are in accordance with the already-labelled data. As our goal is to minimise the size of the versionspace, the next instance to choose should be the one that wouldcause the biggest disagreement between the models. This canbe measured by vote entropy [5]:

x∗ = arg maxx∈U

−∑k

V (yk)

Clog

V (yk)

C(3)

where V (yk) is the number of classifiers “voting” for a givenlabel and C is the committee size. The hard vote constraint canbe relaxed to posterior probability computed by the classifier.In case when the marginal probability is given, Kullback-Leibler (KL) divergence can be also computed [12].Here, itindicates the average difference between the label distributionsof each committee member and their consensus:

x∗ = arg maxx∈U

1

C

C∑c=1

∑k

Pc(yk|x) logPc(yk|x)

PC(yk|x)(4)

where the ”consensus” is PC(yk|x) = 1C

∑c Pc(yk|x).

The main disadvantage of the QBC method is often itscomputational complexity, as at each iteration all models haveto be recomputed for each unlabelled instance in order to findthe disagreement.

Density-weighted methods Both uncertainty and query-by-committee sampling often suffer from querying outliersand redundant instances, which are not representative for thewhole input space. To combat this problem, density-weightedmethods focus on the entire input space, taking into accountnot only informativeness, but also representativeness [24]of each unlabelled sample. This can be accomplished byintroducing an additional term, which measures the similaritybetween instances.

x∗ = arg maxx∈U

I(x) · 1

U

U∑u=1

sim(x, xu) (5)

Here, I(x) represents the informativeness of x that can becomputed using one of the above methods. Existing solutionsfor quantifying similarity involve clustering input space [13],k-nearest neighbours [6] and KL divergence [24] of la-belled and unlabelled sets distribution. Another technique byNguyen [13] propagates the label information to instances inthe same clusters to avoid querying outliers.

If the similarity can be computed efficiently, density-weighted methods were shown to perform superior to uncer-tainty sampling and QBC [17].


Expected error reduction The idea behind this querystrategy is to minimise future generalisation error. In orderto do that the new error has to be estimated for everyunlabelled sample, by adding it to the labelled set. A newmodel is then re-trained and the future generalisation errorcan be recomputed. An example of the best sample is the oneminimising 0/1-loss:

x∗ = arg minx∈U

∑k

Pθ(yk|x)

U∑u=1

(1− P+θ (y|xu)) (6)

where y is the most probable class of x and θ+ is thenew model. Note that as predict the error for each addedsample, we do not know what label we would get. Wetherefore approximate the error over all possible labels ykunder the current model θ. This strategy is one of the mostcomputationally expensive and thus it is only practical whenthere are fast ways of recomputing the model (for an examplesee the next section).

Batch-mode AL In conventional AL, instances are chosenone at a time. By contrast, in a batch-mode AL a group ofsamples is chosen for labelling. It is particularly useful inlimited-time settings and when re-training of the model iscomputationally expensive. However, the challenge with thiskind of strategy is creating an appropriate batch of instancesthat will not contain redundant examples, as it may happenby myopically choosing N -best instances. In order to ensurethat, one can employ the density methods described abovethat introduce representativeness. Moreover, some authors [24]suggest measures that ensure that the selected instances withina batch are not similar to each other and to already labelledexamples, thus incorporating diversity. This can be accom-plished e.g. by clustering unlabelled instances and choosingone examples from each cluster using uncertainty sampling.

IV. SEMI-SUPERVISED LEARNING ON GRAPHS

Active and semi-supervised learning share similar problemsettings: a small set of labelled instances and a large pool ofunlabelled data. Both can be employed when there is littleannotated data available and they try to make the most ofit. It therefore makes sense to combine the techniques fromthese two fields to improve the performance compared to con-ventional techniques. Below we present two semi-supervisedmethods that can be combined with an AL strategy, especiallywhen the problem can be represented as a graph.

A. Propagation algorithms

Many real-life problems can be presented in the form of agraph, where the instances are somehow related to each other.The key to this property is a prior assumption of consistency,which means that nearby points are likely to have the samelabel and that points in the same cluster are likely to havethe same label. In case when we have little annotated data,exploiting these assumptions may significantly improve theoutcome of classification and it was shown to outperformsome of the popular algorithms. They also prove to be usefulin enforcing consistency of a computed result and correctingerrors [25].

Suppose we have l labelled examples in a set L ={x1, · · · , xl} and u unlabelled points U = {xl+1, · · · , xl+u},in total n = l+u samples. Additionally, we have a connectedgraph G = (V,E) where the vertex set V can be thought ofas data points and edges E as relations between them. Theaffinity matrix W can be considered as a similarity matrixbetween instances e.g. radial basis function with entries wij= exp(−||xi − xj ||2/2σ2) if xi and xj are connected and 0otherwise. Lastly, define D, a diagonal matrix with dii equalto the sum of the i-th row of W .

Gaussian random field and harmonic functions Zhu etal. [26] formulated the semi-supervised problem on a graphin terms of a Gaussian Random Field. In this approach it isassumed that the labels of labelled instances are binary, whilethose of unlabelled examples - continuous. The relaxation ofthis problem makes the computation much easier. The costfunction, here called ”energy” is defined as:

E(Y ) =1

2

∑i,j

wij(y(i)− y(j))2 (7)

This formulation ensures that low energy corresponds to labelswhich vary slowly over the graph. Consider Gaussian Field:

p(Y ) =1

Zexp(−βE(Y )) (8)

where β is called an ”inverse temperature” parameter and Z isa normalisation constant. Looking at this expression, there is aclear analogy between Gaussian and Markov Random Fields,with the difference that the former allows continuous statespace.

The minimum energy function f∗ is harmonic and thisproperty means that the value at each unlabelled vertex isequal to the weighted average of its neighbouring nodes. Bydefinition f∗ is mode of the field, but since we deal with aGaussian joint distribution it is equivalent to the field’s mean.

Lets introduce a n×n matrix called combinatorial Laplacian∆ = D − W . It can be then partitioned into blocks whichcorrespond to labelled and unlabelled data:

∆ =

[∆ll ∆lu

∆ul ∆uu

]If we let

f =

[flfu

]where fl = yL and fu are mean values of unlabelled datapoints which we want to infer. The prediction on unlabellednodes is then given by

fu = −∆−1uu∆ulfl (9)

Finally, to classify an unlabelled example we set its label to1 if fi > 0.5 and to 0 otherwise.

The harmonic function method was evaluated on hand-written character recognition and text classification datasets.In each experiment the classification results were computed asthe labelled training set size was slowly increased. The authorsused SVM classification with different kernels as a baseline.In most cases the graph was not given, so different structureswere considered. It was found that the performance of the


(a)

(b)

Fig. 3. a) A toy example showing the effect of label propagation alongthe data manifold. After 400 iterations the right result emerges. b) Clas-sification function starts revealing the true data structure as the algorithmprogresses [25].

harmonic function outperformed SVM. It was also observedthat the sparse nearest-neighbour graph outperformed fullyconnected graphs. This is most probably because in the latterthe edges between different classes, even with small weights,may create strong connections between the classes. Finally, itwas shown that the benefits of harmonic function over SVMare most pronounced with small training set.

The algorithm has got a few interpretations, random walk ona graph being the most relevant one. If we define the transitionmatrix as P = D−1W and start the walk at some unlabellednode i, then the probability of the particle moving to node jafter one step is given by Pij . If we continue the walk untilthe first labelled vertex is reached, then fi is equivalent to theprobability of reaching node with label 1.

Local and global consistency Zhou et al. [25] proposed analgorithm that lets every point iteratively spread informationto its direct neighbours until the convergence is achieved. Itcomputes a classification function that is smooth with respectto the local and global structure shown by the whole dataset.

Given a matrix Y0 consisting of the initial probabilitydistributions estimated by the classifier (or class assignmentsfor labelled set), we propagated the probabilities using theAlgorithm 1.

Construct symmetrical matrix S = D−1/2WD−1/2

while F t − F (t−1) > ε doFt+1 = α SFt+(1-α)Y0

endyi = arg maxk F

∗ik

Algorithm 1: Label propagation algorithm [25].

A symmetric matrix S ensures that the graph is undirectedand the information is spread symmetrically. During iterationeach point receives a fraction of the information from itsneighbours and at the same time retains its initial assignment.The proportion of neighbour vs. self-information is controlledby α parameter. It was shown that the solution converges toF ∗ = (1− α)(I − αS)−1Y0.

Additionally, Zhou proposed a regularisation framework forthe above iterative algorithm. He defined the cost function as:

Q(F ) =1

2(

N∑i,j=1

Wij‖1√Dii

Fi−1√Djj

Fj‖2+µ

N∑i=1

||Fi−Yi||2)

(10)where µ is a positive regularisation parameter. In the costfunction we can distinguish two terms; the first one imposesa smoothness constraint, preventing the classification functionfrom changing too much between neighbours. Note that thelocal difference between two nearby points is weighted by thesimilarity between them and the inverse of relative influenceof the node. Therefore, instead of taking the simple absolutedifference between two labels, their values are spread alongall neighbouring edges. The second term can be regarded asa fitting constraint, which does not let the function changetoo much from the initial label assignment (both labelled andunlabelled). The relative importance between the two termsis captured by µ. The optimal function F ∗ is the one thatminimises the above cost function. By taking the derivative ofQ w.r.t. F , setting it to 0 and introducing new variables, it canbe shown that the optimal function recovers the closed formsolution from the iterative algorithm.

The method was verified on a toy dataset (Fig.3) and tworeal-life examples: digit recognition and text classification.The results were compared to two baselines, k-NN and SVMwith RBF kernel, while varying the labelled set size. Theproposed method turned out to be superior to both baselines. Italso outperformed the Gaussian harmonic function describedabove. The effect of information propagation can be seen ona toy example. As the algorithm progresses, the labels diffusealong the moons. At the same time the classifying functionrevealing the true structure of the data. Finally, the two moonsemerge after 400 iterations.

B. Combining active and semi-supervised learning

As mentioned in the previous sections, the semi-supervisedmethods were evaluated by randomly selecting several labellednodes and testing the classification results on unlabelled in-stances. In practice, we could employ some of the describedAL techniques to smartly choose the most informative ex-amples. The method proposed in [26] is an extension of theharmonic function approach. It greedily queries instances thatminimise the risk of harmonic energy function defined as:

R(f) =

n∑i=1

∑yi=0,1

[sgn(fi) 6= yi]p(yi|L) (11)

where sgn(fi) is the decision rule described in the harmonicfunction subsection. Note that this is equivalent to errorreduction query strategy. In order to approximate the unknown


10 20 30 40 50 60 70 80 90 1000.81

0.82

0.83

0.84

0.85

0.86

Number of queries

Acc

urac

y

Full datasetRandomUncertaintyPair and PPPair Diversity and PPPair Diversity PP start point

(a)

10 20 30 40 50 60 70 80 90 1000.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

Number of queries

Acc

urac

y

Full datasetRandomUncertaintyPair and PPPair Diversity and PPPair Diversity PP start point

(b) (c)

Fig. 4. Accuracy vs. number of queried labels for conventional, probability-propagated and diversity strategies. a) Roads b) Neurons. c) An example of anactive query. Thanks to employing diversity measures we can query long segments of consecutive edges at once.

distribution p(yi|L) we assume that p(yi|L) ≈ fi. Under thisassumption the estimated risk cost becomes:

R(f) =

n∑i=1

min(fi, 1− fi) (12)

After querying xk and receiving a new label yk, theharmonic function changes to f+(xk,yk). The estimated riskchanges accordingly. Since during selection we do not knowwhich label we will receive, we compute the expected esti-mated risk after querying xk by as a weighted average of twopossible estimated risks:

R(f+xk) = (1− fk)R(f+xk,0) + fkR(f+xk,1) (13)

where once again we exploit the assumption that p(yi|L) ≈ fi.The next unlabelled node to choose satisfies:

x∗ = arg minx∈U

R(f+x) (14)

The described method requires recomputing the harmonicfunction for each unlabelled instance, which is computa-tionally intensive due to inverse of Laplacian. However, fora Gaussian Random Field there exists an efficient way ofretraining and the solution after fixing xk becomes:

f (xk,yk)u = fu + (yk − f(k))

(∆(−1)uu )k

(∆(−1)uu )kk

(15)

where (∆−1uu )k is the k-th column of the inverse of graphLaplacian and (∆−1uu )kk is the k-th diagonal element of thesame matrix.

The risk estimation AL with harmonic function was evalu-ated on a synthetic dataset as well as on a few text classifica-tion tasks. In the first experiment it was shown that the riskminimisation approach is superior to conventional uncertaintysampling when it comes to finding the global structure of theproblem and estimating informativeness of the sample. Fortext classification, AL risk minimisation was contrasted withrandom query and SVM classification. It outperformed bothof them, in some simpler cases achieving remarkable accuracywith as few as 5 labelled nodes.

V. RESEARCH PROPOSAL

As discussed in previous sections, annotating the delineationdata is the first and one of the most crucial steps in obtain-ing accurate results. We therefore propose a framework forefficient and convenient image annotation using AL and semi-supervised techniques. We also propose a few ideas that havea potential to increase the performance and convenience of thegraph-based reconstruction algorithms.

A. Preliminary results

Most of the conventional AL techniques exploit neitherunlabelled data nor the relations between instances. While theassumption of independence of variables is justifiable in manyapplications, in cases when the data can be represented as agraph, one can expect that most of the neighbouring instancesshare the same labels. We exploit this relatedness and useprobability propagation algorithm proposed by Zhou et al. [25]in order to estimate the local uncertainty. We show that itis a more effective estimator of informativeness than simpleentropy. The advantages of this approach are twofold; firstly, itallows to spread information about already labelled instancesaround the graph and decrease entropy of examples that aresimilar and related to those. Secondly, it directs the queriestowards the instances in a “controversial” area, e.g. on theinter-phase between two classes. Note that this approach issimilar to the one described in the previous section, with thedifference that we exploit probability propagation to improveAL rather than using AL selection for better classification. Asa baseline, we used random query and uncertainty query. Ourapproach was validated on 2- and 3-D datasets representingdifferent tree-like structures, namely roads and neurons. Thequeries corresponded to edges of an overcomplete graphdescribed in Section II.

In order to make the annotation process more convenientand faster we can construct queries consisting of multipleconsecutive edges. Contrary to querying instances at randomlocations, it does not require tedious search through the wholestack to find the queried samples and also allows the user toget some context information. However, some data within abatch can be redundant e.g. when the edges are uncertain butvery similar to each other. This problem can be tackled by


introducing the notion of batch diversity, which captures howsimilar the batch is to unlabelled and labelled samples. It wasshown that choosing the most diverse but at the same timerepresentative batch allows to obtain the desired performancequicker.

Last but not least, we investigated if the choice of ”in-fluential” instances as starting points further improves theperformance. Once again we used the fact that our problemcan be represented as a weighted graph and that the mostinfluential nodes can be assumed to be the ones with highestcentrality. This choice was expected to maximally spread theinformation around the network from the very beginning. Themost reliable measures turned out to be the ones that takeinto account the graph structure both in Euclidean and featurespace (e.g. betweenness and centrality). As shown in Fig.4, aninfluential start points approach, combined with probability-propagated diverse query turned out to yield the best results.The algorithm will be eventually plugged into an interactiveannotation tool.

B. Future work

Although the automated reconstruction algorithms have re-cently seen considerable improvements, the complexity of dataand imaging artefacts pose challenges that may not be over-come in the foreseeable future. Moreover, some researchersclaim that full automation is even undesirable, as the userswant to have more control over the results. For that reasonit is important to develop tools that can gracefully combineautomated algorithms with a direct user manipulation.

In the reconstruction problem this can be achieved byintroducing a UI in which the user can see preliminary resultsand correct mistakes. Conventionally, the corrections do notinfluence the underlying model, meaning that the user has tocorrect all errors manually. However, it should be beneficialto learn from such mistakes by evolving the initial model aswe obtain more data. In order to implement such method, twoquestions have to be answered:

1) How to incorporate new knowledge in an old model inorder to fix analogous errors and achieve better accuracyfaster?

2) How to choose samples that should be examined by theuser in order to obtain accuracy improvement as fast aspossible?

We will now discuss some of the potential solutions.Learning from corrections This can be achieved in a

several ways. Firstly, we can add additional constraints to theIP procedure. However, as the user continues correcting, thesystem might become over-constrained, making the probleminfeasible to solve. Moreover, it may take considerable amountof time to recompute a new model, which makes it inconve-nient for the user. A possible solution would be to partitionthe original graph and take the subgraph that contains edges ina close neighbourhood of the corrected region. IP can be thensolved only for a small subgraph and the result is combinedwith the rest of the graph. This will speed up the computationand prevent from uncontrolled changes in distant regions ofthe graph.

Fig. 5. Examples of common errors. Top left: faint structure not detected;early branch termination. Top right: gap in a branch. Bottom left: incorrectcrossing of two branches. Bottom right: spurious structure detected.

Alternatively, we could exploit one of the semi-supervisedpropagation algorithms described in Section IV, which canquickly propagate the corrections. Their effect is similar toconstrained Q-MIP and has a potential to fix errors in theneighbourhood of the corrected example. Correction propaga-tion can be augmented on top of any reconstruction methodwhich uses graph representation and therefore introduces avery universal approach.

Another idea would be to group similar errors in clusters,as proposed in [11]. Then it would be sufficient for the user tocorrect one error per cluster and apply the same transformationto the rest of examples in the corresponding group. A potentialdanger associated with this approach is that it may inducebig changes globally if the potential errors are not detectedand classified accurately. A possible solution would be tocompute the probability of given structure being an error andthen setting the preference threshold depending on the userpreferences.

Detecting mistakes Another important issue is assisting theuser with detecting mistakes. This can be done in two ways;to start with, the user can be presented with a visualisationrepresenting informativeness of each edge. Then he can beasked to correct the mistake with the highest informative-ness. Such method was exploited in [8] where the user wasprompted to correct mistakes with the highest entropy. Themain disadvantage of this approach is that it requires a visualinspection of the whole image, which may be particularlycumbersome in case of 3D stacks.

An alternative solution would be to try to detect potentialmistakes automatically and present them to the user to getfeedback. This however may lead to labelling samples whichwere correctly classified and in the end would require moreinteraction. It is therefore crucial to develop a reliable methodfor automatic error suggestion. In [19] this is achieved byminimising an expected prediction error, approximated usingtransductive Rademacher complexity. A risk minimisation


method, described in Section IV in the context of AL, couldbe also used in this situation.

However, the above methods do not take into accountthe geometric appearance of the reconstruction, which wasshown to be an important aspect. By combining neurosci-entists’ observations and rules developed for an automaticgeneration of neural structures, we can learn certain featuresassociated with correct delineation and erroneous one. Thosecan include consistency of radius widths, radius temperingaway from the root, consistency of orientations of consecutivesegments, tortuosity and branch length, just to name a few.We propose combining the path classification response, whichcarries image information, with a geometric term to obtain aBayesian formulation that can reveal possible mistakes and thelevel of confidence. To our knowledge that would be the firstautomatic reconstruction error detector. Ultimately, we wouldlike to construct a ”dictionary” of common mistakes, such asthose shown in Fig.5 and from interactive editing learn thebest transformations that have to be applied in order to obtainthe best possible reconstruction.

VI. CONCLUSION

The increasing amount of imaging data presenting curvi-linear structures such as roads, neurons and blood vesselshas revived the interest in automated delineation. Despitethe sustained efforts, fully-automated approaches are stillsusceptible to noise and cannot cope with low-quality data.As those challenges might not be overcome in the foreseeablefuture, it is important to devise methods that can combinethe advantages of automated methods and the benefits ofinteractive learning.

Our work in the area of AL and semi-supervised learningproved that it is possible to considerably reduce the needfor human intervention during the annotation process. Thisis an important step in the reconstruction pipeline, as mostof the state-of-the-art methods use supervised learning andrequire labelled training data. We showed that it is beneficialto consider the neighbourhood information during the querystrategy and introduce diverse queries that are informative forall data.

We would like to further apply the ideas of informationpropagation and geometric consistency to improve the errorcorrection process. Eventually, the rules used for findingsamples that should be investigated by the user, can be usedfor automated error detection and constitute a post-processingstep.

REFERENCES

[1] D. Angluin. Queries and concept learning. Mach. Learn., 2(4):319–342,Apr. 1988.

[2] L. Atlas, D. Cohn, R. Ladner, M. A. El-Sharkawi, and R. J. Marks,II. Training connectionist networks with queries and selective sampling.chapter Advances in Neural Information Processing Systems 2, pages566–573. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,1990.

[3] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning withstatistical models. Journal of Artificial Intelligence Research, 4:129–145, 1996.

[4] A. Culotta and A. McCallum. Reducing labeling effort for structuredprediction tasks. In M. M. Veloso and S. Kambhampati, editors, AAAI,pages 746–751. AAAI Press / The MIT Press, 2005.

[5] I. Dagan and S. P. Engelson. Committee-based sampling for trainingprobabilistic classifiers. In In Proceedings of the Twelfth InternationalConference on Machine Learning, pages 150–157. Morgan Kaufmann,1995.

[6] A. Fujii, T. Tokunaga, K. Inui, and H. Tanaka. Selective samplingfor example-based word sense disambiguation. Comput. Linguist.,24(4):573–597, Dec. 1998.

[7] J. Huang, S. Erekia, Y. Song, H. Zha, and C. L. Giles. Efficientmulticlass boosting classification with active learning. In SeventhSIAM International Conference (SDM 2007). Society for Industrial andApplied Mathematics, September 2007.

[8] T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactiveinformation extraction with constrained conditional random fields. InProceedings of the 19th National Conference on Artifical Intelligence,AAAI’04, pages 412–418. AAAI Press, 2004.

[9] D. D. Lewis and W. A. Gale. A sequential algorithm for training textclassifiers. In Proceedings of the 17th Annual International ACM SIGIRConference on Research and Development in Information Retrieval,SIGIR ’94, pages 3–12, New York, NY, USA, 1994. Springer-VerlagNew York, Inc.

[10] Y. Liu. Active learning with support vector machine applied to geneexpression data for cancer classification. J. Chemistry Information andComputer Science, 44:1936–1941, 2004.

[11] J. Luisi, A. Narayanaswamy, Z. Galbreath, and B. Roysam. Thefarsight trace editor: An open source tool for 3-d inspection and efficientpattern analysis aided editing of automated neuronal reconstructions.Neuroinformatics, 9(2-3):305–315, 2011.

[12] A. McCallum and K. Nigam. Employing em and pool-based activelearning for text classification. In Proceedings of the Fifteenth Inter-national Conference on Machine Learning, ICML ’98, pages 350–358,San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.

[13] H. T. Nguyen and A. W. M. Smeulders. Active learning using pre-clustering. In International Conference on Machine Learning, pages623–630, 2004.

[14] H. Peng, F. Long, T. Zhao, and E. W. Myers. Proof-editing is thebottleneck of 3d neuron reconstruction: The problem and solutions.Neuroinformatics, 9(2-3):103–105, 2011.

[15] T. Scheffer, C. Decomain, and S. Wrobel. Active hidden markov modelsfor information extraction, 2001.

[16] B. Settles. Active learning literature survey. Technical report, 2010.[17] B. Settles and M. Craven. An analysis of active learning strategies

for sequence labeling tasks. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing, pages 1070–1079.Association for Computational Linguistics, 2008.

[18] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. InProceedings of the Fifth Annual Workshop on Computational LearningTheory, COLT ’92, pages 287–294, New York, NY, USA, 1992. ACM.

[19] H. Su, Z. Yin, T. Kanade, and S. Huh. Active sample selection andcorrection propagation on a gradually-augmented graph. June 2015.

[20] S. Tong and E. Chang. Support vector machine active learning for imageretrieval. In Proceedings of the Ninth ACM International Conference onMultimedia, MULTIMEDIA ’01, pages 107–118, New York, NY, USA,2001. ACM.

[21] S. Tong and D. Koller. Support vector machine active learning withapplications to text classification. J. Mach. Learn. Res., 2:45–66, Mar.2002.

[22] E. Turetken, F. Benmansour, B. Andres, H. Pfister, and P. Fua. Re-constructing Curvilinear Networks using Path Classifiers and IntegerProgramming. IEEE Transactions on Pattern Analysis and MachineIntelligence, 2014.

[23] E. Turetken, F. Benmansour, and P.Fua. Automated reconstruction oftree structures using path classifiers and mixed integer programming. InConference on Computer Vision and Pattern Recognition, June 2012.

[24] Z. Xu, R. Akella, and Y. Zhang. Incorporating diversity and density inactive learning for relevance feedback. pages 246–257. 2007.

[25] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schlkopf. Learningwith local and global consistency. In Advances in Neural InformationProcessing Systems 16, pages 321–328. MIT Press, 2004.

[26] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning andsemi-supervised learning using gaussian fields and harmonic functions.In ICML 2003 workshop on The Continuum from Labeled to UnlabeledData in Machine Learning and Data Mining, pages 58–65, 2003.

Documents

EDIC RESEARCH PROPOSAL 1 Online Learning for … exam...The above formulations require probabilistic output of the classiﬁer, but there exist strategies also for non-probabilistic