Multi-Task Learning: Theory, Algorithms, and Applicationsjye02/Software/MALSAR/MTL-SDM12.pdfTutorial Goals •Understand the basic concepts in multi-task learning •Understand different

The Twelfth SIAM Information Conference on Data Mining, April 27, 2012

Center for Evolutionary Medicine and Informatics

Multi-Task Learning: Theory, Algorithms, and Applications

Jiayu Zhou1,2, Jianhui Chen3, Jieping Ye1,2

1Computer Science and Engineering, Arizona State University, AZ 2The Biodesign Institute, Arizona State University, AZ

3GE Global Research, NY



Tutorial Goals

• Understand the basic concepts in multi-task learning

• Understand different approaches to model task relatedness

• Get familiar with different types of multi-task learning techniques

• Introduce multi-task learning applications

• Introduce the multi-task learning package: MALSAR

2



Tutorial Road Map

• Part I: Multi-task Learning (MTL) background and motivations

• Part II: MTL formulations

• Part III: Case study of real-world applications

– Incomplete Multi-Source Fusion

– Drosophila Gene Expression Image Analysis

• Part IV: An MTL Package (MALSAR)

• Current and future directions

3



Tutorial Road Map








4



Multiple Tasks o Examination Scores Prediction1 (Argyriou et. al.’08)

5

School 1 - Alverno High School

School 138 - Jefferson Intermediate School

School 139 - Rosemead High School

1The Inner London Education Authority (ILEA)

student-dependent school-dependent



Student

id

Birth

year

Previous

score

…

School

ranking

…

72981 1985 95 … 83% …

Student

id

Birth

year

Previous

score

…

School

ranking

…

31256 1986 87 … 72% …

Student

id

Birth

year

Previous

score

…

School

ranking

…

12381 1986 83 … 77% …

Exam

score

?

Exam

score

?

Exam

score

?



Learning Multiple Tasks o Learning each task independently

6

task 1st

task 138th

task 139th

Student

id

Birth

year

Previous

score

School

ranking

…

72981 1985 95 83% …

Student

id

Birth

year

Previous

score

School

ranking

…

31256 1986 87 72% …

Student

id

Birth

year

Previous

score

School

ranking

…

12381 1986 83 77% …

Exam

Score

?

Exam

Score

?

Exam

Score

?




Excellent

Excellent

Excellent



Learning Multiple Tasks o Leaning multiple tasks simultaneously

7

task 1st

task 138th

task 139th

Learn tasks simultaneously Model the tasks relationship

Student

id

Birth

year

Previous

score

School

ranking

…

72981 1985 95 83% …

Student

id

Birth

year

Previous

score

School

ranking

…

31256 1986 87 72% …

Student

id

Birth

year

Previous

score

School

ranking

…

12381 1986 83 77% …

Exam

Score

?

Exam

Score

?




Exam

Score

?



Performance of MTL o Evaluation on the School data:

• Predict exam scores for 15362 students from 139 schools

• Describe each student by 27 attributes

• Compare single task learning approaches (Ridge Regression, Lasso) and one multi-task learning approach (trace-norm regularized learning)

8

N−MSE = mean squared error

variance (target)

1 2 3 4 5 6 7 80.7

0.75

0.8

0.85

0.9

0.95

1

1.05

Index of Training Ratio

N-M

SE

Ridge Regression

Lasso

Trace Norm

Performance measure:



Multi-Task Learning

• Multi-task Learning is different from single task learning in the training (induction) process.

• Inductions of multiple tasks are performed simultaneously to capture intrinsic relatedness.

9

Training DataTrained Model

Task 1


Task 2


Task t

...

...

Training

Generalization

Generalization

Generalization

Multi-Task Learning


Task 1


Task 2


Task t

...

...

Training

Training

Training

Generalization

Generalization

Generalization

Single Task Learning



Learning Methods

o

–

–

–

o

–

–

–

o

–

–



Web Pages Categorization Chen et. al. 2009 ICML

• Classify documents into categories

• The classification of each category is a task

• The tasks of predicting different categories may be latently related

11

Health

Travel

World

Politics

US...

Category Classifiers

Models of different categories are

latently related

Classifiers’ Parameters



MTL for HIV Therapy Screening Bickel et. al. ICML 08

• Hundreds of possible combinations of drugs, some of which use similar biochemical mechanisms

• The samples available for each combination are limited.

• For a patient, the prediction of using one combination is a task

• Use the similarity information by multiple task learning

12

ETV

ABC

AZT

D4T

DDI

FTC

...

Patient Treatment RecordDrug

AZT

ETV

FTC

?



Other Applications

• Portfolio selection [Ghosn and Bengio, NIPS’97]

• Collaborative ordinal regression [Yu et. al. NIPS’06]

• Web image and video search [Wang et. al. CVPR’09]

• Disease progression modeling [Zhou et. al. KDD’11]

• Disease prediction [Zhang et. al. NeuroImage 12]

13



Tutorial Road Map








14



How Tasks Are Related

15

All tasks are relatedAssumption:

Methods • Mean-regularized MTL • Joint feature learning • Trace-Norm regularized

MTL • Alternating structural

optimization (ASO) • Shared Parameter

Gaussian Process




16

• Assume all tasks are related may be too strong for practical applications.

• There are some irrelevant (outlier) tasks.

There are outlier tasksAssumption:




17

Methods • Clustered MTL • Tree MTL • Network MTL

Tasks have group structuresAssumption:

Tasks have tree structuresAssumption:

a cb

Models

Tasks have graph/network structuresAssumption:



Multi-Task Learning Methods

18

• Regularization-based MTL

– All tasks are related • regularized MTL, joint feature learning, low rank MTL, ASO

– Learning with outlier tasks: robust MTL

– Tasks form groups/graphs/trees • clustered MTL, network MTL, tree MTL

• Other Methods

– Shared Hidden Nodes in Neural Network

– Shared Parameter Gaussian Process



Regularization-based Multi-Task Learning

• All tasks are related

– Mean-Regularized MTL

– MTL in high dimensional feature space • Embedded Feature Selection

• Low-Rank Subspace Learning

• Clustered MTL

• MTL with Tree/Graph structure

19



Notation

• We focus on linear models: Yi = 𝑋𝑖 ×𝑊𝑖 𝑋𝑖 ∈ ℝ

𝑛𝑖×𝑑 , 𝑌𝑖 ∈ ℝ𝑛𝑖×1,𝑊 = [𝑊1,𝑊2, … ,𝑊𝑚]

20

Learning

Task m

Dimension d

Sam

ple

nt

... Sa

mp

le n

2

Sam

ple

n1

Feature Matrices Xi

Task m

Sam

ple

nt

... Sa

mp

le n

2

Sam

ple

n1

Target Vectors Yi

Task m

Dim

ensio

n d

Model Matrix W



Mean-Regularized Multi-Task Learning Evgeniou & Pontil, 2004 KDD

• Assumption: task parameter vectors of all tasks are close to each other.

– Advantage: simple, intuitive, easy to implement

– Disadvantage: may not hold in real applications.

21

mean

Task

Regularization penalizes the deviation of each task from the mean

min𝑊

1

2𝑋𝑊 − 𝑌 𝐹

2 + 𝜆 𝑊𝑖 −1

𝑚 𝑊𝑠

𝑚

𝑠=1

𝑚

𝑖=1 2

2








• Clustered MTL


22



Multi-Task Learning with High Dimensional Data

• In practical applications, we may deal with high dimensional data.

– Gene expression data, biomedical image data

• Curse of Dimensionality

• Dealing with high dimensional data in multi-task learning

– Embedded feature selection: L1/Lq - Group Lasso

– Low-rank subspace learning: low-rank assumption – ASO, Trace-norm regularization

23








• Clustered MTL


24



Multi-Task Learning with Joint Feature Learning

• One way to capture the task relatedness from multiple related tasks is to constrain all models to share a common set of features.

• For example, in school data, the scores from different schools may be determined by a similar set of features.

25

Feature 1

Feature 2

Feature 3

Feature 4

Feature 5

Feature 6

Feature 7

Feature 8

Feature 9

Feature 10

Task 1Task 2

Task 3Task 4

Task 5



Multi-Task Learning with Joint Feature Learning Obozinski et. al. 2009 Stat Comput, Liu et. al. 2010 Technical Report

• Using group sparsity: ℓ1/ℓ𝑞-norm regularization

• When q>1 we have group sparsity.

26

min𝑊

1

2𝑋𝑊 − 𝑌 𝐹

2 + 𝜆 𝑊 1,𝑞

𝑊 1,𝑞 = 𝒘𝑖 𝑞

𝑑

𝑖=1

...

Y

≈

n×m

... ×

WX

n×d d×m

Sample 1

...

...Sample 2

Sample 3

Sample n-2

Sample n-1

Sample n

Task 1Task 2

Task 3Task m

Task 1Task 2

Task 3Task m

Input ModelOutput



Writer-Specific Character Recognition Obozinski, Taskar, and Jordan, 2006

• Each task is a classification between two letters for one writer.

27



Dirty Model for Multi-Task Learning Jalali et. al. 2010 NIPS

• In practical applications, it is too restrictive to constrain all tasks to share a single shared structure.

28

+=

Group Sparse Component

P

Sparse Component

Q

Model

W

min𝑃,𝑄𝑌 − 𝑋 𝑃 + 𝑄 𝐹

2 + 𝜆1 𝑃 1,𝑞 + 𝜆2 𝑄 1



Robust Multi-Task Learning

o Most Existing MTL Approaches o Robust MTL Approaches

29

relevant tasks all tasks are relevant

All tasks are relatedAssumption:

There are outlier tasksAssumption:

irrelevant task

irrelevant task



Robust Multi-Task Feature Learning Gong et. al. 2012 Submitted

• Simultaneously captures a common set of features among relevant tasks and identifies outlier tasks.

30 min𝑃,𝑄𝑌 − 𝑋 𝑃 + 𝑄 𝐹

2 + 𝜆1 𝑃 1,𝑞 + 𝜆2 𝑄𝑇1,𝑞

+=


P


Q

Model

W

Joint Selected Features Outlier Tasks








• Clustered MTL


31



Trace-Norm Regularized MTL

32

o Capture task relatedness via a shared low-rank structure

trained model

training data

task 1

training

trained model

training data

task 2 …

…

A shared low-rank subspace

generalization

generalization

trained model

training data

task n generalization

…



Low-Rank Structure for MTL

… 𝛼1 𝛼2 + ≈ ×

training data weight vector target

Task 1

… ≈ ×

Task 2

…

≈ × Task 3

=

=

=

𝛽1 𝛽1 +

𝛾1 𝛾2 +

basis vector basis vector

• Assume we have a rank 2 model matrix:



Low-Rank Structure for MTL

34

=

=

𝛼1 𝛼2𝛽1 𝛽2 𝛾1 𝛾2

𝑇

𝜏 11 ⋯ 𝜏 1𝑚 ⋮ ⋱ ⋮𝜏 𝑝1 ⋯ 𝜏 𝑝𝑚

… …

×

×

A low rank structure Basis vectors Coefficients

m tasks p basis vectors 𝑚 > 𝑝



Low-Rank Structure for MTL Ji et. al. 2009 ICML

• Rank minimization formulation

– min𝑊Loss(𝑊) + 𝜆 × Rank(𝑊)

– Rank minimization is NP-Hard for general loss functions

• Convex relaxation: trace norm minimization

– min𝑊Loss(𝑊) + 𝜆 × 𝑊 ∗

– The trace norm is theoretically shown to be a good approximation for rank function (Fazel et al., 2001).

35

𝑊 ∗ : sum of singular values of W



Low-Rank Structure for MTL o Evaluation on the School data:

• Predict exam scores for 15362 students from 139 schools

• Describe each student by 27 attributes

• Compare Ridge Regression, Lasso, and Trace Norm (for inducing a low-rank structure)

36

N−MSE = mean squared error

variance (target)

1 2 3 4 5 6 7 80.7

0.75

0.8

0.85

0.9

0.95

1

1.05

Index of Training Ratio

N-M

SE

Ridge Regression

Lasso

Trace Norm

Performance measure:

The Low-Rank Structure (induced via Trace Norm) leads to the smallest N-MSE.



Alternating Structure Optimization (ASO) Ando and Zhang, 2005 JMRL

37

Input X ≈ Task 1

Low-Dimensional Feature Map

+ Θ Xw1 v1

Input X ≈ Task 2 + Θ Xw2 v2

Input X ≈ Task m + Θ Xwm vm

...

Θ

• ASO assumes that the model is the sum of two components: a task specific one and a shared low dimensional subspace.



Alternating Structure Optimization (ASO) Ando and Zhang, 2005 JMRL

38

o Learning from the i-th task

o ASO simultaneously learns models and the shared structure:

ℒi Xi Өvi +wi , yi = Xi Өvi +wi − yi 2

minӨ,{vi,wi}

ℒi Xi Өvi +wi , yi + 𝛼 wi2

𝑚

𝑖=1

subject to ӨTӨ = I

o Empirical loss function for i-th task

Input X ≈

Xi

Low-Dimensional Feature Map

+ Θ Xwi vi

yi

ui =Өvi + wi



iASO Formulation Chen et al., 2009 ICML

39

o iASO formulation

minӨ,{vi,wi}

ℒi Xi Өvi +wi , yi + 𝛼 Өvi +wi2 + 𝛽 wi

2

𝑚

𝑖=1

subject to ӨTӨ = I

• control both model complexity and task relatedness

• subsume ASO (Ando et al.’05) as a special case

• naturally lead to a convex relaxation (Chen et al., 09, ICML)

• Convex relaxed ASO is equivalent to iASO under certain mild conditions



Incoherent Low-Rank and Sparse Structures Chen et. al. 2010 KDD

• ASO uses L2-norm on task-specific component, we can also use L1-norm to learn task-specific features.

40

basis coefficient+= = X

Element-wise Sparse Component

Q

Model

W

Low-Rank Component

P

minP,Q

ℒi Xi Pi + Qi , yi + ߣ Q 1

𝑚

𝑖=1

subject to P ∗ ⩽ 𝜂 Convex formulation

Capture task relatedness Task-specific features



Robust Low-Rank in MTL Chen et. al. 2011 KDD

41

• Simultaneously perform low-rank MTL and identify outlier tasks.

Outlier Tasks

basis coefficient+= = X

Model

W

Low-Rank Component

P


Q

miniP,Q ℒi Xi 𝑃𝑖 + 𝑄𝑖 , yi + 𝛼 P ∗ + 𝛽 QT

1,𝑞

𝑚

𝑖=1

Capture task relatedness Identify irrelevant tasks



Summary

• All multi-task learning formulations discussed above can fit into the W=P+Q schema. – Component P: shared structure

– Component Q: information not captured by the shared structure

42

Embedded Feature Selection Shared Structure P Component Q L1/Lq Feature Selection (L1/Lq Norm) 0 Dirty Feature Selection (L1/Lq Norm) L1-norm rMTFL Feature Selection (L1/Lq Norm) Outlier (column-wise L1/Lq Norm) Low-Rank Subspace Learning

Trace Norm Low-Rank (Trace Norm) 0 ISLR Low-Rank (Trace Norm) L1-norm ASO Low-Rank (Shared Subspace) L2-norm on independent comp. RMTL Low-Rank (Trace Norm) Outlier (column-wise L1/Lq Norm)








• Clustered MTL


43



Multi-Task Learning with Clustered Structures

44

• Most MTL techniques assume all tasks are related

• Not true in many applications • Clustered multi-task learning

assumes the tasks have a group

structure the models of tasks from the

same group are closer to each other than those from a different group

Tasks have group structuresAssumption:

e.g. tasks in the yellow group are predictions of heart related diseases and in the blue group are brain related diseases.



Clustered Multi-Task Learning Jacob et. al. 2008 NIPS, Zhou et. al. 2011 NIPS

• Use regularization to capture clustered structures.

45

Training Data X ≈

Training Data X ≈

...

Clustered Models

...

Cluster 1 Cluster 2 Cluster k-1 Cluster k

Cluster 1

Cluster 2

Cluster k-1

Cluster k



Clustered Multi-Task Learning Zhou et. al. 2011 NIPS

• Capture structures by minimizing sum-of-square error (SSE) in K-means clustering:

46

Ij index set of jth cluster

min𝐼 𝑤𝑣 − 𝑤 𝑗 2

2

𝑣∈𝐼𝑗

𝑘

𝑗=1

min𝐹tr 𝑊𝑇𝑊 − tr(𝐹𝑇𝑊𝑇𝑊𝐹)

𝐹 : m×k orthogonal cluster indicator matrix 𝐹𝑖,𝑗 = 1/ 𝑛𝑗 if 𝑖 ∈ 𝐼𝑗 and 0 otherwise

Clustered Models

...


Cluster 1

Cluster 2

Cluster k-1

Cluster k

m tasks

task number m > cluster number k




• Directly minimizing SSE is hard because of the non-linear constraint on F:

47

min𝐹tr 𝑊𝑇𝑊 − tr(𝐹𝑇𝑊𝑇𝑊𝐹)

𝐹 : m×k orthogonal cluster indicator matrix 𝐹𝑖,𝑗 = 1/ 𝑛𝑗 if 𝑖 ∈ 𝐼𝑗 and 0 otherwise

min𝐹:𝐹𝑇𝐹=𝐼𝑘

tr 𝑊𝑇𝑊 − tr(𝐹𝑇𝑊𝑇𝑊𝐹)

Zha et. al. 2001 NIPS

Clustered Models

...


Cluster 1

Cluster 2

Cluster k-1

Cluster k

m tasks

task number m > cluster number k



Improves generalization performance

capture cluster structures

Cluster 1

Cluster 2

Cluster k-1

Cluster k


• Clustered multi-task learning (CMTL) formulation

• CMTL has been shown to be equivalent to ASO

– Given the dimension of the shared low-rank subspace in ASO and the cluster number in clustered multi-task learning (CMTL) are the same.

48

min𝑊,𝐹:𝐹𝑇𝐹=𝐼𝑘

Loss W + 𝛼 tr 𝑊𝑇𝑊 − tr 𝐹𝑇𝑊𝑇𝑊𝐹 + 𝛽 tr 𝑊𝑇𝑊



Convex Clustered Multi-Task Learning Zhou et. al. 2011 NIPS

49

Ground Truth Mean Regularized MTL

Trace Norm Regularized MTL

Convex Relaxed CMTL

noise introduced by relaxations

Low rank can also well capture

cluster structure








• Clustered MTL


50



Multi-Task Learning with Tree Structures

• In some applications, the tasks may be equipped with a tree structure:

– The tasks belonging to the same node are similar to each other

– The similarity between two nodes is related to the depth of the ‘common’ tree node

51

Tasks have tree structuresAssumption:

a cb

ModelsTask a is more similar with b,

compared to c



Multi-Task Learning with Tree Structures Kim and Xing 2010 ICML

• Tree-Guided Group Lasso

52

min𝑊Loss 𝑊 + 𝜆 𝑤𝑣 𝑊𝐺𝑣

𝑗

2𝑣∈𝑉

𝑑

𝑗=1

Inp

ut

Feat

ure

s

Output (Tasks)

W1 W2 W3

Gv5={W1, W2, W3}

Gv4={W1, W2}

Gv1={W1} Gv2={W2} Gv3={W3}

Structure



Multi-Task Learning with Graph Structures

• In real applications, tasks involved in MTL may have graph structures

– The two tasks are related if they are connected in a graph, i.e. the connected tasks are similar

– The similarity of two related tasks can be represented by the weight of the connecting edge.

53

Tasks have graph/network structuresAssumption:




• A simple way to encode graph structure is to penalize the difference of two tasks that have an edge between them

• Given a set of edges E, we thus penalize:

• The graph regularization term can also be represented in the form of Laplacian term

54

𝑊𝑒{𝑖,1} −𝑊𝑒{𝑖,2} 2

2𝐸

𝑖=1

= 𝑊𝑅𝑇 𝐹2

𝑊𝑅𝑇 𝐹2 = 𝑡𝑟 𝑊𝑅𝑇 𝑇𝑊𝑅𝑇 = 𝑡𝑟 𝑊𝑅𝑇𝑅𝑊𝑇 = 𝑡𝑟(𝑊ℒ𝑊𝑇)

𝑅 ∈ ℝ 𝐸 ×𝑚




• How to obtain graph information

– External domain knowledge • protein-protein interaction (PPI) for microarray

– Discover task relatedness from data • Pairwise correlation coefficient

• Sparse inverse covariance (Friedman et. al. 2008 Biostatistics)

55



Multi-Task Learning with Graph Structures Chen et. al. 2011 UAI, Kim et. al. 2009 Bioinformatics

• Graph-guided Fused Lasso

56

ACGTTTTACTGTACAATTTACGene SNP

phenotype

Input

Output


phenotype

Input

Output

Lasso

Graph-Guided Fused Lasso

min𝑊Loss 𝑊 + 𝜆 𝑊 1 + Ω(𝑊) Graph-guided Fusion Penalty

Ω 𝑊 = 𝛾 𝑊𝑗𝑚 − 𝑠𝑖𝑔𝑛 𝑟𝑚𝑙 𝑊𝑗𝑙

𝐽

𝑗=1

𝑒= 𝑚,𝑙 ∈𝐸



Multi-Task Learning with Graph Structures Kim et. al. 2009 Bioinformatics

• In some applications, we know not only which pairs are related, but also how they are related.

• Graph-Weighted Fused Lasso.

57

min𝑊Loss 𝑊 + 𝜆 𝑊 1 + 𝛾 𝜏(𝑟𝑚𝑙)

𝑒= 𝑚,𝑙 ∈𝐸

𝑊𝑗𝑚 − 𝑠𝑖𝑔𝑛 𝑟𝑚𝑙 𝑊𝑗𝑙

𝐽

𝑗=1

Added weight information!


phenotype

Input

Output

Graph-Weighted Fused Lasso



Practical Guideline

• MTL versus STL

– MTL is preferred when dealing with multiple related tasks with small number of training samples

• Shared features versus shared subspace

– Identifying shared features is preferred When the data dimensionality is large

– Identifying a shared subspace is preferred when the number of tasks is large

58



Tutorial Road Map








59



Case Study I: Incomplete Multi-Source Data Fusion Yuan et. al. 2012 NeuroImage

• In many applications, multiple data sources may contain a considerable amount of missing data.

• In ADNI, over half of the subjects lack CSF measurements; an independent half of the subjects do not have FDG-PET.

60

P1 P2 P3 … P114 P115 P116

PET

Subject1

Subject60

Subject61

Subject62

Subject139

Subject140

Subject141

Subject148

Subject149

Subject245

.........

...

......

......

......

......

M1 M2 M3 M4 … M303 M304 M305 C1 C2 C3 C4 C5

MRI CSF



Overview of iMSF Yuan et. al. 2012 NeuroImage

61

MRIPET

Task ITask I

Task IITask II

Task IIITask III

Task IVTask IV

Model I

Model II

Model III

Model IV

MRI CSF

CSF

PET

min𝑊

1

𝑚 𝐿𝑜𝑠𝑠 𝑋𝑖 , 𝑌𝑖 ,𝑊𝑖

𝑚

𝑖=1

+ 𝜆 𝑊𝐺𝑘 2

𝑑

𝑘=1



iMSF: Performance Yuan et. al. 2012 NeuroImage

62

0.79

0.8

0.81

0.82

0.83

0.84

50.0% 66.7% 75.0%

Acc

ura

cy iMSF

Zero

EM

KNN

SVD

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

50.0% 66.7% 75.0%

Sen

siti

vity

iMSF

Zero

EM

KNN

SVD

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

50.0% 66.7% 75.0%

Spe

cifi

city

iMSF

Zero

EM

KNN

SVD



• Drosophila (fruit fly) is a favorite model system for geneticists and developmental biologists studying embryogenesis.

– The small size and short generation time make it ideal for genetic studies.

• In situ hybridization allows us to generate images showing when and where individual genes were active.

– The analysis of such images can potentially reveal gene functions and gene-gene interactions.

Case Study II: Drosophila Gene Expression Image Analysis



Berkeley Drosophila Genome Project (BDGP) http://www.fruitfly.org/

Fly-FISH http://fly-fish.ccbr.utoronto.ca/

Stage 1-3 Stage 4-6 Stage 7-8 Stage 9-10 Stage 11-12 Stage 13-

Drosophila gene expression pattern images

[Tomancak et al. (2002) Genome Biology; Lécuyer et al. (2007) Cell]

Stage 1-3 Stage 4-5 Stage 6-7 Stage 8-9 Stage 10-

bgm

en

hb

runt

Expressions



Comparative image analysis Twist heartless stumps

anterior endoderm AISN trunk mesoderm AISN subset cellular blastoderm mesoderm AISN

dorsal ectoderm AISN procephalic ectoderm AISN subset cellular blastoderm mesoderm AISN

anterior endoderm AISN trunk mesoderm AISN head mesoderm AISN

stage 4-6

[Tomancak et al. (2002) Genome Biology; Sandmann et al. (2007) Genes & Dev.]

trunk mesoderm PR head mesoderm PR

trunk mesoderm PR head mesoderm PR anterior endoderm anlage

yolk nuclei trunk mesoderm PR head mesoderm PR anterior endoderm anlage

stage 7-8

We need the spatial and temporal annotations of expressions



Challenges of manual annotation

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2010

Nu

mb

er

of

imag

es

Cumulative number of in situ images



Low-rank multi-task

Model construction

Graph-based multi-task

Method outline Ji et. al. 2008 Bioinformatics; Ji et. al. 2009 BMC Bioinformatics; Ji et. al. 2009 NIPS

Images Sparse coding

Feature extraction



Low rank multi-task learning model Ji et. al. 2009 BMC Bioinformatics

*1 1

),( WYxwLk

i

n

j

jij

T

i

X Y W

Low rank

trace norm = sum of singular values

Loss term



2-norm loss

Graph-based multi-task learning model Ji et. al. 2009 SIGKDD

2

),(

2

2

1

1 1

)sgn()(),( qpqp

Gqp

pqF

k

i

n

j

jij

T

i wCwCgWYxwL

sensory system head

ventral sensory complex

dorsal/lateral sensory complexes

embryonic maxillary sensory complex

embryonic antennal sense organ

sensory nervous system

0.56

0.31 0.57

0.60

0.79

0.50

0.36

0.35

Closed-form solution



Spatial annotation performance

• 50% data for training and 50% for testing and 30 random trials are generated

• Multi-task approaches outperform single-task approaches

65

67

69

71

73

75

77

79

81

83

85

Stages 4-6 Stages 7-8 Stages 9-10 Stages 11-12 Stages 13-16

AU

C

SC+LR

SC+Graph

BoW+Graph

Kernel+SVM



Tutorial Road Map








71



• A multi-task learning package

• Encode task relationship via structural regularization

• www.public.asu.edu/~jye02/Software/MALSAR/

72

http://www.public.asu.edu/~jye02/Software/MALSAR/



MTL Algorithms in MALSAR 1.0 • Mean-Regularized Multi-Task Learning

• MTL with Embedded Feature Selection – Joint Feature Learning

– Dirty Multi-Task Learning

– Robust Multi-Task Feature Learning

• MTL with Low-Rank Subspace Learning – Trace Norm Regularized Learning

– Alternating Structure Optimization

– Incoherent Sparse and Low Rank Learning

– Robust Low-Rank Multi-Task Learning

• Clustered Multi-Task Learning

• Graph Regularized Multi-Task Learning

73



Tutorial Road Map








74



Trends in Multi-Task Learning • Develop efficient algorithms for large-scale multi-

task learning.

• Semi-supervised and unsupervised MTL

• Learn task structures automatically in MTL

• Asymmetric MTL

• Cross-Domain MTL – The features may be different

75

The relationship is not mutual



76

• National Science Foundation

• National Institute of Health

Acknowledgement



Reference

77



Reference

78



Reference

79



Reference

80



Reference

81



Reference

82



Reference

83

Documents

Multi-Task Learning: Theory, Algorithms, and Applicationsjye02/Software/MALSAR/MTL-SDM12.pdfTutorial Goals •Understand the basic concepts in multi-task learning •Understand different