View
798
Download
2
Tags:
Embed Size (px)
DESCRIPTION
The TBCP-funded global signals in genomic data project is developing methods and software to view and characterise and in the case of batch effects, also correct for, large correlated signals in genomic data. In the last year, we have developed Python-based software to quickly identify genomic signals, with the next phase being the characterisation of these signals. In parallel, we have finished development of a method to identify and remove batch effects which outperforms existing methods. While we have several bodies of work in development, in this talk we will discuss in particular, the performance and importance of the new batch effect removal algorithm. This new technique maximises the removal of the the structured technical noise known as batch effects, with the constraint that the probability of overcorrection is kept to a fraction which is set by the end-user. This tunability allows control for overcorrection - defined as, the removal of genuine biological variance as well as batch noise. Overcorrection should be minimised as it can lead to false positive results due to the artificial deflation of within-group variances. Benchmarking across four datasets against Combat, the leading currently used technique, we show this new method is far superior in balancing removal of batch noise while preserving biological signal. Additionally, the new method is able to leave largely unchanged one of the datasets which has no significant batch effect, whereas Combat reduces the variance of that dataset by over 45%. For noise removal, we use “guided-PCA” a recently published quantifier of batch effects to show the probability of batch effects remaining in the data post correction. For signal preservation, we calculate in each case, the proportion of the original variance which remains in the datasets after correction.
Citation preview
Batch effect correction: How do we compare against ComBat?
Yalchin Oytam* & Fariborz Sobhanmanesh
Synopsis
Batch Effects: •Uncorrected (or under-corrected) Detrimental reduction in power of test; distortion to multiplicity correction •Over-corrected False positives; distortion to multiplicity correction
Novel method, which: •Quantifies the probability of under/over correction •Enables to experimenter to choose confidence/risk (p-value) as constraint for batch removal
AIM: Benchmark the novel method against ComBat
Summary: •Discuss batch effects •Introduce performance criteria •Compare the two methods
Batch Effects?
•Definition
•Structured technical noise / distortion common to all replicates in a processing batch.
•And, vary markedly from batch to batch. • Pervasive and persistent under best practice.
•Not remediable by normalisation techniques. • Typically account for 20-45% of the power in the measurement data!
Impact of batch effects
Rep1 Rep2 Rep3 Rep4 Treat1 t11 + B1 t12 + B2 t13 + B3 t14 + B4
Treat2 t21 + B1 t22 + B2 t23 + B3 t24 + B4
Treat3 t31 + B1 t32 + B2 t33 + B3 t34 + B4
Treat4 t41 + B1 t42 + B2 t43 + B3 t44 + B4
Treat5 t51 + B1 t52 + B2 t53 + B3 t54 + B4
Treat6 t61 + B1 t62 + B2 t63 + B3 t64 + B4
Control c1 + B1 c2 + B2 c3 + B3 c4 + B4
•Differences between B1, B2, B3, and B4 inflate within-treatment variances, diminishing power of any between-treatment comparison test.
•Different genes are affected differently, distorting rank of p-values, and hence distorting multiplicity correction (FDR).
“What if treatments are not distributed across batches?”
Method: Principal Component Analysis
CSIRO Overcoming the challenges of multiplicity and batch effects
Method: Principal Component Analysis
CSIRO Overcoming the challenges of multiplicity and batch effects
A snapshot of batch correction software
Benchmarking – ComBat vs Our Method
• Two dimensions: Noise Rejection and Signal Preservation
•Noise Rejection: Guided PCA (third party quantification of batch noise in data). Reese et al. 2013
•Signal Preservation: data variance after batch correction/ raw data variance
•Ideal: Reject all batch noise, without removing any biological variance.
Benchmarking – Cell Data
gPCA p-value for batch effect presence in raw data = 0.008
Benchmarking – Animal Data
gPCA p-value for batch effect presence in raw data = 0.037
Benchmarking – Combat’s “Native” Dataset
gPCA p-value for batch effect presence in raw data = 0.225
Benchmarking – Combat’s “Native” Dataset
Benchmarking – Combat’s “Native” Dataset
Benchmarking – Combat’s “Native” Dataset
Thank you
CAFHS/Genomics Yalchin Oytam Research Scientist Phone: +61 2 9490 5077 Email: [email protected]
Contact Us Phone: 1300 363 400 or +61 3 9545 2176
Email: [email protected] Web: www.csiro.au
Acknowledgements Konsta Duesing Mike Buckley Bill Wilson Maxine McCall