View
63
Download
2
Category
Tags:
Preview:
Citation preview
3/9/2015 Joydeep Ghosh UT-ECE
Approaches to Mining Large-Scale Heterogeneous Data:
Old and New
Prof. Joydeep Ghosh
Schlumberger Centennial Chaired Professor Fellow, IEEE
Director, IDEAL(Intelligent Data Exploration and Analysis Lab)
University of Texas at Austin
3/9/2015 Joydeep Ghosh UT-ECE
What we do
• Data-Driven Modeling & Knowledge Discovery“Big Data Predictive and Prescriptive Analytics”
– Data Types:• relational databases, distributed sensors, signals, images, web-logs,
key-value….• data (continuous + symbolic) + domain knowledge
– Tools:• Data mining/stats; web mining; machine learning, Neural nets,
signal/image processing….
– Large Scale System issues– Speciality: Multi-learner systems
• Use multiple, complementary approaches for more robust modeling of complex engineering problems
• Custom models, where “canned solutions” are inadequate.
Multi-sensor Fusion (80s, 90s)
• Blackboards (KBS)
• Multiple Hypothesis Tracking
• Basic Tracking (Kalman filters, Gauss-Markov,..)
• Detection/Identification
• The usual ones +
“Important applications can be found in time-critical situations or in situation with a high decision risk, where human deficiencies are to be compensated for by automatically or interactively working fusion techniques (compensating for decreasing attention in routine situations; focusing the attention on anomalous or rare events; complementing limited memory, reaction, or combination capabilities of human beings)” Koch, 2010.
Rationale
Combining Multiple ClassifiersJ. Ghosh, S. Beck and L. Deuser, IEEE Jl. of Ocean Engineering, Vol 17,
No. 4, October 1992, pp. 351-363.
Ave/median/..
MLP RBF Classifer N
FFT
Pre-processsed Data from Observed Phenomenon
. . .
. . .. . .
Gabor Wavelets Feature Set M
Combining Multiple Clusterings (2002)
• Given a set of provisional partitionings, we want to aggregate them
into a single consensus partitioning, even without access to original
features.
Clusterer #1
(individual cluster labels)
(consensus labels)
Provides Improved Accuracy + Robustness + Knowledge Re-use
Factorization of Heterogeneous Data
Patients!
Diagnoses!
Procedures!
Medications!Demographics! Physicians!
W X Y Z
High-throughput Phenotyping on Electronic Health Records using
Multi-Tensor Factorization ($2.2 Mil grant from NSF)
4
Tensor Construction + Generation
+"…"+"
λ1"
Phenotype 1
λR"
Phenotype R
Refinement
Applications
GWAS
Predictive Models
Cohort Construction
Adaptation
""
EHR"
Site A
""
EHR"
Site B
≈"
Tensor
Construction
+ Generation
Recommended