Upload
benjamin-taylor
View
62
Download
0
Tags:
Embed Size (px)
Citation preview
Outline • Introduc)on • Case study objec)ve • Big data landscape • Problem setup • Results/Conclusion • Future work
@bentaylordata
Introduc0on • Chemical Engineering (BS/MS/PhD Candidate)
• 5 years Intel/Micron – Photolithography, process control, yield modeling
• AIQ Hedge fund – 600 GPU chip cluster, algorithmic stock modeling, – distributed metaheuris)c algorithms
• HireVue, Chief Data Scien0st – HR analy)cs, interview modeling
@bentaylordata
Case Study Objec0ve • Given 400 recorded video interviews for sales posi)ons and post hire performance data can improved sor)ng efficiency be demonstrate out-‐of-‐sample?
V=400
Input Data Set Target Data Set, n=400
Personal Email Perf [email protected] Exceeds
[email protected] Meets
[email protected] Below
[email protected] Meets
@bentaylordata
big
data
ha
doop
Big data landscape • Big data plaVorms have mo)vated innova)ons around unstructured data handling. These innova)ons have involved new algorithms and beWer unstructured wrangling methods.
@bentaylordata
Big data landscape • Unstructured data
– Data that does not have a predefine data model or schema, i.e. tool logs, resumes, cover le8ers, images, audio, video, Twi8er, LinkedIn
• Structured data – Data that fits within a predefined data model. Most common structured data formats involve a column/row architecture. Most familiar examples include spreadsheet soYware such as Excel.
@bentaylordata
Problem setup • Unstructured data challenge
– How do we convert the video into a manageable machine ready format? AKA unstructured > structured data.
0.23,0.15,0.98,0.63,0.45,0.36…
1D Vector representa.on
Method?
@bentaylordata
F 3.95 Data Scientist Yale Sky divingM 2.93 HR Analyst SLCC PoetryF 3.41 Data Munger Harvard Cycling
1 3.95 5 310 560 2.93 7 520 911 3.41 6 240 56
Name: Sally TaylorGPA: 3.95Previous Job: Data ScientistSchool: YaleHobbies: Sky diving
UNSTRUCTURED
STRUCTURED
TOKENIZED
Problem Setup • What is done for text modeling?
@bentaylordata
Problem Setup • Piecemeal the structuring: final outputs are scalars
Audio
Video
Text
Signal Processing
Personality
Expression Signal Processing
ts
ts
us
us us
us = unstructured data ts = -me series data
s = scalar data
s
@bentaylordata
@bentaylordata
Combining All Features
X 56.341 -‐200.45 0 1
2 4 60.71 12 52.15 -‐350.12 1 1
Feature Mapping: As the features are produced they are stored in a matrix where each column represents a feature and each row represents an interview
2 4 60.71 12 52.15 -‐350.12 1 0 2 3 16.16 21 25.51 -‐105.21 0 0
NA NA NA NA NA
Results:
Conclusion: Using structured features from audio and video we are able to show predic)ve sor)ng value in our out-‐of-‐sample interviews.
Model AUC score Bernoulli NB 0.75
Other 0.79
67.50% reduc)on in interview evalua)on >300% increase in concentra)on
@bentaylordata
Feature Engineering
Auto Feature Engineering
Future Work:
Future work involves offloading the feature engineering tasks to a more automated Process such as deep learning or more advanced ensemble modeling methods.
My Contact Info: Twi^er: @bentaylordata Email: [email protected] LinkedIn: bentaylordata
@bentaylordata