22
Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Fei Wang Associate Professor Department of Computer Science and Engineering [email protected] Jan 27

Big Data Analytics - University of Connecticut School …fwang/courses/2015_Spring_Big_Data/Lecture_J… · Big Data Analytics!! Special Topics for Computer Science CSE 4095-001 CSE

Embed Size (px)

Citation preview

Big Data Analytics!!

Special Topics for Computer Science CSE 4095-001 CSE 5095-005

Fei Wang Associate Professor

Department of Computer Science and Engineering [email protected]

Jan 27

Data Representation and Feature Construction

Various Forms of Data

Text Representation

http://www.python-course.eu/text_classification_python.php

Text Representation

Vocabulary

TF-IDFTerm Frequency: TF(t,d): The frequency term t appeared in document d

Inverse Document Frequency: IDF(t,D): Logarithmically scaled fraction of the documents that contain the wordlog(# Documents in D/# Documents having term t)

TF-IDF(t,d,D)=TF(t,d)*IDF(t,D)

Text Representation

Myslín, Mark, et al. "Using Twitter to examine smoking behavior and perceptions of emerging tobacco products." Journal of medical Internet research 15.8 (2013).

Image Representation

Color (RGB,HSV) Intensity Location !Texture SIFT Wavelet Transform

Image Set

Network Representation

0 0 0 0 01

0 0 1 1 0 0

1 1 0 0 1 1

0 1 0 0 0 1

0 0 1 0 0 0

0 0 1 1 0 0

Network Representation

0 0 0 0 00

0 0 0 0 0 0

1 1 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 1 1 0 0

Network Representation

1 0 0

0 0

0 0

0 0

0 0

0 0

1

1

1

1

1

EEG

http://www.dsrc.rpi.edu/epilepsy/

Electronic Health RecordsAn Electronic Health Record (EHR) is an evolving concept defined as a systematic collection of electronic health information about individual patients or populations

Jensen, Peter B., Lars J. Jensen, and SØren Brunak. "Mining electronic health records: towards better research applications and clinical care." Nature Reviews Genetics (2012).

Vector Based Representation

v = [v1,v2,…,vd]A collection of numbers

v1 v2

.

.

. vd

v’ =Row Vector

Column Vector

Dimensionality

Transpose

Patient Diagnosis Vector: d is the number of distinct diagnosis code, xi represents the frequency of the i-th diagnosis code in his/her historical records

Matrix Based Representation

Observation window

Patient EHR Matrix

Time

Raw Medical Features Patient Feature Vector

Patient Similarity

Predictive Modeling

Risk Stratification

. . .

• Jimeng Sun, Fei Wang, Jianying Hu, Shahram Edabollahi: Supervised patient similarity measure of heterogeneous patient records. SIGKDD Explorations 14(1): 16-24 (2012)

• Fei Wang, Jimeng Sun, Shahram Ebadollahi: Composite distance metric integration by leveraging multiple experts' inputs and its application in patient similarity assessment. Statistical Analysis and Data Mining 5(1): 54-69 (2012)

• J.Wu, J. Roy,W. F. Stewart, Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches, Medical Care 48 S106–S113 (2010)

Temporal Matrices Aggregated Vectors

Sequence Based Representation

Diagnosis A Medication B Lab Test C

Diagnosis D Medication B . . .

The sequentiality of those events may indicate some impending disease conditions

How to interpret and make use of the sequentiality of the events?

Feature Distribution

http://en.wikipedia.org/wiki/Uniform_distribution_%28continuous%29

Feature Distribution

http://www.mathsisfun.com/data/standard-normal-distribution.html

Feature Distribution

https://www.linkedin.com/pulse/20140215200145-131079-the-myth-of-the-bell-curve

Power-law distributions in empirical data 3

Box 1: Recipe for analyzing power-law distributed data

This paper contains much technical detail. In broad outline, however, the recipe wepropose for the analysis of power-law data is straightforward and goes as follows.

1. Estimate the parameters xmin and α of the power-law model using the methodsdescribed in Section 3.

2. Calculate the goodness-of-fit between the data and the power law using themethod described in Section 4. If the resulting p-value is greater than 0.1 thepower law is a plausible hypothesis for the data, otherwise it is rejected.

3. Compare the power law with alternative hypotheses via a likelihood ratio test,as described in Section 5. For each alternative, if the calculated likelihood ratiois significantly different from zero, then its sign indicates whether the alternativeis favored over the power-law model or not.

Step 3, the likelihood ratio test for alternative hypotheses, could in principle be replacedwith any of several other established and statistically principled approaches for modelcomparison, such as a fully Bayesian approach [32], a cross-validation approach [59], or aminimum description length approach [20], although none of these methods are describedhere.

In the discrete case, x can take only a discrete set of values. In this paper weconsider only the case of integer values with a probability distribution of the form

p(x) = Pr(X = x) = Cx−α . (2.3)

Again this distribution diverges at zero, so there must be a lower bound xmin > 0 onthe power-law behavior. Calculating the normalizing constant, we then find that

p(x) =x−α

ζ(α, xmin), (2.4)

where

ζ(α, xmin) =∞!

n=0

(n + xmin)−α (2.5)

is the generalized or Hurwitz zeta function. Table 2.1 summarizes the basic functionalforms and normalization constants for these and several other distributions that willbe useful.

In many cases it is useful to consider also the complementary cumulative distri-bution function or CDF of a power-law distributed variable, which we denote P (x)and which for both continuous and discrete cases is defined to be P (x) = Pr(X ≥ x).For instance, in the continuous case

P (x) =

" ∞

xp(x′) dx′ =

#

x

xmin

$−α+1

. (2.6)

In the discrete case

P (x) =ζ(α, x)

ζ(α, xmin). (2.7)

Feature Normalization

!"# $!"#%& '(($%)*+, !$ !"# #-.*%*('/ %#!%*#0'/ %#1&2/!& !"'! 3*// 4# .%#&#+!#) *+ 5#(!*$+ 67

!"#" $%&'() *+(,%&- ./ 0&%. )(&-'

8*0#+ ' /$3#% 4$2+) , '+) '+ 2..#% 4$2+) 09$% ' 9#'!2%# ($-.$+#+! 1:

!! ! !" "#" "

#;$

%#&2/!& *+ !! 4#*+, *+ !"# <=:;> %'+,#7

!"2" $%&'() *+(,%&- ./ 0&%. 3()%(&+'

?+$!"#% +$%-'/*@'!*$+ .%$(#)2%# *& !$ !%'+&19$%- !"# 9#'!2%# ($-.$+#+! 1 !$ ' %'+)$- 0'%*1'4/# 3*!" @#%$ -#'+ '+) 2+*! 0'%*'+(# '&

!! ! !" !"

! #A$

3"#%# ! '+) " '%# !"# &'-./# -#'+ '+) !"# &'-./#&!'+)'%) )#0*'!*$+ $9 !"'! 9#'!2%#: %#&.#(!*0#/BCD'*+ '+) E24#&: ;FGGH7

I9 3# '&&2-# !"'! #'(" 9#'!2%# *& +$%-'//B )*&1!%*42!#): !"# .%$4'4*/*!B $9 !! 4#*+, *+ !"# <!;:;>%'+,# *& JGK7 ?+ '))*!*$+'/ &"*9! '+) %#&('/*+, '&

!! ! #!" !$"L"% ;

A#L$

,2'%'+!##& FFK $9 !! !$ 4# *+ !"# <=:;> %'+,#7 M#('+ !"#+ !%2+('!# !"# $2!1$91%'+,# ($-.$+#+!& !$#*!"#% = $% ;7

!"!" 4)(&*5/)6(.%/& ./ ( 7&%5/)6 89:#; )(&</63()%(=,'

8*0#+ ' %'+)$- 0'%*'4/# 1 3*!" (2-2/'!*0#)*&!%*42!*$+ 92+(!*$+ $!#!$: !"# %'+)$- 0'%*'4/# !!%#&2/!*+, 9%$- !"# !%'+&9$%-'!*$+ !! ! $!#!$ *&2+*9$%-/B )*&!%*42!#) *+ !"# <=:;> %'+,# CN'.$2/*&:;FF;H7

!">" ?(&@ &/)6(,%A(.%/&

8*0#+ !"# &'-./# 9$% ' 9#'!2%# ($-.$+#+! 9$%'// *-',#& '& !;! " " " ! !%: O%&! 3# O+) !"# $%)#%&!'!*&!*(& !#;$! " " " ! !#%$ '+) !"#+ %#./'(# #'(" *-',#P&

9#'!2%# 0'/2# 4B *!& ($%%#&.$+)*+, +$%-'/*@#)%'+Q: '&

!!& !%'+Q!;!"""!!%

#!&$ " ;

%" ;! #R$

3"#%# !& *& !"# 9#'!2%# 0'/2# 9$% !"# %!" *-',#7 S"*&.%$(#)2%# 2+*9$%-/B -'.& '// 9#'!2%# 0'/2#& !$ !"#<=:;> %'+,#7 M"#+ !"#%# '%# -$%# !"'+ $+# *-',#3*!" !"# &'-# 9#'!2%# 0'/2#: 9$% #T'-./# '9!#%U2'+!*@'!*$+: !"#B '%# '&&*,+#) !"# '0#%',# %'+Q9$% !"'! 0'/2#7

!"B" C/)6(,%A(.%/& (5.') D..%&- <%*.)%=0.%/&*

S"# !%'+&9$%-'!*$+& *+ 5#(!*$+ L7A '&&2-#!"'! ' 9#'!2%# "'& ' V$%-'/ C!! "AH )*&!%*42!*$+7S"# &'-./# 0'/2#& ('+ 4# 2&#) !$ O+) 4#!!#%#&!*-'!#& 9$% !"# 9#'!2%# )*&!%*42!*$+&7 S"#+:!"#&# #&!*-'!#& ('+ 4# 2&#) !$ O+) +$%-'/*@'!*$+-#!"$)& 4'&#) .'%!*(2/'%/B $+ !"#&# )*&!%*421!*$+&7

S"# 9$//$3*+, &#(!*$+& )#&(%*4# "$3 !$ O!V$%-'/: W$,+$%-'/: XT.$+#+!*'/ '+) 8'--')#+&*!*#& !$ ' %'+)$- &'-./#7 M# '/&$ ,*0# !"#)*Y#%#+(# )*&!%*42!*$+& 4#('2&# !"# *-',# &*-*/'%1*!B -#'&2%#& 2&# 9#'!2%# )*Y#%#+(#&7 ?9!#% #&!*1-'!*+, !"# .'%'-#!#%& $9 ' )*&!%*42!*$+: !"# (2!1$Y0'/2# !"'! *+(/2)#& FFK $9 !"# 9#'!2%# 0'/2#& *&9$2+) '+) !"# &'-./# 0'/2#& '%# &('/#) '+) !%2+1('!#) &$ !"'! #'(" 9#'!2%# ($-.$+#+! "'0# !"#&'-# %'+,#7

5*+(# !"# $%*,*+'/ 9#'!2%# 0'/2#& '%# .$&*!*0#: 3#2&# $+/B !"# .$&*!*0# &#(!*$+ $9 !"# V$%-'/ )#+&*!B'9!#% O!!*+,7 W$,+$%-'/: XT.$+#+!*'/ '+) 8'--')#+&*!*#& '%# )#O+#) 9$% %'+)$- 0'%*'4/#& 3*!"$+/B .$&*!*0# 0'/2#&7 Z!"#% )*&!%*42!*$+& !"'! '%#($--$+/B #+($2+!#%#) *+ !"# &!'!*&!*(& /*!#%'!2%#'%# !"# [+*9$%-: #A '+) M#*42// C3"*(" '%# &.#(*'/('&#& $9 8'--'H: \#!' C3"*(" *& )#O+#) $+/B 9$%<=:;>H '+) ]'2("B C3"$&# -$-#+!& )$ +$! #T*&!H7?/!"$2," !"#&# )*&!%*42!*$+& ('+ '/&$ 4# 2&#) 4BO%&! #&!*-'!*+, !"#*% .'%'-#!#%& '+) !"#+ O+)*+,!"# (2!1$Y 0'/2#&: 3# 3*// &"$3 !"'! !"# )*&!%*421!*$+& 2&#) *+ !"*& .'.#% ('+ U2*!# ,#+#%'//B -$)#/9#'!2%#& 9%$- )*Y#%#+! 9#'!2%# #T!%'(!*$+ '/,$1%*!"-&7

E" F@*/G: ?"H" I()(,%+@ J K(..')& ?'+/-&%.%/& $'..')* 22 L299#M BN!OBP2 6J6

!"# $!"#%& '(($%)*+, !$ !"# #-.*%*('/ %#!%*#0'/ %#1&2/!& !"'! 3*// 4# .%#&#+!#) *+ 5#(!*$+ 67

!"#" $%&'() *+(,%&- ./ 0&%. )(&-'

8*0#+ ' /$3#% 4$2+) , '+) '+ 2..#% 4$2+) 09$% ' 9#'!2%# ($-.$+#+! 1:

!! ! !" "#" "

#;$

%#&2/!& *+ !! 4#*+, *+ !"# <=:;> %'+,#7

!"2" $%&'() *+(,%&- ./ 0&%. 3()%(&+'

?+$!"#% +$%-'/*@'!*$+ .%$(#)2%# *& !$ !%'+&19$%- !"# 9#'!2%# ($-.$+#+! 1 !$ ' %'+)$- 0'%*1'4/# 3*!" @#%$ -#'+ '+) 2+*! 0'%*'+(# '&

!! ! !" !"

! #A$

3"#%# ! '+) " '%# !"# &'-./# -#'+ '+) !"# &'-./#&!'+)'%) )#0*'!*$+ $9 !"'! 9#'!2%#: %#&.#(!*0#/BCD'*+ '+) E24#&: ;FGGH7

I9 3# '&&2-# !"'! #'(" 9#'!2%# *& +$%-'//B )*&1!%*42!#): !"# .%$4'4*/*!B $9 !! 4#*+, *+ !"# <!;:;>%'+,# *& JGK7 ?+ '))*!*$+'/ &"*9! '+) %#&('/*+, '&

!! ! #!" !$"L"% ;

A#L$

,2'%'+!##& FFK $9 !! !$ 4# *+ !"# <=:;> %'+,#7 M#('+ !"#+ !%2+('!# !"# $2!1$91%'+,# ($-.$+#+!& !$#*!"#% = $% ;7

!"!" 4)(&*5/)6(.%/& ./ ( 7&%5/)6 89:#; )(&</63()%(=,'

8*0#+ ' %'+)$- 0'%*'4/# 1 3*!" (2-2/'!*0#)*&!%*42!*$+ 92+(!*$+ $!#!$: !"# %'+)$- 0'%*'4/# !!%#&2/!*+, 9%$- !"# !%'+&9$%-'!*$+ !! ! $!#!$ *&2+*9$%-/B )*&!%*42!#) *+ !"# <=:;> %'+,# CN'.$2/*&:;FF;H7

!">" ?(&@ &/)6(,%A(.%/&

8*0#+ !"# &'-./# 9$% ' 9#'!2%# ($-.$+#+! 9$%'// *-',#& '& !;! " " " ! !%: O%&! 3# O+) !"# $%)#%&!'!*&!*(& !#;$! " " " ! !#%$ '+) !"#+ %#./'(# #'(" *-',#P&

9#'!2%# 0'/2# 4B *!& ($%%#&.$+)*+, +$%-'/*@#)%'+Q: '&

!!& !%'+Q!;!"""!!%

#!&$ " ;

%" ;! #R$

3"#%# !& *& !"# 9#'!2%# 0'/2# 9$% !"# %!" *-',#7 S"*&.%$(#)2%# 2+*9$%-/B -'.& '// 9#'!2%# 0'/2#& !$ !"#<=:;> %'+,#7 M"#+ !"#%# '%# -$%# !"'+ $+# *-',#3*!" !"# &'-# 9#'!2%# 0'/2#: 9$% #T'-./# '9!#%U2'+!*@'!*$+: !"#B '%# '&&*,+#) !"# '0#%',# %'+Q9$% !"'! 0'/2#7

!"B" C/)6(,%A(.%/& (5.') D..%&- <%*.)%=0.%/&*

S"# !%'+&9$%-'!*$+& *+ 5#(!*$+ L7A '&&2-#!"'! ' 9#'!2%# "'& ' V$%-'/ C!! "AH )*&!%*42!*$+7S"# &'-./# 0'/2#& ('+ 4# 2&#) !$ O+) 4#!!#%#&!*-'!#& 9$% !"# 9#'!2%# )*&!%*42!*$+&7 S"#+:!"#&# #&!*-'!#& ('+ 4# 2&#) !$ O+) +$%-'/*@'!*$+-#!"$)& 4'&#) .'%!*(2/'%/B $+ !"#&# )*&!%*421!*$+&7

S"# 9$//$3*+, &#(!*$+& )#&(%*4# "$3 !$ O!V$%-'/: W$,+$%-'/: XT.$+#+!*'/ '+) 8'--')#+&*!*#& !$ ' %'+)$- &'-./#7 M# '/&$ ,*0# !"#)*Y#%#+(# )*&!%*42!*$+& 4#('2&# !"# *-',# &*-*/'%1*!B -#'&2%#& 2&# 9#'!2%# )*Y#%#+(#&7 ?9!#% #&!*1-'!*+, !"# .'%'-#!#%& $9 ' )*&!%*42!*$+: !"# (2!1$Y0'/2# !"'! *+(/2)#& FFK $9 !"# 9#'!2%# 0'/2#& *&9$2+) '+) !"# &'-./# 0'/2#& '%# &('/#) '+) !%2+1('!#) &$ !"'! #'(" 9#'!2%# ($-.$+#+! "'0# !"#&'-# %'+,#7

5*+(# !"# $%*,*+'/ 9#'!2%# 0'/2#& '%# .$&*!*0#: 3#2&# $+/B !"# .$&*!*0# &#(!*$+ $9 !"# V$%-'/ )#+&*!B'9!#% O!!*+,7 W$,+$%-'/: XT.$+#+!*'/ '+) 8'--')#+&*!*#& '%# )#O+#) 9$% %'+)$- 0'%*'4/#& 3*!"$+/B .$&*!*0# 0'/2#&7 Z!"#% )*&!%*42!*$+& !"'! '%#($--$+/B #+($2+!#%#) *+ !"# &!'!*&!*(& /*!#%'!2%#'%# !"# [+*9$%-: #A '+) M#*42// C3"*(" '%# &.#(*'/('&#& $9 8'--'H: \#!' C3"*(" *& )#O+#) $+/B 9$%<=:;>H '+) ]'2("B C3"$&# -$-#+!& )$ +$! #T*&!H7?/!"$2," !"#&# )*&!%*42!*$+& ('+ '/&$ 4# 2&#) 4BO%&! #&!*-'!*+, !"#*% .'%'-#!#%& '+) !"#+ O+)*+,!"# (2!1$Y 0'/2#&: 3# 3*// &"$3 !"'! !"# )*&!%*421!*$+& 2&#) *+ !"*& .'.#% ('+ U2*!# ,#+#%'//B -$)#/9#'!2%#& 9%$- )*Y#%#+! 9#'!2%# #T!%'(!*$+ '/,$1%*!"-&7

E" F@*/G: ?"H" I()(,%+@ J K(..')& ?'+/-&%.%/& $'..')* 22 L299#M BN!OBP2 6J6!"# $!"#%& '(($%)*+, !$ !"# #-.*%*('/ %#!%*#0'/ %#1&2/!& !"'! 3*// 4# .%#&#+!#) *+ 5#(!*$+ 67

!"#" $%&'() *+(,%&- ./ 0&%. )(&-'

8*0#+ ' /$3#% 4$2+) , '+) '+ 2..#% 4$2+) 09$% ' 9#'!2%# ($-.$+#+! 1:

!! ! !" "#" "

#;$

%#&2/!& *+ !! 4#*+, *+ !"# <=:;> %'+,#7

!"2" $%&'() *+(,%&- ./ 0&%. 3()%(&+'

?+$!"#% +$%-'/*@'!*$+ .%$(#)2%# *& !$ !%'+&19$%- !"# 9#'!2%# ($-.$+#+! 1 !$ ' %'+)$- 0'%*1'4/# 3*!" @#%$ -#'+ '+) 2+*! 0'%*'+(# '&

!! ! !" !"

! #A$

3"#%# ! '+) " '%# !"# &'-./# -#'+ '+) !"# &'-./#&!'+)'%) )#0*'!*$+ $9 !"'! 9#'!2%#: %#&.#(!*0#/BCD'*+ '+) E24#&: ;FGGH7

I9 3# '&&2-# !"'! #'(" 9#'!2%# *& +$%-'//B )*&1!%*42!#): !"# .%$4'4*/*!B $9 !! 4#*+, *+ !"# <!;:;>%'+,# *& JGK7 ?+ '))*!*$+'/ &"*9! '+) %#&('/*+, '&

!! ! #!" !$"L"% ;

A#L$

,2'%'+!##& FFK $9 !! !$ 4# *+ !"# <=:;> %'+,#7 M#('+ !"#+ !%2+('!# !"# $2!1$91%'+,# ($-.$+#+!& !$#*!"#% = $% ;7

!"!" 4)(&*5/)6(.%/& ./ ( 7&%5/)6 89:#; )(&</63()%(=,'

8*0#+ ' %'+)$- 0'%*'4/# 1 3*!" (2-2/'!*0#)*&!%*42!*$+ 92+(!*$+ $!#!$: !"# %'+)$- 0'%*'4/# !!%#&2/!*+, 9%$- !"# !%'+&9$%-'!*$+ !! ! $!#!$ *&2+*9$%-/B )*&!%*42!#) *+ !"# <=:;> %'+,# CN'.$2/*&:;FF;H7

!">" ?(&@ &/)6(,%A(.%/&

8*0#+ !"# &'-./# 9$% ' 9#'!2%# ($-.$+#+! 9$%'// *-',#& '& !;! " " " ! !%: O%&! 3# O+) !"# $%)#%&!'!*&!*(& !#;$! " " " ! !#%$ '+) !"#+ %#./'(# #'(" *-',#P&

9#'!2%# 0'/2# 4B *!& ($%%#&.$+)*+, +$%-'/*@#)%'+Q: '&

!!& !%'+Q!;!"""!!%

#!&$ " ;

%" ;! #R$

3"#%# !& *& !"# 9#'!2%# 0'/2# 9$% !"# %!" *-',#7 S"*&.%$(#)2%# 2+*9$%-/B -'.& '// 9#'!2%# 0'/2#& !$ !"#<=:;> %'+,#7 M"#+ !"#%# '%# -$%# !"'+ $+# *-',#3*!" !"# &'-# 9#'!2%# 0'/2#: 9$% #T'-./# '9!#%U2'+!*@'!*$+: !"#B '%# '&&*,+#) !"# '0#%',# %'+Q9$% !"'! 0'/2#7

!"B" C/)6(,%A(.%/& (5.') D..%&- <%*.)%=0.%/&*

S"# !%'+&9$%-'!*$+& *+ 5#(!*$+ L7A '&&2-#!"'! ' 9#'!2%# "'& ' V$%-'/ C!! "AH )*&!%*42!*$+7S"# &'-./# 0'/2#& ('+ 4# 2&#) !$ O+) 4#!!#%#&!*-'!#& 9$% !"# 9#'!2%# )*&!%*42!*$+&7 S"#+:!"#&# #&!*-'!#& ('+ 4# 2&#) !$ O+) +$%-'/*@'!*$+-#!"$)& 4'&#) .'%!*(2/'%/B $+ !"#&# )*&!%*421!*$+&7

S"# 9$//$3*+, &#(!*$+& )#&(%*4# "$3 !$ O!V$%-'/: W$,+$%-'/: XT.$+#+!*'/ '+) 8'--')#+&*!*#& !$ ' %'+)$- &'-./#7 M# '/&$ ,*0# !"#)*Y#%#+(# )*&!%*42!*$+& 4#('2&# !"# *-',# &*-*/'%1*!B -#'&2%#& 2&# 9#'!2%# )*Y#%#+(#&7 ?9!#% #&!*1-'!*+, !"# .'%'-#!#%& $9 ' )*&!%*42!*$+: !"# (2!1$Y0'/2# !"'! *+(/2)#& FFK $9 !"# 9#'!2%# 0'/2#& *&9$2+) '+) !"# &'-./# 0'/2#& '%# &('/#) '+) !%2+1('!#) &$ !"'! #'(" 9#'!2%# ($-.$+#+! "'0# !"#&'-# %'+,#7

5*+(# !"# $%*,*+'/ 9#'!2%# 0'/2#& '%# .$&*!*0#: 3#2&# $+/B !"# .$&*!*0# &#(!*$+ $9 !"# V$%-'/ )#+&*!B'9!#% O!!*+,7 W$,+$%-'/: XT.$+#+!*'/ '+) 8'--')#+&*!*#& '%# )#O+#) 9$% %'+)$- 0'%*'4/#& 3*!"$+/B .$&*!*0# 0'/2#&7 Z!"#% )*&!%*42!*$+& !"'! '%#($--$+/B #+($2+!#%#) *+ !"# &!'!*&!*(& /*!#%'!2%#'%# !"# [+*9$%-: #A '+) M#*42// C3"*(" '%# &.#(*'/('&#& $9 8'--'H: \#!' C3"*(" *& )#O+#) $+/B 9$%<=:;>H '+) ]'2("B C3"$&# -$-#+!& )$ +$! #T*&!H7?/!"$2," !"#&# )*&!%*42!*$+& ('+ '/&$ 4# 2&#) 4BO%&! #&!*-'!*+, !"#*% .'%'-#!#%& '+) !"#+ O+)*+,!"# (2!1$Y 0'/2#&: 3# 3*// &"$3 !"'! !"# )*&!%*421!*$+& 2&#) *+ !"*& .'.#% ('+ U2*!# ,#+#%'//B -$)#/9#'!2%#& 9%$- )*Y#%#+! 9#'!2%# #T!%'(!*$+ '/,$1%*!"-&7

E" F@*/G: ?"H" I()(,%+@ J K(..')& ?'+/-&%.%/& $'..')* 22 L299#M BN!OBP2 6J6

Linear Scaling to [0,1]

Scaling to standard normal distribution with zero mean and

unit variance

99% of the data in [0,1] range

Feature Normalization

features