Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Dear Students – welcome to the second lecture of our course!
1WS 2013/14
A. Holzinger LV 444.152
Inthissecondlecturewestartwithalookondatasources,reviewsomedatastructures,discussstandardizationversusstructurization,reviewthedifferencesbetweendata,informationandknowledgeandclosewithanoverviewaboutinformationentropy.
2WS 2013/14
A. Holzinger LV 444.152
3WS 2013/14
A. Holzinger LV 444.152
4WS 2013/14
A. Holzinger LV 444.152
5WS 2013/14
A. Holzinger LV 444.152
6WS 2013/14
A. Holzinger LV 444.152
Arecommendable referenceis:Scheinerman,E.R.2011.MathematicalNotation:AGuideforEngineersandScientists.Whichalsoincludes themostimportantLATEXcommandsforproducingmaths symbols
7WS 2013/14
A. Holzinger LV 444.152
componentsthroughpandw(Golan,Judge,andMiller;1996);
8WS 2013/14
A. Holzinger LV 444.152
JohnvonNeumannandhishigh‐speedcomputer,approx.1952
Ourfirstquestionis:Wheredoesthedatacomefrom?Thesecondquestion:Whatkindofdataisthis?Thethirdquestion:Howbigisthisdata?So,letuslookatsomebiomedicaldatasources(seeSlide2‐1):
9WS 2013/14
A. Holzinger LV 444.152
Duetotheincreasingtrendtowardspersonalizedandmolecularmedicine,biomedicaldataresultsfromvarioussourcesindifferentstructuraldimensions,rangingfromthemicroscopicworld(e.g.genomics,epigenomics,metagenomics,proteomics,metabolomics)tothemacroscopicworld(e.g.diseasespreadingdataofpopulationsinpublichealthinformatics).Justfororientation:theGlucosemoleculehasasizeof900
=900 10 andtheCarbonatomapprox.300 .Ahepatitisvirusisrelativelylargewith45 =45 10 andtheX‐Chromosomemuchbiggerwith7 =7 10.
Herealotof“bigdata”isproduced,e.g.genomics,metabolomicsandproteomicsdata.Thisisreally“bigdata”– thedatasetsenormouslylarge– whereasineachindividualweestimatemanyTerabytes(1TB=1 10 Byte=1000GByte)ofgenomicsdata,weareconfrontedwithPetabytesofproteomicsdataandthefusionofthoseforpersonalizedmedicineresultsinExabytes ofdata(1EB=1 10 Byte).
Ofcoursetheseamountsareforeachhumanindividual,however,wehaveacurrentworldpopulationof7Billion(1BillioninEnglishlanguageis1MilliardinEuropeanlanguage)people(=7 10 people).Soyoucanseethatthisisreally“bigdata”.This“natural”dataisthenfusedwith“produced”data,e.g.theunstructureddata(text)inthepatientrecords,ordatafromphysiologicalsensorsetc.– thesedataisalsorapidlyincreasinginsizeandcomplexity.Youcanimaginethatwithoutcomputationalintelligencewehavenochancetosurviveinthiscomplexbigdatasets.
http://learn.genetics.utah.edu/content/begin/cells/scale/C‐Atom340pm=340.10‐12mMoleculeGlucose900pmVirus HepatitisVirus45nm=45.10‐9mMicroscope200.10‐9mConfocalmicroscopy 20.10‐6mElectron‐Microscopy0,1.10‐9mX‐Chromosome7.10‐6mDNA2.10‐9mEncyme =Metabolomics
10WS 2013/14
A. Holzinger LV 444.152
MostofourcomputersareVon‐Neumannmachines(seechapter1),consequentlyatthelowestphysicallayer,dataisrepresentedaspatternsofelectricalon/offstates(1/0,H/L,high/low);wespeakofabit,whichisalsoknownasBit,theBasicindissolubleinformationunit(Shannon,1948).DonotconfusethisBitwiththeIEC60027‐2symbolbit– insmallletters– whichisusedasanSIdimensionprefix(e.g.1Kbit=1024bit,1Byte=8bit).Beginningwiththephysicallevelofdatawecandeterminevariouslevelsofdatastructures(seeSlide2‐2):Referto:http://physics.nist.gov/cuu/Units/binary.html
1)Physicallevel: inaVon‐Neumannsystem:bit;inaQuantumsystem:qubitNote: Regardlessofitsphysicalrealization(e.g.voltage,ormechanicalstate,orblack/whiteetc.),abitisalwayslogicallyeither0or1(analog toalight‐switch).Aqubithassimilaritiestoaclassicalbit,butisoverallverydifferent:Aclassicalbitisascalarvariablewiththesinglevalueofeither0or1,sothevalueisunique,deterministicandunambiguous.Aqubitismoregeneralinthesensethatitrepresentsastatedefinedbyapairofcomplexnumbers , , whichexpresstheprobabilitythatareadingofthevalueofthequbitwillgiveavalueof0or1.Thus,aqubitcanbeinthestateof0,1,orsomemixture‐ referredtoasasuperposition‐ ofthe0and1states.Theweightsof0and1inthissuperpositionaredeterminedby(a,b)inthefollowingway:qubit≜ , ≜ 0 1 .Pleasebeawarethatthismodelofquantumcomputationisnottheonlyone(Lanzagorta &Uhlmann,2008).2)LogicalLevel:1)Primitivedatatypes,including:a)Booleandatatype(true/false);b)numericaldatatype(e.g.integer( ,floating‐pointnumbers(“reals”),etc.);2)compositedatatypes,including:a)array,b)record,c)union,d)set(storesvalueswithoutanyparticularorder,andnorepeatedvalues),e)object(containsothers);3)Stringandtexttypes,including:a)alphanumericcharacters,b)alphanumericstrings(=sequenceofcharacterstorepresentwordsandtext)3)AbstractLevel: includingabstractdatastructures,e.g.queue(FIFO),stack(LIFO),set(noorder,norepeatedvalues),lists,hashtable,arrays,trees,graphs,…4)TechnicalLevel: Applicationdataformats,e.g.text,vectorgraphics,pixelimages,audiosignals,videosequences,multimedia,…5)HospitalLevel: Narrative(textual,naturallanguage)patientrecorddata(structured/unstructuredandstandardized/non‐standardized),Omicsdata(genomics,proteomics,metabolomics,microarraydata,fluxomics,phenomics),numericalmeasurements(physiologicaldata,timeseries,labresults,vitalsigns,bloodpressure,CO2 partialpressure,temperature,…),recordedsignals(ECG,EEG,ENG,EMG,EOG,EP…),graphics(sketches,drawings,handwriting,…);audiosignals,images(cams,x‐ray,MR,CT,PET,…),etc.
11WS 2013/14
A. Holzinger LV 444.152
Inbiomedicalinformaticswehavealottodowithabstractdatatypes(ADT),consequentlywebrieflyreviewthemostimportantoneshere.FordetailspleaserefertoacourseonAlgorithm&Datastructures,ortoaclassictextbooksuchas(Aho,Hopcroft&Ullman,1983),(Cormen etal.,2009),orinGerman(Ottmann &Widmayer,2012),(Holzinger,2003)andpleasetakeintoconsiderationthatdatastructuresandalgorithmsgohandinhand(Cormen,2013).
Listisasequentialcollectionofitems , , … , accessibleoneafteranother,beginningattheheadandendingatthetailz.InaVon‐Neumannmachineitisawidelyuseddatastructureforapplicationswhichdonotneedrandomaccess.Itdiffersfromthestack(last‐in‐first‐out,LIFO)andqueue(first‐in‐first‐out,FIFO)datastructuresinsofar,thatadditionsandremovalscanbemadeatanypositioninthelist.Incontrasttoasimpleset theorderisimportant.AtypicalexamplefortheuseofalistisaDNAsequence.ThecombinationofGGGTTTAAAissuchalist,theelementsofthelistarethenucleotidebases.Nucleotides arethejoinedmoleculeswhichformthestructuralunitsoftheRNAandtheDNAandplaythecentralroleinmetabolism.
12WS 2013/14
A. Holzinger LV 444.152
Graph isapair , ,where isasetoffinite,non‐emptyvertices(nodes)and isasetofedges(lines,arcs),whichare2‐elementsubsetsof .If isasetoforderedpairsofvertices(arcs,directededges,arrows),thenitisadirectedgraph(digraph).Thedistancesbetweentheedgescanberepresentedwithinadistance‐matric(twodimensionalarray).Theedgesinagraphcanbemultidimensionalobjects, e.g.vectorscontainingtheresultsofmultipleGen‐expressionmeasures.Forthispurposethedistanceoftwoedgescanbemeasuredbyvariousdistancemetrics.Graphsareideallysuitedforrepresentingnetworksinmedicineandbiology,e.g.metabolismpathways,etc.Inbioinformatics,distancematricesareusedtorepresentproteinstructuresinacoordinate‐independentmanner,aswellasthepairwisedistancesbetweentwosequencesinsequencespace.Theyareusedinstructuralandsequentialalignment,andforthedeterminationofproteinstructuresfromNMRorX‐raycrystallography.Evolutionarydynamicsactonpopulations.Neithergenes,norcells,norindividualsevolve;onlypopulationsevolve.ThissocalledMoranprocess describesthestochasticevolutionofafinitepopulationofconstantsize:Ineachtimestep,anindividualischosenforreproductionwithaprobabilityproportionaltoitsfitness;asecondindividualischosenfordeath.Theoffspringofthefirstindividualreplacesthesecondandindividualsoccupytheverticesofagraph.Ineachtimestep,anindividualisselectedwithaprobabilityproportionaltoitsfitness;theweightsoftheoutgoingedgesdeterminetheprobabilitiesthatthecorrespondingneighborwillbereplacedbytheoffspring.Theprocessisdescribedbyastochasticmatrix ,where denotestheprobabilitythatanoffspringofindividuali willreplaceindividualj.Ateachtimestep,anedge isselectedwithaprobabilityproportionaltoitsweightandthefitnessoftheindividualatitstail.TheMoranprocessisacompletegraphwithidenticalweights(Lieberman,Hauert &Nowak,2005).
13WS 2013/14
A. Holzinger LV 444.152
Treeisacollectionofelementscallednodes,oneofwhichisdistinguishedasaroot,alongwitharelation("parenthood")thatplacesahierarchicalstructureonthenodes.Anode,likeanelementofalist,canbeofwhatevertypewewish.Weoftendepictanodeasaletter,astring,oranumberwithacirclearoundit.Formally,atreecanbedefinedrecursivelyinthefollowingmanner:1.Asinglenodebyitselfisatree.Thisnodeisalsotherootofthetree.2.Suppose isanodeand 1, 2, . . . , aretreeswithroots 1, 2, . . . , , respectively.Wecanconstructanewtreebymaking betheparentofnodes 1, 2, . . . , .Inthistree istherootand 1, 2, . . . , arethesubtrees oftheroot.Nodes 1, 2, . . . , arecalledthechildrenofnode .Dendrogram (fromGreekdendron "tree",‐gramma "drawing")isatreediagramfrequentlyusedtoillustratethearrangementoftheclustersproducedbyhierarchicalclustering.Dendrograms areoftenusedincomputationalbiologytoillustratetheclusteringofgenesorsamples.Theoriginofsuchdendrograms canbefoundin(Darwin,1859).Theexampleby(Hufford etal.,2012)showsaneighbor‐joiningtreeandthechangingmorphologyofdomesticatedmaizeanditswildrelatives.Taxaintheneighbor‐joiningtreearerepresentedbydifferentcolors:parviglumis (green),landraces(red),improvedlines(blue),mexicana (yellow)andTripsacum (brown).Themorphologicalchangesareshownforfemaleinflorescencesandplantarchitectureduringdomesticationandimprovement.
14WS 2013/14
A. Holzinger LV 444.152
Keyproblemsindealingwithdatainclude:1)Heterogeneousdatasources(needfordatafusionanddataintegration)2)Complexityofthedata(high‐dimensionality)3)Noisy,uncertaindata(challengeofpre‐processing)4)Thediscrepancybetweendata‐information‐knowledge(variousdefinitions)5)Bigdatasets(manualhandlingofthedataisimpossible)
15WS 2013/14
A. Holzinger LV 444.152
Nowthatwehaveseensomeexamplesofdatafromthebiomedicaldomain,wecanlookatthe“bigpicture”.Manyika etal.(2011)localizedfourmajordatapoolsintheUShealthcareanddescribethatthedataarehighlyfragmented,withlittleoverlapandlowintegration.Moreover,theyreportthatapprox.30%ofclinicaltext/numericaldataintheUnitedStates,includingmedicalrecords,bills,laboratoryandsurgeryreports,isstillnotgeneratedelectronically.Evenwhenclinicaldataareindigitalform,theyareusuallyheldbyanindividualproviderandrarelyshared(seeSlide2‐4).Biomedicalresearchdata,e.g.clinicaltrials,predictivemodelingetc.,isproducedbyacademiaandpharmaceuticalcompaniesandstoredindatabasesandlibraries.Clinicaldataisproducedinthehospitalandarestoredinhospitalinformationsystems(HIS),picturearchivingandcommunicationsystems(PACS)orinlaboratorydatabases,etc.Muchdataishealthbusinessdataproducedbypayors,providers,insurances,etc.Finally,thereisanincreasingpoolofpatientbehaviorandsentimentdata,producedbyvariouscustomersandstakeholders,outsidethetypicalclinicalcontext,includingthegrowingdatafromthewellnessandambientassistedlivingdomain.
16WS 2013/14
A. Holzinger LV 444.152
Amajorchallengeinournetworkedworldistheincreasingamountofdata– todaycalled“bigdata”.Thetrendtowardspersonalizedmedicinehasresultedinasheermassofthegenerated(‐omics)data,(seeSlide2‐7).Inthelifesciencesdomain,mostdatamodelsarecharacterizedbycomplexity,whichmakesmanualanalysisverytime‐consumingandfrequentlypracticallyimpossible(Holzinger,2013).
MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomics data(e.g.microarraydata);thetranscriptome isthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).
17WS 2013/14
A. Holzinger LV 444.152
MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomics data(e.g.microarraydata);thetranscriptome isthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects(Bessarabova etal.,2012).4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2908408/
18WS 2013/14
A. Holzinger LV 444.152
Afurtherchallengeistointegratethedataandtomakeitaccessibletotheclinician.Whilethereismuchresearchontheintegrationofheterogeneousinformationsystems,ashortcomingisintheintegrationofavailabledata.Datafusionistheprocessofmergingmultiplerecordsrepresentingthesamereal‐worldobjectintoasingle,consistent,accurate,andusefulrepresentation(Bleiholder &Naumann,2008).AnexampleforthemixofdifferentdataforsolvingamedicalproblemcanbeseeninSlide2‐8.
AgoodexampleforcomplexmedicaldataisRCQM,whichisanapplicationthatmanagestheflowofdataandinformationintherheumatologyoutpatientclinic(50patientsperday,5daysperweek)ofGrazUniversityHospital,onthebasisofaqualitymanagementprocessmodel.Eachexaminationproduces100+clinicalandfunctionalparametersperpatient.Thisamasseddataaremorphedintobetteruseableinformationbyapplyingscoringalgorithms(e.g.DiseaseActivityScore,DAS)andareconvolutedovertime.Togetherwithpreviousfindings,physiologicallaboratorydata,patientrecorddataandOmicsdatafromthePathologydepartment,thesedataconstitutetheinformationbasisforanalysisandevaluationofthediseaseactivity.Thechallengeisintheincreasingquantitiesofsuchhighlycomplex,multi‐dimensionalandtimeseriesdata(Simonicetal.,2011).
19WS 2013/14
A. Holzinger LV 444.152
Donotconfusestructurewithstandardization(seeSlide2‐9).Datacanbestandardized(e.g.numericalentriesinlaboratoryreports)andnon‐standardized.Atypicalexampleisnon‐standardizedtext– impreciselycalled“Free‐Text”or“unstructureddata”inanelectronicpatientrecord(Kreuzthaleretal.,2011).
Standardizeddata isthe basisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandards canensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.
Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andiv)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem(refertoIOM).Technicalelementsfordatasharingrequirestandardizationofidentification,recordstructure,terminology,messaging,privacyetc.ThemostusedstandardizeddatasettodateistheinternationalClassificationofDiseases(ICD),whichwasfirstadoptedin1900forcollectingstatistics(Ahmadian etal.,2011),whichwewilldiscussin→Lecture3.Non‐standardizeddata isthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Well‐structureddata istheminorityofdataandanidealisticcasewheneachdataelementhasanassociateddefinedstructure,relationaltables,ortheresourcedescriptionframeworkRDF,ortheWebOntologyLanguageOWL(see→Lecture3).Note:Ill‐structured isatermoftenusedfortheoppositeofwell‐structured,althoughthistermoriginallywasusedinthecontextofproblemsolving(Simon,1973).Semi‐structuredisaformofstructureddatathatdoesnotconformwiththestrictformalstructureoftablesanddatamodelsassociatedwithrelationaldatabasesbutcontainstagsormarkerstoseparatestructureandcontent,i.e.areschema‐lessorself‐describing;atypicalexampleisamarkup‐languagesuchasXML(see→Lecture3and4).Weakly‐Structureddata isthemostofourdatainthewholeuniverse,whetheritisinmacroscopic(astronomy)ormicroscopicstructures(biology)– see→Lecture5.Non‐structureddata orunstructureddata isanimprecisedefinitionusedforinformation expressedinnaturallanguage,whennospecificstructurehasbeendefined.Thisisanissuefordebate:Texthasalsosomestructure:words,sentences,paragraphs.Ifweareveryprecise,unstructureddatawouldmeantthatthedataiscompleterandomized– whichisusuallycallednoiseandisdefinedby(Duda,Hart&Stork,2000)asanypropertyofdatawhichisnotduetotheunderlyingmodelbutinsteadtorandomness(eitherintherealworld,fromthesensorsorthemeasurementprocedure).
20WS 2013/14
A. Holzinger LV 444.152
“Multivariate”and“multidimensional”aremodernwordsandconsequentlyoverusedinliterature.Eachitemofdataiscomposedofvariables, andifsuchadataitemisdefinedbymorethanonevariableitiscalledamultivariabledataitem.Variablesarefrequentlyclassifiedintotwocategories:dependentorindependent.
21WS 2013/14
A. Holzinger LV 444.152
InPhysics,EngineeringandStatisticsavariableisaphysicalpropertyofasubject,whosequantitycanbemeasured,e.g.mass,length,time,temperature,etc.
22WS 2013/14
A. Holzinger LV 444.152
SMILESdata(.smi)consists ofastringobtainedbythesymbolnodesencounteredinadepth‐firsttreetraversalofachemicalgraph, whichisfirsttrimmedtoremovehydrogenatomsandcyclesarebrokentoturnitintoaspanningtree.Wherecycleshavebeenbroken,numericsuffixlabelsareincludedtoindicatetheconnectednodes.
23WS 2013/14
A. Holzinger LV 444.152
Proteomicanalysisofmesenchymal stemcells(MSCs).Two‐dimensionalgelelectrophoresiswasperformedusingwholeproteincellextractsfromP2MSCculturesofpatientswithrheumatoidarthritis(RA)(n=10)(A)andhealthycontrols(n=6)(B).Afterscanning,spotdetection,quantificationandnormalisation,gelswerecomparedusingHierarchicalClusteringSoftwareandPearsontest(C).Noclustercouldbedetectedusingtheseproteomicprofiles.
Proteomicanalysis:Two‐dimensionalelectrophoresiswasperformedusingP2MSCsinpatientswithRA(n=10)andhealthycontrols(n=6)(fig4A,B).ByusingtheHierarchicalClusteringmethod,wecouldnotdefineanyclusterthatmightdiscriminatepatientandcontrolcells(fig4C).ThePearsoncorrelationcoefficientwasnotsignificantlydifferentbetweenpatientandcontrolcells(r=0.933(0.022)andr=0.929(0.020),respectively).Thesedatacorroboratethelackofsignificantchangesincytokineproductionbetweenpatientsandcontrols.
24WS 2013/14
A. Holzinger LV 444.152
http://www.rcsb.org/pdb/images/3ond_bio_r_500.jpgThePDBisalargerepository containing3‐Dstructuralinformation,establishedin1971Dataastoredin2Dbutcaninfactrepresentbiologicalentitiesinthreeormoredimensions
25WS 2013/14
A. Holzinger LV 444.152
Transaxial (left),coronal(middle),andsagittal(right)imagesofapatientwhowasscannedfor30mininlist‐modewiththeBrainPET scanner;therecordingwasstarted20minafterinjectionofabout300MBq fluor‐deoxy‐glucose.
26WS 2013/14
A. Holzinger LV 444.152
InMathematics,henceinInformatics,however,avariableisassociatedwithaspace–oftenann‐dimensionalEuclideanspace – inwhichanentity(e.g.afunction)oraphenomenonofcontinuousnatureisdefined.Thedatalocationwithinthisspacecanbereferencedbyusingarangeofcoordinatesystems(e.g.Cartesian,Polar‐coordinates,etc.):Thedependentvariablesarethoseusedtodescribetheentity(forexamplethefunctionvalue)whilsttheindependentvariablesarethosethatrepresentthecoordinatesystemusedtodescribethespaceinwhichtheentityisdefined.Ifadatasetiscomposedofvariableswhoseinterpretationfitsthisdefinitionourgoalistounderstandhowthe‘entity’isdefinedwithinthen‐dimensionalEuclideanspace .Sometimeswemaydistinguishbetweenvariablesmeaningmeasurementofproperty,fromvariablesmeaningacoordinatesystem,byreferringtotheformerasvariate,andreferringtothelatterasdimension(DosSantos&Brodlie,2002), (dosSantos&Brodlie,2004).Aspaceisasetofpoints.Ametricspacehasanassociatedmetric,whichenablesustomeasuredistancesbetweenpointsinthatspaceand,inturn,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,andametricspaceisatopologicalone.Topologicalspacesfeelalientousbecauseweareaccustomedtohavingametric.BiomedicalExample:Aproteinisasinglechainofaminoacids,whichfoldsintoaglobularstructure.TheThermodynamicsHypothesisstatesthataproteinalwaysfoldsintoastateofminimumenergy.Topredictproteinstructure,wewouldliketomodelthefoldingofaproteincomputationally.Assuch,theproteinfoldingproblembecomesanoptimizationproblem:Wearelookingforapathtotheglobalminimuminaveryhigh‐dimensionalenergylandscape(Zomorodian,2005).
27WS 2013/14
A. Holzinger LV 444.152
Letuscollect ‐dimensional observationsintheEuclideanvectorspace andweget:Eq.2‐1
, … ,
Acloudofpointssampledfromanysource(e.g.medicaldata,sensornetworkdata,asolid3‐Dobject,surfaceetc.).ThosedatapointscanbecoordinatedasanunorderedsequenceinanarbitrarilyhighdimensionalEuclideanspace,wheremethodsofalgebraictopologycanbeapplied.Themainchallengeisinmappingthedatabackinto ortobemorepreciseinto ,becauseourretinaisinherentlyperceivingdatain .Thecloudofsuchdatapointscanbeusedasacomputationalrepresentationoftherespectivedataobject.Atemporalversioncanbefoundinmotion‐capturedata,wheregeometricpointsarerecordedastimeseries.Nowyouwillaskanobviousquestion:“Howdowevisualizeafour‐dimensionalobject?”Theobviousansweris:“Howdowevisualizeathreedimensionalobject?”Humansdonotseeinthreespatialdimensionsdirectly,butviasequencesofplanarprojectionsintegratedinamannerthatissensedifnotcomprehended.Littlechildrenspendasignificanttimeoftheirfirstyearoflifelearninghowtoinferthree‐dimensionalspatialdatafrompairedplanarprojections,andmanyyearsofpracticehavetunedaremarkableabilitytoextractglobalstructurefromrepresentationsinastrictlylowerdimension(Ghrist,2008).Becausewehavethesameproblemhereinthisbook,wemuststayin andthereforetheexampleinSlide2‐12(Zomorodian,2005).InEinstein'stheoryofSpecialRelativity,Euclidean3‐spaceplustime(the"4th‐dimension")areunifiedintotheMinkowskispace
28WS 2013/14
A. Holzinger LV 444.152
Ametricspacehasanassociatedmetric,whichenablestomeasurethedistancesbetweenpointsinthatspaceand,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,henceametricspaceisatopologicalspace.AsetXwithametricfunctiondiscalledametricspace.Wegiveitthemetrictopologyofd,wherethesetofopenballsMostofour“natural”spacesareaparticulartypeofmetricspaces:theEuclideanspaces:TheCartesianproductof copiesof ,thesetofrealnumbers,alongwiththeEuclideanmetric:Eq.2‐2
,
isthe ‐dimensionalEuclideanspace .Wemayinduceatopologyonsubsetsofmetricspacesasfollows:If ⊆ withtopology ,thenwegettherelativeorinducedtopology bydefiningFormoreinformationreferto(Zomorodian,2005)or(Edelsbrunner &Harer,2010).
29WS 2013/14
A. Holzinger LV 444.152
KnowledgeDiscoveryfromData:Bygettinginsightintothedata;thegainedinformationcanbeusedtobuildupknowledge.Thegrandchallengeistomaphigherdimensionaldataintolowerdimensions,hencemakeitinteractivelyaccessibletotheend‐user(Holzinger,2012),(Holzinger,2013).Thismappingfrom → isthecoretaskofvisualizationandamajorcomponentforknowledgediscovery:Enablingeffectiveinteractivehumancontroloverpowerfulmachinealgorithmstosupporthumansensemaking(Holzinger,2012),(Holzinger,2013).
Holzinger,A.2013.Human–ComputerInteraction&KnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:AlfredoCuzzocrea,C.K.,Dimitris E.Simos,EdgarWeippl,Lida Xu (ed.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.
30WS 2013/14
A. Holzinger LV 444.152
Multivariatedataset isadatasetthathasmanydependentvariables andtheymightbecorrelatedtoeachothertovaryingdegrees.Usuallythistypeofdatasetisassociatedwithdiscretedatamodels.
Multidimensionaldataset isadatasetthathasmanyindependentvariables clearlyidentified,andoneormoredependentvariablesassociatedtothem.Usuallythistypeofdatasetisassociatedwithcontinuousdatamodels.
Inotherwords,everydataitem(orobject)inacomputerisrepresented(stored)asasetoffeatures.Insteadofthetermfeatureswemayusethetermdimensions,becauseanobjectwith ‐featurescanalsoberepresentedasamultidimensionalpointinan ‐dimensionalspace.Dimensionalityreductionistheprocessofmappingan ‐dimensionalpoint,intoalower ‐dimensionalspace– thisisthemainchallengeinvisualizationsee→Lecture9.
Thenumberofdimensionscansometimesbesmall,e.g.simple1D‐datasuchastemperaturemeasuredatdifferenttimes,to3Dapplicationssuchasmedicalimaging,wheredataiscapturedwithinavolume.Standardtechniques—contouringin2D;isosurfacing andvolumerenderingin3D—haveemergedovertheyearstohandlethissortofdata.Thereisnodimensionreductionissueintheseapplications,sincethedataanddisplaydimensionsessentiallymatch.
31WS 2013/14
A. Holzinger LV 444.152
Datacanbecategorizedintoqualitative(nominalandordinal)andquantitative(intervalandratio):Intervalandratiodataareparametric,andareusedwithparametrictoolsinwhichdistributionsarepredictable(andoftenNormal).Nominalandordinaldataarenon‐parametric,anddonotassumeanyparticulardistribution.Theyareusedwithnon‐parametrictoolssuchastheHistogram.Theclassicpaperonthetheoryofscalesofmeasurementis(Stevens,1946).
32WS 2013/14
A. Holzinger LV 444.152
Wecansummarizewhatwelearnedsofaraboutdata:Datacanbenumeric,non‐numeric,orboth.Non‐numericdatacanincludeanythingfromlanguagedata(text)tocategorical,image,orvideodata.Datamayrangefromcompletelystructured,suchascategoricaldata,tosemi‐structured,suchasanXMLFilecontainingmetainformation,tounstructured,suchasanarrative“free‐text”.Note,thattermunstructureddoesnotmeanthatthedataarewithoutanypattern,whichwouldmeancompleterandomnessanduncertainty,butratherthat“unstructureddata”areexpressedso,thatonlyhumanscanmeaningfullyinterpretit.Structureprovidesinformationthatcanbeinterpretedtodeterminedataorganizationandmeaning,henceitprovidesacontext fortheinformation.Theinherentstructureinthedatacanformabasisfordatarepresentation.Animportant,yetoftenneglectedissuearetemporalcharacteristicsofdata: Dataofalltypesmayhaveatemporal(time)association,andthisassociationmaybeeitherdiscreteorcontinuous(Thomas&Cook,2005).InMedicalInformaticswehaveapermanentinteractionbetweendata,informationandknowledge,withdifferentdefinitions(Bemmel&Musen,1997),seeSlide2‐16:
33WS 2013/14
A. Holzinger LV 444.152
Data arethephysicalentitiesatthelowestabstractionlevelwhichare,e.g.generatedbyapatient(patientdata)orabiologicalprocess(e.g.Omicsdata).Accordingto(Bemmel&Musen,1997)datacontainnomeaning.
Informationisderivedbyinterpretationofthedatabyaclinician(humanintelligence).
Knowledge isobtainedbyinductivereasoningwithpreviouslyinterpreteddata,collectedfrommanysimilarpatientsorprocesses,whichisaddedtothesocalledbodyofknowledgeinmedicine,theexplicitknowledge. Thisknowledgeisusedfortheinterpretationofotherdataandtogainimplicitknowledge whichguidestheclinicianintakingfurtheraction.
34WS 2013/14
A. Holzinger LV 444.152
Forhypothesisgenerationandtesting,fourtypesofinferencesexist(Peirce,1955):abstraction,abduction,deduction,andinduction.Thefirsttwodrivehypothesisgenerationwhilethelatterdrivehypothesistesting,seeSlide2‐17:Abstractionmeansthatdataarefilteredaccordingtotheirrelevancefortheproblemsolutionandchunkedinschemasrepresentinganabstract descriptionoftheproblem(e.g.,abstractingthatanadultmalewithhaemoglobinconcentrationlessthan14g/dL isananaemicpatient).Followingthis,hypothesesthatcouldaccountforthecurrentsituationarerelatedthroughaprocessofabduction,characterizedbya"backwardflow"ofinferencesacrossachainofdirectedrelationswhichidentifythoseinitialconditionsfromwhichthecurrentabstractrepresentationoftheproblemoriginates.Thisprovidestentativesolutionstotheproblemathandbywayofhypotheses.Forexample,knowingthatdisease willcausesymptom,abductionwilltrytoidentifytheexplanationforB,whiledeductionwillforecastthatapatientaffectedbydisease will
manifestsymptom :bothinferencesareusingthesamerelationalongtwodifferentdirections(Patel&Ramoni,1997).Abduction ischaracterizedbyacyclicalprocessofgeneratingpossibleexplanations(i.e.,identificationofasetofhypothesesthatareabletoaccountfortheclinicalcaseonthebasisoftheavailabledata)andtestingthoseexplanations(i.e.,evaluationofeachgeneratedhypothesisonthebasisofitsexpectedconsequences)fortheabnormalstateofthepatientathand(Patel,Arocha &Zhang,2004).
ThehypothesistestingprocedurescanbeinferredfromSlide2‐17:Generalknowledgeisgainedfrommanypatients,andthisgeneralknowledgeisthenappliedtoanindividualpatient.Wehavetodeterminebetween:Reasoning istheprocessbywhichaclinicianreachesaconclusionafterthinkingaboutallthefacts;Deduction consistsofderivingaparticularvalidconclusionfromasetofgeneralpremises;Induction consistsofderivingalikelygeneralconclusionfromasetofparticularstatements.Reasoninginthe“realworld”doesnotappeartofitneatlyintoanyofthesebasictypes.Therefore,athirdformofreasoninghasbeenrecognizedbyPeirce(1955),wheredeductionandinductionareinter‐mixed;
35WS 2013/14
A. Holzinger LV 444.152
Thequestion“whatisinformation?”isstillanopenquestioninbasicresearch,andanydefinitionisdependingontheviewtaken.Forexample,thedefinitiongivenbyCarl‐FriedrichvonWeizsäcker:“Informationiswhatisunderstood,”impliesthatinformationhasbothasenderandareceiverwhohaveacommonunderstandingoftherepresentationandthemeanstoconveyinformationusingsomepropertiesofthephysicalsystems,andhisaddendum:“Informationhasnoabsolutemeaning;itexistsrelativelybetweentwosemanticlevels”impliestheimportanceofcontext(Marinescu,2011).Withoutdoubtinformationisafundamentallyimportantconceptwithinourworldandlifeiscomplexinformation,seeSlide2‐14:
Manysystems,e.g.inthequantumworldtonotobeytheclassicalviewofinformation.Inthequantumworldandinthelifesciencestraditionalinformationtheoryoftenfailstoaccuratelydescribereality…forexampleinthecomplexityofalivingcell:Allcomplexlifeiscomposedofeukaryotic(nucleated)cells(Lane&Martin,2010).Agoodexampleofsuchacellistheprotist EuglenaGracilis (inGerman“Augentierchen”)withalengthofapprox.30 .Lifecanbeseenasadelicateinterplayofenergy,entropyandinformation,essentialfunctionsoflivingbeingscorrespondtothegeneration,consumption,processing,preservationandduplicationofinformation.
P:Complexity<>Information<>Energy<> Entropy
36WS 2013/14
A. Holzinger LV 444.152
TheetymologicaloriginofthewordinformationcanbetracedbacktotheGreek“forma”andtheLatin“information”and“informare”,tobringsomethingintoashape(“in‐a‐form”).Consequently,thenaivedefinitionincomputerscienceis“informationisdataincontext” andthereforedifferentthandataorknowledge.However,wefollowthenotionof(Boisot &Canals,2004)anddefinethatinformationisanextractionfromdatathat,bymodifyingtherelevantprobabilitydistributions,hasdirectinfluenceonanagent’sknowledgebase.Forabetterunderstandingofthisconcept,wefirstreviewthemodelofhumaninformationprocessingbyWickens (1984):ThemodelbyWickens (1984)beautifullyemphasizesourviewondata,informationandknowledge:thephysicaldatafromthereal‐worldareperceivedasinformationthroughperceptualfilters,controlledbyselectiveattentionandformhypotheseswithintheworkingmemory.Thesehypothesesaretheexpectationsdependingonourpreviousknowledgeavailableinourmentalmodel,storedinthelong‐termmemory.Thesubjectivelybestalternativehypothesiswillbeselectedandprocessedfurtherandmaybetakenasoutcomeforanaction.Duetothefactthatthissystemisaclosedloop,wegetfeedbackthroughnewdataperceivedasnewinformationandtheprocessgoeson.
37WS 2013/14
A. Holzinger LV 444.152
Theincomingstimulifromthephysicalworldmustpassbothaperceptualfilterandaconceptualfilter.Theperceptualfilter orientatesthesenses(e.g.visualsense)tocertaintypesofstimuliwithinacertainphysicalrange(e.g.visualsignalrange,pre‐knowledge,attentionetc.).Onlythestimuliwhichpassthroughthisfiltergetregisteredasincomingdata–everythingelseisfilteredout.Atthispointitisimportanttofollowourphysicalprincipleofdata:todifferentiatebetweentwonotionsthatarefrequentlyconfused:anexperiment’s(raw,hard,measured,factual)dataandits(meaningful,subjective)interpretedinformationresults.Dataarepropertiesconcerningonlytheinstrument;itistheexpressionofafact. Theresultconcernsapropertyoftheworld.Thefollowingconceptualfiltersextractinformation‐bearingdatafromwhathasbeenpreviouslyregistered.Bothtypesoffiltersareinfluencedbytheagents’cognitiveandaffectiveexpectations,storedintheirmentalmodels.Theenormousutilityofdataresidesinthefactthatitcancarryinformationaboutthephysicalworld.Thisinformationmaymodifysetexpectationsorthestate‐of‐knowledge.Theseprinciplesallowanagent toactinadaptivewaysinthephysicalworld(Boisot &Canals,2004).Conferthisprocesswiththehumaninformationprocessingmodelby(Wickens,1984),seeninSlide2‐19anddiscussedin→Lecture7.
38WS 2013/14
A. Holzinger LV 444.152
Entropyhasmanydifferentdefinitionsandapplications,originallyinstatisticalphysicsandmostoftenitisusedasameasurefordisorder.Ininformationtheory,entropycanbeusedasameasurefortheuncertaintyinadataset.
To demonstratehowusefulentropycanbe‐ youcanhavealookatthispaper:Holzinger,A.,Stocker,C.,Peischl,B.&Simonic,K.‐M.2012.OnUsingEntropyforEnhancingHandwritingPreprocessing.Entropy,14,(11),2324‐2350.http://www.mdpi.com/1099‐4300/14/11/2324
39WS 2013/14
A. Holzinger LV 444.152
Theconceptofentropywasfirstintroducedinthermodynamics(Clausius,1850),whereitwasusedtoprovideastatementofthesecondlawofthermodynamics.Later,statisticalmechanicsprovidedaconnectionbetweenthemacroscopicpropertyofentropyandthemicroscopicstateofasystembyBoltzmann.Shannonwasthefirsttodefineentropyandmutualinformation.
Shannon(1948)usedaGedankenexperiment(thoughtexperiment)toproposeameasureofuncertaintyinadiscretedistributionbasedontheBoltzmannentropyofclassicalstatisticalmechanics,seeSlide2‐22:
40WS 2013/14
A. Holzinger LV 444.152
Anexampleshalldemonstratetheusefulnessofthisapproach:1)Let beadiscretedatasetwithassociatedprobabilities :Eq.2‐5
… , … ,
2)NowweapplyShannon’sequationEq.2‐4:Eq.2‐6
3)Weassumethatoursourcehastwovalues(ball=white,ball=black)LetusdothefamoussimpleGedankenexperiment(thoughtexperiment):Imagineaboxwhichcancontaintwocoloredballs:blackandwhite.Thisisoursetofdiscretesymbolswithassociatedprobabilities.Ifwegrabblindlyintothisboxtogetaball,wearedealingwithuncertainty,becausewedonotknowwhichballwetouch.Wecanask:Istheballblack?NO.THENitmustbewhite,soweneedonequestiontosurelyprovidetherightanswer.Becauseitisabinarydecision(YES/NO)themaximumnumberof(binary)questionsrequiredtoreducetheuncertaintyis:log ,where isthenumberofthepossibleoutcomes.Ifthereare eventswithequalprobability then 1/ .Ifyouhaveonly1blackball,thenlog 1 0,whichmeansthereisnouncertainty.Eq.2‐7
, with , 14)NowwesolvenumericallyEq.2‐6:Eq.2‐8
∗ log1
∗ log1
1Since rangesfrom0(forimpossibleevents)to1(forcertainevents),theentropyvaluerangesfrominfinity(forimpossibleevents)to0(forcertainevents).So,wecansummarizethattheentropyistheweightedaverageofthesurpriseforallpossibleoutcomes.Forourexamplewiththetwoballswecandrawthefollowingfunction:Theentropyvalueis1for =0,5anditisboth0foreither =0or =1.Thisexamplemightseemtrivial,buttheentropyprinciplehasbeendevelopedalotsinceShannonandtherearemanydifferentmethods,whichareveryusefulfordealingwithdata.
41WS 2013/14
A. Holzinger LV 444.152
Shannoncalledittheinformationentropy (akaShannonentropy)anddefined:Eq.2‐9
log1
log
where istheprobabilityoftheeventoccurring.If isnotidenticalforalleventsthentheentropy isaweightedaverageofallprobabilities,whichShannondefinedas:Eq.2‐10
2
Basically,theentropyp(x)approacheszeroifwehaveamaximumofstructure– andopposite,theentropyp(x)reacheshighvaluesifthereisnostructure– hence,ideally,iftheentropyisamaximum,wehavecompleterandomness,totaluncertainty.LowEntropymeansdifferences,structure,individuality– highEntropymeansnodifferences,nostructure,noindividuality.Consequently,lifeneedslowentropy.
42WS 2013/14
A. Holzinger LV 444.152
Theprinciplewhatwecaninferfromentropyvaluesis:1)Lowentropy valuesmeanhighprobability,highcertainty, henceahighdegreeofstructurization inthedata.2)Highentropy valuesmeanlowprobability,lowcertainty (≅ highuncertainty;‐),hencealowdegreeofstructurization inthedata.Maximumentropywouldmeancompleterandomnessandtotaluncertainty.Highlystructureddatacontainlowentropy;ideallyifeverythingisinorderandthereisnosurprise(nouncertainty)theentropyislow:Eq.2‐11
0
Eq.2‐12 log .
Ontheotherhandifthedataareweaklystructured– asforexampleinbiologicaldata–andthereisnoabilitytoguess(alldataisequallylikely)theentropyishigh:Ifwefollowthisapproach,“unstructureddata”wouldmeancompleterandomness.Letuslookonthehistoryofentropytounderstandwhatwecandoinfuture,seeSlide2‐25.
43WS 2013/14
A. Holzinger LV 444.152
Youmight arguewhatthepracticalpurposeofthisapproachis– manifoldapplications!!
44WS 2013/14
A. Holzinger LV 444.152
TheoriginmaybefoundintheworkofJakob Bernoulli,describingtheprincipleofinsufficientreason:weareignorantofthewaysaneventcanoccur,theeventwilloccurequallylikelyinanyway.ThomasBayes(1763)andPierre‐SimonLaplace(1774)carriedonandHaroldJeffreys andDavidCoxsolidifieditintheBayesianStatistics,akastatisticalinference.ThesecondpathleadingtotheclassicalMaximumEntropy,en‐routewiththeShannonEntropy,canbeidentifiedwiththeworkofJamesClerkMaxwellandLudwigBoltzmann,continuedbyWillardGibbsandfinallyClaudeElwoodShannon.Thisworkisgearedtowarddevelopingthemathematicaltoolsforstatisticalmodelingofproblemsininformation.Thesetwoindependentlinesofresearchareverysimilar.Theobjectiveofthefirstlineofresearchistoformulateatheory/methodologythatallowsunderstandingofthegeneralcharacteristics (distribution)ofasystemfrompartialandincompleteinformation.Inthesecondrouteofresearch,thesameobjectiveisexpressedasdetermininghowtoassign(initial)numericalvaluesofprobabilitieswhenonlysome(theoretical)limitedglobalquantitiesoftheinvestigatedsystemareknown.RecognizingthecommonbasicobjectivesofthesetwolinesofresearchaidedJaynes inthedevelopmentofhisclassicalwork,theMaximumEntropyformalism.Thisformalismisbasedonthefirstlineofresearchandthemathematicsofthesecondlineofresearch.TheinterrelationshipbetweenInformationTheory,statisticsandinference,andtheMaximumEntropy(MaxEnt)principlebecameclearin1950ies,andmanydifferentmethodsarosefromtheseprinciples(Golan,2008),seenextSlide
45WS 2013/14
A. Holzinger LV 444.152
MaximumEntropy(MaxEn),describedby(Jaynes,1957),isusedtoestimateunknownparametersofamultinomialdiscretechoiceproblem,whereastheGeneralizedMaximumEntropy(GME)modelincludesnoisetermsinthemultinomialinformationconstraints.Eachnoisetermismodeledasthemeanofafinitesetofaprioriknownpointsintheinterval 1,1 withunknownprobabilitieswherenoparametricassumptionsabouttheerrordistributionaremade.AGMEmodelforthemultinomialprobabilitiesandforthedistributions,associatedwiththenoisetermsisderivedbymaximizingthejointentropyofmultinomialandnoisedistributions,undertheassumptionofindependence(Jaynes,1957).TopologicalEntropy (TopEn),wasintroducedby(Adler,Konheim &McAndrew,1965)withthepurpose tointroducethenotionofentropyasaninvariantforcontinuousmappings:Let , beatopologicaldynamicalsystem,i.e.,let beanonemptycompactHausdorff spaceand : → acontinuousmap;theTopEn isanonnegativenumberwhichmeasuresthecomplexity ofthesystem(Adler,Downarowicz &Misiurewicz,2008).GraphEntropywasdescribedby(Mowshowitz,1968)tomeasurestructuralinformationcontentofgraphs,andadifferentdefinition,morefocusedonproblemsininformationandcodingtheory,wasintroducedby(Körner,1973).Graphentropyisoftenusedforthecharacterizationofthethe structureofgraph‐basedsystems,e.g.inmathematicalbiochemistry.Intheseapplicationstheentropyofagraphisinterpretedasitsstructuralinformationcontentandservesasacomplexitymeasure,andsuchameasureisassociatedwithanequivalencerelationdefinedonafinitegraph;byapplicationofShannon’sEq.2.4withtheprobabilitydistributionwegetanumericalvaluethatservesasanindexofthestructuralfeaturecapturedbytheequivalencerelation(Dehmer&Mowshowitz,2011).
MinimumEntropy (MinEn),describedby(Posner,1975),providesustheleastrandom,andtheleastuniformprobabilitydistributionofadataset,i.e.theminimumuncertainty,whichisthelimitofourknowledgeandofthestructureofthesystem.Often,theclassicalpatternrecognitionisdescribedasaquestforminimumentropy.Mathematically,itismoredifficulttodetermineaminimumentropyprobabilitydistributionthanamaximumentropyprobabilitydistribution;whilethelatterhasaglobalmaximumduetotheconcavityoftheentropy,theformerhastobeobtainedbycalculatingalllocalminima,consequentlytheminimumentropyprobabilitydistributionmaynotexistinmanycases(Yuan&Kesavan,1998).CrossEntropy (CE),discussedby(Rubinstein,1997),wasmotivatedbyanadaptivealgorithmforestimatingprobabilitiesofrareeventsincomplexstochasticnetworks,whichinvolvesvarianceminimization.CEcanalsobeusedforcombinatorialoptimizationproblems(COP).Thisisdonebytranslatingthe“deterministic”optimizationproblemintoarelated“stochastic”optimizationproblemandthenusingrareeventsimulationtechniques(DeBoeretal.,2005).Rényi entropy isageneralizationoftheShannonentropy(informationtheory),andTsallis entropyisageneralizationofthestandardBoltzmann–Gibbsentropy(statisticalphysics).Forusmoreimportantare:ApproximateEntropy(ApEn),describedby(Pincus,1991),isuseabletoquantifyregularityindatawithoutanyaprioriknowledgeaboutthesystem,seeanexampleinSlide2‐20.SampleEntropy(SampEn),wasusedby(Richman&Moorman,2000)foranewrelatedmeasureoftimeseriesregularity.SampEn wasdesignedtoreducethebiasofApEn andisbettersuitedfordatasetswithknownprobabilisticcontent.
46WS 2013/14
A. Holzinger LV 444.152
Problem:Monitoringbodymovementsalongwithvitalparametersduringsleepprovidesimportantmedicalinformationregardingthegeneralhealth,andcanthereforebeusedtodetecttrends(largeepidemiologystudies)todiscoversevereillnessesincludinghypertension(whichisenormouslyincreasinginoursociety).Thisseeminglysimpledata– onlyfromonenightperiod– demonstratesthecomplexityandtheboundariesofstandardmethods(forexampleFastFourierTransformation)todiscoverknowledge(forexampledeviations,similaritiesetc.).Duetothecomplexityanduncertaintyofsuchdatasets,standardmethods(suchasFFT)comprisethedangerofmodelingartifacts.Sincetheknowledgeofinterestformedicalpurposesisinanomalies(alterations,differences,a‐typicalities,irregularities),theapplicationofentropicmethodsprovidesbenefits.PhotographtakenduringtheEUProjectEMERGEandusedwithpermission.
47WS 2013/14
A. Holzinger LV 444.152
1)Wehaveagivendataset wherecapital isthenumberofdatapoints:Eq.2‐13
, , … ,
2)Nowweformm‐dimensionalvectorsEq.2‐14
, , … ,3)Wemeasurethedistancebetweeneverycomponent,i.e.themaximumabsolutedifferencebetweentheirscalarcomponentsEq.2‐15
, max, ,…,
4)Welook– sotosay– inwhichdimensionisthebiggestdifference;asaresultwegettheApproximateEntropy(ifthereisnodifferencewehavezerorelativeentropy):Eq.2‐16
ApEn , lim →
where istherunlengthand isthetolerancewindow (letusassumethat isequalto ),ApEn (m,r)couldalsobewrittenasH ,5) iscomputedbyEq.2‐17
1 1
ln
withEq.2‐18
1
6) measureswithinthetolerance theregularityofpatternssimilartoagivenoneofwindowlength7)Finallyweincreasethedimensionto 1 andrepeatthestepsbeforeandgetasaresulttheapproximateentropyApEn ,ApEn , , isapproximatelythenegativenaturallogarithmoftheconditionalprobability(CP)thatadatasetoflength,havingrepeateditselfwithinatolerance for points,willalsorepeatitselffor 1 points.Animportantpointto
keepinmindabouttheparameter isthatitiscommonlyexpressedasafractionoftheStandarddeviation(SD)ofthedataandinthiswaymakesApEn ascale‐invariantmeasure.Alowvaluearisesfromahighprobabilityofrepeatedtemplatesequencesinthedata(Hornero etal.,2006).
48WS 2013/14
A. Holzinger LV 444.152
Inthisslidewecanseetheplotofthenormalizedapproximateentropyforeachoftheepisodesandthemedianacrossalltheepisodes.Fromthisfigurewecanseethattheentropyisaminimumwherewehavenoalterationsandentropyisincreasingwhenhavingirregularities.Ifwehavenodifferenceswegetzeroentropy
49WS 2013/14
A. Holzinger LV 444.152
Afinalexampleshouldmaketheadvantageofsuchanentropymethodtotallyclear:Intherightdiagramitishardtodiscoverirregularitiesforamedicalprofessional–especiallyoveralongerperiod,butananomalycaneasilybedetectedbydisplayingthemeasuredrelativeApEn.Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproduction;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.
50WS 2013/14
A. Holzinger LV 444.152
Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproducibility;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;
anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.
51WS 2013/14
A. Holzinger LV 444.152
52
My DEDICATION is to make data valuable … Thank you!
WS 2013/14
A. Holzinger LV 444.152
53WS 2013/14
A. Holzinger LV 444.152
54WS 2013/14
A. Holzinger LV 444.152
MFC=MinimumFoot ClearanceStride=stepYoucanseebrilliantlywhatyoucanmeasurewithentropy– youcandetermineanomalies,i.e.thebalanceproblemsofelderlygait
MFCPoincaré plots.ToppanelsshowMFCtimeseriesfromahealthyelderlysubject(A)anditscorrespondingPoincaré plot(B).BottompanelsshowMFCtimeseriesfromanelderlysubjectwithbalanceproblem(C)anditscorrespondingPoincaré plot(D).
SignificantrelationshipsofmeanMFCwithPoincaré plotindexes(SD1,SD2)andApEn (r=0.70,p<0.05;r=0.86,p<0.01;r=0.74,p<0.05)werefoundinthefalls‐riskelderlygroup.Ontheotherhand,suchrelationshipswereabsentinthehealthyelderlygroup.Incontrast,theApEn valuesofMFCdataseriesweresignificantly(p<0.05)correlatedwithPoincaré plotindexesofMFCinthehealthyelderlygroup,whereascorrelationswereabsentinthefalls‐riskgroup.TheApEn valuesinthefalls‐riskgroup(meanApEn =0.18± 0.03)wassignificantly(p<0.05)higherthanthatinthehealthygroup(meanApEn =0.13± 0.13).ThehigherApEn valuesinthefalls‐riskgroupmightindicateincreasedirregularitiesandrandomnessintheirgaitpatternsandanindicationoflossofgaitcontrolmechanism.ApEn valuesofrandomlyshuffledMFCdataoffallsrisksubjectsdidnotshowanysignificantrelationshipwithmeanMFC.
55WS 2013/14
A. Holzinger LV 444.152
56WS 2013/14
A. Holzinger LV 444.152
57WS 2013/14
A. Holzinger LV 444.152
Surrogatedatarecords.AandBshowthemajorcomponents.A:themeanprocess,whichhassetpointandspikemodes.B:thebaselineprocess,heremeaningtheheartratevariability,modeledasGaussianrandomnumbers.C:theirsum,asurrogatedatarecord.D–F:amorerealisticsurrogatewiththesamefrequencycontentastheobserveddata.D:aclinicallyobserveddatarecordof4,096R‐Rintervals.Thelefthand ordinateislabeledinms andtherighthand ordinateinSD.E:a4,096‐pointisospectral surrogatedatasetformedusingtheinverseFouriertransformoftheperiodogram ofthedatainD.F:thesurrogatedataafteradditionofaclinicallyobserveddecelerationlasting50pointsandscaledsothatthevarianceoftherecordisincreasedfrom1to2.
58WS 2013/14
A. Holzinger LV 444.152
59WS 2013/14
A. Holzinger LV 444.152
60WS 2013/14
A. Holzinger LV 444.152
61WS 2013/14
A. Holzinger LV 444.152
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_entropy_sect018.htm
Wheremanyotherlanguagesrefertotables,rows,andcolumns/fields,SASusesthetermsdatasets,observations,andvariables.ThereareonlytwokindsofvariablesinSAS:numericandcharacter(string).Bydefaultallnumericvariablesarestoredas(8byte)real.Itispossibletoreduceprecisioninexternalstorageonly.DateanddatetimevariablesarenumericvariablesthatinherittheCtraditionandarestoredaseitherthenumberofdays(fordatevariables)orseconds(fordatetime variables).
http://www.sas.com/technologies/analytics/statistics/stat/index.html
62WS 2013/14
A. Holzinger LV 444.152
Hadoop andtheMapReduce programmingparadigmalreadyhaveasubstantialbaseinthebioinformaticscommunity– inparticular inthefieldofhigh‐throughput next‐generationsequencinganalysis.Thisisduetothecost‐effectivenessofHadoop‐basedanalysisoncommodityLinuxclusters,andinthecloudviadatauploadtocloudvendorswhohaveimplementedHadoop/HBase;andduetotheeffectivenessandease‐of‐useoftheMapReduce methodinparallelizationofmanydataanalysisalgorithms.
63WS 2013/14
A. Holzinger LV 444.152
Nanomedicine opensnewavenuesformanyresearchinbiomedicalinformaticsmethodsandtools.For surethefuturechallengesarenewtopicssuchas“bigdata”inbioinformatics,novelmethodsfortheuseof“omics”‐dataetc.Futureresearchisneededonalgorithmicandmethodologicalissues.Thisneedsthewillingnesstocooperate withdifferentdisciplines.
Twoareasofferidealconditionstowardssolvingthesechallenges:Human‐ComputerInteraction(HCI)andKnowledgeDiscoveryandDataMining(KDD),withthegoalofsupportinghumanintelligencewithmachineintelligence– todiscovernew,previouslyunknowninsightsintothedata.
Holzinger,A.2013.Human–ComputerInteraction&KnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:AlfredoCuzzocrea,C.K.,Dimitris E.Simos,EdgarWeippl,Lida Xu (ed.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.
64WS 2013/14
A. Holzinger LV 444.152
Thechallengewefaceisthatanestimatedaverageof5%ofdataarestructured,therestiseithersemi‐structured,weaklystructuredandmostofourdataisunstructured.
Maybethemostimportantfield forthefutureisdatamining– especiallynoveltechniquesofdatamining,includingbothtimeandspace(e.g.graph‐based,entropy‐based,topological‐baseddataminingapproaches).
65WS 2013/14
A. Holzinger LV 444.152
http://minnesotafuturist.pbworks.com/w/page/21441129/DIKW
Afunnydescription ofdatainformationknowledge.
66WS 2013/14
A. Holzinger LV 444.152
67WS 2013/14
A. Holzinger LV 444.152
Avery placative image.Nicetolookat– buttheusefulnessisquestionable.
68WS 2013/14
A. Holzinger LV 444.152
Allthismodelsareveryquestionable. PleaserememberthatwefollowinourlecturethenotionofBoisot &Canals.
69WS 2013/14
A. Holzinger LV 444.152
Theinterestingissue ofthisgraphicisthatitincludesatime‐axis,whichisimportantfordecisionmakingandpredictiveanalytics.
70WS 2013/14
A. Holzinger LV 444.152