70
Dear Students – welcome to the second lecture of our course! 1 WS 2013/14 A. Holzinger LV 444.152

A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Dear Students – welcome to the second lecture of our course!

1WS 2013/14

A. Holzinger LV 444.152

Page 2: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Inthissecondlecturewestartwithalookondatasources,reviewsomedatastructures,discussstandardizationversusstructurization,reviewthedifferencesbetweendata,informationandknowledgeandclosewithanoverviewaboutinformationentropy.

2WS 2013/14

A. Holzinger LV 444.152

Page 3: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

3WS 2013/14

A. Holzinger LV 444.152

Page 4: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

4WS 2013/14

A. Holzinger LV 444.152

Page 5: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

5WS 2013/14

A. Holzinger LV 444.152

Page 6: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

6WS 2013/14

A. Holzinger LV 444.152

Page 7: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Arecommendable referenceis:Scheinerman,E.R.2011.MathematicalNotation:AGuideforEngineersandScientists.Whichalsoincludes themostimportantLATEXcommandsforproducingmaths symbols

7WS 2013/14

A. Holzinger LV 444.152

Page 8: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

componentsthroughpandw(Golan,Judge,andMiller;1996);

8WS 2013/14

A. Holzinger LV 444.152

Page 9: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

JohnvonNeumannandhishigh‐speedcomputer,approx.1952

Ourfirstquestionis:Wheredoesthedatacomefrom?Thesecondquestion:Whatkindofdataisthis?Thethirdquestion:Howbigisthisdata?So,letuslookatsomebiomedicaldatasources(seeSlide2‐1):

9WS 2013/14

A. Holzinger LV 444.152

Page 10: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Duetotheincreasingtrendtowardspersonalizedandmolecularmedicine,biomedicaldataresultsfromvarioussourcesindifferentstructuraldimensions,rangingfromthemicroscopicworld(e.g.genomics,epigenomics,metagenomics,proteomics,metabolomics)tothemacroscopicworld(e.g.diseasespreadingdataofpopulationsinpublichealthinformatics).Justfororientation:theGlucosemoleculehasasizeof900

=900 10 andtheCarbonatomapprox.300 .Ahepatitisvirusisrelativelylargewith45 =45 10 andtheX‐Chromosomemuchbiggerwith7 =7 10.

Herealotof“bigdata”isproduced,e.g.genomics,metabolomicsandproteomicsdata.Thisisreally“bigdata”– thedatasetsenormouslylarge– whereasineachindividualweestimatemanyTerabytes(1TB=1 10 Byte=1000GByte)ofgenomicsdata,weareconfrontedwithPetabytesofproteomicsdataandthefusionofthoseforpersonalizedmedicineresultsinExabytes ofdata(1EB=1 10 Byte).

Ofcoursetheseamountsareforeachhumanindividual,however,wehaveacurrentworldpopulationof7Billion(1BillioninEnglishlanguageis1MilliardinEuropeanlanguage)people(=7 10 people).Soyoucanseethatthisisreally“bigdata”.This“natural”dataisthenfusedwith“produced”data,e.g.theunstructureddata(text)inthepatientrecords,ordatafromphysiologicalsensorsetc.– thesedataisalsorapidlyincreasinginsizeandcomplexity.Youcanimaginethatwithoutcomputationalintelligencewehavenochancetosurviveinthiscomplexbigdatasets.

http://learn.genetics.utah.edu/content/begin/cells/scale/C‐Atom340pm=340.10‐12mMoleculeGlucose900pmVirus HepatitisVirus45nm=45.10‐9mMicroscope200.10‐9mConfocalmicroscopy 20.10‐6mElectron‐Microscopy0,1.10‐9mX‐Chromosome7.10‐6mDNA2.10‐9mEncyme =Metabolomics

10WS 2013/14

A. Holzinger LV 444.152

Page 11: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

MostofourcomputersareVon‐Neumannmachines(seechapter1),consequentlyatthelowestphysicallayer,dataisrepresentedaspatternsofelectricalon/offstates(1/0,H/L,high/low);wespeakofabit,whichisalsoknownasBit,theBasicindissolubleinformationunit(Shannon,1948).DonotconfusethisBitwiththeIEC60027‐2symbolbit– insmallletters– whichisusedasanSIdimensionprefix(e.g.1Kbit=1024bit,1Byte=8bit).Beginningwiththephysicallevelofdatawecandeterminevariouslevelsofdatastructures(seeSlide2‐2):Referto:http://physics.nist.gov/cuu/Units/binary.html

1)Physicallevel: inaVon‐Neumannsystem:bit;inaQuantumsystem:qubitNote: Regardlessofitsphysicalrealization(e.g.voltage,ormechanicalstate,orblack/whiteetc.),abitisalwayslogicallyeither0or1(analog toalight‐switch).Aqubithassimilaritiestoaclassicalbit,butisoverallverydifferent:Aclassicalbitisascalarvariablewiththesinglevalueofeither0or1,sothevalueisunique,deterministicandunambiguous.Aqubitismoregeneralinthesensethatitrepresentsastatedefinedbyapairofcomplexnumbers , , whichexpresstheprobabilitythatareadingofthevalueofthequbitwillgiveavalueof0or1.Thus,aqubitcanbeinthestateof0,1,orsomemixture‐ referredtoasasuperposition‐ ofthe0and1states.Theweightsof0and1inthissuperpositionaredeterminedby(a,b)inthefollowingway:qubit≜ , ≜ 0 1 .Pleasebeawarethatthismodelofquantumcomputationisnottheonlyone(Lanzagorta &Uhlmann,2008).2)LogicalLevel:1)Primitivedatatypes,including:a)Booleandatatype(true/false);b)numericaldatatype(e.g.integer( ,floating‐pointnumbers(“reals”),etc.);2)compositedatatypes,including:a)array,b)record,c)union,d)set(storesvalueswithoutanyparticularorder,andnorepeatedvalues),e)object(containsothers);3)Stringandtexttypes,including:a)alphanumericcharacters,b)alphanumericstrings(=sequenceofcharacterstorepresentwordsandtext)3)AbstractLevel: includingabstractdatastructures,e.g.queue(FIFO),stack(LIFO),set(noorder,norepeatedvalues),lists,hashtable,arrays,trees,graphs,…4)TechnicalLevel: Applicationdataformats,e.g.text,vectorgraphics,pixelimages,audiosignals,videosequences,multimedia,…5)HospitalLevel: Narrative(textual,naturallanguage)patientrecorddata(structured/unstructuredandstandardized/non‐standardized),Omicsdata(genomics,proteomics,metabolomics,microarraydata,fluxomics,phenomics),numericalmeasurements(physiologicaldata,timeseries,labresults,vitalsigns,bloodpressure,CO2 partialpressure,temperature,…),recordedsignals(ECG,EEG,ENG,EMG,EOG,EP…),graphics(sketches,drawings,handwriting,…);audiosignals,images(cams,x‐ray,MR,CT,PET,…),etc.

11WS 2013/14

A. Holzinger LV 444.152

Page 12: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Inbiomedicalinformaticswehavealottodowithabstractdatatypes(ADT),consequentlywebrieflyreviewthemostimportantoneshere.FordetailspleaserefertoacourseonAlgorithm&Datastructures,ortoaclassictextbooksuchas(Aho,Hopcroft&Ullman,1983),(Cormen etal.,2009),orinGerman(Ottmann &Widmayer,2012),(Holzinger,2003)andpleasetakeintoconsiderationthatdatastructuresandalgorithmsgohandinhand(Cormen,2013).

Listisasequentialcollectionofitems , , … , accessibleoneafteranother,beginningattheheadandendingatthetailz.InaVon‐Neumannmachineitisawidelyuseddatastructureforapplicationswhichdonotneedrandomaccess.Itdiffersfromthestack(last‐in‐first‐out,LIFO)andqueue(first‐in‐first‐out,FIFO)datastructuresinsofar,thatadditionsandremovalscanbemadeatanypositioninthelist.Incontrasttoasimpleset theorderisimportant.AtypicalexamplefortheuseofalistisaDNAsequence.ThecombinationofGGGTTTAAAissuchalist,theelementsofthelistarethenucleotidebases.Nucleotides arethejoinedmoleculeswhichformthestructuralunitsoftheRNAandtheDNAandplaythecentralroleinmetabolism.

12WS 2013/14

A. Holzinger LV 444.152

Page 13: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Graph isapair , ,where isasetoffinite,non‐emptyvertices(nodes)and isasetofedges(lines,arcs),whichare2‐elementsubsetsof .If isasetoforderedpairsofvertices(arcs,directededges,arrows),thenitisadirectedgraph(digraph).Thedistancesbetweentheedgescanberepresentedwithinadistance‐matric(twodimensionalarray).Theedgesinagraphcanbemultidimensionalobjects, e.g.vectorscontainingtheresultsofmultipleGen‐expressionmeasures.Forthispurposethedistanceoftwoedgescanbemeasuredbyvariousdistancemetrics.Graphsareideallysuitedforrepresentingnetworksinmedicineandbiology,e.g.metabolismpathways,etc.Inbioinformatics,distancematricesareusedtorepresentproteinstructuresinacoordinate‐independentmanner,aswellasthepairwisedistancesbetweentwosequencesinsequencespace.Theyareusedinstructuralandsequentialalignment,andforthedeterminationofproteinstructuresfromNMRorX‐raycrystallography.Evolutionarydynamicsactonpopulations.Neithergenes,norcells,norindividualsevolve;onlypopulationsevolve.ThissocalledMoranprocess describesthestochasticevolutionofafinitepopulationofconstantsize:Ineachtimestep,anindividualischosenforreproductionwithaprobabilityproportionaltoitsfitness;asecondindividualischosenfordeath.Theoffspringofthefirstindividualreplacesthesecondandindividualsoccupytheverticesofagraph.Ineachtimestep,anindividualisselectedwithaprobabilityproportionaltoitsfitness;theweightsoftheoutgoingedgesdeterminetheprobabilitiesthatthecorrespondingneighborwillbereplacedbytheoffspring.Theprocessisdescribedbyastochasticmatrix ,where denotestheprobabilitythatanoffspringofindividuali willreplaceindividualj.Ateachtimestep,anedge isselectedwithaprobabilityproportionaltoitsweightandthefitnessoftheindividualatitstail.TheMoranprocessisacompletegraphwithidenticalweights(Lieberman,Hauert &Nowak,2005).

13WS 2013/14

A. Holzinger LV 444.152

Page 14: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Treeisacollectionofelementscallednodes,oneofwhichisdistinguishedasaroot,alongwitharelation("parenthood")thatplacesahierarchicalstructureonthenodes.Anode,likeanelementofalist,canbeofwhatevertypewewish.Weoftendepictanodeasaletter,astring,oranumberwithacirclearoundit.Formally,atreecanbedefinedrecursivelyinthefollowingmanner:1.Asinglenodebyitselfisatree.Thisnodeisalsotherootofthetree.2.Suppose isanodeand 1, 2, . . . , aretreeswithroots 1, 2, . . . , , respectively.Wecanconstructanewtreebymaking betheparentofnodes 1, 2, . . . , .Inthistree istherootand 1, 2, . . . , arethesubtrees oftheroot.Nodes 1, 2, . . . , arecalledthechildrenofnode .Dendrogram (fromGreekdendron "tree",‐gramma "drawing")isatreediagramfrequentlyusedtoillustratethearrangementoftheclustersproducedbyhierarchicalclustering.Dendrograms areoftenusedincomputationalbiologytoillustratetheclusteringofgenesorsamples.Theoriginofsuchdendrograms canbefoundin(Darwin,1859).Theexampleby(Hufford etal.,2012)showsaneighbor‐joiningtreeandthechangingmorphologyofdomesticatedmaizeanditswildrelatives.Taxaintheneighbor‐joiningtreearerepresentedbydifferentcolors:parviglumis (green),landraces(red),improvedlines(blue),mexicana (yellow)andTripsacum (brown).Themorphologicalchangesareshownforfemaleinflorescencesandplantarchitectureduringdomesticationandimprovement.

14WS 2013/14

A. Holzinger LV 444.152

Page 15: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Keyproblemsindealingwithdatainclude:1)Heterogeneousdatasources(needfordatafusionanddataintegration)2)Complexityofthedata(high‐dimensionality)3)Noisy,uncertaindata(challengeofpre‐processing)4)Thediscrepancybetweendata‐information‐knowledge(variousdefinitions)5)Bigdatasets(manualhandlingofthedataisimpossible)

15WS 2013/14

A. Holzinger LV 444.152

Page 16: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Nowthatwehaveseensomeexamplesofdatafromthebiomedicaldomain,wecanlookatthe“bigpicture”.Manyika etal.(2011)localizedfourmajordatapoolsintheUShealthcareanddescribethatthedataarehighlyfragmented,withlittleoverlapandlowintegration.Moreover,theyreportthatapprox.30%ofclinicaltext/numericaldataintheUnitedStates,includingmedicalrecords,bills,laboratoryandsurgeryreports,isstillnotgeneratedelectronically.Evenwhenclinicaldataareindigitalform,theyareusuallyheldbyanindividualproviderandrarelyshared(seeSlide2‐4).Biomedicalresearchdata,e.g.clinicaltrials,predictivemodelingetc.,isproducedbyacademiaandpharmaceuticalcompaniesandstoredindatabasesandlibraries.Clinicaldataisproducedinthehospitalandarestoredinhospitalinformationsystems(HIS),picturearchivingandcommunicationsystems(PACS)orinlaboratorydatabases,etc.Muchdataishealthbusinessdataproducedbypayors,providers,insurances,etc.Finally,thereisanincreasingpoolofpatientbehaviorandsentimentdata,producedbyvariouscustomersandstakeholders,outsidethetypicalclinicalcontext,includingthegrowingdatafromthewellnessandambientassistedlivingdomain.

16WS 2013/14

A. Holzinger LV 444.152

Page 17: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Amajorchallengeinournetworkedworldistheincreasingamountofdata– todaycalled“bigdata”.Thetrendtowardspersonalizedmedicinehasresultedinasheermassofthegenerated(‐omics)data,(seeSlide2‐7).Inthelifesciencesdomain,mostdatamodelsarecharacterizedbycomplexity,whichmakesmanualanalysisverytime‐consumingandfrequentlypracticallyimpossible(Holzinger,2013).

MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomics data(e.g.microarraydata);thetranscriptome isthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).

17WS 2013/14

A. Holzinger LV 444.152

Page 18: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

MoreandmoreOmics‐dataaregenerated,including:1)Genomicsdata(e.g.sequenceannotation),2)Transcriptomics data(e.g.microarraydata);thetranscriptome isthesetofallRNAmolecules,includingmRNA,rRNA,tRNA andnon‐codingRNAproducedinthecells.3)Proteomicsdata:Proteomicstudiesgeneratelargevolumesofrawexperimentaldataandinferredbiologicalresultsstoredindatarepositories,mostlyopenlyavailable;anoverviewcanbefoundhere:(Riffle&Eng,2009).Theoutcomeofproteomicsexperimentsisalistofproteinsdifferentiallymodifiedorabundantinacertainphenotype.Thelargesizeofproteomicsdatasetsrequiresspecializedanalyticaltools,whichdealwithlargelistsofobjects(Bessarabova etal.,2012).4)Metabolomics(e.g.enzymeannotation),themetabolome representsthecollectionofallmetabolitesinacell,tissue,organororganism.5)Protein‐DNAinteractions,6)Protein‐proteininteractions;PPIareatthecoreoftheentireinteractomics systemofanylivingcell.7)Fluxomics (isotopictracing,metabolicpathways),8)Phenomics (biomarkers),9)Epigenetics,isthestudyofthechangesingeneexpression– othersthantheDNAsequence,thereforetheprefix“epi‐“10)Microbiomics11)LipidomicsOmics‐dataintegrationhelpstoaddressinterestingbiologicalquestionsonthebiologicalsystemsleveltowardspersonalizedmedicine(Joyce&Palsson,2006).

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2908408/

18WS 2013/14

A. Holzinger LV 444.152

Page 19: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Afurtherchallengeistointegratethedataandtomakeitaccessibletotheclinician.Whilethereismuchresearchontheintegrationofheterogeneousinformationsystems,ashortcomingisintheintegrationofavailabledata.Datafusionistheprocessofmergingmultiplerecordsrepresentingthesamereal‐worldobjectintoasingle,consistent,accurate,andusefulrepresentation(Bleiholder &Naumann,2008).AnexampleforthemixofdifferentdataforsolvingamedicalproblemcanbeseeninSlide2‐8.

AgoodexampleforcomplexmedicaldataisRCQM,whichisanapplicationthatmanagestheflowofdataandinformationintherheumatologyoutpatientclinic(50patientsperday,5daysperweek)ofGrazUniversityHospital,onthebasisofaqualitymanagementprocessmodel.Eachexaminationproduces100+clinicalandfunctionalparametersperpatient.Thisamasseddataaremorphedintobetteruseableinformationbyapplyingscoringalgorithms(e.g.DiseaseActivityScore,DAS)andareconvolutedovertime.Togetherwithpreviousfindings,physiologicallaboratorydata,patientrecorddataandOmicsdatafromthePathologydepartment,thesedataconstitutetheinformationbasisforanalysisandevaluationofthediseaseactivity.Thechallengeisintheincreasingquantitiesofsuchhighlycomplex,multi‐dimensionalandtimeseriesdata(Simonicetal.,2011).

19WS 2013/14

A. Holzinger LV 444.152

Page 20: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Donotconfusestructurewithstandardization(seeSlide2‐9).Datacanbestandardized(e.g.numericalentriesinlaboratoryreports)andnon‐standardized.Atypicalexampleisnon‐standardizedtext– impreciselycalled“Free‐Text”or“unstructureddata”inanelectronicpatientrecord(Kreuzthaleretal.,2011).

Standardizeddata isthe basisforaccuratecommunication.Inthemedicaldomain,manydifferentpeopleworkatdifferenttimesinvariouslocations.Datastandards canensurethatinformationisinterpretedbyalluserswiththesameunderstanding.Moreover,standardizeddatafacilitatecomparabilityofdataandinteroperabilityofsystems.Itsupportsthereusabilityofthedata,improvestheefficiencyofhealthcareservicesandavoidserrorsbyreducingduplicatedeffortsindataentry.

Datastandardizationreferstoa)thedatacontent;b)theterminologiesthatareusedtorepresentthedata;c)howdataisexchanged;andiv)howknowledge,e.g.clinicalguidelines,protocols,decisionsupportrules,checklists,standardoperatingproceduresarerepresentedinthehealthinformationsystem(refertoIOM).Technicalelementsfordatasharingrequirestandardizationofidentification,recordstructure,terminology,messaging,privacyetc.ThemostusedstandardizeddatasettodateistheinternationalClassificationofDiseases(ICD),whichwasfirstadoptedin1900forcollectingstatistics(Ahmadian etal.,2011),whichwewilldiscussin→Lecture3.Non‐standardizeddata isthemajorityofdataandinhibitdataquality,dataexchangeandinteroperability.Well‐structureddata istheminorityofdataandanidealisticcasewheneachdataelementhasanassociateddefinedstructure,relationaltables,ortheresourcedescriptionframeworkRDF,ortheWebOntologyLanguageOWL(see→Lecture3).Note:Ill‐structured isatermoftenusedfortheoppositeofwell‐structured,althoughthistermoriginallywasusedinthecontextofproblemsolving(Simon,1973).Semi‐structuredisaformofstructureddatathatdoesnotconformwiththestrictformalstructureoftablesanddatamodelsassociatedwithrelationaldatabasesbutcontainstagsormarkerstoseparatestructureandcontent,i.e.areschema‐lessorself‐describing;atypicalexampleisamarkup‐languagesuchasXML(see→Lecture3and4).Weakly‐Structureddata isthemostofourdatainthewholeuniverse,whetheritisinmacroscopic(astronomy)ormicroscopicstructures(biology)– see→Lecture5.Non‐structureddata orunstructureddata isanimprecisedefinitionusedforinformation expressedinnaturallanguage,whennospecificstructurehasbeendefined.Thisisanissuefordebate:Texthasalsosomestructure:words,sentences,paragraphs.Ifweareveryprecise,unstructureddatawouldmeantthatthedataiscompleterandomized– whichisusuallycallednoiseandisdefinedby(Duda,Hart&Stork,2000)asanypropertyofdatawhichisnotduetotheunderlyingmodelbutinsteadtorandomness(eitherintherealworld,fromthesensorsorthemeasurementprocedure).

20WS 2013/14

A. Holzinger LV 444.152

Page 21: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

“Multivariate”and“multidimensional”aremodernwordsandconsequentlyoverusedinliterature.Eachitemofdataiscomposedofvariables, andifsuchadataitemisdefinedbymorethanonevariableitiscalledamultivariabledataitem.Variablesarefrequentlyclassifiedintotwocategories:dependentorindependent.

21WS 2013/14

A. Holzinger LV 444.152

Page 22: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

InPhysics,EngineeringandStatisticsavariableisaphysicalpropertyofasubject,whosequantitycanbemeasured,e.g.mass,length,time,temperature,etc.

22WS 2013/14

A. Holzinger LV 444.152

Page 23: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

SMILESdata(.smi)consists ofastringobtainedbythesymbolnodesencounteredinadepth‐firsttreetraversalofachemicalgraph, whichisfirsttrimmedtoremovehydrogenatomsandcyclesarebrokentoturnitintoaspanningtree.Wherecycleshavebeenbroken,numericsuffixlabelsareincludedtoindicatetheconnectednodes.

23WS 2013/14

A. Holzinger LV 444.152

Page 24: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Proteomicanalysisofmesenchymal stemcells(MSCs).Two‐dimensionalgelelectrophoresiswasperformedusingwholeproteincellextractsfromP2MSCculturesofpatientswithrheumatoidarthritis(RA)(n=10)(A)andhealthycontrols(n=6)(B).Afterscanning,spotdetection,quantificationandnormalisation,gelswerecomparedusingHierarchicalClusteringSoftwareandPearsontest(C).Noclustercouldbedetectedusingtheseproteomicprofiles.

Proteomicanalysis:Two‐dimensionalelectrophoresiswasperformedusingP2MSCsinpatientswithRA(n=10)andhealthycontrols(n=6)(fig4A,B).ByusingtheHierarchicalClusteringmethod,wecouldnotdefineanyclusterthatmightdiscriminatepatientandcontrolcells(fig4C).ThePearsoncorrelationcoefficientwasnotsignificantlydifferentbetweenpatientandcontrolcells(r=0.933(0.022)andr=0.929(0.020),respectively).Thesedatacorroboratethelackofsignificantchangesincytokineproductionbetweenpatientsandcontrols.

24WS 2013/14

A. Holzinger LV 444.152

Page 25: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

http://www.rcsb.org/pdb/images/3ond_bio_r_500.jpgThePDBisalargerepository containing3‐Dstructuralinformation,establishedin1971Dataastoredin2Dbutcaninfactrepresentbiologicalentitiesinthreeormoredimensions

25WS 2013/14

A. Holzinger LV 444.152

Page 26: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Transaxial (left),coronal(middle),andsagittal(right)imagesofapatientwhowasscannedfor30mininlist‐modewiththeBrainPET scanner;therecordingwasstarted20minafterinjectionofabout300MBq fluor‐deoxy‐glucose.

26WS 2013/14

A. Holzinger LV 444.152

Page 27: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

InMathematics,henceinInformatics,however,avariableisassociatedwithaspace–oftenann‐dimensionalEuclideanspace – inwhichanentity(e.g.afunction)oraphenomenonofcontinuousnatureisdefined.Thedatalocationwithinthisspacecanbereferencedbyusingarangeofcoordinatesystems(e.g.Cartesian,Polar‐coordinates,etc.):Thedependentvariablesarethoseusedtodescribetheentity(forexamplethefunctionvalue)whilsttheindependentvariablesarethosethatrepresentthecoordinatesystemusedtodescribethespaceinwhichtheentityisdefined.Ifadatasetiscomposedofvariableswhoseinterpretationfitsthisdefinitionourgoalistounderstandhowthe‘entity’isdefinedwithinthen‐dimensionalEuclideanspace .Sometimeswemaydistinguishbetweenvariablesmeaningmeasurementofproperty,fromvariablesmeaningacoordinatesystem,byreferringtotheformerasvariate,andreferringtothelatterasdimension(DosSantos&Brodlie,2002), (dosSantos&Brodlie,2004).Aspaceisasetofpoints.Ametricspacehasanassociatedmetric,whichenablesustomeasuredistancesbetweenpointsinthatspaceand,inturn,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,andametricspaceisatopologicalone.Topologicalspacesfeelalientousbecauseweareaccustomedtohavingametric.BiomedicalExample:Aproteinisasinglechainofaminoacids,whichfoldsintoaglobularstructure.TheThermodynamicsHypothesisstatesthataproteinalwaysfoldsintoastateofminimumenergy.Topredictproteinstructure,wewouldliketomodelthefoldingofaproteincomputationally.Assuch,theproteinfoldingproblembecomesanoptimizationproblem:Wearelookingforapathtotheglobalminimuminaveryhigh‐dimensionalenergylandscape(Zomorodian,2005).

27WS 2013/14

A. Holzinger LV 444.152

Page 28: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Letuscollect ‐dimensional observationsintheEuclideanvectorspace andweget:Eq.2‐1

, … ,

Acloudofpointssampledfromanysource(e.g.medicaldata,sensornetworkdata,asolid3‐Dobject,surfaceetc.).ThosedatapointscanbecoordinatedasanunorderedsequenceinanarbitrarilyhighdimensionalEuclideanspace,wheremethodsofalgebraictopologycanbeapplied.Themainchallengeisinmappingthedatabackinto ortobemorepreciseinto ,becauseourretinaisinherentlyperceivingdatain .Thecloudofsuchdatapointscanbeusedasacomputationalrepresentationoftherespectivedataobject.Atemporalversioncanbefoundinmotion‐capturedata,wheregeometricpointsarerecordedastimeseries.Nowyouwillaskanobviousquestion:“Howdowevisualizeafour‐dimensionalobject?”Theobviousansweris:“Howdowevisualizeathreedimensionalobject?”Humansdonotseeinthreespatialdimensionsdirectly,butviasequencesofplanarprojectionsintegratedinamannerthatissensedifnotcomprehended.Littlechildrenspendasignificanttimeoftheirfirstyearoflifelearninghowtoinferthree‐dimensionalspatialdatafrompairedplanarprojections,andmanyyearsofpracticehavetunedaremarkableabilitytoextractglobalstructurefromrepresentationsinastrictlylowerdimension(Ghrist,2008).Becausewehavethesameproblemhereinthisbook,wemuststayin andthereforetheexampleinSlide2‐12(Zomorodian,2005).InEinstein'stheoryofSpecialRelativity,Euclidean3‐spaceplustime(the"4th‐dimension")areunifiedintotheMinkowskispace

28WS 2013/14

A. Holzinger LV 444.152

Page 29: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Ametricspacehasanassociatedmetric,whichenablestomeasurethedistancesbetweenpointsinthatspaceand,implicitlydefinetheirneighborhoods.Consequently,ametricprovidesaspacewithatopology,henceametricspaceisatopologicalspace.AsetXwithametricfunctiondiscalledametricspace.Wegiveitthemetrictopologyofd,wherethesetofopenballsMostofour“natural”spacesareaparticulartypeofmetricspaces:theEuclideanspaces:TheCartesianproductof copiesof ,thesetofrealnumbers,alongwiththeEuclideanmetric:Eq.2‐2

,

isthe ‐dimensionalEuclideanspace .Wemayinduceatopologyonsubsetsofmetricspacesasfollows:If ⊆ withtopology ,thenwegettherelativeorinducedtopology bydefiningFormoreinformationreferto(Zomorodian,2005)or(Edelsbrunner &Harer,2010).

29WS 2013/14

A. Holzinger LV 444.152

Page 30: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

KnowledgeDiscoveryfromData:Bygettinginsightintothedata;thegainedinformationcanbeusedtobuildupknowledge.Thegrandchallengeistomaphigherdimensionaldataintolowerdimensions,hencemakeitinteractivelyaccessibletotheend‐user(Holzinger,2012),(Holzinger,2013).Thismappingfrom → isthecoretaskofvisualizationandamajorcomponentforknowledgediscovery:Enablingeffectiveinteractivehumancontroloverpowerfulmachinealgorithmstosupporthumansensemaking(Holzinger,2012),(Holzinger,2013).

Holzinger,A.2013.Human–ComputerInteraction&KnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:AlfredoCuzzocrea,C.K.,Dimitris E.Simos,EdgarWeippl,Lida Xu (ed.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.

30WS 2013/14

A. Holzinger LV 444.152

Page 31: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Multivariatedataset isadatasetthathasmanydependentvariables andtheymightbecorrelatedtoeachothertovaryingdegrees.Usuallythistypeofdatasetisassociatedwithdiscretedatamodels.

Multidimensionaldataset isadatasetthathasmanyindependentvariables clearlyidentified,andoneormoredependentvariablesassociatedtothem.Usuallythistypeofdatasetisassociatedwithcontinuousdatamodels.

Inotherwords,everydataitem(orobject)inacomputerisrepresented(stored)asasetoffeatures.Insteadofthetermfeatureswemayusethetermdimensions,becauseanobjectwith ‐featurescanalsoberepresentedasamultidimensionalpointinan ‐dimensionalspace.Dimensionalityreductionistheprocessofmappingan ‐dimensionalpoint,intoalower ‐dimensionalspace– thisisthemainchallengeinvisualizationsee→Lecture9.

Thenumberofdimensionscansometimesbesmall,e.g.simple1D‐datasuchastemperaturemeasuredatdifferenttimes,to3Dapplicationssuchasmedicalimaging,wheredataiscapturedwithinavolume.Standardtechniques—contouringin2D;isosurfacing andvolumerenderingin3D—haveemergedovertheyearstohandlethissortofdata.Thereisnodimensionreductionissueintheseapplications,sincethedataanddisplaydimensionsessentiallymatch.

31WS 2013/14

A. Holzinger LV 444.152

Page 32: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Datacanbecategorizedintoqualitative(nominalandordinal)andquantitative(intervalandratio):Intervalandratiodataareparametric,andareusedwithparametrictoolsinwhichdistributionsarepredictable(andoftenNormal).Nominalandordinaldataarenon‐parametric,anddonotassumeanyparticulardistribution.Theyareusedwithnon‐parametrictoolssuchastheHistogram.Theclassicpaperonthetheoryofscalesofmeasurementis(Stevens,1946).

32WS 2013/14

A. Holzinger LV 444.152

Page 33: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Wecansummarizewhatwelearnedsofaraboutdata:Datacanbenumeric,non‐numeric,orboth.Non‐numericdatacanincludeanythingfromlanguagedata(text)tocategorical,image,orvideodata.Datamayrangefromcompletelystructured,suchascategoricaldata,tosemi‐structured,suchasanXMLFilecontainingmetainformation,tounstructured,suchasanarrative“free‐text”.Note,thattermunstructureddoesnotmeanthatthedataarewithoutanypattern,whichwouldmeancompleterandomnessanduncertainty,butratherthat“unstructureddata”areexpressedso,thatonlyhumanscanmeaningfullyinterpretit.Structureprovidesinformationthatcanbeinterpretedtodeterminedataorganizationandmeaning,henceitprovidesacontext fortheinformation.Theinherentstructureinthedatacanformabasisfordatarepresentation.Animportant,yetoftenneglectedissuearetemporalcharacteristicsofdata: Dataofalltypesmayhaveatemporal(time)association,andthisassociationmaybeeitherdiscreteorcontinuous(Thomas&Cook,2005).InMedicalInformaticswehaveapermanentinteractionbetweendata,informationandknowledge,withdifferentdefinitions(Bemmel&Musen,1997),seeSlide2‐16:

33WS 2013/14

A. Holzinger LV 444.152

Page 34: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Data arethephysicalentitiesatthelowestabstractionlevelwhichare,e.g.generatedbyapatient(patientdata)orabiologicalprocess(e.g.Omicsdata).Accordingto(Bemmel&Musen,1997)datacontainnomeaning.

Informationisderivedbyinterpretationofthedatabyaclinician(humanintelligence).

Knowledge isobtainedbyinductivereasoningwithpreviouslyinterpreteddata,collectedfrommanysimilarpatientsorprocesses,whichisaddedtothesocalledbodyofknowledgeinmedicine,theexplicitknowledge. Thisknowledgeisusedfortheinterpretationofotherdataandtogainimplicitknowledge whichguidestheclinicianintakingfurtheraction.

34WS 2013/14

A. Holzinger LV 444.152

Page 35: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Forhypothesisgenerationandtesting,fourtypesofinferencesexist(Peirce,1955):abstraction,abduction,deduction,andinduction.Thefirsttwodrivehypothesisgenerationwhilethelatterdrivehypothesistesting,seeSlide2‐17:Abstractionmeansthatdataarefilteredaccordingtotheirrelevancefortheproblemsolutionandchunkedinschemasrepresentinganabstract descriptionoftheproblem(e.g.,abstractingthatanadultmalewithhaemoglobinconcentrationlessthan14g/dL isananaemicpatient).Followingthis,hypothesesthatcouldaccountforthecurrentsituationarerelatedthroughaprocessofabduction,characterizedbya"backwardflow"ofinferencesacrossachainofdirectedrelationswhichidentifythoseinitialconditionsfromwhichthecurrentabstractrepresentationoftheproblemoriginates.Thisprovidestentativesolutionstotheproblemathandbywayofhypotheses.Forexample,knowingthatdisease willcausesymptom,abductionwilltrytoidentifytheexplanationforB,whiledeductionwillforecastthatapatientaffectedbydisease will

manifestsymptom :bothinferencesareusingthesamerelationalongtwodifferentdirections(Patel&Ramoni,1997).Abduction ischaracterizedbyacyclicalprocessofgeneratingpossibleexplanations(i.e.,identificationofasetofhypothesesthatareabletoaccountfortheclinicalcaseonthebasisoftheavailabledata)andtestingthoseexplanations(i.e.,evaluationofeachgeneratedhypothesisonthebasisofitsexpectedconsequences)fortheabnormalstateofthepatientathand(Patel,Arocha &Zhang,2004).

ThehypothesistestingprocedurescanbeinferredfromSlide2‐17:Generalknowledgeisgainedfrommanypatients,andthisgeneralknowledgeisthenappliedtoanindividualpatient.Wehavetodeterminebetween:Reasoning istheprocessbywhichaclinicianreachesaconclusionafterthinkingaboutallthefacts;Deduction consistsofderivingaparticularvalidconclusionfromasetofgeneralpremises;Induction consistsofderivingalikelygeneralconclusionfromasetofparticularstatements.Reasoninginthe“realworld”doesnotappeartofitneatlyintoanyofthesebasictypes.Therefore,athirdformofreasoninghasbeenrecognizedbyPeirce(1955),wheredeductionandinductionareinter‐mixed;

35WS 2013/14

A. Holzinger LV 444.152

Page 36: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Thequestion“whatisinformation?”isstillanopenquestioninbasicresearch,andanydefinitionisdependingontheviewtaken.Forexample,thedefinitiongivenbyCarl‐FriedrichvonWeizsäcker:“Informationiswhatisunderstood,”impliesthatinformationhasbothasenderandareceiverwhohaveacommonunderstandingoftherepresentationandthemeanstoconveyinformationusingsomepropertiesofthephysicalsystems,andhisaddendum:“Informationhasnoabsolutemeaning;itexistsrelativelybetweentwosemanticlevels”impliestheimportanceofcontext(Marinescu,2011).Withoutdoubtinformationisafundamentallyimportantconceptwithinourworldandlifeiscomplexinformation,seeSlide2‐14:

Manysystems,e.g.inthequantumworldtonotobeytheclassicalviewofinformation.Inthequantumworldandinthelifesciencestraditionalinformationtheoryoftenfailstoaccuratelydescribereality…forexampleinthecomplexityofalivingcell:Allcomplexlifeiscomposedofeukaryotic(nucleated)cells(Lane&Martin,2010).Agoodexampleofsuchacellistheprotist EuglenaGracilis (inGerman“Augentierchen”)withalengthofapprox.30 .Lifecanbeseenasadelicateinterplayofenergy,entropyandinformation,essentialfunctionsoflivingbeingscorrespondtothegeneration,consumption,processing,preservationandduplicationofinformation.

P:Complexity<>Information<>Energy<> Entropy

36WS 2013/14

A. Holzinger LV 444.152

Page 37: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

TheetymologicaloriginofthewordinformationcanbetracedbacktotheGreek“forma”andtheLatin“information”and“informare”,tobringsomethingintoashape(“in‐a‐form”).Consequently,thenaivedefinitionincomputerscienceis“informationisdataincontext” andthereforedifferentthandataorknowledge.However,wefollowthenotionof(Boisot &Canals,2004)anddefinethatinformationisanextractionfromdatathat,bymodifyingtherelevantprobabilitydistributions,hasdirectinfluenceonanagent’sknowledgebase.Forabetterunderstandingofthisconcept,wefirstreviewthemodelofhumaninformationprocessingbyWickens (1984):ThemodelbyWickens (1984)beautifullyemphasizesourviewondata,informationandknowledge:thephysicaldatafromthereal‐worldareperceivedasinformationthroughperceptualfilters,controlledbyselectiveattentionandformhypotheseswithintheworkingmemory.Thesehypothesesaretheexpectationsdependingonourpreviousknowledgeavailableinourmentalmodel,storedinthelong‐termmemory.Thesubjectivelybestalternativehypothesiswillbeselectedandprocessedfurtherandmaybetakenasoutcomeforanaction.Duetothefactthatthissystemisaclosedloop,wegetfeedbackthroughnewdataperceivedasnewinformationandtheprocessgoeson.

37WS 2013/14

A. Holzinger LV 444.152

Page 38: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Theincomingstimulifromthephysicalworldmustpassbothaperceptualfilterandaconceptualfilter.Theperceptualfilter orientatesthesenses(e.g.visualsense)tocertaintypesofstimuliwithinacertainphysicalrange(e.g.visualsignalrange,pre‐knowledge,attentionetc.).Onlythestimuliwhichpassthroughthisfiltergetregisteredasincomingdata–everythingelseisfilteredout.Atthispointitisimportanttofollowourphysicalprincipleofdata:todifferentiatebetweentwonotionsthatarefrequentlyconfused:anexperiment’s(raw,hard,measured,factual)dataandits(meaningful,subjective)interpretedinformationresults.Dataarepropertiesconcerningonlytheinstrument;itistheexpressionofafact. Theresultconcernsapropertyoftheworld.Thefollowingconceptualfiltersextractinformation‐bearingdatafromwhathasbeenpreviouslyregistered.Bothtypesoffiltersareinfluencedbytheagents’cognitiveandaffectiveexpectations,storedintheirmentalmodels.Theenormousutilityofdataresidesinthefactthatitcancarryinformationaboutthephysicalworld.Thisinformationmaymodifysetexpectationsorthestate‐of‐knowledge.Theseprinciplesallowanagent toactinadaptivewaysinthephysicalworld(Boisot &Canals,2004).Conferthisprocesswiththehumaninformationprocessingmodelby(Wickens,1984),seeninSlide2‐19anddiscussedin→Lecture7.

38WS 2013/14

A. Holzinger LV 444.152

Page 39: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Entropyhasmanydifferentdefinitionsandapplications,originallyinstatisticalphysicsandmostoftenitisusedasameasurefordisorder.Ininformationtheory,entropycanbeusedasameasurefortheuncertaintyinadataset.

To demonstratehowusefulentropycanbe‐ youcanhavealookatthispaper:Holzinger,A.,Stocker,C.,Peischl,B.&Simonic,K.‐M.2012.OnUsingEntropyforEnhancingHandwritingPreprocessing.Entropy,14,(11),2324‐2350.http://www.mdpi.com/1099‐4300/14/11/2324

39WS 2013/14

A. Holzinger LV 444.152

Page 40: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Theconceptofentropywasfirstintroducedinthermodynamics(Clausius,1850),whereitwasusedtoprovideastatementofthesecondlawofthermodynamics.Later,statisticalmechanicsprovidedaconnectionbetweenthemacroscopicpropertyofentropyandthemicroscopicstateofasystembyBoltzmann.Shannonwasthefirsttodefineentropyandmutualinformation.

Shannon(1948)usedaGedankenexperiment(thoughtexperiment)toproposeameasureofuncertaintyinadiscretedistributionbasedontheBoltzmannentropyofclassicalstatisticalmechanics,seeSlide2‐22:

40WS 2013/14

A. Holzinger LV 444.152

Page 41: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Anexampleshalldemonstratetheusefulnessofthisapproach:1)Let beadiscretedatasetwithassociatedprobabilities :Eq.2‐5

… , … ,

2)NowweapplyShannon’sequationEq.2‐4:Eq.2‐6

3)Weassumethatoursourcehastwovalues(ball=white,ball=black)LetusdothefamoussimpleGedankenexperiment(thoughtexperiment):Imagineaboxwhichcancontaintwocoloredballs:blackandwhite.Thisisoursetofdiscretesymbolswithassociatedprobabilities.Ifwegrabblindlyintothisboxtogetaball,wearedealingwithuncertainty,becausewedonotknowwhichballwetouch.Wecanask:Istheballblack?NO.THENitmustbewhite,soweneedonequestiontosurelyprovidetherightanswer.Becauseitisabinarydecision(YES/NO)themaximumnumberof(binary)questionsrequiredtoreducetheuncertaintyis:log ,where isthenumberofthepossibleoutcomes.Ifthereare eventswithequalprobability then 1/ .Ifyouhaveonly1blackball,thenlog 1 0,whichmeansthereisnouncertainty.Eq.2‐7

, with , 14)NowwesolvenumericallyEq.2‐6:Eq.2‐8

∗ log1

∗ log1

1Since rangesfrom0(forimpossibleevents)to1(forcertainevents),theentropyvaluerangesfrominfinity(forimpossibleevents)to0(forcertainevents).So,wecansummarizethattheentropyistheweightedaverageofthesurpriseforallpossibleoutcomes.Forourexamplewiththetwoballswecandrawthefollowingfunction:Theentropyvalueis1for =0,5anditisboth0foreither =0or =1.Thisexamplemightseemtrivial,buttheentropyprinciplehasbeendevelopedalotsinceShannonandtherearemanydifferentmethods,whichareveryusefulfordealingwithdata.

41WS 2013/14

A. Holzinger LV 444.152

Page 42: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Shannoncalledittheinformationentropy (akaShannonentropy)anddefined:Eq.2‐9

log1

log

where istheprobabilityoftheeventoccurring.If isnotidenticalforalleventsthentheentropy isaweightedaverageofallprobabilities,whichShannondefinedas:Eq.2‐10

2

Basically,theentropyp(x)approacheszeroifwehaveamaximumofstructure– andopposite,theentropyp(x)reacheshighvaluesifthereisnostructure– hence,ideally,iftheentropyisamaximum,wehavecompleterandomness,totaluncertainty.LowEntropymeansdifferences,structure,individuality– highEntropymeansnodifferences,nostructure,noindividuality.Consequently,lifeneedslowentropy.

42WS 2013/14

A. Holzinger LV 444.152

Page 43: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Theprinciplewhatwecaninferfromentropyvaluesis:1)Lowentropy valuesmeanhighprobability,highcertainty, henceahighdegreeofstructurization inthedata.2)Highentropy valuesmeanlowprobability,lowcertainty (≅ highuncertainty;‐),hencealowdegreeofstructurization inthedata.Maximumentropywouldmeancompleterandomnessandtotaluncertainty.Highlystructureddatacontainlowentropy;ideallyifeverythingisinorderandthereisnosurprise(nouncertainty)theentropyislow:Eq.2‐11

0

Eq.2‐12 log .

Ontheotherhandifthedataareweaklystructured– asforexampleinbiologicaldata–andthereisnoabilitytoguess(alldataisequallylikely)theentropyishigh:Ifwefollowthisapproach,“unstructureddata”wouldmeancompleterandomness.Letuslookonthehistoryofentropytounderstandwhatwecandoinfuture,seeSlide2‐25.

43WS 2013/14

A. Holzinger LV 444.152

Page 44: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Youmight arguewhatthepracticalpurposeofthisapproachis– manifoldapplications!!

44WS 2013/14

A. Holzinger LV 444.152

Page 45: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

TheoriginmaybefoundintheworkofJakob Bernoulli,describingtheprincipleofinsufficientreason:weareignorantofthewaysaneventcanoccur,theeventwilloccurequallylikelyinanyway.ThomasBayes(1763)andPierre‐SimonLaplace(1774)carriedonandHaroldJeffreys andDavidCoxsolidifieditintheBayesianStatistics,akastatisticalinference.ThesecondpathleadingtotheclassicalMaximumEntropy,en‐routewiththeShannonEntropy,canbeidentifiedwiththeworkofJamesClerkMaxwellandLudwigBoltzmann,continuedbyWillardGibbsandfinallyClaudeElwoodShannon.Thisworkisgearedtowarddevelopingthemathematicaltoolsforstatisticalmodelingofproblemsininformation.Thesetwoindependentlinesofresearchareverysimilar.Theobjectiveofthefirstlineofresearchistoformulateatheory/methodologythatallowsunderstandingofthegeneralcharacteristics (distribution)ofasystemfrompartialandincompleteinformation.Inthesecondrouteofresearch,thesameobjectiveisexpressedasdetermininghowtoassign(initial)numericalvaluesofprobabilitieswhenonlysome(theoretical)limitedglobalquantitiesoftheinvestigatedsystemareknown.RecognizingthecommonbasicobjectivesofthesetwolinesofresearchaidedJaynes inthedevelopmentofhisclassicalwork,theMaximumEntropyformalism.Thisformalismisbasedonthefirstlineofresearchandthemathematicsofthesecondlineofresearch.TheinterrelationshipbetweenInformationTheory,statisticsandinference,andtheMaximumEntropy(MaxEnt)principlebecameclearin1950ies,andmanydifferentmethodsarosefromtheseprinciples(Golan,2008),seenextSlide

45WS 2013/14

A. Holzinger LV 444.152

Page 46: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

MaximumEntropy(MaxEn),describedby(Jaynes,1957),isusedtoestimateunknownparametersofamultinomialdiscretechoiceproblem,whereastheGeneralizedMaximumEntropy(GME)modelincludesnoisetermsinthemultinomialinformationconstraints.Eachnoisetermismodeledasthemeanofafinitesetofaprioriknownpointsintheinterval 1,1 withunknownprobabilitieswherenoparametricassumptionsabouttheerrordistributionaremade.AGMEmodelforthemultinomialprobabilitiesandforthedistributions,associatedwiththenoisetermsisderivedbymaximizingthejointentropyofmultinomialandnoisedistributions,undertheassumptionofindependence(Jaynes,1957).TopologicalEntropy (TopEn),wasintroducedby(Adler,Konheim &McAndrew,1965)withthepurpose tointroducethenotionofentropyasaninvariantforcontinuousmappings:Let , beatopologicaldynamicalsystem,i.e.,let beanonemptycompactHausdorff spaceand : → acontinuousmap;theTopEn isanonnegativenumberwhichmeasuresthecomplexity ofthesystem(Adler,Downarowicz &Misiurewicz,2008).GraphEntropywasdescribedby(Mowshowitz,1968)tomeasurestructuralinformationcontentofgraphs,andadifferentdefinition,morefocusedonproblemsininformationandcodingtheory,wasintroducedby(Körner,1973).Graphentropyisoftenusedforthecharacterizationofthethe structureofgraph‐basedsystems,e.g.inmathematicalbiochemistry.Intheseapplicationstheentropyofagraphisinterpretedasitsstructuralinformationcontentandservesasacomplexitymeasure,andsuchameasureisassociatedwithanequivalencerelationdefinedonafinitegraph;byapplicationofShannon’sEq.2.4withtheprobabilitydistributionwegetanumericalvaluethatservesasanindexofthestructuralfeaturecapturedbytheequivalencerelation(Dehmer&Mowshowitz,2011).

MinimumEntropy (MinEn),describedby(Posner,1975),providesustheleastrandom,andtheleastuniformprobabilitydistributionofadataset,i.e.theminimumuncertainty,whichisthelimitofourknowledgeandofthestructureofthesystem.Often,theclassicalpatternrecognitionisdescribedasaquestforminimumentropy.Mathematically,itismoredifficulttodetermineaminimumentropyprobabilitydistributionthanamaximumentropyprobabilitydistribution;whilethelatterhasaglobalmaximumduetotheconcavityoftheentropy,theformerhastobeobtainedbycalculatingalllocalminima,consequentlytheminimumentropyprobabilitydistributionmaynotexistinmanycases(Yuan&Kesavan,1998).CrossEntropy (CE),discussedby(Rubinstein,1997),wasmotivatedbyanadaptivealgorithmforestimatingprobabilitiesofrareeventsincomplexstochasticnetworks,whichinvolvesvarianceminimization.CEcanalsobeusedforcombinatorialoptimizationproblems(COP).Thisisdonebytranslatingthe“deterministic”optimizationproblemintoarelated“stochastic”optimizationproblemandthenusingrareeventsimulationtechniques(DeBoeretal.,2005).Rényi entropy isageneralizationoftheShannonentropy(informationtheory),andTsallis entropyisageneralizationofthestandardBoltzmann–Gibbsentropy(statisticalphysics).Forusmoreimportantare:ApproximateEntropy(ApEn),describedby(Pincus,1991),isuseabletoquantifyregularityindatawithoutanyaprioriknowledgeaboutthesystem,seeanexampleinSlide2‐20.SampleEntropy(SampEn),wasusedby(Richman&Moorman,2000)foranewrelatedmeasureoftimeseriesregularity.SampEn wasdesignedtoreducethebiasofApEn andisbettersuitedfordatasetswithknownprobabilisticcontent.

46WS 2013/14

A. Holzinger LV 444.152

Page 47: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Problem:Monitoringbodymovementsalongwithvitalparametersduringsleepprovidesimportantmedicalinformationregardingthegeneralhealth,andcanthereforebeusedtodetecttrends(largeepidemiologystudies)todiscoversevereillnessesincludinghypertension(whichisenormouslyincreasinginoursociety).Thisseeminglysimpledata– onlyfromonenightperiod– demonstratesthecomplexityandtheboundariesofstandardmethods(forexampleFastFourierTransformation)todiscoverknowledge(forexampledeviations,similaritiesetc.).Duetothecomplexityanduncertaintyofsuchdatasets,standardmethods(suchasFFT)comprisethedangerofmodelingartifacts.Sincetheknowledgeofinterestformedicalpurposesisinanomalies(alterations,differences,a‐typicalities,irregularities),theapplicationofentropicmethodsprovidesbenefits.PhotographtakenduringtheEUProjectEMERGEandusedwithpermission.

47WS 2013/14

A. Holzinger LV 444.152

Page 48: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

1)Wehaveagivendataset wherecapital isthenumberofdatapoints:Eq.2‐13

, , … ,

2)Nowweformm‐dimensionalvectorsEq.2‐14

, , … ,3)Wemeasurethedistancebetweeneverycomponent,i.e.themaximumabsolutedifferencebetweentheirscalarcomponentsEq.2‐15

, max, ,…,

4)Welook– sotosay– inwhichdimensionisthebiggestdifference;asaresultwegettheApproximateEntropy(ifthereisnodifferencewehavezerorelativeentropy):Eq.2‐16

ApEn , lim →

where istherunlengthand isthetolerancewindow (letusassumethat isequalto ),ApEn (m,r)couldalsobewrittenasH ,5) iscomputedbyEq.2‐17

1 1

ln

withEq.2‐18

1

6) measureswithinthetolerance theregularityofpatternssimilartoagivenoneofwindowlength7)Finallyweincreasethedimensionto 1 andrepeatthestepsbeforeandgetasaresulttheapproximateentropyApEn ,ApEn , , isapproximatelythenegativenaturallogarithmoftheconditionalprobability(CP)thatadatasetoflength,havingrepeateditselfwithinatolerance for points,willalsorepeatitselffor 1 points.Animportantpointto

keepinmindabouttheparameter isthatitiscommonlyexpressedasafractionoftheStandarddeviation(SD)ofthedataandinthiswaymakesApEn ascale‐invariantmeasure.Alowvaluearisesfromahighprobabilityofrepeatedtemplatesequencesinthedata(Hornero etal.,2006).

48WS 2013/14

A. Holzinger LV 444.152

Page 49: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Inthisslidewecanseetheplotofthenormalizedapproximateentropyforeachoftheepisodesandthemedianacrossalltheepisodes.Fromthisfigurewecanseethattheentropyisaminimumwherewehavenoalterationsandentropyisincreasingwhenhavingirregularities.Ifwehavenodifferenceswegetzeroentropy

49WS 2013/14

A. Holzinger LV 444.152

Page 50: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Afinalexampleshouldmaketheadvantageofsuchanentropymethodtotallyclear:Intherightdiagramitishardtodiscoverirregularitiesforamedicalprofessional–especiallyoveralongerperiod,butananomalycaneasilybedetectedbydisplayingthemeasuredrelativeApEn.Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproduction;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.

50WS 2013/14

A. Holzinger LV 444.152

Page 51: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Whatcanwelearnfromthisexperiment?Approximateentropyisrelativelyunaffectedbynoise;itcanbeappliedtocomplextimeserieswithgoodreproducibility;itisfiniteforstochastic,noisy,compositeprocesses;thevaluescorresponddirectlytoirregularities;

anditisapplicabletomanyotherareas– forexamplefortheclassificationoflargesetsoftexts– theabilitytoguessalgorithmicallythesubjectofatextcollectionwithouthavingtoreaditwouldpermitautomatedclassification.

51WS 2013/14

A. Holzinger LV 444.152

Page 52: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

52

My DEDICATION is to make data valuable … Thank you!

WS 2013/14

A. Holzinger LV 444.152

Page 53: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

53WS 2013/14

A. Holzinger LV 444.152

Page 54: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

54WS 2013/14

A. Holzinger LV 444.152

Page 55: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

MFC=MinimumFoot ClearanceStride=stepYoucanseebrilliantlywhatyoucanmeasurewithentropy– youcandetermineanomalies,i.e.thebalanceproblemsofelderlygait

MFCPoincaré plots.ToppanelsshowMFCtimeseriesfromahealthyelderlysubject(A)anditscorrespondingPoincaré plot(B).BottompanelsshowMFCtimeseriesfromanelderlysubjectwithbalanceproblem(C)anditscorrespondingPoincaré plot(D).

SignificantrelationshipsofmeanMFCwithPoincaré plotindexes(SD1,SD2)andApEn (r=0.70,p<0.05;r=0.86,p<0.01;r=0.74,p<0.05)werefoundinthefalls‐riskelderlygroup.Ontheotherhand,suchrelationshipswereabsentinthehealthyelderlygroup.Incontrast,theApEn valuesofMFCdataseriesweresignificantly(p<0.05)correlatedwithPoincaré plotindexesofMFCinthehealthyelderlygroup,whereascorrelationswereabsentinthefalls‐riskgroup.TheApEn valuesinthefalls‐riskgroup(meanApEn =0.18± 0.03)wassignificantly(p<0.05)higherthanthatinthehealthygroup(meanApEn =0.13± 0.13).ThehigherApEn valuesinthefalls‐riskgroupmightindicateincreasedirregularitiesandrandomnessintheirgaitpatternsandanindicationoflossofgaitcontrolmechanism.ApEn valuesofrandomlyshuffledMFCdataoffallsrisksubjectsdidnotshowanysignificantrelationshipwithmeanMFC.

55WS 2013/14

A. Holzinger LV 444.152

Page 56: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

56WS 2013/14

A. Holzinger LV 444.152

Page 57: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

57WS 2013/14

A. Holzinger LV 444.152

Page 58: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Surrogatedatarecords.AandBshowthemajorcomponents.A:themeanprocess,whichhassetpointandspikemodes.B:thebaselineprocess,heremeaningtheheartratevariability,modeledasGaussianrandomnumbers.C:theirsum,asurrogatedatarecord.D–F:amorerealisticsurrogatewiththesamefrequencycontentastheobserveddata.D:aclinicallyobserveddatarecordof4,096R‐Rintervals.Thelefthand ordinateislabeledinms andtherighthand ordinateinSD.E:a4,096‐pointisospectral surrogatedatasetformedusingtheinverseFouriertransformoftheperiodogram ofthedatainD.F:thesurrogatedataafteradditionofaclinicallyobserveddecelerationlasting50pointsandscaledsothatthevarianceoftherecordisincreasedfrom1to2.

58WS 2013/14

A. Holzinger LV 444.152

Page 59: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

59WS 2013/14

A. Holzinger LV 444.152

Page 60: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

60WS 2013/14

A. Holzinger LV 444.152

Page 61: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

61WS 2013/14

A. Holzinger LV 444.152

Page 62: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_entropy_sect018.htm

Wheremanyotherlanguagesrefertotables,rows,andcolumns/fields,SASusesthetermsdatasets,observations,andvariables.ThereareonlytwokindsofvariablesinSAS:numericandcharacter(string).Bydefaultallnumericvariablesarestoredas(8byte)real.Itispossibletoreduceprecisioninexternalstorageonly.DateanddatetimevariablesarenumericvariablesthatinherittheCtraditionandarestoredaseitherthenumberofdays(fordatevariables)orseconds(fordatetime variables).

http://www.sas.com/technologies/analytics/statistics/stat/index.html

62WS 2013/14

A. Holzinger LV 444.152

Page 63: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Hadoop andtheMapReduce programmingparadigmalreadyhaveasubstantialbaseinthebioinformaticscommunity– inparticular inthefieldofhigh‐throughput next‐generationsequencinganalysis.Thisisduetothecost‐effectivenessofHadoop‐basedanalysisoncommodityLinuxclusters,andinthecloudviadatauploadtocloudvendorswhohaveimplementedHadoop/HBase;andduetotheeffectivenessandease‐of‐useoftheMapReduce methodinparallelizationofmanydataanalysisalgorithms.

63WS 2013/14

A. Holzinger LV 444.152

Page 64: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Nanomedicine opensnewavenuesformanyresearchinbiomedicalinformaticsmethodsandtools.For surethefuturechallengesarenewtopicssuchas“bigdata”inbioinformatics,novelmethodsfortheuseof“omics”‐dataetc.Futureresearchisneededonalgorithmicandmethodologicalissues.Thisneedsthewillingnesstocooperate withdifferentdisciplines.

Twoareasofferidealconditionstowardssolvingthesechallenges:Human‐ComputerInteraction(HCI)andKnowledgeDiscoveryandDataMining(KDD),withthegoalofsupportinghumanintelligencewithmachineintelligence– todiscovernew,previouslyunknowninsightsintothedata.

Holzinger,A.2013.Human–ComputerInteraction&KnowledgeDiscovery(HCI‐KDD):Whatisthebenefitofbringingthosetwofieldstoworktogether?In:AlfredoCuzzocrea,C.K.,Dimitris E.Simos,EdgarWeippl,Lida Xu (ed.)MultidisciplinaryResearchandPracticeforInformationSystems,SpringerLectureNotesinComputerScienceLNCS8127.Heidelberg,Berlin,NewYork:Springer,pp.319‐328.

64WS 2013/14

A. Holzinger LV 444.152

Page 65: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Thechallengewefaceisthatanestimatedaverageof5%ofdataarestructured,therestiseithersemi‐structured,weaklystructuredandmostofourdataisunstructured.

Maybethemostimportantfield forthefutureisdatamining– especiallynoveltechniquesofdatamining,includingbothtimeandspace(e.g.graph‐based,entropy‐based,topological‐baseddataminingapproaches).

65WS 2013/14

A. Holzinger LV 444.152

Page 66: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

http://minnesotafuturist.pbworks.com/w/page/21441129/DIKW

Afunnydescription ofdatainformationknowledge.

66WS 2013/14

A. Holzinger LV 444.152

Page 67: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

67WS 2013/14

A. Holzinger LV 444.152

Page 68: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Avery placative image.Nicetolookat– buttheusefulnessisquestionable.

68WS 2013/14

A. Holzinger LV 444.152

Page 69: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Allthismodelsareveryquestionable. PleaserememberthatwefollowinourlecturethenotionofBoisot &Canals.

69WS 2013/14

A. Holzinger LV 444.152

Page 70: A. Holzinger LV 444 - Genomegenome.tugraz.at/MedicalInformatics/WinterSemester... · time step, an edge E Fis selected with a probability proportional to its weight and the fitness

Theinterestingissue ofthisgraphicisthatitincludesatime‐axis,whichisimportantfordecisionmakingandpredictiveanalytics.

70WS 2013/14

A. Holzinger LV 444.152