I’m Grace Peng, one of the data specialists at the NCAR ... · Standardizaon of data is not a new...

Preview:

Citation preview

I’mGracePeng,oneofthedataspecialistsattheNCARResearchDataArchiveakaastheRDA.ManywithinNCARknowusasthepeoplewhostagedataon/gladeforyou.Somepeopleinthisroommayhaveneverheardofus.Formorethan40years,theRDAhasbeencollecInganddisseminaIngweatherandclimatedatatotheresearchcommunity.WehostagrowingcollecIonofover600datasets.Dataspecialistsarepartdatacurators,partdataengineers,partsoLwaredevelopers,andpartsubjectmaNerexperts.Amongourdataspecialists,wehaveexpertsinmeteorology,oceanography,mathemaIcs,physics,chemistryandengineering.We’rewitnessingalargeincreaseintheuseofwxandclimatedatainthecommercial,governmentandeducaIonalsector.Ourgrowinguserbaseincludesmorenon-expertssowefindourselvesbecomingeducatorsaboutwxandclimatedataaswell.Dataremainsaliveaslongaspeopleareusingit.Workingwithstudentshelpsuslearnwhatweneedtodototoday’sdatatomakeitusefulforfutureusers.BythepeoplecometomewiththeirquesIons,theyhaveusuallybeguntheir

1

I’lltrytogetthroughthesefivefundamentalquesIonsin25minutes.IfIdon’t,pleasecometotheBeNerthanFreeworkshoptomorrow.ThefirstQ,Canthis…soundsobvious,butit’sactuallydifficult.ThisisthepartwhereyoumakesureyouhaveascienceprojectandnotanOp-Edpiece.Thesecond(andfirst)Qtogetherhelpsyouformulateyourresearchplan.Datadiscovery,findingthebestdataforyourproject,isadifficulttechnicalproblem.Datasearchenginesfoundedonmetadatacanhelpyou.Ifthedatadoesn’tmatchwhatyouexpect(fromthedocumentaIon),youcouldbeexperiencingtechnicaldifficulIesreadingthedataincorrectly,oryoumayhavediscoveredanerrorinthedata.Scienceshouldbereproducible.You’veheardothertalksthismorningaboutdatacuraIonanddigitalobjectidenIfiersfordatasets.I’llonlyneedtocoverthemechanicsofhowtocitedatasothatothersunderstandwhatyoudid.

2

Canthisproblembeansweredwithdata?isametaquesIonthatyouneedtothinkthroughbeforestarIngonanyendeavor.Isitmeasurable?

Ifnotdirectlymeasureable,isitrelatedtosomethingmeasurable?Hasthedatabeencollectedalready?MayhavetowaitunIldataexists.

Shouldthisproblembetackled?Don’tcollectsensiIveinformaIonifyoucan’tsafeguardit.

Wethinkofweatherasbenigndata.But,asweatherappsturnpeopleintomobileweatherstaIons,whataretheethicalandsafetyconsideraIons?Ifweknowwheresomeonelives(theirnight-Imeclosestcelltower),andalsowheretheyworkorgotoschoolduringtheday,wecannarrowitdowntoafewpeople.IfthatinformaIonendsupinthehandsofadverIsers,itisannoyingandintrusive.Ifitendsupinthehandsofaviolentexorarepressivegovernment,itcouldbedeadly.Inmyownwork,IhelpresearchersobtaindataforpolluIonstudiesincountrieswherequesIoningtheenvironmentalcostsofdevelopmenthaslandedpeopleinjail.WhatiftheirgovernmentdemandsIhandovertheirdatarequestrecords?

3

Dataavailabilityinformsourapproachestoresearchproblems.WhatkindofspaIo-temporalcoverageandresoluIondoyouneed?Precisionandaccuracy(P&A)ofmeasurements?Weallwantthe“best”data,withthemostP&A.But,workthroughyourerrorbudgettofindtheminimumacceptableP&A.ThiswillopenupmoredatapossibiliIesforyourresearch.Doesitrequiredatafusion;doyouneedtocombinedatafrommulIplesources?Thisisusuallythecaseorelseitwouldn’tberesearch.Newdataenablesnewinquiries.But,donotoverlookthevalueoflookingatolddatainnewways.MyfavoriteexampleisthepaperthatexaminedtheprevalenceofcirruscloudsintheSLCareabeforeandaLeritbecameaDeltaAirlineshubairport.TheycorrelatedjetfueltaxreceiptsandcirrusprevalenceandmadeaveryconvincingcasethatjetcontrailsarethecauseoftheincreasingprevalenceofcirruscloudsnearSLC.

4

ThepracIceofDatadiscoveryissIllinthejuvenilephase.Thereisn’tonekillerappordominantsearchenginethatcansearchalldatarepositoriesyet.Thisisamixedbagwithupsides.ConsidertheImespentindatadiscoveryalearningopportunity.Readresearchrelatedtoyourproject.Whatkindsofdatadotheyuseandwheredidtheyobtainit?Datacanbeciteddirectly,asanenItyinitsownright,oLenwithaDOI.Thisisthemodern,preferredmethod.Aside:[DOIstandsforDigitalObjectIdenIfiersareonetoonemappingsofdatasets(DS)toDOInumber(unlikeDSnames).SomeplacesgivenewDOIstonewversionsofaDS,whileothersgivenewDOIsformajornewversionsofDSs.However,thepracIceofciIngdatapapers,papersdescribingthepreparaIonofadataset,isalsoinwidespreaduseandhasbeenusedformuchlonger.Beforetheadventofdatapapers,peoplewouldcitethefirstsciencepaperthatused

5

Asanexample,takealookattheRDAfrontdoor,rda.ucar.eduYoucandoafreetextsearchfordataatthetop.Or,youcanperformafacetedsearch,successivelynarrowingselecIonsbaseduponmetadatacriteria.ThesearejustafewofthemanyopIonsenabledattheRDA.Checkoutournewdatasetsandotherannouncementsonourblog.Takeavideotourofourhomepagetolearnhowtousethemyriadfeaturesofoursite.

6

TheRDA’ssearchfeaturesarepoweredbystandardizedmetadata,mostcommonlyGlobalChangeMasterDirectorykeywords.NetCDFiscloselyassociatedwithCFstandardsandNCAR.Itisself-describingandself-contained.Thereareseveralotherstandardsinwideuse.SomeolderstandardssuchasFGDCmaybesupercededbynewerones.However,someoldstandardssuchasWMOandGRIBremainindailyuseworldwide.GRIBandBUFRareself-describing,butnotself-contained.Theyrequireexternaltabletounpackthebinarydatacorrectly.Ifyouapplythewrongtable,youreadthevaluesincorrectly.GRIBandBUFRarewidelyusedinoperaIonaltransmission,butmustbecarefulifusedlater.IfyouareunsureaboutthecorrectGRIBtabletoapply,ASK!Adatafilecanbemappedtomorethanonedatastandardforinteroperability.Infact,olderbutsIllusefuldatasetsshouldbemappedtonewmetadatastandardstomaintaintheirdiscoverabilitybybothhumansandmachines.

7

GCMDkeywordsmapoutexplicitlythecontentsofdatafiles.Thisshouldeliminateguesswork,whichcanleadtomisinterpreIngdata.Thislooksexcessivelydetailed,butthisishowmachine-to-machinecommunicaIonworks.Thisalsoenablesdistributedsearch,whereoneportalcancrawlotherdatasitesforyou.Dataproducerscanobtainhelpfromdatacurators/metadataspecialiststomapouttheirdata.

8

StandardizaIonofdataisnotanewidea.Once,whalingshiplogbookslookedliketheoneatleL.In1853,MauryandJansencreatedandtestedanuniversallogbookformatthatissimilartowhat’susedtoday.ThentheyorganizedaninternaIonalconferencetopromotetheuseoftheiruniversallogbookformat.SeafarerscouldsIllmaintainidiosyncraIclogbookslikethoseatleL,buttheywererequiredtoalsologintheirinformaIonintheuniversallogbookwithstandardizedunits.BaumanRareBookshNp://sappingaNenIon.blogspot.com/2012/11/reading-digital-sources-case-study-in.htmlTheBrusselsConferenceanditslegacyLp://Lp.wmo.int/Documents/PublicWeb/amp/mmop/documents/JCOMM-TR/J-TR-27-BRU150-Proceedings/DOCUMENTS_JCOMM_27/Session_4/4_2_Konnen.pdf

9

ThisdatastandardizaIonenablesustocreatemapslikethis.WeknowglobalshippingroutesandtrafficoverImefromtheselogbooks.Weagreedtocommonunitsofmeasurement(laItudeandlongitude,GreenwichIme,Celsiusscale,frequencyofrecordedmeasurements)hNp://sappingaNenIon.blogspot.com/2012/11/reading-digital-sources-case-study-in.htmlThedataofthepastisavailabletoustodaybecauseofpreservaIoneffortsofdatacompilersandcuratorsofyesterday.It’sunderstandabletousonlybecausetheyhandeddownnotjustthedata,buttheknowledgeaboutthedataintheformofmetadataandadherencetowell-describeddataconvenIons.Theseareknownasdataandmetadatastandards.ThecorollaryisthatouradherencetodatastandardstodaywillenableyourdataoftodaytoImetravelandaidfuturescienIsts.

10

ThisiscurrentlytheRDA’smostpopulardataset;itisusedtoiniIalizeNumericalWeatherPredicIon(NWP)modelssuchasWRF.Ifwearchiveadataset,wealsoarchivedocumentaIonabouthowitwaspreparedandsoLwaretoreadit.Ifthedatasetisarchivedinoneofthemajorstandards,suchasNetCDForGRIB,wedon’tneedtosaymuchmoreaboutit.But,ifitisinamoreidiosyncraIcformat,wewillprovidespecializedreadersandotherhelp.NotethateachdatasetisassignedtoadataspecialistwhosecontactinformaIonisatthetopright.Dataspecialistsserveasaconduitbetweendatausersanddataproducers.

11

Usersaccessdatathroughtools.One-off(ad-hoc)datarequiresone-offtools.DataandmetadatastandardizaIonallowsustocreatereusabletools.Thisecosystemofdatastandards,tools,searchanddeliverytoolsmakesupourshareddatainfrastructure.Wheninfrastructureworkssmoothly,themagicbecomesmundaneandinvisible.Usersarenotchargedthefullcostofdevelopingormaintaininginfrastructure.Butpleasekeepinmindthatinfrastructureisnotcheapanddelayingmaintenancecancausecatastrophicfailuresinthefuture.

12

Speakingofmagic.Whenyouopenupthosedatafiles,takesomeImetointerrogateitscontents.IfyouhaveIme,explorebeyondthevaluesthatyouoriginallywantedtousetoseeifsomethingelsemightopenupnewpossibiliIesforyourresearch.Ifyoudon’tgetwhatyouexpecttosee,eitherinthefilecontentsorvaluesthatdon’tmakephysicalsense,stop.Consultwiththedataprovider.Ifyougotthedatafromus,sendamessagetordahelp@ucar.eduandccittothedataspecialistforthatdataset.

13

Forthesakeofreproduciblescience,usesounddatacitaIonpracIces.DifferentjournalshavedifferentdatacitaIonstyles.Theysharecommonrequiredelements.Whocreatedthedataset?Howisitcalled?.GivetheDOIifithasone.Wheredidyougetit?DatacanbereplicatedinmulIpleplacesthatservedifferentsubsetsofthedata.CiIngthedatasourceisimportant.DataissomeImesreprocessedorcorrected.Putdowntheversionnumberanddataaccessdate.IfyoumakeadatarequestthroughtheRDAwebinterface,therequestisautomaIcallyloggedinourdatabase.Shouldanythinghappen,wecanregeneratethedatarequestforyouoranyoneelsewhowantstorepeatyourwork.Thiscansaveyoustoragespaceasyouneedonlysavelocallywhatyouarecurrentlyusing.OurdatabaserecordsalsoallowustomakecustomdatacitaIonsforyouwiththiswidgetfoundonallofourdatasethomepages.ThewidgetgeneratescitaIonsinAMS,AGUandseveralotherpopularstyles.YoucanalsocreateRISandBibTeX

14

Ifyouareinterestedinthinkingmetaaboutdata,Iputupalistofmyfavoritedatabooks.Wealsomaintainablogwherewemakeannouncements,giveIpsandposttutorials.Thank-you.

15

Recommended