31
Data Smack Down (Exploratory Data Analysis) Institute for Tribal Environmental Professionals Tribal Air Monitoring Support Center Melinda Ronca-Battista Brenda Sakizzie Jarrell Southern Ute Indian Tribe

Institute for Tribal Environmental Professionals Tribal Air Monitoring Support Center Melinda Ronca-Battista Brenda Sakizzie Jarrell Southern Ute Indian

Embed Size (px)

Citation preview

Tribal Environmental Data Management

Data Smack Down (Exploratory Data Analysis) Institute for Tribal Environmental ProfessionalsTribal Air Monitoring Support CenterMelinda Ronca-Battista

Brenda Sakizzie JarrellSouthern Ute Indian TribeWhy ITEP does this:Assisting tribes all over countrySimilar questions are askedAll evaluations begin the sameNeed for free (MS Office) toolsApplicable to any dataSupplements and uses existing EPA tools, helping tribes use AMTIC and have confidence of the interpretation of their dataAll materials at:

AAA Data Analysis and Interpretation folders37-steps to data domination:Clean up your data.Verify QC limits metAggregate data into setsFind PatternsAsk your questionEvaluate the shapes of the distributionsApply the testStep 1: Clean Your Data1. Initial Cleaning (checking links, hidden values, finding repeated rows, tracking data so none is lost)2. Normalization (separating information into separate fields, using data validation to limit entries to drop-down lists)3. Documenting Your Clean Data 4. General cleaning (macros, values-only, documentation, eliminating hidden characters)

http://www4.nau.edu/itep/resources/Data Analysis, Step 1-Data Clean Up folder5Data Smack Down Step 2:QC Dont exert any effort on bad datastart with a quick review of QCReview logbooks, auditsGenerate AQS report of QC data, andUse EPAs DASC tool; enter data into the spreadsheet and review PLOTS

http://www4.nau.edu/itep/resources/Data Analysis, Step 2-QC folderAQS report:

Plots generated using EPAs DASC tool:Part of QC is completeness:

After Part 2-QC, Step 3: Aggregate DataHourly values slow down computerUse MS Access Quick Start guide to aggregate data into chunks of daily or weekly averagesMS Access easier than Excel for handling missing data, excluding codeshttp://www4.nau.edu/itep/resources/Data Analysis, Step 3-Aggregate Data folder, Tribal Data Access Quick Start subfolder

10Aggregation into:5-day averages of daily max O3 valuesReduces number of data points from >10,000s of hourly values to ~ 1000s of daily values to ~100s of 5-day averagesEnables the next step of graphing against relevant parameters (temp, solar radiation, NO2, etc.)Step 4-Find Patterns:Apply common sense to the dataHow does it vary with met parameters?How does it vary with other pollutants?

Step 4A-Use Excel ToolsDynamic Named Ranges as your data increase, plots and summary statistics are automatically recalculatedSee pg. 7 of doc using ranges in excel and graphing.doc

AutofilterFilter out or in data1-click recalculation of plots of different subsets

Step 4B-plot vs time on x-y plot:

Use x-y plot (ALWAYS SCATTERPLOT NEVER DATE because that produces category plots)Use secondary axis for 2nd parameter so it can have its own unitsStart with all data, then use Excel Autofilter to find subsets of data where both parameters have values, that show a pattern, clicking on different values that are immediately graphed

Time plot of temp and O3, 2006 only:Plot 2nd parameter on its own axisWith click of Autofilter, see filtered/more data & look for patterns:

ALWAYS plot on x-y scatterplots-never dates16Step 4C-quantify the patternUse linear regression between parametersPerfect 1-to-1 relationship with one rise on y-axis to every one run on x-axis shows:Slope ~ 1 and RSQ (r2) ~ 1Can calculate in plot or using functionsCalculate correlations:=slope(Ys,Xs) and =RSQ(Ys, Xs)ORScatterplot, show trendline, show equation and R2 on chart

For the case of our correlation between O3 and temp, how well do they compare?

pretty well for all data:Insert regression line on x-y scatterplots, or use slope= and RSQ= functions19But with less data:

=slope(Ys,Xs) and =RSQ(Ys, Xs)20Step 5: Ask your questionEx: When analyzing this data, we saw a shift in how the 2 sites O3 trackedDuring one time period, one site had markedly higher O3 levels than the other site, but the rest of the time the 2 sites agreed well Is this significant?

----Ute 1, ----Ute 322In this case, the (Ute 3- Ute 1)/Ute 3 ratio:

Step 5: Get specific with your question:

Step 6-Evaluate DistributionsIs this helpful? Sort of...

VALIDINVALIDMean-0.010.2Variance0.030.02

24Pictures more useful:See BoxCharter.zip for Excel Add-In to generate box and whisker plots

INVALIDVALIDIs there a difference? How much, and how sure are we?Decide on your assumption that the data must be used to prove wrong--the null hypothesisIf you think the data from 2 sites are different, then assume that the difference between them is zero and then the data must prove that wrong, at some level of confidencethe null hypo is that there is zero difference between the means of the datasetsStep 6: Evaluate distributions:Are they normal? approximately?If so, the tests are easier

Hmmmmm.

VALIDINVALID27Use Excel Q-Q plot:Straight line is perfectly normal

Step 7: Use tests in Excels Data Analysis Toolpack:

29Conclusion:The invalid dataset was removed from AQSNew audits were conducted, a 2nd analyzer was collocated with Ute 1, and now data from that site are deemed goodSouthern Ute Indian Tribe is very careful with all data, and this story shows how good QC and careful data analysis yields confidence in decisions

Knowledge is powerplease [email protected]

AAA Data Analysis and Interpretation foldersChart10.0710.0650.0630.0610.0520.0510.060.0550.0620.060.0630.060.060.0660.0480.0630.0580.071

Ute 1Ute 3Concentration (PPM)8-hour Running Average Ozone 4th Max Values

Substation4th Max Values for Ozone 8hr Running Averages Requested Summary Year(s) : '1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007' , Monitors : '35-045-1005'SubstationSLAMS16 mi. NW of Farmington, NM4th Max(PPM)Summary Year199920002001200220032004200520062007Tribal Monitor Id35-045-1005-44201-10.0650.0800.0740.0750.0750.0690.0720.0710.0733-year Averages2001200220032004200520062007Substation0.0730.0760.0750.0730.0720.0710.072NAAQS0.080.080.080.080.080.080.08Ute 1Ute 3y= -0.0007x + 0.062R2 = 0.3203y = -0.0016x + 0.0537R2 = 0.6897BloomfieldMesa Verdey = -0.0016x + 0.0789R2 = 0.9626y = -0.0008x + 0.067R2 = 0.625SubstationShamrocky = 0.0008x + 0.0761R2 = 0.6497y = -0.0017x + 0.0612R2 = 0.7991

Substation199919990.06919990.0710.06520000.0790.07320000.0630.06120010.0740.06520010.0520.05120020.0760.0720020.060.05520030.0730.06720030.0620.0620040.0680.0690.0670.0630.0620050.0750.0760.0750.060.06620060.0630.0740.0740.0480.06320070.0690.070.068

Substation, NMBloomfield, NMMesa Verde NPShamrock StationUte 1 Ignacio, COUte 3 Bondad, COConcentration (PPM)8-hour Ozone 4th Max Values

Bloomfield0.07233333330.06866666670.0670.06166666670.05833333330.0720.07066666670.0710.06166666670.0620.06866666670.0730.0720.0570.0630.0690.07333333330.07233333330.05533333330.0666666667

&RSubstation, NMBloomfield, NMMesa Verde National ParkShamrock Station, COUte 1 Ignacio, COUte 3 Bondad, COConcentration (PPM)8-hour Ozone--3 Year Averages of Annual 4th Daily Maximum Concentration Values in the Four Corners Region

Mesa Verde NP0.07650.0620.07633333330.05833333330.07433333330.0580.07233333330.06166666670.0720.06166666670.06866666670.0570.0690.0553333333

Substation, NMBloomfield, NMUte 1Concentration (PPM)8-hr Ozone--3-yr Avgs of Annual 4th Daily Max Concentration Values in Four Corners Region

Shamrock Station0.0690.0590.06933333330.05566666670.06733333330.05533333330.06866666670.0670.05833333330.07066666670.0710.0620.0730.0720.0630.07333333330.07233333330.0666666667

Mesa Verde National ParkShamrock Station, COUte 3 Bondad, COConcentration (PPM)8-hr Ozone--3-yr Avgs of Annual 4th Daily Max Concentration Values in the Four Corners Region

SUIT 8-hr avgs0.06920010.0620.0590.069333333320020.05833333330.05566666670.067333333320030.0580.05533333330.06866666670.0670.06166666670.05833333330.07066666670.0710.06166666670.0620.0730.0720.0570.0630.07333333330.07233333330.05533333330.0666666667

Mesa Verde National ParkShamrock Station, COUte 1 Ignacio, COUte 3 Bondad, COConcentration (PPM)8-hour Ozone--3 Year Averages of the Annual 4th Daily Maximum Concentration Values in Southwestern Colorado (2001-2007)

Macro10.06520010.0520.0510.0720020.060.0550.06720030.0620.060.0690.0670.0630.060.0760.0750.060.0660.0740.0740.0480.0630.070.0680.0580.071

Mesa Verde NPShamrock Station, COUte 1 Ignacio, COUte 3 Bondad, COConcentration (PPM)8-hour Ozone 4th Maximum Values in Southwestern Colorado (2001-2007)

4th Max Values for Ozone 8hr Running Averages Requested Summary Year(s) : '1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007' , Monitors : '35-045-0009'BloomfieldSLAMS162 Highway 550 ; Bloomfield, NM4th Max(PPM)Summary Year199920002001200220032004200520062007Tribal Monitor Id35-045-0009-44201-10.0790.0740.0760.0730.0680.0750.0630.0693-year Averages2001200220032004200520062007Bloomfield0.0770.0760.0740.0720.0720.0690.069NAAQS0.080.080.080.080.080.080.08

Num Obs: 7210Num Obs: 7242

4th Max Values for Ozone 8-hr Running Averages Requested Summary Year(s) : '1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007' , Monitors : '08-083-0101'4th Max(PPM)Summary Year199920002001200220032004200520062007Tribal Monitor Id08-083-0101-44201-10.0690.0730.0650.0700.0670.0690.0760.0740.0703-year Averages2001200220032004200520062007Mesa Verde NP0.0690.0690.0670.0690.0710.0730.073NAAQS0.080.080.080.080.080.080.08

Num obs: 8611

Ozone Daily Maximum 8-hour AveragesStart for for O3 at Shamrock Station began 4/18/20044th Max(PPM)JulySummary Year199920002001200220032004200520062007SiteShamrock Station0.0670.0750.0740.0683-year Averages2001200220032004200520062007Shamrock Station0.0670.0710.0720.072NAAQS0.080.080.080.080.080.080.08

as of July 2007 dataonly acct up to July 2007

4th Max Values for Ozone 8hr Running Averages4th Max(PPM)DecSummary Year199920002001200220032004200520062007Tribal Monitor IdTT-750-7001-44201-10.0710.0630.0520.0600.0620.0630.0600.0480.058TT-750-7003-44201-10.0650.0610.0510.0550.0600.0600.0660.0630.0713-year averages2001200220032004200520062007Ute 10.0620.0580.0580.0620.0620.0570.055Ute 30.0590.0560.0550.0580.0620.0630.067NAAQS0.080.080.080.080.080.080.08Ute 1 Trendline (linear regression)y= 0.0016x + 0.0537R2 = 0.6897Ute 3 Trendline (linear regression)y = -0.0007x + 0.062R2 = 0.3203

Ute 1Ute 3Concentration (PPM)8-hour Running Average Ozone 4th Max Values

Ute 1Ute 3Concentration (PPM)8-hour Ozone--3 year Averages of 4th Daily Maximum Values at Ute 1 and Ute 3 Sites

Macro1Auto_Open00000000000Macro2000000000000Macro3000000000000Macro400000Macro500000Macro600000Macro7000000000000000000Recover00