Airline Data Analysis

Embed Size (px)

DESCRIPTION

An analysis of airline data.

Citation preview

  • Analyzing Airline Data In this project, I will be analyzing over 6gb of Airline data, and answering

    some questions that I think would be important when looking at data similar to this.

    data=load(url("http://eeyore.ucdavis.edu/stat141/Data/winterDelays.rda"))

    Number 1

    How many flights are there in the data set?

    >nrow(winterDelays) 1961489

    So there are 1,961,489 flights in the data set because that is the number of rows in the set.

    Number 2

    Which airline has the most flights?

    sort(table(winterDelays$UNIQUE_CARRIER)) VX HA F9 YV 9E AS FL B6 US MQ UA AA OO DL EV WN 17216 23468 23699 41305 44967 46737 61988 75816 130453 145487 161857 172342 201171 227539 232481 354963

    So WN (Southwest Airlines) has the most flights (354,963) in this data set.

    Number 3

    Compute the number of flights for each originating airport and airline carrier. Show only the rows and columns for the 20 airports with the largest number of flights, and the 10 airline carriers with the most flights.

    >tab = table(winterDelays$ORIGIN, winterDelays$UNIQUE_CARRIER) >m = margin.table(tab,1) >ord = order(m, decreasing = TRUE)[1:20] >m2 = margin.table(tab,2) >ord2 = order(m2,decreasing = TRUE)[1:10] >tab[ord,ord2] WN EV DL OO AA UA MQ US B6 FL ATL 3369 29393 63957 558 1656 260 1992 1734 0 17897 ORD 0 17570 1852 8740 15653 18113 25434 2319 479 0

  • DFW 0 1633 1638 2005 50114 1326 29051 2163 337 0 DEN 17944 4975 2094 15603 1636 14383 756 1498 324 285 LAX 11797 0 6104 19803 9582 9268 2756 1921 1032 293 IAH 0 25998 772 6463 1363 21311 793 1809 0 0 PHX 18698 15 2249 7006 1906 2065 56 18814 229 64 SFO 5063 0 2490 16838 3457 14453 0 1658 1276 295 CLT 0 1732 1600 193 664 99 1713 28284 479 644 LAS 23477 0 3731 2446 3115 3931 0 2081 1152 421 DTW 1874 7449 14912 1104 743 231 1561 1098 0 676 EWR 2045 15783 1312 0 1136 14783 848 1413 2232 0 MSP 2334 2350 16429 8173 1037 882 1094 1334 0 482 MCO 9471 50 5465 0 3168 3869 0 2942 6148 5506 SLC 3572 123 9700 18149 579 343 492 664 360 0 JFK 0 426 6375 0 4674 1429 2279 901 13363 0 BOS 2175 860 3061 0 3376 3697 0 5693 11041 1172 BWI 18762 920 2167 0 934 1081 574 1510 491 4518 LGA 1834 876 7885 1 4918 2446 5571 3985 2039 1284 SEA 3442 0 2636 1863 1523 2980 0 1022 451 0

    This table shows the top 20 airports with the largest number of flights, and the 10 airline carriers with the most flights.

    Number 4

    Is the mean delay in November different from the mean delay in December?

    >mean(winterDelays[winterDelays$MONTH == 11,'ARR_DELAY'], na.rm=TRUE) [1] -0.1246967 > mean(winterDelays[winterDelays$MONTH == 12,'ARR_DELAY'], na.rm=TRUE) [1] 6.892993

    Yes, the mean delays for November and December are different since the mean delay for november is -0.125 minutes and the mean delay for december is 6.893 minutes.

  • Number 5

    Which is a better measure for characterizing the center of the distribution for overall delay - mean, median or mode? Why?

    hist(winterDelays$ARR_DELAY, main="Histogram of Delays", xlab= "Delay time")

    I believe that median would be the best measure because it is not a normal distrobution (it is skewed right). Since the data is so heavily skewed, the mean delay would not be a good measure for the overall average delay since the outliers would skew the mean heavily. Usually, with heavily skewed data the median or mode is the best measure of characterizing the center, in this case I am going with median.

  • Number 6

    What is the mean and standard deviation of the arrival delays for all United airlines (UA) flights on the weekend out of SFO?

    >delays = subset(winterDelays, UNIQUE_CARRIER == 'UA' & ORIGIN == 'SFO' & DAY_OF_WEEK%in%c(6,7), ARR_DELAY) >mean(delays$ARR_DELAY,na.rm=TRUE) [1] 0.9390957

    So the mean Arrival delay time for United Airlines is about 1 minute.

    >sd(delays$ARR_DELAY,na.rm=TRUE) [1] 36.76253

    So the standard deviation for arrival delay time for United Airlines is about 37 minutes.

    Number 7

    Plot the distributions of overall delay for each month. What is the best way to display this?

    Delay1= subset(winterDelays, winterDelays$MONTH=="1") Delay2= subset(winterDelays, winterDelays$MONTH=="2") Delay3= subset(winterDelays, winterDelays$MONTH=="11") Delay4= subset(winterDelays, winterDelays$MONTH=="12") boxplot(list(Delay1$ARR_DELAY, Delay2$ARR_DELAY, Delay3$ARR_DELAY, Delay4$ARR_DELAY), xaxt='c', xlab="MONTH", ylab="Mean Delay", outline=TRUE) axis(1, at=1:4, labels=c("Jan", "Feb", "Nov", "Dec")) axis(2)

  • This plot shows the distrobutions of delay for each month, including ALL the outliers. As one can see, this is a hard grah to read becuase of all the outliers, but it shows the True data and distrobutions wih no manipulation.

    boxplot(list(Delay1$ARR_DELAY, Delay2$ARR_DELAY, Delay3$ARR_DELAY, Delay4$ARR_DELAY), xaxt='c', xlab="MONTH", ylab="Mean Delay", outline=FALSE) axis(1, at=1:4, labels=c("Jan", "Feb", "Nov", "Dec")) axis(2)

  • This plot is shows the distrutions of delay from each month, not including the outliers (which is lying in a sense), but it much more clear and easy to read.

    Number 8

    Display the number of flights for each airport on a single plot so we can quickly compare them.

    flight=table(winterDelays$ORIGIN) dotchart(flight, cex=.3, main="Number of Flights per Airport", xlab="Number of Flights", ylab="Airport" )

    ## Warning: 'x' is neither a vector nor a matrix: using as.numeric(x)

    Above is the number of flights for each airport on a single plot. It is impossible to read the Y axis, because I have plotted ALL airports, but it could easily be more readable by plotting only a few airports on several plots.

  • Number 9

    Are there many more flights on weekdays relative to Saturday and Sunday?

    >compare=table(winterDelays$DAY_OF_WEEK) >compare 1 2 3 4 5 6 7 286929 275181 288557 308239 292745 235456 274382 #With 1 being Mon, 2 being Tuesday, 3 being Wednesday, 4 being Thursday, 5 Being Friday, 6 Being Saturday and 7 being Sunday.

    It appears that are are many more flights on weekdays rather relative to Saturday and Sunday, especially Saturday since it is has the lowest number of flights (235,465).

    Number 10

    What day of the week has the most number of delayed flights?

    >table(winterDelays$DAY_OF_WEEK, winterDelays$ARR_DELAY>0) FALSE TRUE 1 176174 106163 2 175396 94570 3 177613 104861 4 186625 115142 5 178447 108652 6 155173 76932 7 172365 97596

    Here, the TRUE column is showing the number of delayed flights. The rows are signifying the day of week (1 being Monday, 1 being Tuesday and so on). We can see that Thursday has the mist number of delayed flights (115,142).

    Number 11

    What day of the week has the largest median overall delay? 90th quantile for overall delay?

    >daymedian= with(winterDelays, tapply(ARR_DELAY, list(DAY_OF_WEEK), median, na.rm=TRUE)) >daymedian 1 2 3 4 5 6 7 -4 -5 -5 -4 -4 -6 -5

    There is a tie for Mon, Thurs, Fri, and Sun for median overall Delay at -4 minutes. It is important to remeber that this is median overall delay and not mean overall delay.

    >with(winterDelays, tapply(ARR_DELAY, list(DAY_OF_WEEK), quantile, prob=.9, na.rm=TRUE))

  • 1 2 3 4 5 6 7 32 27 34 32 32 26 31

    Wednesday has the highest 90th quantile for overall delay with 34 minutes.

    Number 12

    Consider the 10 airports with the most number of flights. For this subset of the data, which routes (origin-destination pair) have the worst median delay.

    # The Professor said to acknowledge when we had outside help. On this problem I was directed by Nick Ulle. tt = table(winterDelays$ORIGIN) mtt= margin.table(tt,1) order.mtt = order(mtt,decreasing=TRUE)[1:10] #This will show the top ten airports and their corresponding number of flights. tt[order.mtt] tbest

  • Number 13

    Graphically display any relationship between the distance of the flight (between the two origin and destination) airports and the overall delay. Interpret the display.

    smoothScatter(winterDelays$ARR_DELAY, winterDelays$DISTANCE, xlab="Delay", ylab="Distance(miles)")

    ## Warning: Binning grid too coarse for current (small) bandwidth: consider ## increasing 'gridsize'

    It is hard to tell from this data, becuase there are many more flights with shorter distance than long, but it would probably be safe to say that there is not much of a difference in delay times from shorter and longer flights. That being said, there are a few more significant outliers in the shorter flights than in the longer ones, but this is likely due to the vast percentage of flights being shorter ones. I used smoothScatter on this problem as the professor had recommeded because it shows the graph in a much neater fashion and it is easier to read.

  • Number 14

    What are the worst hours to fly in terms of experiencing delays.

    > tapply(winterDelays$ARR_DELAY, winterDelays$ARR_TIME_BLK, mean, na.rm=TRUE) 1-0559 0600-0659 0700-0759 0800-0859 0900-0959 1000-1059 1100-1159 1200-1259 1300-1359 1400-1459 2.5888 -1.4700 -1.3306 -0.9237 -0.4361 - 0.2953 0.2934 1.3193 1.7777 3.3557 1500-1559 1600-1659 1700-1759 1800-1859 1900-1959 2000-2059 2100-2159 2200-2259 2300-2359 3.8163 4.7015 5.9254 6.1220 6.5939 6.3258 6.5707 6.5587 5.1105

    The time above is sated in military time, as this is what the airlines use as a way of keeping time.

    So the worst hours to fly, in terms of experienceing delays would be between 6:00 and 10:00 pm, where the delay time averages around 6 minutes. The absolute worst hour to fly would be between 7:00 and 8:00 p.m.

    Number 15

    Are the delays worse on December 25th than other days? Thanksgiving? Provide evidence to support your conclusions.

    > tapply(winterDelays$ARR_DELAY, list(winterDelays$MONTH, winterDelays$DAY_OF_MONTH), mean, na.rm=TRUE) 1 2 3 4 5 6 7 8 9 10 1 6.8945763 5.9187169 4.6276680 -1.253096 0.7645577 1.391044 -3.5615892 -0.6202718 -1.571971 -1.633439 2 3.7956690 3.0728452 -0.7300243 10.649105 -2.2401363 -2.808513 4.9996482 2.7355633 -2.238645 10.826545 11 1.4928729 -0.6600911 -4.6630724 -2.244804 -2.6255178 -2.084708 0.1688204 0.8479014 1.293546 -2.655190 12 -0.7579075 2.4712065 -2.2010932 -3.697318 -3.3443914 -4.176649 -0.5729222 -2.8824730 5.661869 16.335838 11 12 13 14 15 16 17 18 19 20 1 2.401867 -1.844984 7.201963 2.5351881 -1.541786 6.9567796 0.1099049 -0.705212 -6.126114 1.2997991 2 11.548020 -2.928880 -1.979298 -0.3131504 -1.050239 1.4220683 0.1417358 0.942776 6.703037 2.4348809 11 3.966439 10.177729 -1.115996 -2.1576971 2.156633 0.9675679 -2.9009901 -3.221705 -2.974814 0.1846513 12 -1.402158 -3.225194 -2.745070 -0.9295866 -0.358311 8.5672392 9.6133976 3.798906 10.656995 18.5276957

  • 21 22 23 24 25 26 27 28 29 30 1 2.112755 -0.3910833 1.182579 6.407166 13.318865 0.1107449 5.841744 5.5110657 5.766958 19.3347537 2 10.141076 12.1875036 5.183197 2.091021 3.569462 9.8617102 8.933590 0.7131929 NA NA 11 7.154388 -8.8592179 -6.118507 -2.850553 1.918476 3.0102443 2.688409 -1.8931600 -1.034617 0.4279986 12 25.894130 11.0176950 8.934984 2.568117 14.125010 33.2713518 22.885562 13.9638004 16.154251 8.5911889

    Above shows a table of month (rows corresponding to 1= Jan, 2=Feb, 11=Nov, 12=Dec) vs. day of month and their delay times in minutes.

    For November, the worst day for delays was November 21st with delay times averaging to about 7 minutes. Thanksgiving fell on November 22 in 2012, and the mean delay for Thanksgiving is about -8 minutes. So thanksgiving day delays are not 'bad' compared to other days of the year, but the day before thanksgiving is.

    For December, the worst day for delays was December 26th with delay times averaging to about 33 minutes. Christmas day delays were stil pretty bad, though, with a mean delay of about 14 minutes. That being said, there are more days in December that have worse delays than that of Christmas day.

    So thanksgiving day is not a bad day to travel in terms of delay times, while Christmas day is much worse.

    Number 16

    How many missing values are there in each variable?

    colSums(is.na(winterDelays)) YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK 0 0 0 0 0 FL_DATE UNIQUE_CARRIER AIRLINE_ID CARRIER TAIL_NUM 0 0 0 0 0 FL_NUM ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID ORIGIN_CITY_MARKET_ID ORIGIN 0 0 0 0 0 ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC 0 0 0 0 0 DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID DEST DEST_CITY_NAME 0 0 0

  • 0 0 DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC CRS_DEP_TIME 0 0 0 0 0 DEP_TIME DEP_DELAY DEP_DELAY_NEW DEP_DEL15 DEP_DELAY_GROUP 30721 30721 30721 30721 30721 DEP_TIME_BLK TAXI_OUT WHEELS_OFF WHEELS_ON TAXI_IN 0 31540 31540 32950 32950 CRS_ARR_TIME ARR_TIME ARR_DELAY ARR_DELAY_NEW ARR_DEL15 0 32950 35780 35780 35780 ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED CANCELLATION_CODE DIVERTED 35780 0 0 0 0 CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME FLIGHTS DISTANCE 0 35780 35780 0 0 DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY 0 1619153 1619153 1619153 1619153 LATE_AIRCRAFT_DELAY FIRST_DEP_TIME TOTAL_ADD_GTIME LONGEST_ADD_GTIME DIV_AIRPORT_LANDINGS 1619153 1950485 1950485 1950485 0 DIV_REACHED_DEST DIV_ACTUAL_ELAPSED_TIME DIV_ARR_DELAY DIV_DISTANCE DIV1_AIRPORT 1957674 1958659 1958659 1957755 0 DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_WHEELS_ON DIV1_TOTAL_GTIME DIV1_LONGEST_GTIME 1957250 1957250 1957249 1957249 1957249 DIV1_WHEELS_OFF DIV1_TAIL_NUM DIV2_AIRPORT DIV2_AIRPORT_ID DIV2_AIRPORT_SEQ_ID 1958597 0 0 1961419 1961419 DIV2_WHEELS_ON DIV2_TOTAL_GTIME DIV2_LONGEST_GTIME DIV2_WHEELS_OFF DIV2_TAIL_NUM 1961419 1961419 1961419 1961479 0 DIV3_AIRPORT DIV3_AIRPORT_ID DIV3_AIRPORT_SEQ_ID

  • DIV3_WHEELS_ON DIV3_TOTAL_GTIME 0 1961487 1961487 1961487 1961487 DIV3_LONGEST_GTIME DIV3_WHEELS_OFF DIV3_TAIL_NUM DIV4_AIRPORT DIV4_AIRPORT_ID 1961487 1961489 1961489 1961489 1961489 DIV4_AIRPORT_SEQ_ID DIV4_WHEELS_ON DIV4_TOTAL_GTIME DIV4_LONGEST_GTIME DIV4_WHEELS_OFF 1961489 1961489 1961489 1961489 1961489 DIV4_TAIL_NUM DIV5_AIRPORT DIV5_AIRPORT_ID DIV5_AIRPORT_SEQ_ID DIV5_WHEELS_ON 1961489 1961489 1961489 1961489 1961489 DIV5_TOTAL_GTIME DIV5_LONGEST_GTIME DIV5_WHEELS_OFF DIV5_TAIL_NUM X 1961489 1961489 1961489 1961489 1961489

    The table above shows the amount of missing values (NA's) for each variable in the data set.

    Number 17

    Each of the variables DEP_TIME, DEP_DELAY, DEP_DELAY_NEW have the same number of missing values. Do these missing values correspond to the same records for each of these variables?

    >length(which(is.na(winterDelays$DEP_TIME))) [1] 30721 > length(which(is.na(winterDelays$DEP_DELAY))) [1] 30721 > length(which(is.na(winterDelays$DEP_DELAY_NEW))) [1] 30721

    Above shows that each of the three varibles have the exact same number if missing values.

    > identical(which(is.na(winterDelays$DEP_DELAY_NEW)),which(is.na(winterDelays$DEP_TIME)), which(is.na(winterDelays$DEP_DELAY))) [1] TRUE

    We can see that, in fact the missing values all correspond to the same records because the output gave us the value 'TRUE' when we asked if all of the variables missing values correspond together.

  • Number 18

    Does the distribution of delays depend on the time of day? Provide evidence for your conclusion..

    library("lattice", lib.loc="/Library/Frameworks/R.framework/Versions/3.0/Resources/library") bwplot(winterDelays$ARR_DELAY~winterDelays$ARR_TIME_BLK, data = winterDelays, scales=list(rot=45),ylim=c(-80,80),main="Delays by Time of Day",xlab="Time of Day", ylab="Delay (in minutes)")

    The median delay increses as the day goes on, as it looks in this plot. Also, the ditrobution of delays increases as the day goes on as well. This is lkely dues to some of the flights being dependent on the ones before them, as there are only a certain aount of gates at each airport. So, yes, the distrbution of delays does depend on time of day.

  • Number 19

    What proportion of flights took off late?

    nrow (winterDelays [winterDelays$DEP_DELAY>0,] )/ nrow(winterDelays) [1] 0.3814918

    Number20

    What proportion of flights arrived late? What proportion arrived early?

    > nrow (winterDelays [winterDelays$ARR_DELAY>0,] )/ nrow(winterDelays) [1] 0.3771094

    So the proportion of flights that arrived late is about 38%.

    `` > nrow (winterDelays [winterDelays$DEP_DELAY0))

    arrive=nrow(subset(winterDelays,ARR_DELAY>0))

    depart/(arrive+depart)

    So the proportion of flights that took off late and also arrived late was about 51%. Number 22

    Do planes leaving late tend to make up time?

    depart2= subset(winterDelays, DEP_DELAY>0)

    nrow(subset(depart2,(DEP_DELAY - ARR_DELAY) > 0)) [1] 497560

    nrow(subset(depart2,(DEP_DELAY - ARR_DELAY) < 0)) [1] 196348

    The number above tell us that, in fact, the flights that leave late to tand to make up time. The way we can see this is that when a flight is leaving late, it should also be arriving late by the same amount of time or later, but most of the flights that left late actually arrived ealrier than expected, menaing they made up time. A very impressive result in my opinion.

  • Number 23

    Do flights that take off late fly faster to make up time? tapply(winterDelays/AIR_TIME, winterDelays$ARR_DELAY

  • So the average distance from SFO to LAS is 414 miles

    tapply(winterDelays,ORIGIN=="SFO" & winterDelays$DEST=="LAX", mean) FALSE TRUE 762.8049 337.0000 ``` So the Average distance from SFO to LAX is 337 miles.

    Yes, this can be done for any pair of airports in the data set with a simple substiution of airport destination and origin.

    Number 26

    For flights from SFO to these 5 most popular destinations, compute and display the distribution of the average speed of the airplane on each flight?

    # subset= winterDelays[winterDelays$ORIGIN=="SFO" & as.character(winterDelays$DEST)%in%c("ORD", "SAN", "LAX", "JFK", "LAS"),] subset[,"DEST"]=as.character(subset[,"DEST"]) speed= (subset$DISTANCE/ subset$AIR_TIME) splt=split(speed, list(subset$DEST)) boxplot(splt)

  • Above shows the average speed of the airplanes on each of the flights of the top 5 destination airports originating from SFO.

    Number 27

    Are the distributions of delays for "commuter" flights between SFO and LAX similar to those for SFO to JFK, EWR?

    delay=(winterDelays$ARR_DELAY) splt2=split(delay, list(subset$DEST))

    ## Warning: data length is not a multiple of split variable

    boxplot(splt2, outline=FALSE)

    THe above plot shows that the ditrsibutions form lfights from SFO to LAS and LAX are similar to those to JFK, dispite that the flights are much further in distance. Now this plot does not show the outliers like the one below, but it gives us a better view of the distrobutions.

    boxplot(splt2)

  • This is the same box plot as the one above, but includes the outliers and still shows that the dustrbutions are similar.

    Number 28

    What other questions could we address with this data?

    We could ask which days of the winter months are the best days to fly in terms of delay time. We could find which airline carrier had highest average delay time for the winter months. We could also see if which airline carrier experienced the most cancellations.

    Number 29

    What other data/variables would allow us to address additional interesting questions?

    If we had data on the weather conditions per flight, we could find what percentage of the delays were a result of weather conditions.

  • We could also gather data on passengers, and see if flights with more passengers had a higher average delay time.

    We could gather delay data from the enitre year to find which is the worst and best times to fly in terms of delay.