20

Click here to load reader

Data analysis and statistical inference project

Embed Size (px)

DESCRIPTION

Data analysis and statistical inference Coursera project

Citation preview

Page 1: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 1 of 10

TITLE: "Association between confidence in banks and social class"

DATE: "20.10.2014"

OUTPUT: html_document:

THEME: cerulean

Introduction:

RESEARCH QUESTION: Does the social class a person belongs in determine the level of confidence in banks and financial institutions? I am a Financial Researcher so this question relates to my field of interest. I consider this a pertinent question because the financial institutions could develop a marketing strategy in order to improve the level of confidence in banks of certain persons belonging to a certain social class based on the results of this study.

Data:

DATA-citation: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11, http://doi.org/10.3886/ICPSR34802.v1

DATA-collection: The data were collected between 1972-2012 in the United States. The mode of data collection is computer-assisted personal interview (CAPI), face-to-face interview and telephone interview. The unit of observation is the individual, the universe is all non-institutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. The data type is survey data.

DATA - Cases (observational/experimental units): There are a total of 57,061 cases and 114 variables in this data set. The unit of observation is the individual, the universe is all non-institutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States.

DATA - Variables: CONFINAN - CONFIDENCE IN BANKS & FINANCIAL INSTITUTIONS (categorical, ordinal); it has 3 Levels: A Great Deal, Only Some and Hardly Any is the response variable. CLASS - SUBJECTIVE CLASS IDENTIFICATION (categorical, ordinal); it has 5 Levels: Lower Class, Working Class, Middle Class, Upper Class and No Class is the explanatory variable.

Page 2: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 2 of 10

DATA - Type of study: The type of study is an observational one because the data were collected in a way that does not directly interfere with how the data arise. It only establishes an association and it uses past data. The observational study is not causal, but generalizable because there was no random assignment but the sampling was random.

DATA - Scope of inference - generalizability: The unit of observation is the individual, the population of interest is all non-institutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. The observational study is not causal, but generalizable because there was no random assignment but the sampling was random. Sources of bias: - Confounding variables could be: educ (level of education),paeduc (HIGHEST YEAR SCHOOL COMPLETED, FATHER),maeduc (HIGHEST YEAR SCHOOL COMPLETED, MOTHER), degree RS HIGHEST DEGREE), race (RACE OF RESPONDENT), sex (RESPONDENTS SEX), unemp (EVER UNEMPLOYED IN LAST TEN YRS); - Convenience sample - Non-response - Voluntary response.

DATA - Scope of inference - causality: We can't establish causal links between the variables of interest because there was no random assignment and it is an observational study, not an experiment. So we can establish only an association between the variables of interest.

Some visualization of the GSS data:

names(gss)

## [1] "caseid" "year" "age" "sex" "race" "hispanic" ## [7] "uscitzn" "educ" "paeduc" "maeduc" "speduc" "degree" ## [13] "vetyears" "sei" "wrkstat" "wrkslf" "marital" "spwrksta" ## [19] "sibs" "childs" "agekdbrn" "incom16" "born" "parborn" ## [25] "granborn" "income06" "coninc" "region" "partyid" "polviews" ## [31] "relig" "attend" "natspac" "natenvir" "natheal" "natcity" ## [37] "natcrime" "natdrug" "nateduc" "natrace" "natarms" "nataid" ## [43] "natfare" "natroad" "natsoc" "natmass" "natpark" "confinan" ## [49] "conbus" "conclerg" "coneduc" "confed" "conlabor" "conpress" ## [55] "conmedic" "contv" "conjudge" "consci" "conlegis" "conarmy" ## [61] "joblose" "jobfind" "satjob" "richwork" "jobinc" "jobsec" ## [67] "jobhour" "jobpromo" "jobmeans" "class" "rank" "satfin" ## [73] "finalter" "finrela" "unemp" "govaid" "getaid" "union" ## [79] "getahead" "parsol" "kidssol" "abdefect" "abnomore" "abhlth" ## [85] "abpoor" "abrape" "absingle" "abany" "pillok" "sexeduc" ## [91] "divlaw" "premarsx" "teensex" "xmarsex" "homosex" "suicide1" ## [97] "suicide2" "suicide3" "suicide4" "fear" "owngun" "pistol" ## [103] "shotgun" "rifle" "news" "tvhours" "racdif1" "racdif2" ## [109] "racdif3" "racdif4" "helppoor" "helpnot" "helpsick" "helpblk"

head(gss$class)

Page 3: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 3 of 10

## [1] Middle Class Middle Class Working Class Middle Class Working Class ## [6] Middle Class ## Levels: Lower Class Working Class Middle Class Upper Class No Class

head(gss$confinan)

## [1] <NA> <NA> <NA> <NA> <NA> <NA> ## Levels: A Great Deal Only Some Hardly Any

Summary statistics

summary(gss$class)

## Lower Class Working Class Middle Class Upper Class No Class ## 3147 24458 24289 1741 1 ## NA's ## 3425

summary(gss$confinan)

## A Great Deal Only Some Hardly Any NA's ## 9015 19659 6379 22008

I have created the data frame my_data and I have eliminated the NA values

my_data <- gss[,c("class","confinan")] my_data <- my_data[complete.cases(my_data),] head(my_data)

## class confinan ## 4602 Working Class A Great Deal ## 4603 Middle Class Only Some ## 4604 Working Class A Great Deal ## 4605 Middle Class Only Some ## 4606 Middle Class Only Some ## 4607 Middle Class Only Some

summary(my_data$class)

## Lower Class Working Class Middle Class Upper Class No Class ## 2007 15634 15329 1113 1

summary(my_data$confinan)

## A Great Deal Only Some Hardly Any ## 8775 19102 6207

I have created the data frame newdata and I have eliminated the No class column

Page 4: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 4 of 10

newdata <- subset(my_data, class!="No Class") table(droplevels(newdata)$class)

## ## Lower Class Working Class Middle Class Upper Class ## 2007 15634 15329 1113

n1=length(newdata$class) n2=length(newdata$confinan)

Exploratory data analysis

Social class structure - The majority of the respondents belong to the working class and the middle class (91%).

round(table(droplevels(newdata)$class)*100/n1,2)

## ## Lower Class Working Class Middle Class Upper Class ## 5.89 45.87 44.98 3.27

barplot(round(table(droplevels(newdata)$class)*100/n1,2), ylim=c(0,100), main ="Social Class Structure (%)",col=rainbow(3))

Page 5: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 5 of 10

Level of confidence structure - The majority of respondents stated that they have only some confidence in the financial system (56%).

round(table(newdata$confinan)*100/n2,2)

## ## A Great Deal Only Some Hardly Any ## 25.75 56.05 18.21

barplot(round(table(newdata$confinan)*100/n2,2), ylim=c(0,100), main="Level of Confidence Structure (%)", col=rainbow(4))

Some visualization of both variables

barplot(table(droplevels(newdata)$class, newdata$confinan), main="Social Class in function of the confidence in banks", col = rainbow(5), legend=TRUE)

Page 6: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 6 of 10

(table(droplevels(newdata)$class, newdata$confinan)*100)/n1

## ## A Great Deal Only Some Hardly Any ## Lower Class 1.2968 3.0396 1.5521 ## Working Class 10.8676 25.7724 9.2304 ## Middle Class 12.5282 25.5024 6.9448 ## Upper Class 1.0533 1.7311 0.4812

table(droplevels(newdata)$class, newdata$confinan)

## ## A Great Deal Only Some Hardly Any ## Lower Class 442 1036 529 ## Working Class 3704 8784 3146 ## Middle Class 4270 8692 2367 ## Upper Class 359 590 164

mosaicplot(table(droplevels(newdata)$class, newdata$confinan), main="Confidence in banks according to Social Class", cex.axis=0.8, col=rainbow(5))

Page 7: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 7 of 10

From the contingency table we observe there is a negative relationship between social class and confidence in banks and financial institutions. People belonging in the lower class are less confident in banks and people in the upper class have the most confidence in the financial system.

Inference

The two Hypotheses statements are the following:

H_0: Confidence in banks and social class are independent.

H_A: Confidence in banks and social class are dependent.

List of conditions necessary for performing a chi-square test of independence:

• the observations should be independent - The type of study is an observational one because the data were collected in a way that does not directly interfere with how the data arise. The observational study is generalizable because the sampling was random.

• expected counts for each cell should be at least 5 (all the 12 cells have at least 5 counts) - it can be seen from the results of the inference function. Expected counts are calculated as: (row total*column total)/grand total.

Page 8: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 8 of 10

• degrees of freedom should be at least 2 (in this case there are 6 degrees of freedom). df= (R-1)(C-1) = (4-1)(3-1)=6, where R is the number of rows in the two-way table and C is the number of columns.

The method of inference will be Hypothesis test only because I have two categorical variables (both with more than 2 levels, 4 levels for the social class and 3 levels for the confidence in banks variable).No other methods are applicable and hence there's nothing to compare. I will use the chi-square test of independence and the method will be theoretical because the expected counts for each cell are at least 5.

source("http://bit.ly/dasi_inference") inference(y = newdata$confinan,x = droplevels(newdata)$class , est = "proportion", type = "ht", method ="theoretical", alternative = "greater")

## Response variable: categorical, Explanatory variable: categorical ## Chi-square test of independence ## ## Summary statistics: ## x ## y Lower Class Working Class Middle Class Upper Class Sum ## A Great Deal 442 3704 4270 359 8775 ## Only Some 1036 8784 8692 590 19102 ## Hardly Any 529 3146 2367 164 6206 ## Sum 2007 15634 15329 1113 34083

## H_0: Response and explanatory variable are independent. ## H_A: Response and explanatory variable are dependent. ## Check conditions: expected counts ## x ## y Lower Class Working Class Middle Class Upper Class ## A Great Deal 516.7 4025 3947 286.6 ## Only Some 1124.8 8762 8591 623.8 ## Hardly Any 365.4 2847 2791 202.7 ## ## Pearson's Chi-squared test ## ## data: y_table ## X-squared = 267.8, df = 6, p-value < 2.2e-16

Page 9: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 9 of 10

Conclusion

Taking into consideration the fact that the p-value is very small, under 2.2e-16, and the X-squared equals 267.8 we can reject the null hypothesis and state that the two variables, confidence in banks and financial institutions and social class the individual belongs to are dependent. For future research the confounding variables should be introduced in the design and we should also take into account the fact that the Upper Class could actually consist of bankers.

References

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11, http://doi.org/10.3886/ICPSR34802.v1

Appendix

head(newdata)

## class confinan ## 4602 Working Class A Great Deal

Page 10: Data analysis and statistical inference project

Data Analysis and Statistical Inference Project - Coursera

Association between confidence in banks and social class

Marușa Beca

Page 10 of 10

## 4603 Middle Class Only Some ## 4604 Working Class A Great Deal ## 4605 Middle Class Only Some ## 4606 Middle Class Only Some ## 4607 Middle Class Only Some

tail(newdata)

## class confinan ## 57054 Lower Class A Great Deal ## 57055 Working Class Only Some ## 57057 Working Class Only Some ## 57058 Lower Class Only Some ## 57060 Lower Class Only Some ## 57061 Working Class A Great Deal