Upload
charles-potter
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Fellows Training
FINCA Client Assessment Tool (FCAT): Data Cleaning
Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health Survey Interviewer’s Manual.
FCAT 2009 Data Cleaning
1. Data Integrity
2. Data Formats in FCAT
3. Data Challenges in FCAT
Data Integrity
If Data is acceptable to use for statistical analysis, that means it has:
INTEGRITY
Test: Will researchers question the results of a study simply based upon the data set that was used?
Data Integrity (continued)
Data has integrity if it is valid and reliable
Internal validity• The concept you are trying to capture should be accurately measured
External validity• What populations do your findings apply to?
(also known as “generalizability”)• Does your sample represent the population?
Statistical Validity• Will statistical models yield valid results?
Reliability• Can the results be replicated or repeated?
Good Data
Importance of good data:
• Accuracy in findings
• Helps direct policy and operations
• Contributes to development of products and services
Examples of Integrity: Recall
High Validity, Low Reliability(Measurement Error)
Example: Expenditure recall over long periods
Solution: Shorten periods, verify responses, reframe questions (health is better or worse than average?)
Examples of Integrity: Self-Reporting
Low Validity, High Reliability(Systematic Bias)
Example: 85% of motorists self-report that they are above-average drivers
Solution: Ask their friends to rate them
1. Data Integrity
2. Data Formats in FCAT
3. Data Challenges in FCAT
FCAT 2009 Data Cleaning
Data Formats in FCAT
Data is recorded in 5 different formats:
CategoricalNon-overlapping, exclusive, and finite
Ex. Home Ownership1. Owned2. Leased3. Privately rented4. Government rented5. Rent free6. Squatted7. Other, please write-in
Ordinal/ScaledRated according to a given scale
Ex. Rate the loan application process:
1. Very difficult
2. Difficult
3. Easy
4. Very easy
Binary
Yes or No
1 2 3
Data Formats in FCAT (Continued)
Data is recorded in 5 different formats:
Write-ins
Text write-ins
Ex. Others please write in response: ______
*Be aware of the type of response expected to avoid inconsistencies and outliers.
Open-Ended
Number write-in
Ex. Food expenditures for the week: __ (in local currency)
Time to gather water: __ (minutes)
*Note: Always record units of measure
4 5
1. Data Integrity
2. Data Formats in FCAT
3. Data Challenges in FCAT
FCAT 2009 Data Cleaning
Data Challenges in FCAT
Inconsistent values
Outliers
Missing values
Calculated values
Others
Cleaning data
Data Cleaning
Data is only useable if it is properly cleaned
As the interviewer and the one familiar with the data, it is your job to ensure that the data is correct
Inconsistent Values
Continue w/ FINCA?
1=Yes, 2=No
Who made the decision to leave?
Why did FINCA or Village Bank ask
you to leave?
Do you plan to return in the future?
1=Yes, 2=No
2 Village Bank Client defaulted 1
2 Client N/A N/A
1 N/A Client defaulted N/A
1. Definition: When a second response is made invalid (either impossible or simply inaccurate) by an earlier given answer
2. Examples:
3. Treatment:a. Filterb. Annotate
(shaded cells show inconsistencies):
Outliers
1. Definition: Response outside the range of values
Outliers (continued)
2. Examples:1) In general how is your health at this time?
1. Excellent2. Good3. Poor4. Very Poor
• Answer: 7
2) How much does your household spend per week for food?• Answer in Ecuador: $10,000
3. Treatment:a. Filterb. Annotatec. Correct value, if possible (e.g. mean of positive values)
Special mention: Inliers. If a question calls for integers and the recorded answer is a decimal. e.g. recording a child’s age as .5 if he is yet to complete a year.
Outlier: Response is out of answer range
Outlier: Response amount is very unlikely
Missing Values
Continue w/ FINCA?
1=Yes, 2=No
Who made the decision to leave?
Why did FINCA or Village Bank
ask you to leave?
Do you plan to return in the future?
1=Yes, 2=No
2 Village Bank Group dissolved 1
2 Client defaulted 2
1 N/A N/A
1. Definition: a. Stated information not recorded, not legitimate skips b. _____
2. Examples:
3. Treatment:a. Filterb. Annotatec. Correct value, if possible
(in shaded cells)
Ex. If you can distinguish between missing value and legitimate skips, replace missing values with the mean over a defined sample (e.g. branch or region).
Calculated Values and Other Challenges
Calculated Values
1. Definition: Data derived from sub-aggregated variables
2. Examples: DPCE, PPP converted from local currency unit
3. Treatment: Record units of measureCheck formulas
OthersText is text; numbers are numbers. Do not write in text responses for columns that accept only numbers. Please use the “Other” or “Notes” columns for this purpose.
Cleaning Data – Do’s
Frequent and periodic• End of the day• Much easier to clean 20 interviews than 80 or 320!
Smaller samples are easier to manage• Avoids locality effects on false identification• Avoids contamination of derived variables (e.g. DPCE)
Keep two files:• Raw data• Cleaned data• Always keep a back-up as well
Record and annotate all data issues in a log or tracking document
Techniques:• Filtering• Histograms• Pivot tables In other words, do not let
data problems snowball
Client ID Collection
Please collect Client ID information from EACH client interviewed.
It is not a violation of privacy, and you can assure the client that their personal information will not harm them in any way, that their responses will be to help make decisions to better loan products and services.
SurveyID
For Entry into the Data Warehouse, we need to create a PRIMARY KEY for the Main Form to link to the cleaned Subform. The code appears like this when finished:
DC20083101 (2 letter country code, the year collected, and an overall interview number from one fellow)
Fellows should give each other a number (1, 2, or 3), and then should add a column in BOTH the main form AND the Household Subform.
SurveyID (cont’d)
Fellow 1 should take his/her overall individual interview number and add 1000 to it, fellow 2 should add 2000, and fellow 3 should add 3000.
Ecuador=ECZambia=ZM
Therefore, the 14th interview performed by Mexico Fellow #2 would be MX20092014. It would read that in the main form AND the HHSubform. Please maintain this convention throughout the Fellowship.
Clean Data
Data is “clean” if:• All categorical codes match those in the survey design
sheet* Ex.: Match drinking water sources with codes 1-15
• All ordinal data are represented as whole numbers* Ex.: Do not have 3.4 years of education
• Outliers have been justified
• Missing data have been correctly annotated
Questions?