DC Big Data Exploration Final Report

World Bank Group Finances

finances.worldbank.org

@WBOpenFinances

DC Big Data Exploration Final Report

March 15-17, 2013

World Bank Group Finances @WBOpenFinances finances.worldbank.org 2

TABLE OF CONTENTS ACKNOWLEDGEMENTS 3 EXECUTIVE SUMMARY 4 DETAILED PROJECT REPORTS 7 PREDICTING SMALL-SCALE POVERTY MEASURES FROM NIGHT ILLUMINATION 7 BACKGROUND AND PROBLEM STATEMENT 7 DATASETS AVAILABLE 7 KEY FINDINGS 7 METHODS AND ANALYSIS 7 RECOMMENDATIONS AND NEXT STEPS 15 ADDITIONAL RESOURCES 16 SCRAPING WEBSITES TO COLLECT CONSUMPTION AND PRICE DATA 17 BACKGROUND AND PROBLEM STATEMENT 17 DATASETS AVAILABLE 17 KEY FINDINGS 17 METHODS AND ANALYSIS 18 RECOMMENDATIONS AND NEXT STEPS 22 ADDITIONAL RESOURCES 22 LATIN AMERICA POVERTY ANALYSIS FROM MOBILE SURVEYS 23 PROBLEM STATEMENT 23 DATASETS AVAILABLE 23 KEY FINDINGS 23 METHODS AND ANALYSIS 23 RECOMMENDATIONS AND NEXT STEPS 28 ADDITIONAL RESOURCES 28 MEASURING SOCIOECONOMIC INDICATORS IN ARABIC TWEETS 29 BACKGROUND AND PROBLEM STATEMENT 29 DATASETS AVAILABLE 29 KEY FINDINGS 29 METHODS AND ANALYSIS 30 RECOMMENDATIONS AND NEXT STEPS 36 ADDITIONAL RESOURCES 36 ANALYZING WORLD BANK DATA FOR SIGNS OF FRAUD AND CORRUPTION 37 DETAILED PROBLEM STATEMENT 37 DATASETS AVAILABLE 37 KEY FINDINGS 37 METHODS AND ANALYSIS 38 RECOMMENDATIONS AND NEXT STEPS 47 ADDITIONAL RESOURCES 49 UNDP RESOURCE ALLOCATION 50 BACKGROUND AND PROBLEM STATEMENT 50 DATASETS AVAILABLE 50 KEY FINDINGS 51 METHODS AND ANALYSIS 51 RECOMMENDATIONS AND NEXT STEPS 54 ADDITIONAL RESOURCES 54 OTHER PROJECTS: A HEURISTIC TOOL FOR AUDITING AND SOCIAL NETWORK ANALYSIS FOR RISK MEASUREMENT 55 NEXT STEPS 55


ACKNOWLEDGEMENTS The World Bank Group Finances team is grateful to all the partners that supported the DC Data Dive on March 15-17:

• UN Global Pulse • Qatar Computing Research Institute • UNDB • UNDP

The following groups from inside the Bank also contributed –

• AFR • CTR • DEC • EXT • IEG • INT • LAC • OPCS • PREM • TWICT • WBI

We are also grateful to DataKind, the data ambassadors that it assembled, and the volunteers who participated. The DC Big Data Exploration would not have occurred without them. The DC Big Data Exploration was preceded by a similar event that the UNDP organized in Vienna in February.


EXECUTIVE SUMMARY Dr. Jim Kim in a recent speech asked ‘what will it take for the World Bank Group to be at its best on every project, for every client, every day?’ His own prescription was that ‘We must…support our clients in applying evidence-based, non-ideological solutions to development challenges… This is the next frontier for the World Bank Group…we need to continue investing in data and analytic tools, building on the success of the Open Data initiative. Data are crucial to setting priorities, making sound policy, and tracking results.’ The age of big data contains the tantalizing promise of reshaping international development. There is overwhelming evidence already from the private sector that big data can be transformative. UPS used sensor data to save 30 million miles off its drivers' routes. Reports claim that predictive analytics has been worth about 23 billion to Target over 8 years. Services like Farecast use vast amounts of seemingly unconnected data to recreate new information services that would not have been possible a few years ago. The question then is how/whether big data has a role in international development. The UN led Global Pulse initiative seeks to harness 'today's new world of digital data and real-time analytics to gain a better understanding of changes in human well-being’. The World Bank too sees big data as a promising area – but one that needs further exploration. On March 15-17 DataKind, in partnership with the World Bank and its partners from UNDP, UNDB, UN Global Pulse, and the Qatar Computing Research Initiative, held the DC Big Data Exploration to explore new ways of using big data to fight poverty and corruption. The event drew more than 120 pro bono data scientists from Washington DC and across the nation to the World Bank’s Preston Auditorium. Working alongside Bank experts on the Poverty and Fraud & Corruption teams, the data scientists uncovered new ways of collecting, exploring, and visualizing data to maximize their impact. The collaboration between the two communities yielded new insights from World Bank data, devised new ways of using existing big data sources for monitoring poverty and corruption, and created entirely new streams of data that the Bank and its partners can use in future research. Prior to the event, DataKind and the World Bank’s Poverty and Fraud & Corruption teams identified six key projects to tackle over the weekend. The projects were designed to address the Bank’s needs and generate tangible insights within a 24-48 hour period: Predicting Small-Scale Poverty Measures from Night Illumination: The team explored whether nighttime light imagery could be used to estimate sub-national poverty levels. Over the weekend, the team created software to overlay lighting information with other geospatial indicators (e.g. population, change in poverty) and performed a statistical analysis showing that lighting levels in satellite images were predictive of poverty levels in 2001 in Bangladesh. The Bank can use these findings to carry out more sophisticated experiments relating nighttime lighting to poverty and to build software to monitor poverty in real-time from remote sensing.


Scraping Websites to Collect Consumption and Price Data: To combat the lack of price data, this team wrote software to scrape food prices from supermarket and cost of living websites, and other sources of food data. The results yielded real-time food price monitoring data for early alerting to food crises, better information for battling inflation, and a richer perspective on food data. Measuring Socioeconomic Indicators in Arabic Tweets: This project analyzed more than 25GB of Arabic tweets to see if they could estimate socioeconomic conditions from what people were saying on social media. The team wrote code to track key socioeconomic terms over time (e.g., “bankrupt” or “food”) and estimated the time zones, locations, and gender of the authors from their messages alone. These findings could be used in future work to design proper experiments to test for socioeconomic differences in regions or demographics based on passively collected social media data. Latin America Poverty Analysis from Mobile Surveys: This project analyzed Listening to Latin America (L2L) mobile survey data to understand the socioeconomic conditions in Peru. The team analyzed basic survey results as well as discovered patterns in the survey response rates, e.g. mobile response rates yield very different answers to socially sensitive questions and economic incentives don’t seem to affect response rates. The Bank can use these findings to plan future surveys so that they collect more accurate information. Analyzing World Bank Data for Signs of Fraud and Corruption: This project combined the World Bank’s internal supplier, contractor, and bidder data with external data to gain a richer perspective on how firms that had been bidding on Bank-financed contracts behaved. The team created new unified databases that make analysis easier for the Bank, and identified interesting patterns in debarred-to-non-debarred organization relationships, co-bidder patterns, and Bank lending patterns over time. UNDP Resource Allocation: The UNDP provided capacity and project data in order to understand what skills and budgets created the best program results. The team explored expenditures by project and identified the types of projects and regions that hit their budget goals compared to those that did not. Heuristic Auditing Tool and Supplier Social Network Analysis: Two other projects outside of DataKind were developed over the weekend as well: The first was a tool for automated auditing of bids; the second was a social network analysis tool for understanding the relationships between suppliers.

The weekend in mid-March was just the start. The World Bank Group Open Finances team is currently working on the following steps for the program:

• Holding an online competition to address an operational question designed to improve the delivery of Bank projects;


• Considering a partnership with DataKind to:

o Work with DataKind’s vetted DataCorps consultants on larger Bank efforts with tangible deliverables;

o Engage the data science community to add innovation capacity and data science expertise to the Bank’s ongoing efforts. The DC community was invigorated by the DC Big Data Exploration; the Bank and DataKind will continue to engage the groups that participated to collaborate on these issues.

• Collaborating with partners inside and outside the Bank to create an analytics program based on big data techniques and tools like those that were used so successfully during the DataDive weekend.

Finally it is worth noting that the promise of big data comes with numerous challenges – especially those related to privacy, data quality, attribution, and legal frameworks. The findings from the DataDive are provisional – a number of methodological issues still need to be addressed: e.g. sample size, selection bias and validity of sources. The promise may however outweigh the perils and the Bank and its partners need to quickly build on the momentum achieved over the DataDive weekend.


DETAILED PROJECT RESULTS PREDICTING SMALL-SCALE POVERTY MEASURES FROM NIGHT ILLUMINATION

Background and Problem Statement Poverty data collection is expensive and slow. To complement this effort, it would be helpful to find potentially less accurate but cheaper, and more frequent ways to measure poverty. The team looked for patterns that could help build leading indicators of poverty by comparing existing national poverty maps to other geospatial indicators to see if they were correlated or, even better, predictive. Nighttime illumination recorded from satellite imagery was used as the indicator for this project. The team sought to identify whether light levels correlated with poverty levels and, more importantly, if changes in light intensities could predict changes in poverty level. If so, then light maps could potentially be used as proxies for poverty data. Datasets Available

• The 2001 and 2005 poverty levels of Bangladesh at every Upazila (county administration level), of which there are about 500.

• Average nighttime illumination levels for every year from 2001 to 2005 from NOAA. Both datasets were available as raw GIS data. The team also built accessible CSV files containing histograms of light per region that could be used for statistical analysis. See the Additional Resources section for other datasets the team collected. Key Findings

• Using regression models, the team found that lighting and census data from 2001 was predictive of poverty levels that year. This finding holds promise for being able to predict poverty levels using satellite imagery of nighttime lighting.

• The team did not find similar results for 2005 data nor for predicting the change in poverty between 2005 and 2001. No statistically significant relationships were found over the weekend.

• Further exploration should be done of the 2005 data before ruling out any predictive ability from lighting maps.

• Geospatial poverty data can be combined with other geospatial data and image-based data (such as remote imagery) to explore the relationships between different variables. The team wrote code to aid in extracting shapefile (.shp) regions from TIFF files (.tiff) as well as interactive maps for researchers to use.

• Census data combined with light intensity data does a slightly better job at predicting poverty levels than just using light data alone

Methods and Analysis The team took two approaches to exploring the relationship of light intensity to poverty levels.

https://hackpad.com/Predicting-SmallScale-Poverty-Measures-from-Night-Illumination-f6RoPTY6IWB


The first involved creating interactive maps that would allow researchers to overlay geographic data with existing poverty maps. With these maps, researchers could visually understand which indicators correlate with poverty. The second approach entailed building statistical models of the relationships between light intensity and poverty levels. Interactive Poverty Maps The team collected poverty maps from Bangladesh from 2001 to 2005 as well as maps of the change in poverty, light intensity, literacy, and total population. After converting all of the data into a common format (not a simple task as some of the data was graphical and other data was in geo format), the team created three major interactive visualizations, available at this site:

1. Descriptive information: This interactive displays descriptive information about each region of Bangladesh for each of the datasets described above. For example, one can view the change in poverty from 2001 to 2005 by region on the map. Figure 1 shows an example of these descriptive maps.

2. Overlaid information: This interactive allows the user to overlay various indicators on top of illumination data from satellite imagery. For example, one could compare light intensity to a map of access to electricity or to the change in poverty levels between 2001 and 2005. Figure 2 shows an example of an overlaid map.

3. Timeline maps: The final interactive shows illumination changes for every year from 2001 to 2005. By navigating across this timeline, one can see where regions have increased or decreased in illumination levels over time. Figure 3 shows a screen shot of the Timeline maps tool.

With these tools, researchers can visually inspect the relationships between poverty and other geospatial datasets. The code used here could be adapted to take in other geospatial or raster image data to add to the tool.

Figure 1: A descriptive map highlights change in poverty in Bangladesh with the Upazilla“Dharmapasha.”

mailto:http://newamericafoundation.github.com/dcworldbank/descriptives.html


Statistical Analysis of Light Intensity Levels and Poverty Data

Using the standardized data from both poverty maps of Bangladesh and satellite imagery of nighttime illumination, the team set out to answer the following questions:

• Is poverty correlated with light intensity? • Are changes in poverty correlated with changes in light intensity?

Figure 2: An overlaid map of poverty in 2001 and light intensity.


To explore the relationship between poverty levels and light intensity, the team first needed to get a measure of light intensity at the same scale as the poverty data. Because the poverty data is on a regional level, the team created an “average light intensity” measure for each region. This measure was computed for each region by first getting an “average” intensity score using the equation:

score = (%of pixelsat intensityi)⋅ ii=1

255

∑

Once this score was computed for each region, a value was assigned to the region based on the percentage of other regions that had a lower score during the same year. For example, if a region's average lighting was more than 80% of the other regions in 2001, it would be assigned an "average light intensity" of 80 for 2001.

Figure 3: An interactive timeline map of light intensity from 2001 to 2005.


Looking just at the plot of light intensity vs. poverty in 2001 for each upazilla (county admin) in Figure 4, we can see a strong correlation between light intensity and poverty levels. The trend is less pronounced in the average light intensity plot for 2005, but still apparent. The team next built a number of models in an attempt to predict actual poverty levels from light intensity and census data.

Figure 4: Poverty vs. light intensity in 2001 and poverty vs. average light intensity in 2005.


Poverty vs. Light Intensity in 2001 The first model the team built was a linear regression predicting poverty level from light intensity alone using 2001 data. Figure 5 shows a plot of the predicted poverty levels in 2001 and the actual poverty levels using light data alone. All hyper-parameters of the model are selected based on cross validation on 80% of the data. The model is then fit to the same 80% and used to predict the remaining 20%. The team set the alpha parameter to 10.0 for these computations, which is recommended when using 80% of the data as cross-validation.

The Root Mean Squared Error (RMSE) for the model using only an intercept term and the 2001 average light intensity is 0.076982. The RMSE is a measure of the model’s accuracy where lower

Figures 5 (top) and 5 (bottom): Predicted poverty vs. true poverty in 2001 for the model fit using light

intensity and census data from 2001. (Figure 5: RMSE = 0.067650; Figure 6: RMSE =0.076982).


numbers indicate higher-performing models. This low RMSE is indicative of a good fit, meaning that this model is a promising sign that 2001 light intensities predicted 2001 poverty levels. Figure 5 shows actual poverty levels vs. the poverty levels predicted by the model. The team also included features from the 2001 census to see if the fit improved. In this case, the RMSE on the test data was 0.067650. This slightly lower score means that the census data combined with light intensity data does a slightly better job at predicting poverty levels than just using light data alone. Figure 6 shows the predicted poverty levels vs. true poverty levels using this model.

The team also looked at how well the census features (i.e. lights in 2001, lights in 2005, and poverty in 2001) could predict the change in poverty between 2001 and 2005. This RMSE was 0.129603 for the model without 2001 census data (Figure 8) and 0.143962 for the model with 2001 census data (Figure 9). Based on the results from these models, the team concluded that the combination of lighting and census data from 2001 was predictive of poverty that same year.

Figure 6: Predicted poverty vs. true poverty in 2005 for the model fit using light intensity

alone. (RMSE = 0.129964)


Figure 7: Predicted poverty vs. true poverty in 2005 for the model fit using light intensity data

from 2005 and census data from 2001. (RMSE = 0.144068)

Figure 8: Change in poverty from 2001 to 2005 predicted against the actual change in poverty

using lights in 2001, lights in 2005, poverty in 2001, and census data in 2001. (RMSE =

0.143962)


Recommendations and Next Steps The World Bank team should re-run these analyses using the most recent and detailed poverty maps available. If there are additional geospatial indicators, they should be included in the analysis. Based on the preliminary results and findings of the weekend, there are two definitive areas for further exploration:

• The World Bank can reproduce the analysis prototyped at the event using the most recent and detailed poverty data available in different countries. Testing the approach on richer poverty maps, as well as using supplemental geographic and census data could help uncover a deeper connection between light and poverty.

• The second area of research could be to identify additional geospatial data sources to incorporate into future versions. Initial sources identified, like available roads and night illumination, could contribute to a leading proxy for poverty.

Additional next projects can include adding more granular data and modeling techniques to this prototype. Additional geospatial data, along with traditional indicators can supplement the project for stronger correlations.

Figure 9: Change in poverty from 2001 to 2005 predicted against the actual change in poverty

using lights in 2001, lights in 2005, and poverty in 2001. (RMSE = 0.129603)


By identifying new sources, the World Bank has an opportunity to build on this and other geospatial-data research efforts to provide timely and granular poverty measurements. ADDITIONAL RESOURCES Available on the Team’s Hackpad o A 2006 NOAA paper on using night-lights and satellites to measure national poverty levels.

They cite and address the measurement challenges that the World Bank faces in particular. http://www.ngdc.noaa.gov/dmsp/pubs/Poverty_index_20061227_a.pdf

o Additional geospatial data and external resources (a detailed list). o UN studies on parking lot density and cellphone coverage predicting poverty levels:

o Parking Lot Study o Cellphone Coverage Study

o Code to convert GeoTIFFs to shapefiles o Two academic papers created by the team:

o Luminosity o Lighting Up Poverty

o Interactive Visualizations

http://www.ngdc.noaa.gov/dmsp/pubs/Poverty_index_20061227_a.pdf

http://www.ngdc.noaa.gov/dmsp/pubs/Poverty_index_20061227_a.pdf

https://hackpad.com/Predicting-SmallScale-Poverty-Measures-from-Night-Illumination-f6RoPTY6IWB#:h=Data-Available:

http://www.unglobalpulse.org/publicpolicyandcellphonedata

https://github.com/mshron/geotiffshapefile

https://github.com/b-k/luminosity

https://github.com/duckworthd/lighting-up-poverty

http://newamericafoundation.github.com/dcworldbank/descriptives.html


SCRAPING WEBSITES TO COLLECT CONSUMPTION AND PRICE DATA Background and Problem Statement Collecting detailed food price data is not just important for poverty monitoring, but also critical for the economic management of a country. In 2009, Kenya had an official inflation rate of25% but a lending rate of below 20 percent. If these numbers were true, banks would have gone bankrupt. Banks thus had to “guess” inflation levels to set their interest rates. What’s more, existing food price data often takes a huge amount of time to collect and exists only at the national level. Differences at the regional level are difficult to come by. In most developing countries, food makes up the largest share of inflation (often up to half). This project team, dubbed “Team Ndizi”, decided to supplement the World Bank’s current food price data by scraping new, more real-time food price data from other sources. Their goal was to identify if:

• Other data sources could be scraped to create more real-time estimates of food prices and therefore estimate inflation rates and poverty.

• There existed data sources that captured price data at regional and sub-national levels so that prices could be compared across the country.

Datasets Available The team did not begin with any data but instead scraped a multitude of websites to create detailed datasets of price information. Key Findings Team Ndizi proved how useful scraping can be for the World Bank and they were able to collect food prices across a number of helpful sources. As with all other projects from the DC Big Data Exploration, the findings should be considered provisional, as there are a number of methodological issues that still need to be addressed, e.g. sample size, selection bias and validity of sources.

• Global food prices could be scraped from humunch.com • Daily crop prices in Kenya could be scraped from mobile price providers like mFarm

going back 1,000 days. This could be extended to other countries. • Prices for multiple common goods could be scraped from grocery store sites to create a

healthy “food basket.” The team found prices at the national and sub-national level and verified them against cost-of-living sites.

• The team used the Wayback Machine to scrape historical data, and demonstrated that • pulling historical rice prices showed evidence of the Indonesian rice crisis before global

food prices did.

http://newamericafoundation.github.com/dcworldbank/descriptives.html


Methods and Analysis Team Ndizi identified dozens of websites to be scraped and dozens of sources for official price data. They proved that it was feasible to scrape global and regional data from existing sources that could then be used by the World Bank to track price changes, and monitor inflation. Scraping Global Banana Prices Team Ndizi first turned their sights on http://www.humuch.com, a global price repository. While the site allows users to look up specific commodity prices and map them, the data is not in a raw machine readable format. Team Ndizi trained four of its members to scrape banana prices from the site and converted the data to a machine-readable format. Figure 11 shows a comparison of banana prices by continent as a proof of concept.

Figure 11: A bar graph of banana prices by continent created using data scraped from

humunch.com.

Figure 10: A line plot of 2012 monthly average dry maize prices in five different regions in

Kenya. This plot was created from data the team scraped from mFarm.

http://www.humuch.com/


Scraping Farm Data in Kenya The team next looked to the site mFarm, which forwards crop price information to farmers in Kenya via cellphone so they can make more informed decisions. By harvesting this data “exhaust” from the site, the team gleaned monthly price data for a range of crops across regions of Kenya. This data could be used to follow sudden changes in food prices or as a comparison to prices in other parts of the world. Figure 12 shows a graph of dry maize prices for five regions in Kenya across all of 2012. The team also began working on interactive line graphs and maps of this data so that the World Bank could easily access the data in a readable format. Scraping Grocery Prices from South Africa The team also scraped prices of a wide range of goods from Pick ‘n Pay’s grocery store websites in South Africa. These sites provide food prices for all of their products but, again, they are locked in the website and are not available for download and analysis. Team Ndizi’s freed this data using web scraping, and focused on 11 essential food types that to a balanced daily diet.

With this data, the team could price a typical food basket for someone surviving on a 2,000 calorie diet. They compared these prices across different countries as well as looked at breakdowns of how much each good contributed to the total cost of the basket. Figure 13 shows the daily cost per person for a balanced 2,000 calorie diet across four African countries and the US. Figure 14 shows the cost of a balanced 2,000 calorie diet in South Africa and what proportion of that cost is attributable to each good. The team also replicated this data for a few countries in Africa as well as sub-regions within each country, as shown in Figure 15. These results are significant as they show that, with regular scraping, the World Bank can create real-time measures of food basket prices around the world, even at the sub-national level.

Figure 12: The daily cost per person for a balanced 2,000 calorie diet across five countries (four

African, one USA). The prices of 11 staple foods is included and visualized proportionally.


Perhaps most interestingly, Team Ndizi also validated their price scraping by comparing their price estimates against known cost of living calculators across the Web. Figure 16 shows Team Ndizi’s scraped prices for 11 basic food items alongside the prices reported by three well-known cost of living websites: Numbeo, Xpatulator, and Expatistan. From this plot we can see two compelling findings: first, the price that Team Ndizi found from scraping Pick ‘n Pay grocery store websites in South Africa is never more than a few cents away from the average estimate from the other cost of living sites. Secondly, only Numbeo actually measures all of these items, while the other sites were lacking at least one product that Ndizi was able to measure. This data could be used in a number of different ways. Prices could be tracked over time or the cost of living could be computed using different food products. A full view of food prices across the country (depending on the coverage of the grocery websites) could be provided to the World Bank.

Figure 13: Proportional break down of costs for a healthy balanced diet in South Africa.


Indonesian Rice Crisis Tracking Team Ndizi’s final project looked at rice prices in Indonesia over time. The data was scraped from Carrefour Indonesia, a popular supermarket chain. The team also used the Wayback Machine to go back to historical versions of the website to collect data. Some experimentation was done with pulling prices from Twitter data as well, but there was not enough time to create a full-fledge “universal” scraper from all sources.

Figure 15: Food prices for 11 items that Team Ndizi scraped (orange) compared to prices from

known cost of living websites.

Figure 16: World food prices, as reported by major monitoring agencies (green and yellow) vs.

prices of two brands of rice in Indonesia (blue and red).


What was most interesting about the Indonesian price-scraping project was that it proved the importance of having more real-time food data in anticipating and managing crises. The team scraped the price of rice per kg for two different brands of rice (Si Pulen Crystal and Topi Koki Setra Ramos) in Indonesia from January 2010 to March 2013. Figure 16 shows these prices plotted over time against World Food Prices and the Food and Agricultural Organization’s (FAO) data. What is striking from this plot is that the rice prices in Indonesia increased by a good $1/USD per kg above the world’s food prices. This shows that early signs of the rice crisis may have been observed in this data that wouldn’t have been observed by looking at global data. Moreover, the team scraped data that extends beyond estimates that the FAO was able to provide. FAO data stopped in October 2012. Recommendations and Next Steps Team Ndizi showed how easily food price data can be scraped and collected and has provided code and data to replicate their work. Some areas for future work and more careful analysis include:

• Using web-scraping techniques, as prototyped in the event, to create granular, and near real-time measures of food prices at sub-national levels;

• Further examining the difference between scraped and official data as a tool for filling in gaps where the current measures of price data are aggregated or infrequent. The Indonesian rice price exploration from the DataDive may serve as an example of the useful perspective that could be gained from granular, local, and nearing real-time data;

• More robust comparisons and correlations from the scraped price data to known economic metrics and historical data. During the event, the team conducted some basic validation of the price scraped data by comparing their price estimates against known cost of living calculators across the Web. The deeper dive will help the Bank determine how useful this granular view of price data can be;

• If the methods outlined prove to be useful, then a more basic ‘universal’ scraper could be built to automate the sampling of price data from sources around the Web.

Additional Resources The team’s HackPad contains an in-depth list of examples of related projects, the datasets they collected, and the code needed to recreate their work.

https://hackpad.com/Scraping-Websites-to-Collect-Consumption-and-Price-Data-SXoIdM1XFyw


LATIN AMERICA POVERTY ANALYSIS FROM MOBILE SURVEYS Problem Statement The World Bank’s poverty team is interested in finding new ways to measure welfare in Latin American countries. Two pilot surveys of household well being, called Listening to Latin America (L2L) used cell phones. In this project, teams used the L2L data from the Peru survey to address the following question:

• Is it possible to draw inferences about changes in welfare at the national level using data collected with cellphones, or a combination of this data and the national household survey (ENAHO) data?

Datasets Available

• L2L Survey Data (SPSS Format, also CSV, currently) • Peru: Initial F2F Questionnaire • Honduras F2F Questionnaire* • Peru: Mobile Questionnaire • Honduras Mobile Questionnaire* • L2L Final Report • ENAHO data

*The data was not analyzed at the event. Key Findings

• More “Yes” answers were given to personal and negative questions when follow-up surveys were delivered by SMS or phone call than in-person. This finding may indicate that people’s impersonal answers are more honest than when they are face-to-face. Mobile data collection may therefore be more accurate, in some contexts, than face-to-face surveys.

• Monetary incentives did not appear to influence response rates, regardless of the technologies involved.

• The data from these surveys provides very rich detail about the Peruvian people and could be used at a broad level to learn more about socioeconomic conditions in this country.

As with all other projects from the DC Big Data Exploration, the findings should be considered provisional, as there are a number of methodological issues that still need to be addressed, e.g. sample size, selection bias and validity of sources. Methods and Analysis Initial Exploration of the Survey Data The team first dove into understanding the questions asked in each survey, before looking at the basic statistics of Peruvian respondents in aggregate. The data on the Peruvian people was so

https://hackpad.com/FrpEYxPQnZx#Latin-America-Poverty-Analysis-from-Mobile-Surveys

mailto:http://www.inei.gob.pe/web/enaho/


https://www.dropbox.com/sh/r5qahbwi99gtfi6/7_XJyb7N8V

mailto:https://docs.google.com/file/d/0B8uPWK1wtxNfQW5VN3lzLTVrNk0/edit%3Fusp=sharing

mailto:https://docs.google.com/file/d/0B8uPWK1wtxNfRFE5UGdxV3NueWs/edit%3Fusp=sharing

mailto:https://docs.google.com/file/d/0B8uPWK1wtxNfbUMzRVlxRlAzc1U/edit%3Fusp=sharing

mailto:https://docs.google.com/file/d/0B8uPWK1wtxNfVmNTQUZwb1ozTHc/edit%3Fusp=sharing



rich that it could (and perhaps should) be the subject of a research project in its own right. With just a quick look at the basic statistics, the team found the following basic statistics:

• Family sizes of approximately three to six members; • In about half of the cases, neither of the respondent's parents was educated; • The number of hours of work each respondent reported was normally distributed

around a mean of 40 hours with a standard deviation of about 15 hours; • Most respondents worked in farming and related fields, with small business and

housework being next most popular. The remaining occupations comprised the long tail of the rest of the data;

• 40% of respondents have soil floors; • 8% of respondents had someone in the household lose their job in the last month; • 8% of respondents had someone in their household find a new job in the last month; • 7% of respondents had someone in their household miss school for lack of money in the

last month; • 13% of respondents had someone miss school due to sickness in the last month; • 11% of respondents had someone robbed in their household in the last month; • 9% of respondents had moved in the last three years.

These findings begin to paint the picture of socioeconomic conditions in Peru. They may already be known to the World Bank Poverty team, but the DataKind team felt it was worth bringing these figures up in case any were new or surprising. In either instance, it is quite easy to repeat this analysis for future survey data. Response Rate Analysis The team’s next analysis was designed to confirm the results in the published study, which focused primarily on response rates in the follow-up studies, and analyze which factors of the survey correlated with higher and lower response rates. The team first looked at response rates regardless of follow-up technology. Of the 1,600 people surveyed, about 1,000 didn't respond to any follow-ups, while the other 600 responded to about half of the follow ups. Given the fact that households had agreed to participate in the six-month survey beforehand, the response rate seemed particularly low to the team. The team wondered if phone network connectivity was playing a role here and recommends the World Bank explore rural versus urban response rates. Effects of Survey Technology The team next looked at the effects of follow-up technology on response rates. The overall goal of this study was to evaluate the effectiveness of mobile-phone based surveying. Answering this question could have far-reaching implications for the Bank’s ability to collect accurate information with the global reach and ease of using mobile technology. The data the team was working with contained responses to surveys for an initial survey and then six follow-up surveys. The first survey was conducted in the traditional way: face-to-face


and in person. The subsequent surveys were performed using one of three methods: human telephone conversation (CATI), pre-recorded voice mobile phone interview (IVR), and text-based (SMS). The breakdown of response rates by technology is shown in the table below:

Method Response Rate CATI 50% SMS 30% IVR 25%

Here again we see fairly low response rates. The aforementioned question of cellphone reliability would be important to address. Interestingly, person-to-person interviews yielded the highest response rates, which may be due to the respondent’s feeling of responsibility to another person that they do not feel when ignoring a pre-recorded phone call or an SMS. Incentives for Responses

The team also explored whether monetary incentives affected response rates (see Figure 17). The incentives, in this case, seemed to have little or no impact on response rate. The difference between no incentive and a small incentive showed no major increase and the increase of the incentive from $1-$5 appeared to have no impact on the response rate.

Figure 17: Response rates over six months broken out by technology type and monetary

incentive offered (none, $1, $5).


The lack of incentive motivation may be particular to this survey, so it is recommended that the same analysis be performed on the Honduras data to see if similar findings to those in Peru are observed. Moreover, a more rigorous analysis should be performed to see if there are truly no statistically significant increases in responses based on incentives. A larger dataset may be needed to answer this question confidently. While the financial incentives did not appear to increase response rate on average, the team also thought it would be interesting to explore the related question responses about financial hardships to explore if they affect the impact of the monetary incentive. Effects CATI vs. Technology-Intermediated Follow-ups The team next looked to see if the presence of humans in the follow-up process (CATI) affected the responses in ways that purely technological follow-ups (SMS) didn’t. To address this question, the team looked at the ratio of “No” to “Yes” answers to very personal questions about negative outcomes, i.e. “did something bad happen to you in the last X months?” In CATI follow-up surveys, the No: Yes ratio varies between 1:7 and 1:16, indicating that a fairly low proportion of people are reporting bad incidents happening to them (5.9% to 12.5% ‘no’ responses). When looking at responses to the identical questions on text-mobile surveys, however, the ratio of No: Yes answers drops the range from 1:3 to 1:6 (14%-25% ‘no’ responses). In other words, about twice as many people report something bad happening to them via impersonal SMS follow-up survey than report it in a CATI follow-up. There are many theories as to why this could be happening, not least of which is that the group surveyed via mobile may in fact have had more bad things happen to them. The experiment should be repeated with other respondents to see if the trend is observed again. Other theories include:

• Selection bias: people who have had a bad event happen to them are more likely to respond to the surveys in the first place.

• Shame/Privacy concern: People under-report bad events happening to them in a face-to-face interview. There is precedence for this behavior: there is a body of literature on physician illness that shows serious under-reporting of depression in in-person interviews vs. anonymous interviews (~1% vs. 15% prevalence)

Researchers should begin by exploring not just the ratio of No: Yes answers, but also the response rates for different types of questions to determine whether selection bias is at work. The team was interested in knowing whether respondents who had something bad happen to them were more likely to respond because they had something to report. It would also be interesting to repeat this study to see if the same effect is observed for questions that are considered “neutral” or “good”, e.g. did you find a job recently? A related question would be whether certain topics of questions (e.g. education, finance, household) were affected by the type of technology used in the surveys.


The team explored these ideas graphically as well. Figure 18 shows the percentage of “Yes” answers to each of seven questions in the Peru Mobile Questionnaire, where each line represents the responses for the face-to-face, first follow-up, and second follow-up phases of the interviews. From this figure, we see a general upward trend on almost all questions over time. Without a doubt the percentage of “Yes” answers increases during follow-up 1 after the face-to-face. The team wondered what could account for such a pronounced increase in “Yes” answers during the follow-up sessions. The team dug in to see if talking to a human correlates with an under-reporting of answers. The team compared the percentage of “Yes” answers given to a very personal and negative question – whether the respondent had been robbed in the last month – across respondents that received follow-ups via SMS, IVR, and CATI. The results in Figure 19, which graphs the percentage of “Yes” responses given during the face-to-face, first follow-up, and second-follow-

Figure 18: Percentage of “Yes” answers at each stage of the process for seven different

questions.

Figure 19: Percentage of “Yes” responses given at each stage of the interview process,

broken out by technology.


up broken down by technology, could indicate that the more impersonal the mechanism, the higher the reported incidence of robbery. We can see that the IVR recipients reported the highest rate of robbery during their first follow-up, while they seemed about equal to the SMS group during the face-to-face and second follow-up. What is striking is that the CATI (human-voice interview) responses are consistently lower than the other two technologies. Again, these results could be due to legitimate differences in the groups (e.g. the SMS group, by chance, really did get robbed more often than the CATI group). Recommendations and Next Steps As the data is so rich, the team felt it would be useful to analyze differences in responses to questions, broken out by demographic groups, e.g. wealth, geographic location, educational attainment, age, etc. Presuming that the target population is the marginalized and rural poor, analyses could shed light on ways to modify future mobile surveys to target this population. Some ideas for deeper analysis of the L2L survey:

• On monetary incentives: the Bank could do a deeper dive into the data to determine whether or not there is a relationship between the financial questions of the survey and the impact of the monetary incentive as well as retention in the survey.

• On the use of mobile for follow-up surveys: the team questioned whether wireless connectivity could play a role in the response rate of the follow-up surveys, especially where respondents live in rural areas. The Bank should also consider exploring more factors that may impact survey response rates.

• The analyses in this report should be repeated on the Honduras data and compared to Peru to see if similar trends emerge.

• Lastly, the World Bank should conduct a similar survey in Peru to see if the results can be duplicated. There could be good evidence that mobile survey responses are more reliable and real-time than in-person surveys. The implications of this finding could mean that the World Bank would not only have access to more people around the world through mobile surveys than they could reasonably support in-person, but also that the results would be more accurate and complete than are provided by in-person surveys.

Additional Resources • Charles F. Turner's work on mode effect on the collection of data regarding sensitive or

risky behavior. In particular, T-ACASI reduces bias in STD Measurements: The National STD and Behavior Measurement Experiment

• Mick Cooper’s work on mode comparison • Dr. Edith de Leeuw's research regarding mode comparison • Eleanor Singer has a comprehensive article on the impact of incentives on response

rates in household surveys. • CPS is a longitudinal survey that uses mixed methods that might be useful or survey

methodology. • R code to generate response rates by incentive and technology

mailto:http://dragon.soc.qc.cuny.edu/Staff/turner/Reprints3.htm

mailto:http://dragon.soc.qc.cuny.edu/Staff/turner/TechPDFs/80_STDs.pdf

mailto:http://dragon.soc.qc.cuny.edu/Staff/turner/TechPDFs/80_STDs.pdf

mailto:http://www.psc.isr.umich.edu/pubs/select/472/Mick_P_Couper

mailto:http://edithl.home.xs4all.nl/

mailto:http://www.isr.umich.edu/src/smp/Electronic%2520Copies/51-Draft106.pdf

mailto:http://www.isr.umich.edu/src/smp/Electronic%2520Copies/51-Draft106.pdf

mailto:http://www.census.gov/cps/methodology/collecting.html

http://www.dropbox.com/s/vlrt1pgpekpsop9/responseRateViz.r


MEASURING SOCIOECONOMIC INDICATORS IN ARABIC TWEETS Background and Problem Statement Conventional poverty measures are time-consuming and expensive to collect. The World Bank is interested in exploring alternative data sources for measuring poverty that are easier to collect and less expensive to update. It was hypothesized that monitoring conversations on Twitter may shed light onto socioeconomic conditions based on what people talk about and attributes of their conversations. Their goal was to research what information could be drawn from these tweets to inspire future experts. They explored questions such as:

• Do the frequencies of key socioeconomic keywords (e.g. “broke” or “need money”) change over time and, if so, do those changes reveal anything interesting?

• Can we learn about the social network structure of people tweeting to each other and does that teach us anything about their socioeconomic conditions?

• What can we learn about someone tweeting from just their text or other aspects of their tweets?

• Can we correlate any of the activity on Twitter with standard poverty indicators? Datasets Available

• 25GB of Arabic tweets spanning a six-week period from November 2011 to January 2012. Qatar Computing Research Initiative (QCRI) delivered the tweets, and claimed it to be a nationally representative sample. The dataset was so large that the team stored the data in a database on Amazon Web Services. The team used samples of the data to study their questions.

• An English to Arabic translation of key socioeconomic terms.

Key Findings The World Bank could monitor tweets and other social media channels to potentially learn more about a range of socioeconomic indicators:

• The team found clear periodic cycles in features of the Twitter data. These could be correlated against existing poverty indicators;

• The team was able to identify a user’s location using only their message text and the times of day they tweeted;

• Gender can likely be detected from language patterns in text and could thus be used as input to socioeconomic modeling;

• It is possible to infer a measure of social connectedness from the network of tweets. This measure could be correlated with socioeconomic conditions.

As with all other projects from the DC Big Data Exploration, the findings should be considered provisional, as there are a number of methodological issues that still need to be addressed, e.g. sample size, selection bias and validity of sources.

https://www.dropbox.com/s/r162yvo5u14lfhw/Poverty%20Terms.csv


Methods and Analysis

Keyword Usage The team began their exploration by looking at the frequency of keywords over time. Using the English to Arabic translation, the team came up with three categories of terms they would track: Everyday item terms, economic terms, and positive sentiment terms. “Everyday” item mentions: Gasoline نيزنبلورتب Bread جاصةنوبطةزبخزبخ شيع Rice زرأ زر Meat موحل ةمحل محل Milk نبلبيلح Butter ةدبز دبز Beans ايبولةيبوللوفايلوصاف Cigarettes ورراج رئاجس Car/Auto ةيبرع ةرايس “Economic” mentions: Price ةنمثأنمث راعسا رعس Money دوقنلامسولف Fees/Bills ةرتكافريتاوف تاقفنفيراصم Purchasing/Buying ءارتشاءارشعفد Credit/Loan ةراعإفلسفيلستدصقيديركضرق قوست Salary/Pension بتارتابترمبترمشاعم Work/Job لغشلمع Rain اتشةيرطمتاطقاستراطمأرطم Day(s), Week(s), Month(s) روهشنيرهشرهشعيباسانيعوبساعوبسأمايأنيمويموي


“Positive sentiment” words ليمج ةفحت ةولح ةعئار ♡ ♥ زاتمم بيط شوخ سيوك ديعس ةطوسبم طوسبم

The team wrote code to count the occurrences of phrases in each category to see if they changed significantly over time. Figure 20 shows these frequencies. While there may have been a slight increase in the total number of mentions over time, no significant trend was seen, nor was it possible for the team to identify sudden increases or decreases of phrases in this graph. With another poverty indicator, it could be possible to find correlations between the two datasets. Next, the team turned to ways of controlling for different variables in the data once an indicator was determined for future projects. If they could write code to extract features of the tweets, such as periodic trends in tweet frequency or the gender of a user, those facts could help World Bank researches correlate the Twitter data with other poverty measures or they could block for these variables in an experiment.

Figure 20: Mentions of everyday terms (red), economic terms (green), and positive terms

(blue) over time.


Timing of Tweets

Figure 21 shows the frequency of tweets by the day of the week while Figure 22 shows the percentage of tweets by time of day. The team observed clear cyclic trends in when people were tweeting, therefore could be able to account for time of day in tweets when performing a real experiment.

Figure 21: Number of tweets by the day of the week. 28% of tweets occur on the weekends.

The most tweets occur on Monday. All times are in Greenwich Mean Time.

Figure 22: Percentage of tweets by time of day. All times are in Greenwich Mean Time.


While the plot for Figure 23 is for the entire dataset, we could isolate an individual user to their respective time zone. Knowing the user’s time zone would allow future researchers to account for time when running experiments. This data may also be available directly from Twitter.

Country References The team next looked to see if they could determine the origin location of each tweet based on mentions of countries. Figure 24 shows the number of times that each major Arab country was mentioned with a hashtag in a tweet. Bahrain was the most frequently mentioned country, followed by Syria. Many of the Bahrain tweets referred to the upcoming anniversary of the February 14 protests.

Identifying the location of a tweet is important because it could help researchers infer the socioeconomic conditions in that region. In Figure 24, the team also established methods for determining the origin country of a tweet using only the message text. While Twitter may provide this information to us automatically, other forms of social media may not. This code can be adapted to estimate locations of messages so that researchers can account for regional effects when running experiments.

Figure 24: Number of tweets by hour of day. All times are in Greenwich Mean Time.

Figure 23: Number of mentions of each country, computed by counting country hashtags.


Gender Identification Knowing the gender of the author could be very beneficial to understanding one’s socioeconomic conditions. It could also be another variable to account for when running experiments. The team identified resources that could be used to infer the gender of the author of each tweet from the message text alone. Using the suffixes of words in Arabic may be a straightforward way of determining gender. Social Connectedness Research suggests that people who are more socially connected are more affluent than those who are not. To pursue this idea, the team explored frequency of tweeting and social connectedness of tweeters in the dataset.

The team first counted the number of times each person tweeted during the three-month window (Figure 25). Most people tweeted once or twice, with very few people tweeting more. Two accounts tweeted five and six times, one of which appears to be a news source. With these results, Bank experts could try to determine the socioeconomic status of the Twitter accounts involved to see if there is a correlation between tweet frequency and affluence.

Figure 25: Groups of tweeters joined by the number of times they tweeted in this three-month

period.


The team searched for every tweet in which a user “@” mentioned another user, and used these connections to build the social graph of all of the users. Each node represents a Twitter user and each line indicates that one of the two users it connects mentioned the other in a tweet. Larger nodes have more connections. Colors indicate social groups, e.g. nodes connected with green lines have more friends in common with other nodes connected with green lines than with nodes connected with red lines. From this graph, we can see that there are a few large nodes, namely the large red one toward the upper left corner, which are highly socially connected. It would be interesting to look at these individual accounts and see if their socioeconomic status can be determined and find a correlation between connectedness of nodes and socioeconomic conditions.

Figure 26: A connectedness graph showing Twitter users who mentioned one another. Each

node represents a Twitter user and each line indicates that one of the two users it connects

mentioned the other in a tweet. Larger nodes have more connections.


Recommendations and Next Steps None of the assumptions about frequency of keywords or locations mentioned can be verified without a good indicator to measure against. The World Bank experts should identify key indicators that can be matched with the trends in the data or use their expertise to verify that patterns in the data track with some known qualitative measure of poverty:

• Subject matter experts in poverty should team with data scientists to help advise on the problem formulation beforehand as well as during the project;

• More detailed information about the tweets should be secured. We suspect the data we received was trimmed of GPS data, free-text locations, and more.

Additional Resources

• The team’s HackPad page • The team’s project page on GitHub • Final presentation • UNGP projects on mining tweets for unemployment and crisis-related issues:

o Study on monitoring crisis and stress (video) o Study on unemployment statistics

• Additional information about Twitter dataset from Vienna Open Data Day • QCRI Permission to use data • Male/Female language differences, from Debra Tannen • Stopwords in social signaling • Kate Niederhoffer • Jamie Pennebaker's group at UT-Austin

https://hackpad.com/Measuring-Socioeconomic-Indicators-in-Arabic-Tweets-IZ5ByP2LvIt

https://github.com/tlevine/arabic-tweets

http://dl.dropbox.com/u/36727215/DataDive%20Presentation.pptx

mailto:o%09http://www.unglobalpulse.org/projects/twitter-and-perceptions-crisis-related-stress

http://www.unglobalpulse.org/projects/twitter-and-perceptions-crisis-related-stress

http://www.unglobalpulse.org/projects/can-social-media-mining-add-depth-unemployment-statistics

http://wiki.opendataday.org/Vienna2013_Datasets_for_DataDive

http://wiki.opendataday.org/QCRI_End_User_Data_License_Agreement_%E2%80%93_Data_Dive_2013

http://www9.georgetown.edu/faculty/tannend/book_you_just_dont.html

http://www.cpdiehl.org/2010/04/social-signaling-and-language-use.html

http://socialabacus.blogspot.com/

http://homepage.psy.utexas.edu/HomePage/Faculty/Pennebaker/Home2000/JWPhome.htm


ANALYZING WORLD BANK DATA FOR SIGNS OF FRAUD AND CORRUPTION Detailed Problem Statement The World Bank’s Fraud and Corruption team is faced with the weighty task of detecting individuals and companies that misuse or misappropriate funds on Bank financed projects. Corruption can occur at almost any stage of the project pipeline, from design to bidding and final execution. It can be difficult for the Fraud and Corruption team to gain a full view of potential grievances because data about bidders, contractors, and contracts often live in different datasets around the Bank and are not consolidated. Moreover, a large amount of data about bidders and projects exists outside of the Bank in areas such as project implementation units and more that could be used to gain more insights about the bidders and contactors involved. The Fraud and Corruption team is often faced with the insurmountable task of tracking suspicious companies by hand. The team explored how they could help strengthen and scale the World Bank’s methods using new data and analytical tools, focusing on the following main tasks:

• Creating contractor profiles containing external corporate data such as location, chief personnel, date incorporated, etc. Consolidating this information would help identify undisclosed relationships between firms, and hopefully lead to a method to discover shell corporations.

• Consolidate existing World Bank datasets and produce datasets from unstructured sources within the Bank. Using this data the team explored and built tools to highlight contractor behavior and activity, such as bidder relationships.

Datasets Available

• World Bank Project API • OpenCorporates • Major Contracts Awarded • World Bank Project Pages • Debarment Documents

For detailed lists of datasets used, see the Datasets sections of the team’s two HackPads: HackPad 1 and HackPad 2. Key Findings

• Debarment data can be scraped to create a full list of all debarred companies, which can then be analyzed. The team created ranked lists of countries by number of debarments, based on debarment type, and changes in debarments over time;

• Using external corporation data, the team was able to measure relationships between “similar” suppliers. The team built network graphs that showed relationships between debarred and non-debarred firms that shared similar addresses, phone numbers, officers, or names;

• The team was able to scrape co-bidder information from the Web and used that data to build social networks of co-bidders. This code could be used by the Bank to identify suspicious activities between co-bidders;

http://opencorporates.com/

https://hackpad.com/Analyzing-World-Bank-Supplier-Profiles-f5U4QI7ssfi

https://hackpad.com/Analyzing-the-World-Bank%E2%80%99s-Project-Data-for-%E2%80%98Signals%E2%80%99-YXW26ImKpZL


• The team proved it was feasible to combine disparate Bank datasets into more unified supplier profiles. They wrapped this unified data into an API so that the Bank could have consolidated supplier information;

• The team analyzed project approval trends over time and found increases in the number of projects approved toward the end of each month and in the spring and early summer, specifically. The team did not draw any conclusions from this but this could prove quite interesting to explore further.

As with all other projects from the Big DC Data Exploration, the findings should be considered provisional, as there are a number of methodological issues that still need to be addressed, e.g. sample size, selection bias and validity of sources. Methods and Analysis The major goal of the weekend project was to provide new datasets and algorithms that could automatically identify organizations, either bidders or contractors, as potential risks to the Bank. To this end, the team first created a number of datasets that they then analyzed for suspicious patterns. Historical Debarment Data The first task the team tackled was compiling a list of historical debarred firms. With this list, one could compare incoming bidders and contractors against debarred companies to see if they share suspicious similarities, e.g. same address or phone number. The dataset was compiled using the help of the Wayback Machine, which allowed the team to see the Bank’s list of debarred firms over time. This approach proved that scraping the Web for data could be used to create a constantly updated list of debarred firms. The Bank, however, likely has this information internally. Digitizing it could sidestep the need for this approach. Figure 27 shows the average time of debarment by country for firms that are not banned permanently, color-coded by whether countries are borrowing or non-borrowing. Greece tops the list for longest debarments and is a non-borrowing country.

mailto:http://web.archive.org


Figure 27: Ranked list of countries by number of firms, along with proportions of firms

permanently debarred.

Figure 28 shows countries as they are ranked by percentage of permanently debarred firms. Here Ireland and the United Arab Emirates top the list, with 100% of debarments in these countries permanent. However, they each have only a few firms debarred, so this is not completely surprising. The UK, in contrast, has a higher number of debarred firms than either, however only two-thirds of the debarments against UK firms are permanent. These patterns may be interesting to investigate further.


Figure 28: Ranked list of countries by proportion of permanently debarred firms

vs. temporarily debarred.


The team also looked at the change in the average number of non-permanent debarments in countries before and after 2006. Figure 29 shows the changes in debarment rates as sloping lines, colored by whether they’re increasing or decreasing. From this graph, we can see Bangladesh, China, and the US increased the most between the two time periods, while the UK, Indonesia, and Sweden decreased the most. Figures like these might be interesting to the Bank team in understanding where concentrations of debarments are changing over time. These may be indicative of areas that are becoming more corrupt or that are improving over time.

Figure 29: Changes in number of debarments by country before and after 2006.

Each line connects a country’s debarment number between the two time periods.

Red lines indicate decreases, green lines indicate increases.


Lastly, the team performed a visual analysis of the grounds for debarment for each country. A Tableau report on grounds for debarment can be viewed here (Windows required), while a full Tableau report on the debarred data can be downloaded here. Two highlights from the reports are shown below:

Figure 30: World map of debarments by type in each country.

In Figure 30, the size of each pie chart is proportional to the number of debarred firms. In Figure 31, there appears to be few consistent trends across countries, each being unique in its composition of reasons for debarment.

Figure 31: Ranked list of countries by most debarred firms, broken out by reason for

debarment.

https://www.dropbox.com/s/qfb846nhn388xid/Banned%20Firms%20by%20Ground.twbx

http://www.tableausoftware.com/products/reader

https://www.dropbox.com/s/0t44k3hk8wjegun/All_%28Historical_and_current%29_Cleansed_Debarred_Data.twbx


Debarred and Non-Debarred Firm Relationships

The team next looked at relationships between debarred and non-debarred firms. To analyze the similarities between these firms (e.g. similar addresses, phone numbers, names), the team first had to supplement the contractor and bidder profiles with identifying information. OpenCorporates, a freely available database of company registrations, was merged with the Bank’s list of non-debarred firms to add addresses, phone numbers, and officers. The team now had two lists of firms with identifying information such as address and officers included one for debarred firms and one for non-debarred firms. They built a network visualization to understand the relationships between debarred and non-debarred firms using a simple matching measure. Firms are represented as nodes: debarred nodes are red and non-debarred nodes are green. Nodes share an edge if they are considered “similar,” in the sense that they share an address, a phone number, officers, or a similar spelling of their names. Figure 32 shows one example of a network of connections between a major debarred company (the left, large red node) and all other firms. Note the high number of non-debarred countries connected to it.

Figure 32: Network diagram of connections between debarred (red) and non-debarred (green)

firms. Advantages exist between firms if they share common attributes like addresses, phone

numbers, officers, or similar names.

This figure indicates that suspicious relationships may exist between the companies. Further study should be done on the debarred/non-debarred groups that share edges to understand why they are linked and what that means. For future work, the team suggested developing an automated which flags contractors when: a firm’s geodesic distance to a debarred firm falls below a certain threshold; j of its k nearest neighbors have debarred histories; it is classified as a debarred firm using a supervised clustering algorithm trained on a carefully vetted sample of the data.

http://opencorporates.org/


Supplier Profile Database Having looked at the debarred companies specifically, the team next turned to building a full Supplier Profile Database. Before the DC Big Data Exploration, much of the data from suppliers existed across different datasets. The team created a unified database enabling users to drill down on supplier information. This database was formed by combining results from search APIs to obtain supplier and Bank data. Code to link suppliers with the Bank’s projects can be found here and the full database code can be found on Cam Cook’s GitHub page. The Bank can use this project as a framework to develop a tool that can be used by both fraud and corruption examiners as well as implementing agencies to analyze contractors or potential contractors. Users can manually examine relationships between firms. If supplemented by data on debarred firms examiners could identify relationships to known debarred firms or individuals. Mapping Bidder Relationships The team next looked at interesting patterns in the relationships between bidders. To begin, the team gathered the URLs of all award notices from the Bank. They then scraped these sites to produce data about the award and about all the bidders involved. They then generated a network of relationships between bidders. Figure 33 shows networks of co-bidders, where each node is a firm and groups of nodes all bid on contracts together. The visualization only includes firms that had bid on three or more contracts. Each edge indicates that the two connected firms bid on a contract together, with darker, thicker edges indicating more co-bids. The node size is based on number of bids and the bluer a node is, the more centrally connected it is.

Figure 33: Clusters of common co-bidders. Only companies that bid on three or

more awards are included. Node sizes are proportional to number of bids and

nodes are more blue the more central they are. Edge widths and colors are

proportional to number of co-bids.

http://dl.dropbox.com/u/2396/normalization.R

http://dl.dropbox.com/u/2396/normalization.R

https://github.com/Ccook/DataDive-DC


From the figure we can see some very interesting synergies. G3 appears to be a set of bidders who have all bid on a contract with one another the same number of times. G1 has a highly irregular pattern in which one central node co-bids with a few other partners, each of whom have their own networks of co-bidding. Again a more in-depth analysis of the data needs to be undertaken to ascertain the relevance and accuracy of the findings. Code and data to build these networks can be found in the Additional Resources section. Having shown that mapping bidder relationships is possible, the Bank can extend this method to conduct analyses on the impact of project factors to the bidding process. Adding the debarment data and/or contractor profile would greatly help identify whether collusion is likely occurring. Fraud and Corruption staff could examine the degree of separation between firms; bidding firms with common addresses, officers, etc. might be likely to be involved in collusion. Lastly, the team examined trends in the project approval process, specifically the number of approvals made by the World Bank per year. The team acquired all project data made by the World Bank between 1947 and 2012 and ran an analysis of the trends over time (Figures 34-35). Project Approval Trends

Figure 34: Total number of project approvals made by the Board per year.


Figure 35: The left chart shows approvals per month, where we see an increase in the number of

approvals toward the end of each month. The right shows the aggregate of all approvals.


Figure 36: Project approvals by month over time. June approvals seem to increase most.

Figure 36 shows the proportion of approvals by month of year over time along with trend lines fit by linear regression. Green lines indicate an increase in the number of approvals while red indicate a decrease. We can see that the spring and summer months have been rising in the proportion of approvals over time, meaning the Bank is granting proportionally more in late spring and early summer. Recommendations and Next Steps For future projects on debarment, it would be useful to have the following:

• A chronology of company/individual actions that resulted in debarment • For the companies/individuals that have been debarred:

o Did the projects request extensions? o Did the projects request additional funding? o Were they “problem” projects?

The Bank needs to create and maintain a formal "data warehouse" of their data that is cleansed, organized and well cataloged. The Bank should consider creating unified profiles for:

• Countries • Suppliers • Project types • Projects • Evaluation types • Project activities • Project activity outcome types • Time periods


To enable a proper, useful data warehouse, rigor and data cleansing/ETL (Extract, Transform and Load) will need to be implemented. When studying contractor relationships, the team found that there are companies that may not have the capacity to do the projects and therefore most likely hire subcontractors. Finding better data for subcontractors and individual consultants could greatly improve the quality of the results. The World Bank may want to supply governments with standardized forms or questionnaires to accompany RFPs and ask for experience of individuals who will potentially be working on the projects as well as the backgrounds of management teams. This form should be uniform for all projects and ask specific questions such as capability of the contractors and subcontractors and their past experience dealing with similar projects and results. In the case of subcontractors, the same form should be filled out. Also, the Bank may want to pick a few subcontractors at random and interview them confidentially to find out more about the work being subcontracted. Note: Even though the governments are the ones who grant/award projects, the Bank can provide this form for the governments and have it as a requirement in the RFPs. The team was not sure how much the Bank is involved in the RFPs and if RFPs are standardized or not. There are a rich number of future projects available from this weekend’s results. High potential topics include:

• Creating automated algorithms to flag suspicious firms and companies as they come into the Bank’s pipeline; For example, building on the analysis at the DC Big Data Exploration, the Bank may test out methods of flagging contractors when: a firm’s geodesic distance to a debarred firm falls below a certain threshold; j of its k nearest neighbors has debarred histories; it is classified as an “at risk” firm using a supervised clustering algorithm trained on a carefully vetted sample of the data.

• Further analyzing the distribution of debarred firms over time using factors such as: country of origin, firm size, firm industry etc. The Bank may find it useful to investigate trends such as locally high concentrations of debarred firms, or concentrations of certain types of misconduct over time.

• Building a unified set of profiles for major Bank entities (e.g. suppliers, countries, etc.) beyond what the current APIs allow; developing a tool that can be used by both fraud and corruption examiners as well as implementing agencies to analyze existing or potential contractors. For example: automate the process of cross-checking suppliers against debarred firms, and alerting users to known relationships to debarred firms or individuals.

• A deeper analysis of co-bidder relationships to automatically flag suspicious behavior; the World Bank can extend this method to conduct analyses on the impact of project factors to the bidding process. Adding the debarment data and/or contractor profile would greatly help identify whether collusion is likely occurring. Fraud and Corruption staff could examine the degree of separation between firms; bidding firms with common addresses, officers, etc. might be likely to be involved in collusion.


• Factoring in new data streams in the vein of “civil witness” for understanding corruption during project execution.


• Team HackPads o HackPad 1 o HackPad 2

• API for Supplier Profiles GitHub • Data visualization of disbarred firms and individuals (Excel file) • Code to generate network graphs of similarities between debarred and non-debarred

firms. • Code to scrape bidder information • The cleansed co-bidder data with co-bid groups included • Python code used to scrape and parse award notices available • Excel file used to create network diagram of co-bidders

https://hackpad.com/Analyzing-World-Bank-Supplier-Profiles-f5U4QI7ssfi

https://hackpad.com/Analyzing-the-World-Bank%E2%80%99s-Project-Data-for-%E2%80%98Signals%E2%80%99-YXW26ImKpZL

https://www.dropbox.com/s/z40sjwpewf78axb/debarred-data-3%20FG.xlsx

http://dl.dropbox.com/u/2396/datadive/network_map.R

http://dl.dropbox.com/u/2396/datadive/network_map.R

https://www.dropbox.com/s/4uf5kcwrdh9rbfl/scrape_award_notices.py

https://www.dropbox.com/s/6hh5pcuvssodsgk/bidders_w_groups_201303171056.xlsx

https://www.dropbox.com/s/4uf5kcwrdh9rbfl/scrape_award_notices.py

https://www.dropbox.com/s/dvrpnzakax1at4q/Shared_Bids_NodeXL_v2.xlsx


UNDP RESOURCE ALLOCATION Background and Problem Statement The UNDP Capacity and Performance team was trying to improve UNDP’s ability to fund development by examining the relationship between its staffing and its expenditure across the programs. Over the last several years, UNDP is increasingly focusing on measuring and improving its performance. Its objective is to make sure that all resources that UNDP brings to developing countries are used as effectively as possible, produce maximum value and lead to tangible and sustained improvements in people’s lives. To reach this objective, it is critical for UNDP to be able to monitor how well its offices are performing, especially in implementing concrete programs and projects. UNDP must be able to identify and, ideally predict, weaknesses and potential setbacks, and to take timely action to correct the course. The DataKind team joined with UNDP experts to use their data to understand how well their projects have been performing. The team chose expenditure as the measure of performance for this analysis. They addressed the following questions:

• Are women or men more likely to work in specific program areas? • What mix of workforce characteristics is associated with the greatest performance? • Can workforce characteristics accurately predict a downturn in performance?

Datasets Available The team compiled a dataset on the UNDP workforce, budget and expenditure from programs and projects that took place between 2008 and 2012. Each observation in the data represents an employee, a description of that employee, and the project he or she worked on between 2008 and 2012. Only employees that worked on projects listed in the Budget and Expenditure data were retained in this dataset. The dataset is available here. Key Findings

• Looking purely at budget and expenditure data, the team was able to classify UNDP projects into four broad categories of efficiency and analyze the breakdown of each type of project by country, region, time, and type.

• Key drivers of efficiency were mostly related to characteristics of the project and not staff characteristics. Of the staff characteristics in the data, the team found that the average number of years of service, total number of staff, and more recent projects were indicators of whether a project is more likely to spend more than budgeted.

As with all other projects from the DC Big Data Exploration, the findings should be considered provisional, as there are a number of methodological issues that still need to be addressed, e.g. sample size, selection bias and validity of sources.

https://hackpad.com/7ZECE8KEwk5#UNDP-Resource-Allocation

mailto:https://www.dropbox.com/s/hrxmo05e9uraa3y/UNDP_CAPACITY_AND_PERFORMANCE_cleaned.csv%253Fm


Methods and Analysis The team’s first challenge was to define a clear metric for the "success" of a project. UNDP did not appear to have an internal measure of success so the team came up with measures that could be used to define "efficiency." In this analysis, an efficient project spends the budgeted money without too high of an overhead. The team first combined several data sets to get at these measures. First, they calculated the amount of overhead spent by country and year by adding the estimated salaries of personnel not associated with any program. Then they allocated this overhead back to the programs, proportional to the amount expended on each program. Second, the team calculated the ratio of expended to budgeted money, ideally meeting or exceeding projected figures, or in the worst case scenario being unable to put money to use.

Figure 38: Each point is a project, and large (million dollar or greater) projects are indicated by

red dots. The four regions of project types are shown as well.

Figure 37: Plot of (overhead/expended) vs. (expended/budget). Each point is a project, and

large (million dollar or greater) projects are indicated by red dots.


• Couldn’t spend all their money (yellow) • Had high overhead (purple) • Spent more than allocated (red) • Were near target (green)

These categories were then annotated back into the data files, containing properties of programs, as described below. The team also created an interactive visualization of these ratios by year, available here. Properties of Programs The team then sought to understand basic properties of program spending over time. The team analyzed 324 projects from 2012 and explored their ratio of expenditure to budget. The table below shows the number of projects that fell into each type of expenditure ratio: Expenditure Ratio # of Projects with that Ratio NA 45 >1 25 =1 11 <1 206 =0 24 <0 13 The largest category was for projects with an expenditure ratio <1. That means that two-thirds of the projects underspent their budget. Breaking that category down further, the team explored what proportions of projects fell into more specific expenditure categories: Expenditure Ratio # of Projects with that Ratio 0.95 - 0.99 47 0.80 – 0.94 62 0.50 – 0.79 53 0.00 – 0.50 44 Here we see that 44 / 312, or about 14% of projects, spent less than half their budget. Dr. Harris, a consultant to the team, had quoted that almost a third of projects in 2012 spent less than 80% of their budgets. From this table, we can see that that number is exactly 97 / 312, or 31%. This is an issue that UNDP should explore more thoroughly as it seems troubling that about 1/3 of projects can’t spend their budgets. Programs by Year Using updated versions of the staff and program files, the team then studied program performance by year. Figure 40 shows program expenditure in each of the six regions from 2008 to 2012. This figure shows a large dip in expenditure in the Oceania regions starting 2009 to 2011. Oceania spent less than the other regions during these years. It turns out that the majority

http://www.kentrondesign.com/files/vistest/expended_allocated_totals.html


of the almost one third of projects mentioned in Dr. Harris's summary above are concentrated in this region. The team conjectured that reshuffling of funds could be one contributing factor. UNDP should consider comparing other similar programs in the regions to help determine whether UNDP projects overspent/underspent significantly. Performance Measures of Programs The team took some initial steps toward identifying indicators for performance in the program data. Trying to correlate staff expertise with performance proved difficult, as there were so many projects with 10 or fewer assigned staff. Turning back to rate of expenditure, the team then explored the ratio of expenditure/budget for each individual bureau. For some bureaus, the histogram of expenditure/budget for their programs sharply peaked near 1, indicating good performance for the majority of projects coming out of that bureau. In other bureaus, the distribution is more spread out between 0 and 1, indicating large variation in how efficiently funds are disbursed. Deeper analysis of the bureaus with more varied distributions should be performed. The team also felt it would be interesting to look at this observation alongside country development data.

Figure 39: Program expenditure by year in each of the six major sites.


Geographic Breakdowns of Program Efficiency The team also looked at geographic breakdowns of programs. Figure 40 shows the ratio of expenditure / budget by country. From this figure alone it is not clear if there are surprising trends, but the team would like UNDP to weigh in on what they see in this plot.

Figure 40: Expenditure of projects by country.

Recommendations and Next Steps First of all, a better measure of program efficiency and impact will help focus future analysis. Looking purely at expenditure and budget does not speak to the actual performance of the project. Much can be learned by looking at the variables that effect program efficiency distinct from program performance. Some of the factors that UNDP could investigate:

• Correlations between staff expertise and project performance using additional personal information about the team members as variables. Some variables that they might consider include: education level, ambition, and income level.

• Using more granular data about the projects such as what was accomplished in a given year, the scope of the project, progress towards outcome metrics.

• The team observed that in some regions, the expenditure: budget ratio varied more widely than others. UNDP could further analyze the variance among expenditure: budget ratios by region.

• The team observed that 1/3 of projects spent less than 80% of their budget. UNDP should consider comparing similar programs across the regions to identify correlations across programs (size, sector and project).


• Team HackPad

https://hackpad.com/UNDP-Resource-Allocation-7ZECE8KEwk5


ADDITIONAL PROJECTS There were two additional projects at the DC Big Data Exploration that teams worked on:

• Social networking analysis for risk measurement: Can you forecast project risk using social networking analysis tools?

• Can you use simple heuristic auditing to sniff out discrepancies in expenditure data: What do you do when you have the information but don’t know if it contains signals about potential fraud and corruption related risk?

Because these projects were not setup through DataKind, we unfortunately, do not have detailed information about the projects. However, the link above leads to the hackpad contributed by the authors.Their involvement in the event shows that innovation can come from a wide community of innovators and technologists. Next Steps Thanks to the Bank’s willingness to team with DataKind prior to this event, the volunteer teams were able to deliver a huge amount of work to the Bank and its partners in just a short amount of time. The major takeaways across all projects were:

• Huge amounts of data exist outside the Bank in the form of mobile, social media, and open data that must be brought to bear on the Bank’s problems. Data scientists could be introduced to fill the capacity gap in using and understanding this type of data.

• Greater effort needs to be made within the Bank to reconcile its data across departments for reusability and advanced analytics. One application could be creating unified profiles for entities like suppliers.

Events like DataDives, competitions, and startup weekends raise visibility for the Bank, unite the community and generate ideas; however, this will not lead to sustainable change unless the Bank commits serious resources to continuing the work and supporting it.

Additional Reading Blogs that may be relevant:

1. Short recap blog - with links to raw project hackpads 2. Chris Kreutz's recap of the DataDive in Vienna 3. Max Richman on scraping pricing data to measure poverty 4. Francis Gagnon on better data and the power of data visualization 5. Ben Ranoust on using visual analytics to probe risk factors influencing project outcomes 6. Marc Maxson on auditing the world - the sequel 7. Dennis McDonald on learning from data explorations 8. Giulio Quaggiotto and Prasanna Lal Das on personal data philanthropy 9. Milica Begovic, Giulio Quaggiotto, and Ben Ranoust on social networking analysis for

development 10. Giulio Quaggioto, Anoush Tatevossian and Prasanna Lal Das set the stage.

https://hackpad.com/ehWRydy6QgJ#Additional-Projects

http://www.slideshare.net/renoust/network-analysis-applied-to-project-risks-identification

https://hackpad.com/Additional-Projects-ehWRydy6QgJ

http://blogs.worldbank.org/opendata/scenes-from-a-dive-what-s-big-data-got-to-do-with-fighting-poverty-and-fraud

http://europeandcis.undp.org/blog/2013/03/21/data-dive-measuring-poverty-through-real-time-data/

http://www.intermedia.org/2013/03/25/data-volunteering-with-international-organizations-at-the-world-bank/

http://infodez.wordpress.com/2013/03/17/it-starts-with-the-data/

http://infodez.wordpress.com/2013/03/18/diving-with-a-view/

http://www.slideshare.net/renoust/network-analysis-applied-to-project-risks-identification

http://chewychunks.wordpress.com/2013/03/19/turning-victims-of-fraud-into-agents-of-change/

http://www.ddmcd.com/datadive.html

http://blogs.worldbank.org/opendata/would-you-give-up-your-personal-data-for-development

http://europeandcis.undp.org/blog/2013/01/31/big-data-and-development-organizations-what-happens-when-you-move-from-theory-to-practice/

http://europeandcis.undp.org/blog/2013/01/31/big-data-and-development-organizations-what-happens-when-you-move-from-theory-to-practice/

http://europeandcis.undp.org/blog/2013/01/11/can-big-data-help-deliver-better-operational-results/