OPEN DATA FOR FINANCIAL INNOVATIONS IN THE DEVELOPING WORLD DR. BIPLAV SRIVASTAVA A C M D I S T I N G U I S H E D S C I E N T I S T , A C M D I S T I N G U I S H E D S P E A K E R S E N I O R R E S E A R C H E R A N D M A S T E R I N V E N T O R , I B M R E S E A R C H – I N D I A
1 1 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Why This Talk? Main Messages
� Financial Innovations are key for a developing country like India to provide better opportunities to its citizens ¡ Impacts not only finance (Banking, Insurance, …) ¡ But all other areas of a society (Healthcare, Transportation, Industry)
� Innovations depend on data, analysis and timely access � Open data is often the most promising source to start
making quick impact � Eventual aim should be to scale innovations with other data
sources and reach production scale to people seamlessly
2 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Actions to Take
Tutorial on 27 July 2015 @ IJCAI 2015
� Join: “AI in India” google group – ¡ https://groups.google.com/forum/#!forum/ai-in-india
� Participate in machine learning competition on using open data for health area (disease, finance, …) ¡ Start: https://www.facebook.com/dataview2016 ¡ Competition page:
http://gator3080.hostgator.com/~sigdata//comad2016/data_challenge_competition.html
¡ Data and insights sought: http://gator3080.hostgator.com/~sigdata//comad2016/data_sources.html
3
Europe GDP Growth
Source: http://ec.europa.eu/eurostat/web/national-accounts/statistics-illustrated
5 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Complexity and Innovation
� Complexity ¡ Many countries: 28 in EU, 19 use Euro ¡ Changes within Europe; Yugoslavia broke up during
2004-2010 ¡ There have been continuous currency changes since 1999 when
Euro was introduced; since 2001, Cyprus, Slovenia, Malta, Slovakia … have joined or changed currency
� Innovation ¡ Linked data to represent data, metadata and relationships ¡ Contexual and holistic visualization
6 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Indian Reality – Kingfisher Airline Case
� A two-term Rajya Sabha MP ¡ Heading company and taking loans from banks ¡ Leading airline to collapse ¡ Delaying repayment
� The airline (company) ¡ Not paying employees and vendors ¡ Not even paying income tax deducted from employees
� Consequence ¡ Airline collapses leading to loss to travellers and employees ¡ Banks suffer heavy losses ¡ Little impact on company leader
7 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Reality in a Developing Country
� In private sector, hard to know about genuineness of ¡ Individuals and companies ¡ Their needs and expenses
� In government sector, hard to know about ¡ Spending – budgeted and actuals ¡ Effectiveness of their spending ¡ Benchmarking with best practices, e.g., return of investment
� Consequence ¡ Little loans available to the needy ¡ High non-performing assets (NPAs) of banks ¡ Lower performance of markets since investors stay away ¡ Lower country growth, high unemployment and poverty
9 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Resources for Finding About a Person
� Public encyclopedia: Wikipedia ¡ Example: http://en.wikipedia.org/wiki/Vijay_Mallya
� Specialized databases ¡ Indianboards: http://indianboards.com/pages/index.aspx
÷ Example: Infosys (http://indianboards.com/pages/companyprofile.aspx?code=C0000604)
¡ US CEOs: http://ceo.com ¡ Forbes profile:
÷ Example: http://www.forbes.com/profile/ginni-rometty/
10 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Resources for Finding About a Company
� Market regulators ¡ SEC (USA): Edgar filings -
http://www.sec.gov/edgar/searchedgar/companysearch.html ¡ Ministry of Corporate Affairs (MCA) database:
http://www.mca.gov.in/DCAPortalWeb/dca/MyMCALogin.do?method=setDefaultProperty&mode=31
� Private market intelligence companies ¡ EMIS:
÷ Example: http://www.securities.com/php/company-profile/KR/Samsung_Electronics_CoLtd_en_1651328.html
11 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Snapshot: Financial Innovations Needed for Developing Countries
� [Individuals] Data-based generation of ¡ Credit profile of individuals ¡ Criminal profile of individuals
� [Entities] Data-based generation of ¡ Credit profile of legal entities – Companies, NGOs ¡ Ranking of companies in an industry
� [Governments] Data-driven automatic ¡ Audit of government programs for effectiveness ¡ Ranking of cities, state governments ¡ Corruption assessment
� Prediction of ¡ Stocks ¡ Initial public offers (IPOs) ¡ Tax collection
12 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Outline
� Motivating Examples � Open Data � Analytical Techniques � Discussion
¡ Pattern in Building Usable Systems ¡ Smart City – What to Solve? ¡ Call to Action
13 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Open Data
� Open data is the notion that data should not be hidden, but made available to everyone. The idea is not new.
� Scientific publications follow this: “standing on the shoulders of giants” ¡ Science stands for repeatability of results and
hence, sharing ¡ The scientific community asserts that open
data leads to increased pace of discovery. (See: Ray P. Norris, How to Make the Dream Come True: The Astronomers' Data Manifesto, At http://www.jstage.jst.go.jp/article/dsj/6/0/6_S116/_article, Accessed 2 Apr, 2012)
� Governments are the new source for open data ¡ Data.gov efforts world-wide; 400+
governmental bodies, including 20+ national agencies, including India, have opened data
¡ In India, additional movement is “Right to Information Act”
15 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Not to Be Confused With Orthogonal Trend – Big Data
� Volume � Variety � Velocity � Veracity � …
Cartoon critical of big data application, by T. Gregorius. http://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Big_data_cartoon_t_gregorius.jpg/220px-Big_data_cartoon_t_gregorius.jpg
16 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
400+Data Catalogs of Public Data
As on 21 July 2015
17 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Data.gov (USA)
As on 16 June 2015
18 Talk at IEEE Bangalore Workshop, Technologies for Planning and Acting in Real World Systems
City Level – Chicago, USA
19 As on 16 June 2015
Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Peek into the Future - Amsterdam
http://citydashboard.waag.org/ 21 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Illustration of Levels
Source: http://5stardata.info/
Does Opening Data Make It Reusable? No
1
2
3
4
5
22 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
India: Right to Information Act
� Any citizen “may request information from a "public authority" (a body of Government or "instrumentality of State") which is required to reply expeditiously or within thirty days.” ¡ Passed by Parliament on 15 June 2005 and came fully into force on 13
October 2005. Citation Act No. 22 of 2005 � Lauded and reviled
¡ Brought transparency ¡ Also,
÷ Increased bureaucracy ÷ Shortcomings in preventing corruption
� More information ¡ http://en.wikipedia.org/wiki/Right_to_Information_Act ¡ http://rti.gov.in
23 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Data Quality in Public Data in India
� Right to Information ¡ Not even 1* ¡ Information available to requester, but no one else
� Data.gov.in ¡ 2-3* ¡ Available in CSV, etc but not uniquely referenceable
� Open data movements are moving to linked data form for semantics
24 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Semantics for Published Data
25
Classify data in public domain. Use schema.org as illustration.
¡ Select an area (e.g., food, news events, crime, customs, diseases, …) ¡ Build + disseminate the catalog tags via a website ¡ Encourage publishers to use meta-data tags and enable search
Catalog/ ID
General Logical
constraints
Terms/ glossary
Thesauri “narrower
term” relation
Formal is-a
Frames (properties)
Informal is-a
Formal instance
Value Restrs. Disjointness, Inverse, part-of…
Credits: Ontologies Come of Age McGuinness, 2001 From AAAI Panel 99 – McGuinness, Welty, Uschold, Gruninger, Lehmann Plus basis of Ontologies Come of Age – McGuinness, 2003
Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Still Confused on Semantics? Start with Linked Data Glossary
26 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Open Data References
� Concept ¡ Open Data, At http://en.wikipedia.org/wiki/Open_data, ¡ Open 311, At http://open311.org/ ¡ Catalog of Open Data, At http://datacatalogs.org/dataset ¡ Data City Exchange: http://www.imperial.ac.uk/digital-city-exchange
� India specific ¡ Open data report in India, At http://cis-india.org/openness/publications/ogd-report
� Standards ¡ W3C, At http://www.w3.org/2011/gld/ ¡ 5 Star Linked Data ratings, At http://www.w3.org/DesignIssues/LinkedData.html
� Applications and ecoystems ¡ Introduction to Corruption, Youth for Governance, Distance Learning Program, Module 3, World Bank
Publication. Accessed on June 15th 2011, At http://info.worldbank.org/etools/docs/library/35970/mod03.pdf
¡ Dublinked, At http://dulbinked.ie
27 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Advanced AI Techniques (Analytics) like Planning & Machine Learning make use of data and models to provide insight to guide decisions
Models
Analytics
Data
Insight
Data sources: Business automation
Instrumentation Sensors
Web 2.0 Expert knowledge
“real world physics”
Model: a mathematical or
algorithmic representation of
reality intended to explain or predict some aspect of it
Decision executed automatically or
by people
29 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Example: Talks
� Are they useful? (Descriptive) ¡ Answering needs an assessment about the event
� If it happens next time, how many will attend? (Predictive) ¡ Above + Answering needs an assessment about unknowns
(e.g., future) � Should you attend? (Prescriptive)
¡ Above + Answering needs understanding the goals and current status of the individual
30 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Analytics Landscape
Degree of Complexity
Com
petit
ive
Adv
anta
ge
Standard Reporting
Ad hoc reporting
Query/drill down
Alerts
Simulation
Forecasting
Predictive modeling
Optimization
What exactly is the problem?
What will happen next if ?
What if these trends continue?
What could happen…. ?
What actions are needed?
How many, how often, where?
What happened?
Stochastic Optimization
Based on: Competing on Analytics, Davenport and Harris, 2007
Descriptive
Prescriptive
Predictive
How can we achieve the best outcome?
How can we achieve the best outcome including the effects of variability?
31 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
ML References
� WEKA ¡ Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html ¡ WEKA Tutorial:
÷ Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka.
÷ A presentation which explains how to use Weka for exploratory data mining. ¡ WEKA Data Mining Book:
÷ Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)
÷ http://www.cs.waikato.ac.nz/ml/weka/book.html ¡ WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/Main_Page
� Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd ed. � http://www.kdnuggets.com/2015/03/machine-learning-table-elements.html
32 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Discussion: A Pattern in Building Usable Systems
33 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Recap of Key Points from Finance Scenarios
� Very difficult to find about persons, companies, states reliably
� This is leading to wastage, e.g., non-performing assets in banking system
� Outside finance: wastage in public spending (healthcare, transportation, industrial production, …), business and individual spending
� Information technology (IT) and financial innovations are needed, especially in developing countries
34 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Real-World Applications of ICT Follow a Pattern
n Value (from Action, Decisions) – Providing benefits that matter, to people most in need of, in a timely and cost-efficient manner. Going beyond technology to process and people aspects.
n Data + Insights – Available, Consumable with Semantics, Visualization / Analysis
n Access - Apps (Applications), Usability - Human Computer Interface, Application Programming Interfaces (APIs)
35 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Example – Financial Innovations
� Decision Value – To individuals, businesses, government institutions ¡ Individuals Examples – Which person to financially trust? Which bank to trust? ¡ Govt Examples – What company to give contracts? ¡ Business Examples – Which companies and individuals to give credit to? What
discounts to give? � Data – Quantitative as well as qualitative
¡ Open data ¡ Social data ¡ Transactional data
� Access – ¡ Today, little, reliable information
Key Idea: Can we make insights available when needed and help people make better decisions?
36 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Example – Public Health Innovations
� Decision Value – To individuals, businesses, government institutions ¡ Individuals Examples – Which doctor should I go? Which hospital should I go?
What health policies should I take? ¡ Govt Examples – What diseases should be of focus? Which hospitals should be
given grants? Which health programs should be discontinued? � Data – Quantitative as well as qualitative
¡ Past incidents – Cases, deaths, spending ¡ Health trends – vaccines, epidemics, health instruments ¡ Financial trends – insurance, policies, social behaviors
� Access – ¡ Today, little, and that too in health / technical jargon ¡ In pdf documents, website
Key Idea: Can we make insights available when needed and help people make better decisions?
37 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
DataView 2016
Tutorial on 27 July 2015 @ IJCAI 2015
Data and insights sought: http://gator3080.hostgator.com/~sigdata//comad2016/data_sources.html
38
Insights sought 1. What diseases are most prevalent in a given area (e.g., state, district, city, by keyword)? 2. Which diseases have been better controlled than others in India? What states have done better than others? Are there approaches which have worked for controlling / reducing instances of diseases better than others? 3. How much money has been allocated to tackle specific diseases compared to others? Which regions do better than others in controlling diseases relative to money spent? 4. Is their a relationship between water-borne diseases and their relation to water pollution? Datasets Health • H-DS-1: http://data.gov.in/catalog/number-cases-and-deaths-due-diseases , AllIndia (from 2000 to 2011) and State-wise (2010 and 2011) number of cases and deaths due to specified diseases (Acute Diarrhoeal Diseases, Malaria, Acute Respiaratory Infection, Japanese Encephalitis, Viral Hepatitis). • H-DS-2: http://data.gov.in/catalog/cases-and-deaths-due-kala-azar , Cases and Deaths due to the illness Kala-Azar in Bihar, West Bengal and Country during the years 1996 till 2000. • H-DS-3: https://data.gov.in/catalog/cases-and-deaths-due-japanese-encephalitis-and-dengue-dhf-during-tenth-plancases and deaths due to Japanese Encephalitis and Dengue / DHF during Tenth Plan. • H-DS-4: https://data.gov.in/catalog/water-quality-affected-habitations, Water Quality Affected Habitations • H-DS-5: Hospital Directory with Geo Code as on September 2015, https://data.gov.in/catalog/hospital-directory-national-health-portal Expenditure • F-DS-1: https://data.gov.in/catalog/outlays-and-expenditure-aids-control-programme-during-ninth-plan, outlays and expenditure of AIDS Control Programme during Ninth Plan. • F-DS-2: https://data.gov.in/catalog/public-sector-outlaysexpenditure-during-eleventh-five-year-plan, public sector outlays and expenditures during Eleventh Five Year Plan (2007-12) under various Heads of Development (Rs. Crore). • F-DS-3: http://data.gov.in/catalog/outlays-department-health-agreed-planning-commission-during-tenth-plan , data related to 9th Plan Allocation, 9th Plan Anticipated Expenditure, 10th Plan Allocation as Agreed by Planning Commission. • F-DS-4: https://data.gov.in/catalog/percentage-share-household-expenditure-health-and-drugs-various-states-during-eleventh-five, data related to percentage share of household expenditure on health and drugs in various states during Eleventh Five Year Plan. • F-DS-5: https://data.gov.in/catalog/state-wise-plan-outlays-and-expenditure, table provides state-wise plan outlays and expenditure during 2011-2012. • F-DS-6: https://data.gov.in/catalog/outlay-tenth-plan-tenth-plan-sum-annual-outlay-and-tenth-plan-actual-expenditure-department, data related to Outlay Tenth Plan, Tenth Plan (200207) sum of Annual Outlay and Tenth Plan (2002-07) Actual Expenditure for Department of Health and Family Welfare. Water Quality • W-DS-1: https://data.gov.in/catalog/status-water-quality-india-2012, http://data.gov.in/catalog/number-cases-and-deaths-due-diseases , status of Water Quality in India in 2012 • W-DS-2: https://data.gov.in/catalog/status-water-quality-india-2008-and-2011, status of Water Quality in India - 2008 and 2011
Example –River Water Pollution
� Decision Value – To individuals, businesses, government institutions ¡ Individuals Examples – Can I take a bath without getting sick? What crops
should I grow? What water should I drink and pay for? ¡ Govt Examples – How should govt spend money on sewage treatment for
maximum disease reduction? How should it inspect industries? � Data – Quantitative as well as qualitative
¡ Dissolved oxygen, ¡ pH, ¡ … 30+ measurable quantities of interest
� Access – ¡ Today, little, and that too in water technical jargon ¡ In pdf documents, website
Key Idea: Can we make insights available when needed and help people make better decisions?
39 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
What is a Smart City?
Smart city can mean one or more of the following: � As a resource optimization objective, it is to know and manage a
city's resources using data.
� As a caring objective, it is about improving standard of life of citizens with health, safety, etc indices and programs.
� As a vitality objective, it is about generating employment and doing sustainable growth.
A city leadership can choose among these or define their own objective(s) and manage with measurements to pro-actively achieve it
41
See other FAQs at: https://sites.google.com/site/biplavsrivastava/research-1/intelligent-systems/scfaqs
Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
42
Smarter Cities solution paths leverage a similar approach
Uni
que
valu
e re
aliz
ed
Use of Smarter Cities capabilities
ManageData 1
AnalyzePatterns 2
Optimize Outcomes 3
Integrate service information to improve department operations
Develop integrated view to improve outcomes and compliance
Leverage end-to-end case management to optimize service delivery
Ç Improve service levels È Reduce fraud and abuse
Ç Focus on the citizen Ç Savings from overpayment Ç Assistance with compliance
Ç Integrated case management Ç Automation of citizen support È Reduce operating costs
Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
India’s 100 Smart Cities
43 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015 Details: https://sites.google.com/site/biplavsrivastava/smart-cities-in-india
Comments on India’s 100 City Plans
� A much-needed, much-delayed, start ¡ JNURM and earlier initiatives did not show impact
� However selection criteria was non-technical ¡ Focus was on funding feasibility (center-state) and administrative
considerations ¡ No commitment on measurable improvement of any metric in any
city domain � Opportunity to impact India’s transformation
(theoretically) ¡ However, environment to try out India-specific, new innovations
needs to be created ¡ Focus has to be on improvement metrics; accountability for money
spent; quality outcomes
44 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Smart City Challenges
� From resource angle, decrease waste/ inefficiency while improving service delivery to citizens
� Problems are old but accentuated today by population growth and reducing resources
� Open Data, effectiveness of analytical methods hold promise
� Challenges ¡ Provide value quickly ¡ Use value synergies from different domains (e.g., finance, health,
environment, traffic, corruption …) ¡ Grow to scale
46 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Common Descriptive Analytics Patterns, Accelerated with Open Data
� Correlation of outcomes, across ¡ Data sources in same domain ¡ Different domains
� Return of investment analysis ¡ Money invested v/s Metrics to measure improvement in
domain ¡ Comparison of performance with history ¡ Comparison of performance with other regions
47 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Employing All Data – Data Fusion
� Open Data is one source ¡ Often easiest to get but with issues (e.g., at aggregate level, with gaps,
imprecise semantics)
� Social is another promising data ¡ People are anyway generating it (People-as-sensors) ¡ However, social sites have varying data reuse permissions,
license costs, access limits ¡ Big data techniques already being used here
� Use sensor data if available ¡ Internet of Things (IoT) and big data techniques are relevant ¡ Most prevalent in health, environment and transportation
� Key is to release the fused data also for reuse
48 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Building Community for Innovations
� Multi-disciplinary ¡ In AI ¡ In Computer Science ¡ In science: domain (finance, health, transport, …), techniques (CS,
engg.) and evaluation (public policy, …) � Multi-stakeholder
¡ Citizens ¡ Government ¡ Academia ¡ Business/ Industry ¡ Non-profits, …
� Getting to scale is key
49 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Main Messages
� Financial Innovations are key for a developing country like India to provide better opportunities to its citizens ¡ Impacts not only finance (Banking, Insurance, …) ¡ But all other areas of a society (Healthcare, Transportation, Industry)
� Innovations depend on data, analysis and timely access � Open data is often the most promising source to start
making quick impact � Eventual aim should be to scale innovations with other data
sources and reach production scale to people seamlessly
50 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015
Thank You
Merci Grazie
Gracias Obrigado
Danke
Japanese
French
Russian
German Italian
Spanish
Portuguese
Arabic
Traditional Chinese
Simplified Chinese
Hindi
Romanian
Korean
Multumesc
Turkish
Teşekkür ederim
English
Dr. Biplav Srivastava, [email protected]://www.research.ibm.com/people/b/biplav/
51 Talk at IDRBT Doctoral Consortium, Hyderabad 11 Dec 2015