Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data

Embed Size (px)

Citation preview

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    1/48

    University of Groningen

    Industrial Engineering and Management

    Bachelor Thesis

    Supervisors: prof. dr. H.G. Sol (University of Groningen),

    ir. drs. T.A. van den Broek (TNO)

    Open Data: a design for the provisioning of

    Dutch government public and geo-spatial

    transport data.

    J.P.S. van Grieken

    Groningen, February 28, 2011

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    2/48

    Abstract

    Governments increasingly start to publishing structured, machine readable and free public

    sector information for commercial and public re-use. They are moving from a closed model

    in which businesses pay a cost that maximizes government profit or covers long-term cost

    towards a free model in which data is freely available without any cost. This form of public

    sector information provisioning is also referred to as open data. In this paper a design and

    business model for Dutch public and geo-spatial data is presented. Furthermore, the impli-

    cations of a governmental open data policy on the business case of various stakeholders that

    work with public- and geospatial transport data is examined.

    To establish a design for open data a literature review and interviews with specialists were

    conducted. We found that the proliferation of the internet as a participatory and eco-nomic platform, the development of freedom of information and transparency policies and

    the perceived economic benefits of free public sector information, have contributed to the

    development of open data. We found that if government data were to be made available at

    zero or marginal cost this could lead to significant increases in economic activity. Businesses

    could use the different data sets to create services and therefore add value to the data. This

    economic activity in its turn would lead to more revenue for the businesses and increase

    overall welfare. The government would benefit from this activity through taxation of the

    services.

    A business model of open data in the public and geo-spatial transport sector was designed.

    In this model barriers in legislation were removed, accurate pricing strategies and a tech-

    nical implementation for open data were recommended. We found that this model causes

    changes in the business case of data providing organizations and businesses. Especially the

    cost structure of these respective stakeholder should be changed. Finally, a design for a data

    warehouse for road and public transport data is presented. The design covers a warehouse

    architecture, data model, interface design, hardware recommendations and qualitative as-

    pects. In the final section of the paper we discuss some of the findings in relation to economic

    activity, loss of intellectual property, licensing of open data and changes in government cost-

    structure.

    Keywords: public sector information, open data, design, business case, data-warehouse,

    public transport, geo-data, economics, transparency, governments

    Open Data: a design for the provisioning of Dutch government public and geo-spatial trans-

    port data. by J.P.S. van Grieken is licensed under a Creative Commons Attribution -Non

    Commercial -Share Alike 3.0 Unported License.

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    3/48

    Contents

    1 Introduction 3

    1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.1.1 The Networked Society . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.1.2 Drivers of transparency . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.2 Open Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Theory 9

    2.1 The economics of open data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Dutch government information architecture . . . . . . . . . . . . . . . . . . . 12

    2.3 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.4 The business model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.5 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3 Methods 17

    3.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2 Open Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.3 Stakeholder Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.4 Structured interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.5 Business case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.6 Requirements analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.7 Data Warehouse design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4 Business Model Design 20

    4.1 Effects of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4.2 Effects on the stakeholder business cases . . . . . . . . . . . . . . . . . . . . . 22

    5 Technology Design 25

    5.1 Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5.2 Warehouse Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    5.3 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    5.4 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    5.5 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    1

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    4/48

    CONTENTS

    5.6 Qualitative Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    6 Discussion 33

    6.1 Effects on businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    6.2 Changes in government cost structures . . . . . . . . . . . . . . . . . . . . . . 33

    6.3 Loss of intellectual property and market disturbance . . . . . . . . . . . . . . 34

    6.4 Legal: insuring coverage, quality, privacy and neutrality of data . . . . . . . . 34

    6.5 Data vs. Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    6.6 Risks of the design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    7 Appendix 40

    .1 Requirements Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    .2 Interview Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    .3 Final Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    .4 List of Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    .5 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    2

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    5/48

    Chapter 1

    Introduction

    Political participation, civil society, and transparency are among the indispens-able elements that are the imperatives of democratization. As quoted from a

    speech at Harvard University, Kennedy School of Government by Recep Tayyip

    Erdogan , January 30th 2003

    Long before the rise of computer technology governments have started to collected vast

    amounts of structured data. Already in 1811 the cadastre started measuring and recording

    the ownership of land1. In 1899 the Central Bureau for Statistics (CBS) kept detailed records

    and statistics on the Dutch population in order allow decision makers to construct effective

    economic policies. Most of this data is used by different governmental organizations to servethe public in their daily operations. For example, the cadastre uses the detailed maps they

    have gathered to determine the boundaries of land when sold. Nowadays, this structured

    data is stored in large data warehouses owned and maintained by different branches of

    government. Estimates suggest that between 100-150 Dutch governmental organizations

    posses data that could be relevant to the public or to businesses [ 1].

    If this government data were to be made available at zero or marginal cost this could

    lead to significant increases in economic activity[23]. Businesses could use the different data

    sets to create services and therefore add value to the data. This economic activity in its turn

    would lead to more revenue for the businesses and increase overall welfare. The government

    would benefit from this activity through taxation of the services. For example, after re-

    leasing the data within months innovative applications in public transport, crime, parking,

    schools, tourism and dining were created2

    There are three main reasons that this business potential remains untapped in the Nether-

    lands. First of all, governments often choose a pricing strategy that either maximizes profit

    or returns the long-term average cost. This causes a barrier for businesses to re-use the data

    because the cost to gather the information themselves is similar to buying it directly from

    the government. Secondly, law and policy restrictions apply to most of the datasets the

    government owns. For example, copyright and database law restrictions limit businesses in

    3

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    6/48

    CHAPTER 1. INTRODUCTION

    the services that could possibly be build on this data. Finally, most government bodies lack

    the technical infrastructure to deliver high quality data to businesses at high speed.

    1.1 Context

    Before we begin the analysis of the economics and technical infrastructure needed for our

    design we first want to explain the developments in legislation and society that have lead to

    open data.

    1.1.1 The Networked Society

    The first important development that has made open data possible is the rise of internet

    within our society. The internet has created a market for information services and goods. It

    has created possibilities for collaboration and trade of information goods and services andis developing as a major distribution platform for these services.

    Everywhere around the globe broadband access has been pushed into markets to con-

    nect people to the internet. Since a couple of years almost everybody in the Netherlands

    has access to the internet via a computer or mobile device. The access to the internet has

    risen from 77% in 2004 to 93% in 2009 [3]. These new forms of communication have enabled

    citizens to communicate in new ways amongst themselves and with public institutions. Net-

    works of people continue to form the structures and organization of society, a phenomenon

    which is mainly referred to as the rise of the network society [4]. These ways of interaction

    create new ways of collaboration among citizens in terms of speed, scale, anonymity, inter-

    activity and community building. The internet provides a market for people to collaborate

    and is described by Antonijevic and Gurak as

    [The internet] has brought easy to use content-creating applications such as

    blogs, wikis, social networking sites, and file sharing platforms rooted in broad-

    band access, affordable hardware and software solutions, and with the Internet

    perceived and used as a new normal in contemporary way of life. [5].

    The development of the internet as a network of individuals collaborating is recognized

    as a new way of creating economic value. The OECD sees the web as one of the drivers

    for creativity and economic development among people in the coming century [ 6]. In thefield of software construction this has lead to the collaborative software creation between

    programmers and other specialist from all over the globe, which is referred to as open source

    software. Open Source software challenges the rules of economics, software development and

    IT management. On development networks like sourgeforge.net, vast amounts of program-

    mers work together on software projects without any financial compensation[7].

    These programmers engage in civil society and organize bar camps3 and online platforms

    where they meet and try to construct software that helps governments and citizens in their

    daily lives. A good example of a developed network is the Sunlight Labs in the United States

    which counts around 2700 volunteering programmers 4 that work on various projects. In

    4

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    7/48

    CHAPTER 1. INTRODUCTION

    Europe a large community of programmers can be found in the United Kingdom, Denmark

    and Spain.

    A study in the United Kingdom looked at the motivation of these communities of pro-

    grammers in relation to open data. Citizens showed a desire to engage with government inopen data initiatives. The survey indicated that 36% wanted to be actively involved and

    use, vs. 33% that were just happy to get the data. Similar effects have been found in

    the relation between citizens and the government in the Netherlands[8]. A study by TNO

    suggests that the rise of the social web (web 2.0) causes citizens to create new platforms

    that they use to organize, collaborate, share, trade and create [10]. These platforms are

    open in nature, require visitors to collaborate and try to use the distributed knowledge of

    all the participants. We have now described the implications that give open data is societal

    context. The networked society has lead to a collaboration platform and potential market

    for open data.

    1.1.2 Drivers of transparency

    In most countries that have adopted open data policies the development originated from

    transparency and freedom of information laws. The term transparency has many different

    definitions depending on specific use and context. In the field of politics and government

    transparency is usually referred to as social transparency[10]. This form of transparency

    is defined as Social Transparency allows citizens to be more informed and encourages the

    disclosure as a regulation mechanism of centers of authority. It is based on ethics and gov-

    ernance, where the interests and needs are focused in the citizens [11]. Governments use

    Freedom of Information (FOI) laws to define the formal rights and degrees of freedom of

    transparency within a nation. The first freedom of information laws came into effect after

    the second world war, but in most countries these types of laws are still in development. A

    study on freedom of information laws found that in 1985 only 11 countrys adopted free-

    dom of information laws, but in 2004 almost 59 countries had some form of transparency

    law passed through parliament[12]. Transparency and the right to obtain government in-

    formation are seen as essential to corruption prevention, democratic participation, trust in

    government, accountability, informed decision making, and provisioning of information to

    the public. [13]. As a tool, the internet allows for easy publishing and rapid sharing of public

    sector information in relation to Freedom of Information rights. The internet has causedmore transparent public sector organizations that are able to respond to citizen needs more

    rapidly[15].

    The United States have a rich history of freedom of information and transparency

    policies[16]. They experimented in 1997 with one of the first government transparency

    websites called Fedstats.com. This website provides statistics on all the federal govern-

    ment agencies and publishes it on a website. Furthermore, in the last 20 years various

    transparency laws have been approved by the senate. In 2006 the Federal Funding and

    Transparency Act was adopted providing high degrees of budget transparency. A year later

    the Honest Leadership and Open Government Act followed and provided accountability and

    openness to citizens. The final chapter in freedom of information laws in the United States

    5

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    8/48

    CHAPTER 1. INTRODUCTION

    was the Memorandum on Transparency and Open Government5. In this memorandum

    the Obama administration calls all federal agencies for an unpresidented level of openness.

    The memorandum declares that all departments should be transparent, participatory and

    collaborative. With this memorandum the administration promotes accountability, publicengagement, public participation and crowdsourcing using internet technology. The most

    important development is that the United States government considered all data gathered

    to be national public asset and should therefore be available to all citizens in a structured

    format.

    In Europe similar policies have been adopted in the United Kingdom, Norway, Spain,

    Denmark, Estonia and Greece6. Although most of the initiatives are still in a development

    phase, some similarities can be pointed out. The Danish government launched an open gov-

    ernment strategy which contained public sector information provisioning called Offentlige

    Data I Spil aimed at providing a portal website that provides structured data to citizens.

    Similar data portals have been constructed in the United Kingdom7, the Catalan region of

    Spain (Aporta)8 and Norway9. In terms of policy some developments at the level of the

    European Committee can be pointed out. The first import piece of legislation on the use

    of public sector information is 2003 directive 98/EC on the re-use of public sector informa-

    tion10. This treaty describes the development of a European data products market based

    on public sector information. The main goal of this treaty is to make available, where pos-

    sible, documents that will be re-usable for commercial and non-commercial purposes where

    possible through electronic means. The member states are allowed to charge for the cost of

    collection, production, reproduction and dissemination together with a reasonable return on

    investment. Some European studies have been carried out on the effects of public sector in-formation. The Commercial exploitation of Europes public sector information report issued

    by the European Committee estimates the total value of the public sector information in

    Europe between EUR 28 billion per annum and EUR 134 billion per annum, with a central

    estimate of EUR 68 billion[17]. The last relevant European development was the eUnion

    program that ran under Swedish presidency of the European Union. In the Visby declara-

    tion11 the European member states call for EU member states and community institutions

    should seek to make data freely accessible in open machine-readable formats, for the benefit

    of entrepreneurship, research and transparency. This declaration has as of now not yet

    been put into legislation.

    Although the Netherlands scores high on the digital e-readiness ranking[18] there is no

    clear open government program as can be found in other European member states. An

    open government study found that the Dutch government lacks leadership, central coordi-

    nation, focus, has trouble distinguishing open data and participation and is weary of the

    business case of open government[?]. The Dutch government has been experimenting with

    participation subsidies and has supported some pilots in the field of open data. In terms of

    legislation no far reaching freedom of information laws have been adopted by the govern-

    ment. Copyright, Freedom of Information and database laws still prohibit the distribution

    of open data by central government. Also, no policy programs promoting open government

    or open data have been announced. The government is however conducting some research

    6

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    9/48

    CHAPTER 1. INTRODUCTION

    into the possibilities of open data in the Netherlands. In order to successfully implement

    open data within a country a culture of freedom of information supported by legislation is

    required.

    1.2 Open Data

    Before we can elaborate problem definition we need a consistent definition of open data

    . Open Data is defined as the publishing of structured, free, and machine readable public

    sector information[2] Where public sector information (PSI) is information gathered by gov-

    ernmental bodies and stored in some structured form. Open Data should not be confused

    with open source or open standard which are software and digital communication protocols

    respectively. We have used this definition because it is used most often in literature. Fur-

    thermore, this definition lets us differentiate between publicly available data (which is not

    per definition free or machine readable) and open data.

    1.3 Problem Definition

    In this section we will state the societal problem that underlies our research question. The

    data governments collect in their daily operations represent an economic value, and therefore

    economic potential. This economic value currently remains untapped in the Netherlands.

    Therefore, the problem definition for this study is:

    The business potential of open government data in the Netherlands remains untapped

    which causes loss of economic activity.

    There is still an uncertainty what consequences an open data model has on different stake-

    holders. Furthermore, how the technical infrastructure changes with open data policies.

    1.4 Objective

    The objective of this study is to create a design for the provisioning of open public and

    geo-spatial transport data. This study has been conducted in a period of three months and

    is be part of a larger study into the cost - benefit relations of open data at the Netherlands

    Organization for Applied Scientific Research (TNO). The study also serves as the bachelor

    thesis Industrial Engineering & Management of mr. J.P.S. van Grieken at the University of

    Groningen.

    Before we start with the design we need to establish the basic premises of our problem

    definition: open government data causes economic activity. When we proved this we first

    need to find the main causes of our problem definition. When we find those causes we will

    then create a design that includes both the societal problem and a technical implementation.

    For scoping purposes we will be looking at two types of data: public and geo-spatial transport

    7

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    10/48

    CHAPTER 1. INTRODUCTION

    data. We chose these data types because of their market popularity in foreign open data

    initiatives12.

    8

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    11/48

    Chapter 2

    Theory

    In this chapter we use theory try to identify the causes of our problem. We will start withan elaboration of the economic case for open data. Then we will briefly introduce Dutch

    government information architecture, and describe how this acts as a barrier for open data.

    After that we will describe the business model of open data. This will result in elaboration

    and justification of the research question.

    2.1 The economics of open data

    The main premises of this study is that open data causes a positive economic effect. This

    chapter elaborates on the economic literature available on open data. We will first start

    with an introduction on the economic value of public sector information.

    In their daily operation governments collect data in order to perform their primary

    tasks such as determination of land ownership or running a public bus service. The data

    collected represents both an economic value and an investment value. The investment value

    of this data is what governments pay in order to collect, maintain and distribute data. The

    second economic value of this data represents the part of the national income which can be

    attributed to business that create services using the data, or combine it with other data in

    order to add value. Studies performed by the European Committee suggest that the total

    economic value lies between e28 billion per annum and e134 billion per annum, with a

    central estimate ofe68 billion[17]. In 2000 the total investment of European member statesin public sector information was valued at e9.5bn[17].

    Usually, public services that have been paid for by taxpayers can only be used once.

    The nature of information and data however provides the option for it to be copied and

    distributed at nearly no extra cost.[19]. When governments decide to publish free and

    machine readable data value can be created in the market in the same way. Businesses

    reusing public sector information do not need to gather the data themselves which lowers

    the investment and time to market. Furthermore, companys will use data previously not

    available to create new services. Other economic effects of open data can be found within

    government itself. Research has shown that these forms of openness reduces corruption[20]

    which in the end leads to a more transparent and efficient government due to an effective

    9

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    12/48

    CHAPTER 2. THEORY

    allocation of knowledge[13]. These specific effects however are our out of scope for this study.

    Before we go into the details of the economic effects of open data we can describe the value

    chain of information products in order to analyze the business case[17]. The value chain for

    information products starts with the creation or collection of various forms of data. Afterthis process the data needs to be collected and stored in a form that allows for structured

    retrieval. The next step is processing and packaging which allows for delivery of the data.

    This final delivery process is used to bring the data at the client or end-user in a form defined

    by the processing and packaging stage.

    Figure 2.1: The data value chain

    We will now give an example of how this value chain applies to the areas we have se-

    lected. The Dutch railway network operator Pro-rail embedded sensors in rail network that

    can pinpoint the location of trains (creation). This data is collected and together with other

    meta data stored into a database (collection & storage). The train operators in the Nether-

    lands require this data to be able to adjust train schedules. Pro-rail therefore packages the

    data in such a way that the operators can use it to adjust their planning and communicate

    with travelers about delays (processing & packaging). Pro-rail uses a computer interface todeliver this data to the different train operators in the country (delivery). The data that

    has been delivered to the train operators represents value because it allows the operators to

    utilize their material in a more optimal way and provide service to their customers. In the

    case of open data, governments will deliver the processed and packaged data at no cost to

    businesses and the public.

    Different costing methods have been proposed for public sector information in order to

    maximize the return of investment for governments. The return governments can get on

    public sector information is a trade off between charging directly for the data, or provid-

    ing the data at marginal or no cost at all. In the later case the return on investment is

    achieved thought regular taxation on the economic activities that businesses perform with

    the data. Pollock describes three possible pricing policies governments could use for public

    sector information distribution and investigates its returns[21]. In a profit-maximization

    strategy governments set their prices to maximize the profit given the demand for the data.

    An average-cost or cost-recovery strategy can be used to equal the price to the total cost of

    data collection and distribution. In this case the users of the data pay for the entire value

    chain of the data. The final policy is the marginal or zero cost strategy in which the prices

    are equal to the short-term marginal cost. In many cases these cost will be zero because

    agencies that have already created distribution channels for the data to other government

    bodies will not have to charge for delivery of data the market. For example, the cadas-

    10

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    13/48

    CHAPTER 2. THEORY

    tre already distributes geo-spatial data to local authorities and therefore should not charge

    businesses to use this delivery infrastructure. In the Netherlands depending on the specific

    government organization different pricing strategies are used. The most dominant strategies

    are profit maximization or average-cost policies.

    Several studies have shown that the case for a marginal or zero cost policy is strong.

    A study on the economic effects of statistical data approaches the problem from economic

    theory angle. The study reasons that economic efficiency is maximized when services that

    are produced actually exchange hands in the most efficient manner to avoid waste and fulfill

    customer needs. Pricing of public sector information is therefore not economically efficient

    because the collection and distribution infrastructure is already funded by taxpayers. In this

    case strategies other than zero-cost will prevent the public form enjoying the benefit of these

    good trough consumption[22]. Another study shows that the case for marginal or zero cost

    policies are quite strong. The marginal cost to deliver data to other sources than primarily

    intended approach zero for many government datasets. Moreover, the business demand for

    this data is likely to be high and grow over time. Furthermore, it is likely that the distri-

    bution of free data will generate new innovative services. It is certainly safe to assume that

    the market will be better equipped to innovate on this data than public institutions facing

    heavy regulatory and budget constraints.[23].

    When we look at the economics of open data in the public and geospatial transport

    data we find that similar effects occur. A study on the impact of public sector geographic

    information in the Netherlands shows that a reduction in the price of the entire vector mapof the Netherlands from e1 million to e200.000 caused a significant increased demand and

    revenue for the cadastre[24]. Furthermore, a case study of the new map of the Nether-

    lands containing planning information on housing and infrastructure projects maintained

    by the Department of Housing and Special planning sheds an interesting light in the increase

    of dataset usage. The department brought this dataset under creative commons license13

    making it freely available for downloading. At first, the dataset was bought on average once

    every month but by releasing the data under a public license increased to 200 downloads

    per month[24].

    A similar study on the economic effects of cadastral information was performed in Spain.

    In 2004 the Cathalan regional government launched a cadastral information system providing

    topographical and geo-data in an open way. Using a survey the cost-benefit effects of this

    investment for government organizations (municipalities, regional and public authorities)

    were investigated. The study showed that the information system increases the efficiency

    and workings of other governmental organizations significantly. Although the investment in

    the portal was high (e1,2 million) the benefits within other government authorities were in

    2006 e2.371.000[25]. We can conclude that in some cases internal governmental organiza-

    tions can benefit largely from open public sector information because data comes available

    in a standardized way to both businesses and other branches of government.

    11

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    14/48

    CHAPTER 2. THEORY

    Most of the research on open public sector information focusses on a macro economic

    analysis of data provisioning. The Pira[17] study and most of the works of Pollock [19][21]

    focus on macro economic descriptions of the market and estimates of the value of publicsector information. At the micro level however literature lacks an analysis of the business

    cases and economics.

    2.2 Dutch government information architecture

    In order to understand the context of the ICT landscape in this study we will briefly in-

    troduce the information architecture of the Dutch Government. The Dutch Ministry of the

    Interior and Kingdom relations is formally responsible for the ICT within the government.

    The basic architecture that the central government should follow is formulated in NORA

    (Dutch Government Reference Architecture), a set of principles, guidelines and technologies

    that branches of government can follow to organize their ICT. The goals of Nora are to guide

    individual government bodies in the design of their information architecture and supports

    in policy making and deployment[27]. Within the architecture three principles are defined:

    basic principles, collaboration principles and regulations. The basic principles describe the

    relation between government, the public and businesses. The collaboration principles de-

    scribe interoperability constraints and finally the regulations describe technical constraints,

    standards and messages.

    In the architecture different components can be identified:

    1. Data Sources: (basisregistraties) the data sources or basis registries contain various

    forms of data the government collects.

    2. Service Bus: (servicebussen) the service bus is a data transportation facility that

    can move pieces of information thourgh a messaging system

    3. Transaction Gate: (transactiepoort) the transaction Gate allows organizations to

    interact with the government on a machine level. For example when applying for a

    tax refund.

    4. Security and Identity: security and identity management are organized on the level

    of the individual datasets but can be accessed through one identification system called

    DigiD.

    5. Front Office: the front office systems are used by various organizations to interact

    with citizens and businesses. This can be a government website, but also a civil servant

    supporting a citizen.

    6. Organizations: the model allows for different organizations using similar architec-

    tures within their organization to interact with each other.

    The following image describes the relation between the different components.

    12

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    15/48

    CHAPTER 2. THEORY

    Figure 2.2: The Dutch Government Reference Architecture (NORA)

    The Nora architecture can be classified as a service oriented architecture. In a serviceoriented architecture various virtual information services are defined which can be requested

    by a user. Furthermore, service oriented architectures use well defined standards for mes-

    sages and communication and are build up in a modular fashion. Technical implementations

    of these service oriented architectures are usually web-services or some other form of infor-

    mation service bus. The Dutch government is still in the phase of constructing this unified

    information service bus. In this phase the focus is to enable interoperability, providing basic

    technical standards and policies to enable information flow between different governmental

    organizations. In the coming years in can be expected that these systems will evolve into

    the alignment of administrative procedures and technical systems[28].

    For the deployment of vast amounts of data in an open fashion it is important that both

    the information service bus as well as alignment of technical systems and administrative

    procedures are well organized.

    Reflecting on this architecture in relation to open data we can identify a couple of prob-

    lems. First of all, the architecture does not include means to deliver raw data (basisregis-

    traties) to businesses. The current model includes a government transaction port that allows

    for message transactions like for example declaring tax. Furthermore, the central front of-

    fice allows for the providing of services like requesting a new passport. No data interface is

    provided in this architecture. Secondly, the current architecture only allows for security and

    identity management at the front office or transaction port. The service bus that transports

    13

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    16/48

    CHAPTER 2. THEORY

    the data is organized internally. This causes problems with open data because both public

    and non-public data travel over the same bus. Finally, the architecture does not dictate

    message or data standards that would come in handy when distributing open data. We can

    conclude that the current architecture works as a barrier for open data. No central technicalinfrastructure is in place to deliver the data.

    2.3 Stakeholders

    In this section we elaborate more on our choice of stakeholders and how they relate to

    available literature. Most studies in open data are only concerned the government and

    businesses as stakeholders. We will use more specific definitions of stakeholders based on

    Rowleys e-government stakeholder definition[31].

    1. Data provider: is a governmental organization delivering some form of valuablepublic transport data. The data provider is depended on central government funding,

    but can be outside of direct democratic control. The stake of this organization is to

    fulfill their lawful obligation at the lowest cost. Examples of this stakeholder group in

    the Netherlands the Dutch cadastre.

    2. Network Operator the network operator stakeholder is the owner of the physical

    infrastructure of the transport network (i.e. roads, tracks) and can be both a govern-

    mental as well as a non-governmental organization. An example is the rail network

    operator Prorail. A network operator can also be a data provider if law forces this

    stakeholder group to deliver this data at zero cost. As an e-government stakeholderthe businesses can be classified as Governmental Organization.

    3. Service Operators: Using these networks to provide travel services are the service

    operators. These operators can also be a governmental or non-governmental organi-

    zation. The stake of the service operator is to provide an efficient and high quality

    travel service. An example of this stakeholder group in the Netherlands is the rail

    operator NS. As an e-government stakeholder the service operators can be classified

    as Businesses.

    4. Businesses: The businesses are privately owned profit organization that can use

    data provided by the operators to create services for the traveler. The stake of this

    group is to get the data at the lowest possible cost in a usable format. As an e-

    government stakeholder the businesses can be classified as Businesses. An example

    of this stakeholder group in the navigation company Tom Tom.

    5. Traveler: The traveler is the end-user of the services from both the operators and the

    businesses. As an e-government stakeholder the traveler can be classified as People

    as service users. The stake of this group in this research is to maximize quality of

    services and minimize cost.

    6. Transport authorities: the transport authorities are the regulatory bodies involved

    in public transport. As an e-government stakeholder the transport authorities can

    14

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    17/48

    CHAPTER 2. THEORY

    be classified as Public Administrators. The stake of this group is to gain a good

    understanding of the transport networks in order to control safety.

    7. Civil Society: the civil society are citizens and foundations that advocate various

    subjects. As an e-government stakeholder the civil society can be classified as People

    as citizens. Their interested in the way policies are organized and what their impact

    on society is. The stake of this group in this research is to provide transparency and

    accountability to decide on and evaluate policy.

    Throughout the study these are the definitions of the stakeholders used.

    2.4 The business model

    In this section we describe the current business case of open data in the Netherlands. Fur-

    thermore, we will elaborate on some blind spots literature and the effects on the business

    cases of different stakeholders.

    The current business case of government data starts at different government organizations

    that collect data. These organizations collect and store the data. The data is then provided

    under legal, financial and technical limitations. In the Netherlands, no central policy on

    these limitations apply. A study on these limitations suggests that 31% of the databases

    do not allow for commercial re-use. Furthermore, in 72% of the cases the data is available

    free but only for non-commercial use. Finally, only 22% of the databases provide access

    through other means then a web-interface (no direct access to the data). Only 4% of the

    databases is accessible through a API[1]. In the cases were data is not freely available profit

    maximization or cost-averaging pricing strategies apply. The data is then sold to businesses

    that re-use the data in their applications. The business use some of the data to improve

    their products. The limitations in this business model causes a lack of economic activity on

    the government data.

    We found that a gap exists in the current literature on open data. Most of the research on

    distribution of public sector information at marginal cost has focussed on economic (macro),

    policy or transparency effects. We put forward that to study the case of open data more

    precisely the business case of different stakeholders should be analyzed more thoroughly. In

    most of the studies conducted the stakeholders defined are government and businesses orthe public. These narrow definitions leave little room for the investigation of effects other

    than the primary value chain and revenue models. In order to create a good design for open

    data we will need to gain more insight into the business cases of the different stakeholders

    instead of only looking at the global business model.

    2.5 Research Question

    Based on our problem definition and the exploration of the subject of open data in the

    Netherlands we are ready to introduce the research question. In the previous sections we

    15

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    18/48

    CHAPTER 2. THEORY

    proved the economic case for open data and found the most important causes for our prob-

    lem. We now need to find out how we can solve these problems with our design. We will

    focus on two causes of the problem:

    1. Pricing: we will need to find a pricing strategy that maximizes net-value for both

    businesses and government. We will design a business model that deals with this cause.

    2. Technology: we will need to find a technical infrastructure to deliver the data.

    From our theory section we expect that open data policies will cause changes in the

    business cases of different stakeholders. We will need to investigate the effects of the design

    of the new open data business model. Based on the theory and hypothesis about changes

    in the business case we can introduce the primary research question.

    What changes in the business model for public- and geospatial transport data could be

    observed when open data would be made available?

    The research question aims at finding the effects of an open data business model of various

    stakeholders. We focus on public and geospatial transport data based on the statistics of

    the American data portal data.gov. The statistics of this website show that geospatial and

    transport data are among the most popular datasets businesses tend to reuse. Furthermore,

    we focus on the Netherlands in order to be able to study the cases in detail in the amount

    of time available.

    The secondary research question focusses on solving the design question of our technical

    infrastructure. If the government were to decide on an open data policy this will have

    significant changes to the information architecture of government organizations. In the

    current closed model data is used primarily internally and therefore interfaces to other

    information system external to the organizations have not been realized. To be able to

    deliver open data to businesses an interface should be designed. Therefore, the secondary

    research question is:

    What technical infrastructure should be provided in order to deliver open public- and

    geospatial transport data to businesses?

    16

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    19/48

    Chapter 3

    Methods

    The goal of this study is to design a business case and technical infrastructure for opendata. The study is based on a literature review, open and structured interviews of various

    stakeholders and specialists. Also various design methods such as requirements analysis,

    business model generation, ORM modeling and data warehouse modeling have been used.

    Because open data is subject to many influences concerning economy, privacy, civil society

    and is influenced by many different stakeholders like citizens, business, civil society, civil

    servants we believe that a literature and stakeholder analysis are appropriate methods to

    review the depth of the subject.

    Figure 3.1: The design proces

    3.1 Literature Review

    The literature review serves to find out the theoretical underpinnings of open data. We used

    the literature review to find the main causes of the problem, and provide context to the

    topic of open data. Furthermore, we looked into the electronic government architectures,

    specifically the Dutch governments information architecture NORA.

    3.2 Open Interviews

    In order to gain more insight into the specific case of open data in the Netherlands and

    to outline the methods used to design a business case for open data, interviews with var-

    ious specialists were conducted. These specialists vary from government officials, business

    leaders, civil servants and activists. Based on these interviews and the literature review

    the structured interviews for analysis of the business case were constructed. A list of the

    interview subjects can be found in the appendix.

    17

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    20/48

    CHAPTER 3. METHODS

    3.3 Stakeholder Identification

    Based on the open interviews and the literature review we made an analysis of the relevant

    stakeholders. These stakeholders were used to selects respondents for the structured inter-

    views. Furthermore, this identification served as means to retrieve consistent terminology

    throughout the design phase. The list of stakeholders and their description can be found in

    the previous chapter.

    3.4 Structured interviews

    Structured interviews were then performed where the interviewer used a fixed set of ques-

    tions to gain insight in both the business case and technical requirements. The interviews

    were conducted with an interview protocol based on interview techniques by Emans[36]. We

    choose this interview form because it provides a good base for comparison of the differentanswers that respondents give. We interviewed 2-3 respondents from organizations within

    every stakeholder group that we defined. The interviews were performed in a special inter-

    viewing room. Respondents could choose to remain anonymous. All of the conversations

    were recorded for future reference. The interviews took between 1:30 and 2 hours and were

    performed during the day. The interviews were conducted in the same chronology with ev-

    ery respondent. The language of the interviews was Dutch. Depending on the respondents

    technological backgrounds the business case question set, interface question set or both sets

    were requested. A list of the interview subjects can be found in the appendix together with

    the interview protocol.

    3.5 Business case analysis

    To be able to gain insight in the low level effects of open data an analysis of the business

    case of different stakeholders was performed. The business model generation method[26] was

    used to analyze the business case of these various stakeholders. Since the design proposes a

    change in the business model of government data provisioning an in depth analysis of the

    effects is required. We used the Osterwalders method to identify the effects on the business

    case of all of the stakeholders within the value chain. This method provides us with a nice

    overview of all the possible changes to these respective stakeholders. The business modelgeneration method uses nine areas to describe a stakeholders business case which we will

    explain here:

    1. Partners: describes the key partners such as suppliers or government institutions are

    found and a motivation for the partnership is explained.

    2. Activities describes what key activities are preformed and how they contribute to

    the revenue streams.

    3. Value Proposition: describes what value is delivered to the customer and what

    costumer need is solved.

    18

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    21/48

    CHAPTER 3. METHODS

    4. Costumer Relations: describes what type of relationship the organization has with

    their costumers, how costly they are and how they are established.

    5. Costumer Segments: describes in what markets the organization operates.

    6. Distribution Channels: describes the distribution channel of the organization.

    7. Resources: describes what resources are necessary in order to create the value propo-

    sition.

    8. Cost Structure: describes what the most important costs inherent in the business

    model are.

    9. Revenue Stream: describes the nature of the revenue streams and finds what value

    are our customers really willing to pay.

    The results of the business case analysis and proposed model are presented in the business

    case design section.

    3.6 Requirements analysis

    For the data warehouse design we used van Lamsweerdes requirements engineering method[29].

    Furthermore, Boehms analysis of non-functional requirements was used to gain insight into

    qualitative aspects of the warehouse design[30]. The requirements engineering method uses

    a process of scoping, stakeholder analysis, user characteristics definitions, product perspec-

    tive, use case analysis and requirements specification to create a software interface design.In order to account for non-functional requirements that might be important for the in-

    terface we looked for usability, safety, efficiency, performance, capacity and interoperability

    constraints.

    3.7 Data Warehouse design

    We choose to design a data warehouse as a technical solution for delivering open data to

    businesses. To design this data warehouse we used a UML based method [33]. However,

    instead of using UML to describe the data model, we used Object Role Modeling (ORM) [34].

    This specific method was used because we have more experience with this type of modeling,

    and this method allows for detailed conceptual modeling in a compact schema. The results

    of this design are presented in the technology design section.

    19

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    22/48

    Chapter 4

    Business Model Design

    In this chapter we propose a design for the business model of open data in the Netherlands.Furthermore, we analyze the impact of this business model on the different stakeholders.

    The current business model of public sector information works as follows. Government bod-

    ies collect various forms of transport data and store this for internal use. When a business

    wants to use this data for commercial purpose the data can be bought. This data is offered

    at a competing or cost averaging pricing strategy. Most governments organizations dont

    structure their data in open standards. Furthermore, various types of license limitations

    apply to the data. After the data has been sold, the business uses the data in a existing

    product or service which in turn is sold to an end user.

    Figure 4.1: The business model of open data

    We propose an open business model. The business model of open data for public and

    geo-spatial transport data essentially works as follows. Government organizations like the

    Ministry of Transportation, the cadaster and the public transport network operators pub-

    20

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    23/48

    CHAPTER 4. BUSINESS MODEL DESIGN

    lish structured, machine readable and free datasources in a data warehouse. Businesses then

    download or link to this data and create new services.These services are then provided to

    end-users. The government provides the data in a structured form based on available open

    standards.

    In this business model the situation for some of the stakeholders changes. The most

    significant changes occur for the government organizations (i.e. data provider and network

    operator stakeholder groups). In the designed business model these organizations will have

    to change

    1. Pricing Strategy: the pricing strategy for re-use of public sector data has to change

    from competing or cost-averaging strategies to a free or marginal cost strategy.

    2. Legislation: copyright, intellectual property and database law are adjusted in such a

    way the data can be easily used by the businesses.

    3. Technical Infrastructure: the organizations provide a technical infrastructure to

    deliver the data sets or web-services to businesses.

    4.1 Effects of the model

    It can be expected that in this business model the economic activity of businesses around

    this data increases significantly. All of the stakeholders that were interviewed expect a sig-

    nificant increase in economic activity. For example, the developers behind the Train I-phone

    App (Trein) expect that such a development will cause severe competition to create the best

    travel app on a mobile device. The planning service OV9292 expects that not only competi-

    tion will increase, but explains that the use of public transport will probably increase when

    travel information is more widely available. There own research has shown that OV9292

    increases use of public transport with 8%. We can thus expect more businesses will start to

    use open data to generate revenue.

    Furthermore, it can be expected that new types of innovative services will emerge with

    open data. In New York, San Francisco and other major citys that opened up their data

    within months various types of travel services emerged14

    . The respondents from the inter-views also expect new and innovative services to emerge when government data is combined

    with commercial data sets and services. One of the examples that was mentioned in the

    interviews was a toilet finding service in Denmark. This service provides citizens with a

    bladder defect with the location of toilets in their area, a service that could not have been

    created without open data. With our business model we can expect that the business po-

    tential currently untapped in the Netherlands could be opened up. The effects that this

    business model has on the business cases of the various stakeholders will be explored in the

    next section.

    21

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    24/48

    CHAPTER 4. BUSINESS MODEL DESIGN

    4.2 Effects on the stakeholder business cases

    This section describes the effects of the business model on the specific business cases of the

    stakeholders we interviewed. We use the definitions of the different aspects of the business

    case introduced in the methods section. For every stakeholder the aspects of the business

    case that change are described. If an aspect is not described in this section no relevant

    changes were observed.

    1. Data provider: for the data provider some significant changes to the business model

    can be observed. The most significant change is the loss of income due to different

    pricing strategies. The revenue streams of these data providers change because they

    will have to compensate for the loss of income. For example, the cadastre expects that

    open data will force them to provide topographic data and information on the legal

    status of land for free. However, to maintain the quality expected by law cost have to

    be incurred. Somehow the loss in income has to be compensated. Also, organizations

    like OV9292 explained that providing the data for free would probably cause a loss in

    income on for example the timetable services. They also pointed out that certain data

    quality requires maintenance and expertise, which costs money. At the business end

    stakeholders agree that this quality of data is one of the most important requirements

    for them to re-use the data. We propose that this loss of income is compensated by

    the national government since they are beneficiary of the effects of open data through

    taxation. Furthermore, the distribution channels of the data providers will change.

    Based on the interviews we can observe that both the cadastre and the providers of

    transport data fear this loss in income. The cadastre furthermore fears that nationalgovernment is not willing to compensate for the loss of income. In this case they will

    either decrease the number of key activities, or will increase the price of other products

    they currently deliver to the market.

    Furthermore, some organizations will have to provide a technical infrastructure to

    deliver vast amounts of data to businesses. This infrastructure will change the way

    distribution channels are organized. This change in infrastructure will also require an

    investment in technology for some of the organizations. Other areas of the business

    case of these organizations like costumer segments, resources and partners will not

    change in our business model.

    2. Network Operator: for the network operator the most significant changes occur

    when they are a provider of data. For example,in the railway sector Prorail main-

    tains the network and provides the data on locations of trains to the different service

    operators on the network. In this case the change in pricing strategy will decrease

    their overall income. However, the network operators in general are already obliged to

    provide this data to their main customers: the service operators under Dutch public

    transport law (wet personenvervoer). The travel information OV9292 said that they

    would make the data available if requested. However, this would be the raw data,

    but not the planning service they provide. OV9292 thinks that this planning software

    is the core intellectual property, not the raw data. The most significant change for

    22

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    25/48

    CHAPTER 4. BUSINESS MODEL DESIGN

    the network operator is the change in customer segments. When open data would be

    introduced a new group of customers for the data would emerge: businesses.

    3. Service Operators: for the service operator changes in the cost structure will occur.

    Data that was only commercially available can now be obtained at zero or marginal

    cost. For some operators like for example NS this could be a significant decrease

    in cost for data collection. Furthermore, based on the interviews with OV9292 the

    availability of free public transport data will increase the number of customers that

    use their services. This increases the volume of the revenue stream obtained from

    travel services.

    23

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    26/48

    CHAPTER 4. BUSINESS MODEL DESIGN

    4. Businesses: like the data providers, the changes to the business model of businesses

    is significant. In the old model businesses had to pay for the acquisition of data

    from government bodies. In the proposed model this data is available for free, which

    significantly lowers the cost of acquisition of data products. Furthermore, by enforcingthe use of open standards the cost for changing the data into appropriate formats will

    decrease. We can therefore conclude that the cost structure of these business changes

    in the business model.

    Furthermore, based on the interviews we can conclude that competition will increase.

    Respondents expect that the barrier to enter the market with a certain service will

    lower. For example, one of the respondents expects that acceptable quality navigation

    products could be made with the map provided by the cadaster. The main cause for

    lowering this barrier is that no significant investments in acquisition of high quality

    mapping data is required when the map can be downloaded for free at the cadastre.

    Also, key activities of some business can change due to the change in the business

    model. For example, commercial mapping organizations like Google, Tom Tom and

    Navteq currently rely on land metering and other mapping techniques for their map-

    ping product. At least 20 properties of these mapping products could be made available

    for free through the cadastre. Different business organizations pointed out that it is

    important that the data is license free and that coverage and quality of the data are

    guaranteed.

    5. Traveler: for travelers we cant really speak of a business case. We will however state

    the obvious changes this stakeholder incurs in our business model. The traveler willexperience an increase in the number of services available to them. Furthermore, due

    to the increase in competition the quality and functions of the services provided will

    probably increase.

    6. Transport authorities: since the transport authorities play no vital role in the

    business model we will deem them out of scope. Some of the effects that we might

    expect that influence transport authorities is that the availability of more data will

    give vital insight in the performance of the transport networks. This could lead to

    better policies at the government level.

    7. Civil Society: civil society organizations currently play no significant role in the

    business model of open data. However, it can be expected that civil society organiza-

    tions engage in the creation of social applications. These applications were previously

    to expensive to develop because of the data acquisition efforts, but become viable in

    our new model. Some examples of these types of applications are Schoolscope in the

    United Kingdom. This website offers parents a benchmark of the quality of schools.

    Another application reports on hazardous locations in the New York Manhattan area

    based on traffic data published by the government.

    By using the business model generation method we found that the most significant

    changes in our design are a change in cost structure of the providers and users of data.

    24

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    27/48

    Chapter 5

    Technology Design

    On of the causes of problem is the lack of technical infrastructure to deliver high qualitydata to businesses at high speed. We performed a requirements analysis that has lead to a

    technical solution to our problem. In this chapter we propose a design of a data warehouse

    for public and geo-spatial transport data.

    A data warehouse is essentially a data storage and decision support system based on a

    variety of different datasets. In business data warehouses are frequently used as management

    support tools. A data warehouse is always subject-oriented and records and interprets

    attributes of these subjects over time. Some examples of subjects in our case are vehicles,

    stops, travelers and so on. We chose to design a data warehouse above a normal database

    system because a data warehouse allows for decision support (planning) and can cope with

    multiple sources of different information. The scope of this design is an analysis of the

    landscape where the warehouse will operate in, a draft architecture of the different data

    warehouse layers, a data model for the storage of public and geospatial transport data, an

    interface design and recommendations on standards and hardware. We will not look into

    front-end applications, query structure, optimization, rollout or maintenance aspects of the

    data warehouse. We used the UML-based data warehouse design method to create this

    design[33].

    5.1 Landscape

    Before we can describe the interface design we need to define the context architecture in rela-

    tion to the value chain. The data warehouse collects data from different data providers and

    network operators. This data is processed and packaged in the warehouse. We assume that

    the standards as defined by the European Committee for Standardization (CEN) Service

    Interface for Real Time Information CEN/TS 1553115 which includes data on timetables,

    network monitoring, vehicle monitoring, connection monitoring and a general message ser-

    vice will be used. For the geographical data various vector forms can be distributed. In

    this study we assume web map service, web feature service and web mapping tile service by

    the open geospatial organization are used. For the traffic and delay data we suggest to use

    the European Open Travel Data Access Protocol (OTAP) and the standards defined by the

    25

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    28/48

    CHAPTER 5. TECHNOLOGY DESIGN

    National Database Road-traffic (NDW).

    Figure 5.1: The data warehouse in its context

    After the data is processed and packaged it can be delivered through the interface. Public

    transport data can be defined as data regarding the physical infrastructure (stops, stations,

    routes), the timetable (planning, platforms), and the status of the network (delays, out-

    ages). Geo-spatial transport can be defined as data regarding the main motorway network

    (network, ramps) and the status of the network (traffic jams).

    5.2 Warehouse Architecture

    This section describes the general architecture of the data warehouse. A data warehouse

    is generally build up out of four main components. First their are multiple data sources

    that provide different sorts of information to data warehouse. In our example road, train,

    network and mapping data feeds into the data warehouse. After the data has been processed

    through the different layers of the data warehouse it is offered to users in a data mart. This

    data mart is a subset of the larger data store and is oriented to either public transport or

    road network relevant data. When a user requests certain data from the data mart trough

    the interface (API) it can be re-used in an application. In this model we also included a

    planning layer that can interpret the different sorts of raw data and return routing and

    planning information.

    We explicitly place this layer outside the data processing part of the data warehouse

    because we want to keep this planning capability of the data warehouse optional. We want

    to keep this optional because these specific types of planning packages are also used in the

    market and might introduce unfair competition to other vendors of planning software.

    26

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    29/48

    CHAPTER 5. TECHNOLOGY DESIGN

    Figure 5.2: The data warehouse architecture

    The source layer of the data warehouse is the physical infrastructure that gathers the

    data from the different data sources. In our data warehouse the data sources either push

    the data to the data warehouse at some predetermined interval, or a separate data scraper

    is used to collect the data. In the extraction layer the scheduling of the data extraction from

    the data sources is organized. For example, the vector map of the road network probably

    wont require an update more regular than once or twice every week, were the location ofa train will probably have to be updated every 30 seconds. Some data warehouses feature

    a staging area that is used to normalize the data and check for quality, coverage and other

    constrains. Such a staging area would be relevant if a large number data sources would

    be used and if the quality of this data could not be trusted. Since the providers of the

    data are all known, agreements can be made on these aspects of the data delivery and we

    will not require data staging. In the ETL (Extraction, Transformation and Load) layer the

    data from the extraction layer is used and transformed into the relevant data structure,

    meta data is extracted and the data is loaded into the databases. In this process the data

    is checked for integrity, cleaned and sometimes translated. The ETL stage takes does not

    directly operate on the databases of the data warehouse but uses staging tables. Depending

    on the requirements of the data and the update frequency the different steps used can vary.

    After the ETL layer the data is processed in the storage layer. This layer basically the

    data base management system of the data warehouse (DBMS). The primary task of this

    layer is to store and retrieve data from the data warehouse. It uses the ACID properties

    (atomicity, consistency, isolation, durability) to guarantee data warehouse transactions are

    processed reliably. The storage layer pushes different types of data on set intervals to the

    two data marts that we included in the design. The data marts are a subset of the data

    present in the data warehouse relevant to the user group. We use two different data marts

    for different redundancy purposes. First, the data marts can be hosted on different hardware

    27

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    30/48

    CHAPTER 5. TECHNOLOGY DESIGN

    environments than the data warehouse. This will make sure that if the data warehouse for

    some reason goes offline data can still be extracted. Furthermore, if these data marts were

    non-existed and the API would be coupled to the data warehouse directly a failure in the data

    warehouse would cause both the vital road and public transport information infrastructureto go offline together. This could lead to major delays on both the public transport and

    road network. Finally, the data marts allow for a much cheaper failover environment than

    the data warehouse. Because a data mart is essentially a big cache of the subset of the data

    warehouse it could be mirrored onto different physical locations. The final layer in our data

    warehouse design is the interface with the end-users. This interface design will be defined

    further on in this chapter.

    5.3 Data Model

    To be able to store data in our data warehouse we will have to model the data first. For the

    geo-data and traffic data some good internationally accepted data models are already freely

    available to use. We choose to adopt these standards in our design. For the Geo-spatial

    information the OpenGis Map Service standard will be used[35]. The road data model will

    be based on the model already used by the Dutch National Database Roadtraffic16. However,

    such a well defined data model misses for public transport data in the Netherlands. Some

    efforts have been put into the BISON standard. This standard however, only models the

    interfaces between various service providers in the public transport domain. For the public

    transport data a draft version of the BISON standard and the interviews have been used to

    derive a data model. We tried to combine the BISON standard with the already availableCEN/TS 15531 standard for public transport defined by the European Comittee.

    Figure 5.3: Available data models

    Based on the service interface requirements we used the Object Role Modeling (ORM)

    technique[34] to generate the model for public transport. The model only describes the

    conceptual data relations in the data warehouse. Weve used nine elementary object types

    to describe the domain of public transport.

    The vehicle object type is the physical means of transportation (e.g. train, bus, taxi)

    and has various attributes such as a location, capacity and the availability of a toilet. A

    vehicle is maintained by a certain service operator which only has a name in our model. At

    the infrastructure side of the spectrum we defined a stop, platform and connection. A stop

    28

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    31/48

    CHAPTER 5. TECHNOLOGY DESIGN

    Figure 5.4: The ORM data model for public transport

    is a physical location where a vehicle can stop to drop off travelers. A stop can have multiple

    platforms. The route between two stops or platforms can be defined as a connection, which

    has a distance and can be available or unavailable. A connection is maintained by a network

    operator. Furthermore, the unique combination of a connection, vehicle and a planned

    item results in a schedule. The planning item contains a departure and arrive timestamp

    (date & time) and may contain a note for the operator. Different planning items together

    generate a route for a passenger. When the planning changes a exception can be created.

    This exception is a message to the traveller and operators that a certain planned item has

    changed. An exception can also be a single message that has no influence on the planning.

    5.4 Interface

    To connect the data warehouse to the business users an Application Programming Interface

    (API) will be constructed. The interface will act as a data provisioning system for public

    transport and geo-spatial data. For both data types a separate API will be constructed

    capable of providing the data for both the public transport and the geo-spatial transport.

    The interface will be run as a web service that allows for access through the HTTP proto-

    col (over the web). The interface will be constructed on a Representational State Transfer

    29

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    32/48

    CHAPTER 5. TECHNOLOGY DESIGN

    (REST) communication bus that uses messages formatted in Extensible Markup Language

    (XML). The choice for REST is based on the focus on different system states that can be

    retrieved through the interface using common operands (like GET, POST, PUT, DELETE).

    This type of API provides scalability, safety, stability, generality in interfaces, latency re-duction and is flexible enough to extend with more services in the future. For the messages

    that are being sent through the interface the XML standard will be used. XML is an W3C

    consortium approved standard for machine readable document markup. It provides enough

    freedom to define custom schemas for the propose of geo and public transport data provi-

    sioning without losing standardization.

    A rest interface can be built on different programming languages, databases and services.

    Since the systems that are being used by the different data providers are unknown to us

    some assumptions have to be made. We assume that the data provides want high flexibility

    and extendibility in programming language. Furthermore, they want low implementation

    and maintenance cost, finally they want the interface to be compatible with the wishes of

    the third party developers.

    Taking into account these requirements the interface will be build on Python. Python is

    a multi paradigm language allowing programmers to incorporate different styles of coding.

    Python is a stable language that is provided natively in many Linux distributions and works

    flawlessly with Oracle web servers. Many large corporations like Google, ABN-AMRO,

    CERN and NASA use Python for their interfaces.

    Depending on the relation with the data provider (either local caching or direct API) a

    database is required. The construction of this interface will be built on an Oracle 11

    database. The database can be manipulated using Standard Query Language (SQL) whichis an international standard for interaction with relational databases.

    The interface will deliver data through web-services. When a user registers for an API key

    the services can be used. We split the API for the rail and road network into two separate

    APIs for redundancy. We believe this redundancy is required because if the system were

    to be one single API, a failure would result in no transportation data what so ever. For the

    public transport data the following categories of service calls to the API can be defined:

    1. Planning Services: the planning service category contains several planning and

    decision services. These services are used to determine optimal routes based on various

    parameters. The most important services are the Planned Timetable Service whichreturns the current timetable. The Estimated Timetable Service also takes into

    account the actual state of the network and adjusts the planning accordingly.

    2. Monitoring Services: the monitoring services category contains several network

    monitoring services. The goal of these services is to determine the current state of the

    networks and vehicles. The exception monitoring service provides information into

    network exceptions like the failure of turnpikes. The stop monitoring service provides

    information on the stations and platforms. The vehicle monitoring service provides

    information on the location of individual vehicles. Finally, the network and connection

    monitoring service provides meta-information on the state of the network.

    30

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    33/48

    CHAPTER 5. TECHNOLOGY DESIGN

    3. Other Services: the other services category contains services that relate to pricing,

    messaging and interaction with the network operator.

    For the public transport data the following categories of service calls to the API can be

    defined:

    1. Planning Services: the planning service category contains two services that can

    return the delays on the specific sections of road. Furthermore, the estimated capacity

    service returns the probability of a capacity shortage on a certain section of road based

    on real time measurement and statistical data.

    2. Monitoring Services: the monitoring services category contains several network

    monitoring services. The goal of these services is to determine the current state of the

    network and connections. Several different services report on planned maintenance,

    incidents, connections etc.

    3. Map and Network Services: the map and network category contains services re-

    turning static data on the road network. Several services provide a download the latest

    version of the road vector map, static information on junctions and exits and static

    information on road facilities and signs.

    4. Other Services: he other services category contains services that relate to pricing,

    messaging and interaction with the network operator. Furthermore it provides streams

    of video and weather stations at the road side.

    A more extensive analysis of the services and the design can be found in the appendix.

    5.5 Hardware

    The data warehouse will have to run onto a solid physical infrastructure. We will present

    some recommendations on the hardware of the data warehouse. We will have to take into

    account the scalability, parallel processing capabilities, database management / hardware

    combination and cost effectiveness of the hardware environment. Based on the expected

    usage of the data warehouse we can expect that the system will sometimes require a high

    peak capacity. For example when major malfunctions to the public transport system occur

    expected API requests per min can triple. But we cannot plan for these types of outages,

    so our hardware will have to be able to cope with these peak loads. Furthermore, since high

    volumes of API requests are performed on the system parallel processing support could in-

    crease reliability and speed. Finally, it is important that the software and operating systems

    used match with the database management tool that we selected.

    The goal of this recommendation is to find a solution that has a high reliability and

    is cost-efficient. We recommend the use of a cloud oriented hardware. In a cloud server

    setup virtual server capacity is rented with a cloud infrastructure provider like Amazon.

    The advantages of cloud operated services is that they can scale elastically with the end-

    user demand. Furthermore, cloud infrastructure providers have preconfigured virtual servers

    31

  • 8/7/2019 Open Data: a design for the provisioning of Dutch government public and geo-spatial transport data.

    34/48

    CHAPTER 5. TECHNOLOGY DESIGN

    readily available for use. This will reduce the cost for maintenance personnel significantly.

    A possible specification for this hardware could be:

    Amazon Elastic Compute Cloud (Amazon EC2)17

    Servers: High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2

    Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of local

    instance storage, 64-bit platform. This setup allows for high transaction volumes.

    Operating System: Oracle Enterprise Linux

    Database System: Oracle Database 11g

    Application Server (running python): Oracle WebLogic Server

    Service Packages: Amazon Elastic Block Store, Elastic IP Addresses, Amazon

    Virtual Private Cloud, Amazon CloudWatch, Auto Scaling, Elastic Load Balancing

    5.6 Qualitative AspectsThe final design specifications for this data warehouse have a non-functional nature. Weve

    investigated the performance aspects of the database based on the interviews. For the geo-

    spatial data we can expect 5000-10000 requests / min. With the public transport data we

    expect 500 planning requests, which we estimate will cause 5000 requests / min . We were

    unable to retrieve the expected amount of requests for the road network. We estimate the

    number of requests to be 5000 / min. The total number of request that should be handled

    by the data warehouse therefore should be: 20.000 API requests per minute.

    The update frequency of the data depends on the specific type of data. The vector map

    has an update speed of twice a year, while the locat