Automated (meta)data collection – problems and solutions Grete Christina Lingjærde and Andora Sjøgren USIT, University of Oslo

Automated (meta)data collection – problems and solutions

Grete Christina Lingjærde and Andora Sjøgren

USIT, University of Oslo

•Authoritative registers

•Import from HR

•ITAR

•Open Access

Topics

What is Frida?

• Frida is an integrated research environment for the documentation and presentation of research activities, research results and scientific competence.

• Data from Frida is used to generate statistics for research activities at Norwegian universities. Information provided by this system plays a major role in the annual funding of universities by the Norwegian Ministry of Education and Research.

• Therefore, data quality has been a major issue in the development of

the system.

What does Frida provide?

• a unified view of researchers, research projects and research production at all the organization levels to the institution

• a flexible and distributed model for registration and validation of data where researchers have full insight in and control over their own data

What does Frida provide – 2?

• direct import of research publications from ISI and Norart makes registration less time-consuming

• a system suitable for internal presentation and external profiling of research groups, research centers, departments, etc

• a system that satisfies the government’s demands for documentation of research production

Information needs that Frida has to cover

Internal needs• Internal division/distribution of assets• Presentation/overview of scientific activities• Information for developing an institutional strategy for Research Activities

Government• Reports to DBH on aggregated level

Financial Model• Research activities are a part of the basis for the grants/funds given to the

universities

Profiling of researchers and research activities

The five modules of Frida

• Research results• Projects• Scientists• Research units • Annual reporting

The authoritative registers

The system contains registers/separate tables of:• periodicals, series• publishers • organizations (institutions) • common code tables.

Frida institutions share these registers. The common use and maintenance of these registers is an important quality measure in Frida.

The Institution register

• The Institution register is a common register containing data of cooperating institutions, both national and international.

• Each institution is assigned a unique number called ´workplace code´. This code is used as the root element when describing the institution's hierarchy of workplace codes in XML.

• The Institution register is maintained by FS, and automatically copied to Frida

The Institution register - 2

• Initially we had problems caused by the delay from the time an institution was registered in FS, to the data was available in Frida.

• The register now contains some 20 000 institutions so occurrences of this problem today is rare.

Import of institutional data from local systems

Before Frida can be used at an institution, some data specific to the institution must be in place in Frida:

• Data about places. Workplace codes for every unit

• For a user to be able to register data in Frida, a personal record must be imported into Frida from the institution’s user administrative system.

Data in such systems are based on data from the institution’s human resource (HR) system. In other words, a user must be employed at the institution or in some other way be associated with the institution in order to register data in Frida.

Import of institutional data from local systems - 2

• The ability to import data from HR- systems into Frida means that data delivered in a specified format can be directly loaded into Frida

• The specified format is described in XML

• The main benefit from importing is simplified maintenance of data concerning people and employments.This can be very labour-saving for large institutions and institutions with high turnover.

Import of institutional data from local systems - 3

Because local HR-systems are authoritative sources for data about persons, importing correct information ensures better quality of data.

The guiding principle is to register data only once and in the authoritative system for this data.

When we join together data from multiple systems, we use social security numbers for persons and workplace codes for organisational units.

Personal data register

Each Frida institution has its own personal data register.

Challenge:

Guests/associated persons such as visiting researchers, professors emeritus, etc, are not always registered in the local personnel system.

Solution: Registering non-employees in the personnel system as guests, after which their data are imported into Frida.

History of the organization structure

Frida contains only the organization structure of to day. Our experience has told us that many of the institutions have enough problems with the present structure.

Changes to the organization structure should be maintained in other systems, for example in a Human Resource system.

Frida is not the authoritativ system for data representing the organization structure of the institution!

Changes in the organization structure – challenges

Frida can automatically update the code for an organization unit to another code. Data about publications, projects etc will be connected to the new unit code and removed from the old one.

If an organization unit is divided into two or more units, the piece of work connecting the data to the right unit must mainly be carried out manually.

When a unit code is no longer referenced, it can be removed.

It is important that Frida is updated when the organization structure is changed. Consequences of not updating Frida may be that some organization units occur several times and that persons, publications etc are connected to wrong units.

Changes in the organization structure – challenges - 2

It is important that the institutions have knowledge about the organization structure, the representation code for these and the use and importance of these codes in different systems.

Description of routines of how these codes are updated etc. are important.

Our experience tells us that the institutions are concerned about the problem when they first take a system in use. When things run automatically, they forget to have focus on this area.

A new financial model

The Norwegian documentation system for research funding was approved by the Ministry of Education and Research in 2005, and the model was applied for the first time during budget allocations in 2006.

The system is designed to facilitate a performance-based distribution of research funding to institutions based on factors including academic publishing activity.

Central initiatives

The Ministry of Education and Research took initiative to

improve the quality of publication data. This resulted in:

• (1) The creation of a national register of publication channels (periodicals, series, publishers) and institutions (organizations).

• (2) An information pool of bibliographic data to be distributed

to local research documentation systems.

• A system called ITAR (Import Service and Authority Registers) was developed in order to organize information from authoritative registers and bibliographic data. These data are made available to Frida via an export service in ITAR. Suppliers of bibliographic data: ISI, Norart and BIBSYS.

Data from external bibliographical data sources - 1

An import component has been developed in the Frida-application which allows academic staff to import their own publications as well as allowing administrative staff to import all publications for their institution.

The import component in Frida has been designed to handle the different statuses a publication may have:• The import publication has already been manually

registered• The import publication has already been imported but

lacks additional data • The import publication is new (has not been previously

registered in Frida)


During the import phase, a selection of ITAR-data is defined as authoritative and will override manually registered data. This is particularly relevant for data later submitted when applying for funding from the Ministry of Education and Research, including publication channel, the number of authors and the publication type (article, letter etc.). These data can not be changed by the user. Other data such as title and volume can be changed.


Problems with duplicates.

We encountered problems with duplicates when Frida was young. The functionality of the relevant application windows was not good enough, and several users failed to search for entries already made..

Both the interface and the general control mechanism for duplicates has been improved

Full-text databases

All universities which are using Frida today also use open archives to store their publication in full text, also called open Access-databases:

• DIVA, BORA, DUO, Munin

Scientific full text documents can be delivered to Frida:

• Metadata (title, authors, etc) are registered in Frida• The full text documents with the metadata are transferred to

the open archive of the respective university

NORA: is the organisation of the Norwegian Open Research Archives.

Documents

Automated (meta)data collection – problems and solutions Grete Christina Lingjærde and Andora Sjøgren USIT, University of Oslo