Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, 2011 11 th Workshop on Domain-Specific Modeling

LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

Alexandre Donizeti AlvesHoracio Hideki Yanasse

Nei Yoshihiro Soma

October 24, 2011

11th Workshop on Domain-Specific Modeling

Introduction

Lattes Platform is an information system implanted

by CNPq (National Council for Scientific and

Technological Development) to manage

information on science, technology and innovation

related to researchers and institutions in Brazil

This platform is undoubtedly the major source of

information available on Brazilian researchers

Introduction: Lattes Platform

http://lattes.cnpq.br

Introduction

The Lattes CV system, a curricular information

system, is the main component of the platform

Currently, the Lattes CV system stores around

2,000,000 curricula of researchers, lectures,

students and professionals from diverse areas of

knowledge

Introduction: Lattes CV system

http://buscatextual.cnpq.br/buscatextual

Jorge Almeida Guimaraes

Introduction: Lattes curriculum (English)

Introduction: Lattes curriculum (English)

Introduction: Lattes curriculum (Portuguese)

Introduction

In the last years, many works were developed

using data extracted from Lattes Platform of

researchers of different areas of knowledge

A common problem presented in these works is

that the curricula and the information extracted

had to be obtained manually

Introduction

Therefore, this system has a very high

quality information extraction potential

LattesMiner

LattesMinerLattesMiner is an internal multilingual DSL for automatic

information extraction from Lattes curricula

It is composed by a set of classes written in Java that

allows developers to implement their own applications

with a high-level abstraction and expression power

LattesMiner

Data Discovery is used to find the (ID) number of the researchers.

Usually, only the name of the researcher is available.

Data Acquisition is responsible for downloading the Lattes curricula

of the researchers from Lattes CV system on the Web.

Data Extraction is the main component of LattesMiner. It is

responsible for extracting data from the HTML files. The technique

of information extraction based on regular expressions was used.

The extracted data can be stored in XML files or in any database

using the Data Structure component.

The Data Visualization component is responsible for the identification

and visualization of the academic social networks. These networks are

identified by checking the relationships between researchers.

The Data Analysis component is responsible for the analysis of the

data extracted and also for the analysis of the relationships identified.

LattesMiner

LattesMiner

Biodata

Board

BiodataIE

BoardIE

BoardDaoBiodataDao

lattes.miner

lattes.miner.ielattes.miner.en

lattes.miner.dao

Perfil Banca

lattes.miner.brThe LattesMiner class is composed by instances of classes Biodata

and Board, in addition to many others not presented here.

LattesMiner

LattesMiner was created through a fluent interface, that

provides a compact and yet easy-read representation of

the domain problem

Fluent interfaces are implemented using the method

chaining

LattesMiner makes use of static factory methods and

imports

Case Study

http://plsql1.cnpq.br/divulg/RESULTADO_PQ_102003.curso

For the following examples researchers of the Computer Science area

with CNPq Research Productivity Scholarship were considered.

The list contains all the names of the researchers.

However, their corresponding (ID) number are not provided.

Listing 1

import java.util.*;

import lattes.util.Util;

import static lattes.miner.LattesMiner.*;

public class Listing1{ public static void main(String[] args) {

}}

List<String> list = new ArrayList<String>();

for (String name : Util.getList("names.txt"))

list.add( );

Util.setList(list, "ids.txt");

search(name)

Java application code

Listing 2

dir("cvs");

for (String id : Util.getList("ids.txt"))

download(id). save();

Code fragment used to download the lattes curricula of

the researchers.

Listing 3

props("mysql");

for (String id : Util.getList("ids.txt")) {

}

load(id). biodata(). address();

publications( )JOURNAL . save();

This listing shows as to extracted data from Lattes curricula

of the researchers.

Listing 4

for (String id : Util.getList("ids.txt")){

}

// Portuguese

// English

for (Banca b : ){

}

for (Board b : ){

}

carregar(id).bancas() . getBancas()

load(id). boards() . getBoards()

if ( )

System.out.println( );

if ( )

System.out.println( );

b.ano() == 2010

b.aluno()

b.year() == 2010

b.student()

Code fragment to illustrate how the LattesMiner is used to extract information in different languages.

Results

The SUCUPIRA is a system for identification and visualization of academic social networks.

Here is shows the geographical distribution of the five researchers

that have published more articles in scientific journals.

Results

This is a graph of contacts of the five researchers that have published more in scientific journals.

The graph depicts an academic social network of the five researchers.

Nodes are presented with

the name of researcher

The color of the edges represent the number of relationships among researchers.

Conclusions

Currently, the Lattes curricula are available in HTML

format

LattesMiner however does not depend on the data

format because it allows users to program their

own applications with a high-level abstraction

If the data format is eventually modified, the DSL

interface remains the same

Conclusions

An advantage of LattesMiner is that it searches by

the name of the researcher

LattesMiner is multilingual

Another advantage is that the data extracted can

are stored in a structural format (XML or

database), allowing these data to be easily used by

others applications

Future work

The future step that is already being implemented

in the LattesMiner DSL is a statistical analysis of

the data

ACNOWLEDGMENTS

Documents

Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, 2011 11 th Workshop on Domain-Specific Modeling