Upload
sydney-richardson
View
218
Download
4
Tags:
Embed Size (px)
Citation preview
LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform
Alexandre Donizeti AlvesHoracio Hideki Yanasse
Nei Yoshihiro Soma
October 24, 2011
11th Workshop on Domain-Specific Modeling
Introduction
Lattes Platform is an information system implanted
by CNPq (National Council for Scientific and
Technological Development) to manage
information on science, technology and innovation
related to researchers and institutions in Brazil
This platform is undoubtedly the major source of
information available on Brazilian researchers
Introduction: Lattes Platform
http://lattes.cnpq.br
Introduction
The Lattes CV system, a curricular information
system, is the main component of the platform
Currently, the Lattes CV system stores around
2,000,000 curricula of researchers, lectures,
students and professionals from diverse areas of
knowledge
Introduction: Lattes CV system
http://buscatextual.cnpq.br/buscatextual
Jorge Almeida Guimaraes
Introduction: Lattes curriculum (English)
Introduction: Lattes curriculum (English)
Introduction: Lattes curriculum (Portuguese)
Introduction
In the last years, many works were developed
using data extracted from Lattes Platform of
researchers of different areas of knowledge
A common problem presented in these works is
that the curricula and the information extracted
had to be obtained manually
Introduction
Therefore, this system has a very high
quality information extraction potential
LattesMiner
LattesMinerLattesMiner is an internal multilingual DSL for automatic
information extraction from Lattes curricula
It is composed by a set of classes written in Java that
allows developers to implement their own applications
with a high-level abstraction and expression power
LattesMiner
Data Discovery is used to find the (ID) number of the researchers.
Usually, only the name of the researcher is available.
Data Acquisition is responsible for downloading the Lattes curricula
of the researchers from Lattes CV system on the Web.
Data Extraction is the main component of LattesMiner. It is
responsible for extracting data from the HTML files. The technique
of information extraction based on regular expressions was used.
The extracted data can be stored in XML files or in any database
using the Data Structure component.
The Data Visualization component is responsible for the identification
and visualization of the academic social networks. These networks are
identified by checking the relationships between researchers.
The Data Analysis component is responsible for the analysis of the
data extracted and also for the analysis of the relationships identified.
LattesMiner
LattesMiner
Biodata
Board
BiodataIE
BoardIE
BoardDaoBiodataDao
lattes.miner
lattes.miner.ielattes.miner.en
lattes.miner.dao
Perfil Banca
lattes.miner.brThe LattesMiner class is composed by instances of classes Biodata
and Board, in addition to many others not presented here.
LattesMiner
LattesMiner was created through a fluent interface, that
provides a compact and yet easy-read representation of
the domain problem
Fluent interfaces are implemented using the method
chaining
LattesMiner makes use of static factory methods and
imports
Case Study
http://plsql1.cnpq.br/divulg/RESULTADO_PQ_102003.curso
For the following examples researchers of the Computer Science area
with CNPq Research Productivity Scholarship were considered.
The list contains all the names of the researchers.
However, their corresponding (ID) number are not provided.
Listing 1
import java.util.*;
import lattes.util.Util;
import static lattes.miner.LattesMiner.*;
public class Listing1{ public static void main(String[] args) {
}}
List<String> list = new ArrayList<String>();
for (String name : Util.getList("names.txt"))
list.add( );
Util.setList(list, "ids.txt");
search(name)
Java application code
Listing 2
dir("cvs");
for (String id : Util.getList("ids.txt"))
download(id). save();
Code fragment used to download the lattes curricula of
the researchers.
Listing 3
props("mysql");
for (String id : Util.getList("ids.txt")) {
}
load(id). biodata(). address();
publications( )JOURNAL . save();
This listing shows as to extracted data from Lattes curricula
of the researchers.
Listing 4
for (String id : Util.getList("ids.txt")){
}
// Portuguese
// English
for (Banca b : ){
}
for (Board b : ){
}
carregar(id).bancas() . getBancas()
load(id). boards() . getBoards()
if ( )
System.out.println( );
if ( )
System.out.println( );
b.ano() == 2010
b.aluno()
b.year() == 2010
b.student()
Code fragment to illustrate how the LattesMiner is used to extract information in different languages.
Results
The SUCUPIRA is a system for identification and visualization of academic social networks.
Here is shows the geographical distribution of the five researchers
that have published more articles in scientific journals.
Results
This is a graph of contacts of the five researchers that have published more in scientific journals.
The graph depicts an academic social network of the five researchers.
Nodes are presented with
the name of researcher
The color of the edges represent the number of relationships among researchers.
Conclusions
Currently, the Lattes curricula are available in HTML
format
LattesMiner however does not depend on the data
format because it allows users to program their
own applications with a high-level abstraction
If the data format is eventually modified, the DSL
interface remains the same
Conclusions
An advantage of LattesMiner is that it searches by
the name of the researcher
LattesMiner is multilingual
Another advantage is that the data extracted can
are stored in a structural format (XML or
database), allowing these data to be easily used by
others applications
Future work
The future step that is already being implemented
in the LattesMiner DSL is a statistical analysis of
the data
ACNOWLEDGMENTS