View
1.344
Download
2
Embed Size (px)
DESCRIPTION
May 2011 Personal Validation and Entity Resolution Conference. Presenter: Antoine Chevrette, System Engineering Division, Statistics Canada
Citation preview
— — G-Link —G-Link —
A ProbabilisticA ProbabilisticRecord Linkage SystemRecord Linkage System
Antoine ChevretteAntoine ChevretteSystem Engineering DivisionSystem Engineering Division
Statistics CanadaStatistics Canada
2
AgendaAgenda
• Background: early days of record linkage
• Motivation for building G-Link
• G-Link design objectives
• System overview
• Software installation
• What’s in the future?
23-04-102 Statistics Canada • Statistique Canada
3
Theory of Record LinkageTheory of Record Linkage
Ivan Fellegi & Alan Sunter• “A Theory for Record Linkage” (1969)
Still widely regarded as both pivotal and definitive Implemented in Statistics Canada’s linkage software
23-04-103 Statistics Canada • Statistique Canada
4
Linkage Systems at Statistics CanadaLinkage Systems at Statistics Canada
Ted Hill (SDD) and Martha Fair (Health) produced: the “Generalized Iterative Record Linkage System” (GIRLS)• First released as a mainframe-only product (GIRLS) ca. 1980
• Re-engineered for Unix servers ca. 1990 (rename GRLS)
Larger linkages became practical over time Functionality and ease of use encouraged wider application
23-04-104 Statistics Canada • Statistique Canada
5
Why Replace GRLS?Why Replace GRLS?
GRLS fully functional, and very popular, but:• Requires the use of a Unix-based server
• Requires connection with the Oracle DBMS
Potential applications saw architecture as a barrier GRLS software was aging & required significant updates
23-04-105 Statistics Canada • Statistique Canada
6
G-Link Design ObjectivesG-Link Design Objectives
Operable on all Windows desktops Available for both Windows & Unix servers No third-party software dependencies No additional licensing fees Full GRLS work-alike functionality Processing speed comparable to GRLS Extensible Easy to use
23-04-106 Statistics Canada • Statistique Canada
7
G-LINK introduction through:• Menu options.• The following screens:
Project creation Data importation Data analysis Pairs creation Index creation Rules creation Graph and pairs distribution weitghts Pairs review Group creation and mapping Data exportation Batch functionality
Installation instructions
G-LINK Overview G-LINK Overview
8
G-LINK overview G-LINK overview
9
Project CreationProject Creation
External or Internal Linkage
Internal: e.g. Find duplicate records from an address file.
External: e.g. Link a cancer database with a death database.
Information taken from a configuration file (for server
mode only)
Project protected by a username and
password
10
Data ImportationData Importation
You can see the first 100
observations form the SAS file
Once the importation is complete you can create
derived columns based on nysiis and soundex
Definitions for the columns to import
11
Data analysisData analysis
Obtain the frequency of
each field value
12
Pairs CreationPairs Creation
Create pairs interactively
Experienced users can directly create SQL statements
13
Rule CreationRule Creation
3 level character rule
14
Rule creationRule creation
3 level character matrix rule
15
Rule CreationRule Creation
2 level date rule
16
Rule CreationRule Creation
Numerical condition rule
17
User RulesUser Rules
Type must be custom
Outcome set by users. (use in the user rule psql)
Include field from your input tables
18
Pairs weight distribution graphPairs weight distribution graph
You can choose the range selection
Minimum and maximum weight + the threshold values
19
Pairs revisionPairs revision
Special criteria in order to revise groups of pairs
Rules outcome level
Manual update
20
Group creation and mappingGroup creation and mapping
Mapping screen
Group creation screen
21
Data ExportationData Exportation
Export in flat or SAS files
22
Set a G-Link project as batch. Run from the command line,
embeded script with time execution.
BatchBatch
23
How to install G-LINKHow to install G-LINK
G-LINK is installed using an .exe file on a Windows machine. G-LINK can be installed locally or in server mode
• You should use the server client mode when:
Performance is important (option of using multiple cpus) Data confidentiality is required.
Interface
Logical
Processing (DBMS)
Local
Processing (DBMS)
Server
24
G-Link: The Future?G-Link: The Future?
Product will continue to evolve:• Faster processing• Enhanced pre-processing and post-processing• Enhanced fuzzy matching
Possibility of “record-at-a-time” linkages:• For interactive applications (capture, un-duplication)• Potential for embedded processing
23-04-1024 Statistics Canada • Statistique Canada
25
Contact: