27
C I R C L E Centre for Innovation, Research and Competence in the Learning Economy L U N D U N I V E R S I T Y P.O. Box 117, SE-221 00 Lund, Sweden Swedish inventors matching to registers and descriptive data Presentation at APE-INV Brussels September 5 th 2011 Lina Ahlin and Olof Ejermo [email protected] [email protected]

Lina Ahlin and Olof Ejermo [email protected] [email protected]

  • Upload
    alaura

  • View
    97

  • Download
    0

Embed Size (px)

DESCRIPTION

Swedish inventors  ‐  matching to registers and descriptive data Presentation at APE-INV Brussels September 5 th 2011. Lina Ahlin and Olof Ejermo [email protected] [email protected]. - PowerPoint PPT Presentation

Citation preview

Page 1: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

C I R C L ECentre for Innovation, Research and Competence in the Learning Economy

L U N D U N I V E R S I T YP.O. Box 117, SE-221 00 Lund, Sweden

Swedish inventors matching to registers and‐descriptive data

Presentation at APE-INVBrussels September 5th 2011

Lina Ahlin and Olof [email protected]

[email protected]

Page 2: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

On the agenda

• What is so special with Swedish data• 1st matching • 2nd matching • Future – how to reach 100% match rate?• (Results)

Page 3: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Linking inventors to registers

• EPO applied patents 1978-2009 for inventors with addresses in Sweden.

• Matching done on name-home address combinations

• Problem 1: different inventors may have the same name

• Problem 2: addresses may be old• How to verify person identity and connect to

Swedish register data?

Page 4: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Swedish dataQ: What makes Swedish data so exciting (and why we want a high match rate)?A: Through Statistics Sweden it is possible to connect individuals to register data which connects several levels of information relevant for innovation studies:• Individual level: field/level of education, age, income, gender,

workplace• Regions: workplace, home municipality• Sectoral level: sectors, firm size, level of R&D...

can give a multifacetted view of innovation, but need a personal identifier ”personnummer” to do this

e.g. 19500131-3422

Birth date Jan 31st, 1950 Even number = female

Page 5: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

1st matching (Oct-Dec 2010)• All Swedes (incl. Personnummer) listed on address register ”SPAR” • Matching of addresses through InfoTorg stores addresses/address changes

latest 3 years addition of personnummer– Individuals under 16 not matched

• Old patents added under the assumption that:Sven Ivar Johanson Sven Ivar JohansonStorgatan 1 = Storgatan 1111 00 Stockholm 111 00 Stockholm

Match rate 64% of inventor-patent pairs. Low peak 23% in 1978 to high peak 93% in 2008. This is because of mobility of inventors.

Register 2008-2010 Patent applied for in 1992

Page 6: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

• InfoTorg returned 56% match rate• Manual check (visual – no robot) + 8%

Page 7: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

64% match rate

19781980

19821984

19861988

19901992

19941996

19982000

20022004

20062008

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Fractions 64%

1985-2005: present access to individual registers at Statistics Sweden 2006-2009: additions as of Sep. 30th 2011

Page 8: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

2nd matching (April-Sep 2011)

• Use public access to registers (Swedish geneaological association )– CD:s of Swedish population (1980)/1990

published by old addresses and birth date– CD ”Book of dead” 1901-2009 address at death

+ personnummer• Match birth date + name to personnummer

using service by InfoTorg or online sources

Page 9: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Methodology

• Extract data from Swedish deadbook and Swedish genealogy records for 1990 (to some extent also 1980) on all individuals in the population by letter

• Generate a variable containing name, address and postal address for all individuals in the population as well as for inventors who are not fully matched

Page 10: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Normalized Levenshtein (”strgroup”) in STATA

• An example of the ”name-address” string:”Sven Ivar Johanson, Storgatan 1, 111 00

Stockholm” (from EPO)= ”Sven Ifwar Johanson, Storgatan 1, 111 00

Stockholm” (from Swedish population 1990) • Replace/insert 3 letters to make strings equal• Divided by length of shortest string (48)

(3/48) = 0.0625 (=a good hit)

Page 11: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Adding date of birth

1. 1990 Levensthein names & adresses2. 1990 Levensthein unique names 3. Levenshtein from CD dead 1901-2009 - names

and adresses 4. Strgroup: similarity on name-address hits 1-35. Some manual additions and minor changes 6. 1980 Levenshtein names and addresses (letters

D&H)

Page 12: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Methodology: continued

• Manually examine each match to see whether Levenshtein-command has matched correctly

• Some hits discarded incl ambiguous name match hits

Page 13: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

New match rate 80%

19781979

19801981

19821983

19841985

19861987

19881989

19901991

19921993

19941995

19961997

19981999

20002001

20022003

20042005

20062007

20082009

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Fractions 64%Fractions 80%

Page 14: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Adding personnummer (ongoing)New match rate 80%, but not full personnummer. What to do?1. Use date of birth-part of personal number for fully matched

inventors2. Join all possible combinations of birth dates for those fully

matched and those with only birth dates.3. Run Levenshtein-distance on inventor names4. Small Levenshtein-distance: accept that the inventors are the

same since name and birth date match5. Large Levenshtein-distance: reject6. Further, manually check remaining inventors. Look at

addresses for further confirmation if uncertain.

Page 15: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Adding personnummer ctd.

• Use Deathbook yrs 1975-2009. Use date of birth-part of personal numbers

• Re-run step 2-6 on previous slide

Page 16: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Adding personnummer ctd.

Problem: not all inventors were previously identified no 4 last digitsTwo options to get full personal numbers from birth dates:1. Use InfoTorg again with name + added

parameter ”birthdate”2. Manually add four last digits by using

internet service (www.upplysning.se)

Page 17: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Some matching problems

• Difficult to match individuals who change last names (mainly women) or with common names and who move a lot.

• Two people with the same name can live on the same address (i.e. father names his son after himself) – possibility to match the wrong person. If detected, oldest person is chosen.

• For inventors affiliated with some firms (AstraZeneca), company address given

Page 18: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Towards 100%• Idea: scoring methods based on identified inventors

– Name– Identified co-inventors– Technology class– City– Postal code– Which algorithm?

• Statistics Sweden for validating parent/child name similarity problem?

• Use 1980 population CD?• Strategy of focusing on highly productive unmatched inventors?

Page 19: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Suggestions/questions

Page 20: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Patent distribution by sector

Page 21: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Patent distribution in manufacturing (share of total patenting)

Page 22: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Patent distribution in services (share of total patenting).

Page 23: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Education level among inventors

Page 24: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Percentile distribution of inventors’ patent productivity.

Percentile All patents Contribution Patents 2004-07 Contribution 2004-07

Percentile value Percentile value Percentile value Percentile value

1% 1 0.12 1 0.11

5% 1 0.20 1 0.17

10% 1 0.25 1 0.20

25% 1 0.33 1 0.33

50% 1 0.83 1 0.50

75% 3 1.50 2 1.00

90% 6 3.00 4 2.00

95% 9 5.00 6 3.00

99% 21 11.50 12 5.83

Mean/inventor 2.81 1.40 2.06 0.97

Number of inventors

18 489 18 489 8 526 8 526

Page 25: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Sectors, SNI92-codes, # inventors, contribution 2004-2005.

Sector SNI92-codes Unique inventors, mean/year 2004-2005

Contribution*, mean 2004-2005

% cooperation cross sector

1994-1995

% cooperation cross sector

2004-2005

Primary 1000-14999 8.5 5.9 28% 28%

Manufacturing 15000-37999 1567 749.9 11% 11%

Services 38000-74999, 80410, 80423-80425, 80427-80429, 85200, 85325, 91111-91330, 92110-92130, 92310, 92330-92400, 92611-92614, 92621-99000

806.5 411.1 23% 23%

Academia 80301-80309 and ** 190 72.6 54% 54%

Public sector 75000-80299, 80421-80422, 80426, 85000-85140, 85311-85324, 90000-90008, 92200, 92320, 92511/92530, 92615

62.5 28.4 67% 67%

* ”Contribution” counts patent fractions which adjusts for co-inventorship.** ”Academia” can also in a few cases be found in the sectors R&D in technical and natural sciences (73101-73104) and in technical testing and analysis (74300).

Page 26: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

Cooperation by sector, 2004-05Primary Manufacturin

gServices Academia Public

sectorSum

Primary43% 57% 0% 0% 100%

Manufacturing

1% 77% 17% 5% 100%Services

1% 66% 24% 9% 100%Academia

0% 29% 48% 22% 100%Public sector

0% 18% 37% 45% 100%

Page 27: Lina Ahlin and Olof Ejermo lina.ahlin@circle.lu.se olof.ejermo@circle.lu.se

The most important patenting academic institutions 2004-2005

Univ/institute

Contributions/year

Share Patents/billion research revenue SEK

Patents/thousand FTE, NTM

Lund 20.3 23% 6.3 15.0

Uppsala 11.6 13% 4.2 9.7

Karolinska 11.6 13% 3.9 9.3

KTH 9.8 11% 5.7 8.7

Göteborg 9.0 10% 3.7 10.9

Linköping 7.9 9% 6.4 10.3

Chalmers 7.2 8% 5.1 8.6

Stockholm 2.9 3% 1.7 4.1

Umeå 2.3 3% 1.5 2.8

Sum 82.6 94% 4.4 9.3

Others (13) 5.0 6% 1.3 1.8