On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

On the Genealogy of Knowledge:

Alternative patent data sources to complicate old

problems and come up with new ones

Luis A. Rios

Duke University

AOM Patent PDW, August 2014

Preface

The Tree of Knowledge

In patent research, we sometimes take a conveniently

stylized view of what patents and firms look like

A firm. CUSIP 369604BC6

A patent. US 223,898

Simplification increases tractability

Usually good enough

Both patents and firms are conceptualized as stable and discrete units,

with clear 1:m relationships

So we say things like “GE has 37,268 patents”

But the genealogy of firms and patents is complicated

And often intertwined

Most commercially meaningful patents are actually

intricate legal constructs

Source: www.patentlens.net

In the last few years, a wealth of new resources have

become available, which allow us to paint a more

realistic picture of how firms and IP co-evolve

Presentation Roadmap

PATSTAT and BvD: Two exciting European data sources

Not well-known in the U.S.

Great for “genealogical” patent research

Four problems that empirical patent researchers should at least be aware of.

One classic and three new ones.

The new ones hard to ignore in light of the new data

Examples of some neat things that we can do. Suggestions for future research.

1. Intro to PATSTAT and BVD

PATSTAT

PATSTAT, the EPO Worldwide Patent Statistical Database, is a

snapshot of the EPO master documentation database (DOCDB) with

worldwide coverage

Contains more than 20 tables with bibliographic data, citations and

family links of about 70 million applications of more than 80

countries

Has been available on optical media since 2008, now also online (and

there is a free trial)

Most researchers use it via SQL queries. But raw data platforms very

promising.

11

PATSTAT virtually unknown pre-2010

PATSTAT (simplified structure)

PATSTAT (simplified structure)

PATSTAT (full structure)

PATSTAT + BVD contribution

Since 2011, the OECD has been working with Idener on a new firm-

name matching project which adds to PATSTAT the massive company

info database of Bureau vanDjik.

“How To Kill Inventors: Testing The Massacrator© Algorithm For

Inventor Disambiguation” (Pezzoni, Lissoni & Tarasconi, 2012)

xxx

PATSTAT + BVD contribution

The technical name matching methodologies are similar to other

projects (e.g. approximate string matching; weighted token-based

comparisons; distance measures)

However, the BvD data adds detailed company information (~120

million firms globally), which increases the accuracy of matching

considerably.

Also adds crucial firm ownership and structure information

Many institutions have had BvD products for years (e.g. it is included in

WRDS), but the patent module is new.

You may already have this!

2. Why this is exciting: four headaches for patent

researchers

(three of which may be new to you)

Challenges in using and interpreting patent data

a. The classic problem: Need to disambiguate names. Lots of attention, lots

of progress.

b. Less obvious problem: Subsidiary assignees mask true ownership

c. Not clear what we can/should do problem: equivalents and priority

families

d. Very obscure problem: invisible transfers, citations to applications

a. The “classic” name disambiguation problem

Often illustrated by the variations of IBM’s name on patents.

Hundreds of different spellings, such as “International Business

Machines,” “IBM, INC” “Intl Business Machines,” etc.

It affects both firm and inventor names.

It matters because we want to know who created the IP. Critical for

accurate patent counts.

Some of my favorite versions of IBM

Many ongoing projects on disambiguation

Many different approaches over the years: NBER firm name

standardization routines, inventor names from HBS Dataverse (Lai;

D'Amour; Yu; Sun; Fleming, 2013)

“How To Kill Inventors: Testing The Massacrator© Algorithm For

Inventor Disambiguation” (Pezzoni, Lissoni & Tarasconi, 2012)

PATSTAT data are a great compliment to other approaches. Especially

suited for small and private firms, and European firms.

Its HAN database standardizes inventors with unique IDs. Continues to

be improved. Works very nicely when merged to DATAVERSE

b. Less obvious problem:

Subsidiary assignees mask true ownership

Even if we had perfectly matched names, many patents get assigned

to subsidiaries!

Source: Arora, Belenzon and Rios, SMJ 2014

Here are 8 (out of 312) J&J subsidiaries

Source: Arora, Belenzon and Rios, SMJ 2014

Subsidiaries are a problem for any kind of patent

analysis.

Because firms adopt different assignment strategies, the error will not be simply “noise.”

IBM, assigns over 90% of its patents to IBM, whereas Johnson & Johnson assigns less than 10% to itself. Most go to wholly-owned subsidiaries.

Failure to properly aggregate to the parent firm could be a problem for measures such as patent propensity, innovative performance, knowledge transfer, etc.

Empirical work needs to make a “theory of the firm” decision each time it defines what a firm is, in order to draw the boundaries

These are J&J’s patents per the NBER data:

19 Subsidiaries

5,816 patents

These are J&J’s patents using

PATSTAT/BVD:

90 Subsidiaries

12,192 patents

NOTE: This does not simply come out of

PATSTAT. Result of months of refinement.

The ability to aggregate/disaggregate

critical for defining what the firm is on a

project-by-project basis

We may be interested in the internal

division of innovative labor (dissagregate)

Or in a top-down firm strategy question

(aggregate)

c. Not clear what we can/should do problem:

equivalents and priority families

In the “real world” most patents are not discrete quanta. They are often parts of families,

pools or complex thickets. So rather than a 1:1 we often have a m:m match between patents

and the technologies they cover

Priority families are patents that share a common priority via the same original application.

Often, the set of “equivalents” (Martinez, 2010).

They grow via divisionals (eg. original app split into two) or continuations (original app

replaced by revised app).

Very few commercially significant patents are “singletons.”

most valuable patents also spawn their own slightly tweaked fences

Take this patent.

ContentGuard’s “System, Apparatus, and Media for

Granting Access to and Utilizing Content”

In the “Related U.S. Application Data, it tells us that this

is a continuation of a continuation of a continuation of a

continuation.

It might seem that this is the fifth patent in a series of

continuations, four of which were granted before.

But it is not that simple. The “related” section on the face

only tells us the direct “parents”

In fact, there are actually 35 granted US patents with little more than

different title and some changed words, ALL sharing just one priority

kk

Equivalents are tricky, and

what we do about them has

to be driven by theoretical

framing also.

Is this one USPTO patent?

Or is it 35?

What do we make of the 35th

patent in this series?

Is that more “incremental”?

More “strategic”?

is the original application

Where the “novel” invention

lies?

d. Very obscure problem: invisible transfers,

citations to applications

How do we account for a firm’s acquired patents? What about a firm’s

sold patents?

These are inputs into its production function that are generally ignored.

But new data suggests that this is more than just a “rounding error”

Some of the most innovative firms also have the absorptive capacity to

identify and acquire very early technology.

Often these acquisitions occur at the application stage, pre-publication

Most current patent methodologies mistakenly assign the eventual

granted patents as having been internally generated.

This underestimated the significance of open innovation and markets

for technology

kk

Just when you thought things

could not get more complicated…

ContentGuard’s fertile patent was

bought from Xerox!

ContentGuard has been

exploiting a patent that it did not

“invent”

A metaphysical conundrum:

Did ContentGuard “create” 34

patents?

Is Xerox the real innovator of this

technology?

Many cites occur at the A1 stage (application) and are

not counted in B’s

Application filed by

Nevengineering, Inc

(45 cites in 7 years)

Granted to Google

(2 cites in 3 years)

Google buys Neven…and uses

the technology in Google Glass

(Source: Rios, 2014)

If more innovative patents are cited during

application, then the granted patent’s cites are

under counted!

This is especially true for patents that are

traded during their application process and

then modified by the inventing firm

(continuations and divisionals)

This is especially important

in fast-moving technology

areas

D

Neven Hartmut Director of Engineering

Prof. Babak

Parviz

12 patents

University of

Washington

Motion Research

Technologies Inc

Dobson Dominic

3 patents

Babak Parviz Head-Project Glass

Neven Vision Founder – Neven Hartmut

9 Patents

DNNresearch

Internal Inventors 37 Patents

MicroOptics

Corporation

Salvatore Scellato

PhD Student

1 patent

Mark Spitzer Director of Operations

Salvatore Scellato

Mark Spitzer

6 patents

Geoffrey Hinton

Google as a successful technology aggregator

This is what I mean by the “genealogy of knowledge”

If we want to trace the locus of innovation, we need to

go down corporate and patent rabbit holes

US2003208447

US6236971, US6708157, US6714921,

US6895392, US6898576, US6910022,

US6925448, US6928419, US6934693,

US6944600, US6957193, US6959290,

US7024392, US7039613, US7043453,

US7058606, US7065505, US7113912,

US7200574, US7209902, US7225160,

US7260556, US7266529, US7269576,

US7269577, US7359881, US7389270,

US7523072, US7664708, US7788182,

US7970709, US8170955, US8370956,

US8393007, US8443457, US8484751,

US8671461

This is what I mean by the “genealogy of knowledge”

If we want to trace the locus of innovation, we need to

go down corporate and patent rabbit holes

US2003208447

US6236971, US6708157, US6714921,

US6895392, US6898576, US6910022,

US6925448, US6928419, US6934693,

US6944600, US6957193, US6959290,

US7024392, US7039613, US7043453,

US7058606, US7065505, US7113912,

US7200574, US7209902, US7225160,

US7260556, US7266529, US7269576,

US7269577, US7359881, US7389270,

US7523072, US7664708, US7788182,

US7970709, US8170955, US8370956,

US8393007, US8443457, US8484751,

US8671461

Pendrell has only 4 subsidiaries. Imagine what this looks like for J&J!

Google’s genealogy inseparable from its technological

roots

Some interesting areas for future research

What firms are accessing early technology?

How does organizational structure relate to innovation performance?

Does better aggregation of citations (at the family level) + sources of

citation (examiner vs. applicant) increase the reliability of citations as

measures of knowledge flows?

Are family measures (size, diversity) fruitful for capturing innovative

performance?

PATSTAT: pros and cons

Massive amounts of data. International coverage and patent family

information facilitates cross-country comparisons

Constant updates.

But..Expensive (~$2,400). And frequent errors and revisions.

Must get to know it intimately before relying on it.

For example just one buried non-ASCII character can lose millions of

observations during imports. You would never know if you did not know

what to expect the data to look like.

Still a bit of a hacker/ wild west frontier, but a great place to start for

those who want to be ahead of the curve

Must read: Gianluca Tarasconi’s RawPatentData blog. User

contributors catch errors faster than EPO

PATSTAT: pros and cons

Tips: Hardware

41

Documents

On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets