41
On the Genealogy of Knowledge: Alternative patent data sources to complicate old problems and come up with new ones Luis A. Rios Duke University AOM Patent PDW, August 2014

On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

On the Genealogy of Knowledge:

Alternative patent data sources to complicate old

problems and come up with new ones

Luis A. Rios

Duke University

AOM Patent PDW, August 2014

Page 2: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Preface

The Tree of Knowledge

Page 3: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

In patent research, we sometimes take a conveniently

stylized view of what patents and firms look like

A firm. CUSIP 369604BC6

A patent. US 223,898

Page 4: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Simplification increases tractability

Usually good enough

Both patents and firms are conceptualized as stable and discrete units,

with clear 1:m relationships

So we say things like “GE has 37,268 patents”

Page 5: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

But the genealogy of firms and patents is complicated

And often intertwined

Page 6: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Most commercially meaningful patents are actually

intricate legal constructs

Source: www.patentlens.net

Page 7: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

In the last few years, a wealth of new resources have

become available, which allow us to paint a more

realistic picture of how firms and IP co-evolve

Page 8: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Presentation Roadmap

PATSTAT and BvD: Two exciting European data sources

Not well-known in the U.S.

Great for “genealogical” patent research

Four problems that empirical patent researchers should at least be aware of.

One classic and three new ones.

The new ones hard to ignore in light of the new data

Examples of some neat things that we can do. Suggestions for future research.

Page 9: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

1. Intro to PATSTAT and BVD

Page 10: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

PATSTAT

PATSTAT, the EPO Worldwide Patent Statistical Database, is a

snapshot of the EPO master documentation database (DOCDB) with

worldwide coverage

Contains more than 20 tables with bibliographic data, citations and

family links of about 70 million applications of more than 80

countries

Has been available on optical media since 2008, now also online (and

there is a free trial)

Most researchers use it via SQL queries. But raw data platforms very

promising.

Page 11: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

11

PATSTAT virtually unknown pre-2010

Page 12: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

PATSTAT (simplified structure)

Page 13: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

PATSTAT (simplified structure)

Page 14: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

PATSTAT (full structure)

Page 15: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

PATSTAT + BVD contribution

Since 2011, the OECD has been working with Idener on a new firm-

name matching project which adds to PATSTAT the massive company

info database of Bureau vanDjik.

“How To Kill Inventors: Testing The Massacrator© Algorithm For

Inventor Disambiguation” (Pezzoni, Lissoni & Tarasconi, 2012)

xxx

Page 16: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

PATSTAT + BVD contribution

The technical name matching methodologies are similar to other

projects (e.g. approximate string matching; weighted token-based

comparisons; distance measures)

However, the BvD data adds detailed company information (~120

million firms globally), which increases the accuracy of matching

considerably.

Also adds crucial firm ownership and structure information

Many institutions have had BvD products for years (e.g. it is included in

WRDS), but the patent module is new.

You may already have this!

Page 17: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

2. Why this is exciting: four headaches for patent

researchers

(three of which may be new to you)

Page 18: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Challenges in using and interpreting patent data

a. The classic problem: Need to disambiguate names. Lots of attention, lots

of progress.

b. Less obvious problem: Subsidiary assignees mask true ownership

c. Not clear what we can/should do problem: equivalents and priority

families

d. Very obscure problem: invisible transfers, citations to applications

Page 19: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

a. The “classic” name disambiguation problem

Often illustrated by the variations of IBM’s name on patents.

Hundreds of different spellings, such as “International Business

Machines,” “IBM, INC” “Intl Business Machines,” etc.

It affects both firm and inventor names.

It matters because we want to know who created the IP. Critical for

accurate patent counts.

Page 20: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Some of my favorite versions of IBM

Page 21: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Many ongoing projects on disambiguation

Many different approaches over the years: NBER firm name

standardization routines, inventor names from HBS Dataverse (Lai;

D'Amour; Yu; Sun; Fleming, 2013)

“How To Kill Inventors: Testing The Massacrator© Algorithm For

Inventor Disambiguation” (Pezzoni, Lissoni & Tarasconi, 2012)

PATSTAT data are a great compliment to other approaches. Especially

suited for small and private firms, and European firms.

Its HAN database standardizes inventors with unique IDs. Continues to

be improved. Works very nicely when merged to DATAVERSE

Page 22: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

b. Less obvious problem:

Subsidiary assignees mask true ownership

Even if we had perfectly matched names, many patents get assigned

to subsidiaries!

Source: Arora, Belenzon and Rios, SMJ 2014

Page 23: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Here are 8 (out of 312) J&J subsidiaries

Source: Arora, Belenzon and Rios, SMJ 2014

Page 24: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Subsidiaries are a problem for any kind of patent

analysis.

Because firms adopt different assignment strategies, the error will not be simply “noise.”

IBM, assigns over 90% of its patents to IBM, whereas Johnson & Johnson assigns less than 10% to itself. Most go to wholly-owned subsidiaries.

Failure to properly aggregate to the parent firm could be a problem for measures such as patent propensity, innovative performance, knowledge transfer, etc.

Empirical work needs to make a “theory of the firm” decision each time it defines what a firm is, in order to draw the boundaries

Page 25: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

These are J&J’s patents per the NBER data:

19 Subsidiaries

5,816 patents

Page 26: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

These are J&J’s patents using

PATSTAT/BVD:

90 Subsidiaries

12,192 patents

NOTE: This does not simply come out of

PATSTAT. Result of months of refinement.

The ability to aggregate/disaggregate

critical for defining what the firm is on a

project-by-project basis

We may be interested in the internal

division of innovative labor (dissagregate)

Or in a top-down firm strategy question

(aggregate)

Page 27: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

c. Not clear what we can/should do problem:

equivalents and priority families

In the “real world” most patents are not discrete quanta. They are often parts of families,

pools or complex thickets. So rather than a 1:1 we often have a m:m match between patents

and the technologies they cover

Priority families are patents that share a common priority via the same original application.

Often, the set of “equivalents” (Martinez, 2010).

They grow via divisionals (eg. original app split into two) or continuations (original app

replaced by revised app).

Very few commercially significant patents are “singletons.”

most valuable patents also spawn their own slightly tweaked fences

Page 28: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Take this patent.

ContentGuard’s “System, Apparatus, and Media for

Granting Access to and Utilizing Content”

In the “Related U.S. Application Data, it tells us that this

is a continuation of a continuation of a continuation of a

continuation.

It might seem that this is the fifth patent in a series of

continuations, four of which were granted before.

But it is not that simple. The “related” section on the face

only tells us the direct “parents”

Page 29: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

In fact, there are actually 35 granted US patents with little more than

different title and some changed words, ALL sharing just one priority

Page 30: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

kk

Equivalents are tricky, and

what we do about them has

to be driven by theoretical

framing also.

Is this one USPTO patent?

Or is it 35?

What do we make of the 35th

patent in this series?

Is that more “incremental”?

More “strategic”?

is the original application

Where the “novel” invention

lies?

Page 31: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

d. Very obscure problem: invisible transfers,

citations to applications

How do we account for a firm’s acquired patents? What about a firm’s

sold patents?

These are inputs into its production function that are generally ignored.

But new data suggests that this is more than just a “rounding error”

Some of the most innovative firms also have the absorptive capacity to

identify and acquire very early technology.

Often these acquisitions occur at the application stage, pre-publication

Most current patent methodologies mistakenly assign the eventual

granted patents as having been internally generated.

This underestimated the significance of open innovation and markets

for technology

Page 32: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

kk

Just when you thought things

could not get more complicated…

ContentGuard’s fertile patent was

bought from Xerox!

ContentGuard has been

exploiting a patent that it did not

“invent”

A metaphysical conundrum:

Did ContentGuard “create” 34

patents?

Is Xerox the real innovator of this

technology?

Page 33: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Many cites occur at the A1 stage (application) and are

not counted in B’s

Application filed by

Nevengineering, Inc

(45 cites in 7 years)

Granted to Google

(2 cites in 3 years)

Google buys Neven…and uses

the technology in Google Glass

(Source: Rios, 2014)

If more innovative patents are cited during

application, then the granted patent’s cites are

under counted!

This is especially true for patents that are

traded during their application process and

then modified by the inventing firm

(continuations and divisionals)

This is especially important

in fast-moving technology

areas

Page 34: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

D

Neven Hartmut Director of Engineering

Prof. Babak

Parviz

12 patents

University of

Washington

Motion Research

Technologies Inc

Dobson Dominic

3 patents

Babak Parviz Head-Project Glass

Neven Vision Founder – Neven Hartmut

9 Patents

DNNresearch

Internal Inventors 37 Patents

MicroOptics

Corporation

Salvatore Scellato

PhD Student

1 patent

Mark Spitzer Director of Operations

Salvatore Scellato

Mark Spitzer

6 patents

Geoffrey Hinton

Google as a successful technology aggregator

Page 35: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

This is what I mean by the “genealogy of knowledge”

If we want to trace the locus of innovation, we need to

go down corporate and patent rabbit holes

US2003208447

US6236971, US6708157, US6714921,

US6895392, US6898576, US6910022,

US6925448, US6928419, US6934693,

US6944600, US6957193, US6959290,

US7024392, US7039613, US7043453,

US7058606, US7065505, US7113912,

US7200574, US7209902, US7225160,

US7260556, US7266529, US7269576,

US7269577, US7359881, US7389270,

US7523072, US7664708, US7788182,

US7970709, US8170955, US8370956,

US8393007, US8443457, US8484751,

US8671461

Page 36: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

This is what I mean by the “genealogy of knowledge”

If we want to trace the locus of innovation, we need to

go down corporate and patent rabbit holes

US2003208447

US6236971, US6708157, US6714921,

US6895392, US6898576, US6910022,

US6925448, US6928419, US6934693,

US6944600, US6957193, US6959290,

US7024392, US7039613, US7043453,

US7058606, US7065505, US7113912,

US7200574, US7209902, US7225160,

US7260556, US7266529, US7269576,

US7269577, US7359881, US7389270,

US7523072, US7664708, US7788182,

US7970709, US8170955, US8370956,

US8393007, US8443457, US8484751,

US8671461

Pendrell has only 4 subsidiaries. Imagine what this looks like for J&J!

Page 37: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Google’s genealogy inseparable from its technological

roots

Page 38: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Some interesting areas for future research

What firms are accessing early technology?

How does organizational structure relate to innovation performance?

Does better aggregation of citations (at the family level) + sources of

citation (examiner vs. applicant) increase the reliability of citations as

measures of knowledge flows?

Are family measures (size, diversity) fruitful for capturing innovative

performance?

Page 39: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

PATSTAT: pros and cons

Massive amounts of data. International coverage and patent family

information facilitates cross-country comparisons

Constant updates.

But..Expensive (~$2,400). And frequent errors and revisions.

Must get to know it intimately before relying on it.

For example just one buried non-ASCII character can lose millions of

observations during imports. You would never know if you did not know

what to expect the data to look like.

Page 40: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Still a bit of a hacker/ wild west frontier, but a great place to start for

those who want to be ahead of the curve

Must read: Gianluca Tarasconi’s RawPatentData blog. User

contributors catch errors faster than EPO

PATSTAT: pros and cons

Page 41: On the Genealogy of Knowledge: Alternative patent data ...In the “real world” most patents are not discrete quanta. They are often parts of families, pools or complex thickets

Tips: Hardware

41