Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
On the Genealogy of Knowledge:
Alternative patent data sources to complicate old
problems and come up with new ones
Luis A. Rios
Duke University
AOM Patent PDW, August 2014
Preface
The Tree of Knowledge
In patent research, we sometimes take a conveniently
stylized view of what patents and firms look like
A firm. CUSIP 369604BC6
A patent. US 223,898
Simplification increases tractability
Usually good enough
Both patents and firms are conceptualized as stable and discrete units,
with clear 1:m relationships
So we say things like “GE has 37,268 patents”
But the genealogy of firms and patents is complicated
And often intertwined
Most commercially meaningful patents are actually
intricate legal constructs
Source: www.patentlens.net
In the last few years, a wealth of new resources have
become available, which allow us to paint a more
realistic picture of how firms and IP co-evolve
Presentation Roadmap
PATSTAT and BvD: Two exciting European data sources
Not well-known in the U.S.
Great for “genealogical” patent research
Four problems that empirical patent researchers should at least be aware of.
One classic and three new ones.
The new ones hard to ignore in light of the new data
Examples of some neat things that we can do. Suggestions for future research.
1. Intro to PATSTAT and BVD
PATSTAT
PATSTAT, the EPO Worldwide Patent Statistical Database, is a
snapshot of the EPO master documentation database (DOCDB) with
worldwide coverage
Contains more than 20 tables with bibliographic data, citations and
family links of about 70 million applications of more than 80
countries
Has been available on optical media since 2008, now also online (and
there is a free trial)
Most researchers use it via SQL queries. But raw data platforms very
promising.
11
PATSTAT virtually unknown pre-2010
PATSTAT (simplified structure)
PATSTAT (simplified structure)
PATSTAT (full structure)
PATSTAT + BVD contribution
Since 2011, the OECD has been working with Idener on a new firm-
name matching project which adds to PATSTAT the massive company
info database of Bureau vanDjik.
“How To Kill Inventors: Testing The Massacrator© Algorithm For
Inventor Disambiguation” (Pezzoni, Lissoni & Tarasconi, 2012)
xxx
PATSTAT + BVD contribution
The technical name matching methodologies are similar to other
projects (e.g. approximate string matching; weighted token-based
comparisons; distance measures)
However, the BvD data adds detailed company information (~120
million firms globally), which increases the accuracy of matching
considerably.
Also adds crucial firm ownership and structure information
Many institutions have had BvD products for years (e.g. it is included in
WRDS), but the patent module is new.
You may already have this!
2. Why this is exciting: four headaches for patent
researchers
(three of which may be new to you)
Challenges in using and interpreting patent data
a. The classic problem: Need to disambiguate names. Lots of attention, lots
of progress.
b. Less obvious problem: Subsidiary assignees mask true ownership
c. Not clear what we can/should do problem: equivalents and priority
families
d. Very obscure problem: invisible transfers, citations to applications
a. The “classic” name disambiguation problem
Often illustrated by the variations of IBM’s name on patents.
Hundreds of different spellings, such as “International Business
Machines,” “IBM, INC” “Intl Business Machines,” etc.
It affects both firm and inventor names.
It matters because we want to know who created the IP. Critical for
accurate patent counts.
Some of my favorite versions of IBM
Many ongoing projects on disambiguation
Many different approaches over the years: NBER firm name
standardization routines, inventor names from HBS Dataverse (Lai;
D'Amour; Yu; Sun; Fleming, 2013)
“How To Kill Inventors: Testing The Massacrator© Algorithm For
Inventor Disambiguation” (Pezzoni, Lissoni & Tarasconi, 2012)
PATSTAT data are a great compliment to other approaches. Especially
suited for small and private firms, and European firms.
Its HAN database standardizes inventors with unique IDs. Continues to
be improved. Works very nicely when merged to DATAVERSE
b. Less obvious problem:
Subsidiary assignees mask true ownership
Even if we had perfectly matched names, many patents get assigned
to subsidiaries!
Source: Arora, Belenzon and Rios, SMJ 2014
Here are 8 (out of 312) J&J subsidiaries
Source: Arora, Belenzon and Rios, SMJ 2014
Subsidiaries are a problem for any kind of patent
analysis.
Because firms adopt different assignment strategies, the error will not be simply “noise.”
IBM, assigns over 90% of its patents to IBM, whereas Johnson & Johnson assigns less than 10% to itself. Most go to wholly-owned subsidiaries.
Failure to properly aggregate to the parent firm could be a problem for measures such as patent propensity, innovative performance, knowledge transfer, etc.
Empirical work needs to make a “theory of the firm” decision each time it defines what a firm is, in order to draw the boundaries
These are J&J’s patents per the NBER data:
19 Subsidiaries
5,816 patents
These are J&J’s patents using
PATSTAT/BVD:
90 Subsidiaries
12,192 patents
NOTE: This does not simply come out of
PATSTAT. Result of months of refinement.
The ability to aggregate/disaggregate
critical for defining what the firm is on a
project-by-project basis
We may be interested in the internal
division of innovative labor (dissagregate)
Or in a top-down firm strategy question
(aggregate)
c. Not clear what we can/should do problem:
equivalents and priority families
In the “real world” most patents are not discrete quanta. They are often parts of families,
pools or complex thickets. So rather than a 1:1 we often have a m:m match between patents
and the technologies they cover
Priority families are patents that share a common priority via the same original application.
Often, the set of “equivalents” (Martinez, 2010).
They grow via divisionals (eg. original app split into two) or continuations (original app
replaced by revised app).
Very few commercially significant patents are “singletons.”
most valuable patents also spawn their own slightly tweaked fences
Take this patent.
ContentGuard’s “System, Apparatus, and Media for
Granting Access to and Utilizing Content”
In the “Related U.S. Application Data, it tells us that this
is a continuation of a continuation of a continuation of a
continuation.
It might seem that this is the fifth patent in a series of
continuations, four of which were granted before.
But it is not that simple. The “related” section on the face
only tells us the direct “parents”
In fact, there are actually 35 granted US patents with little more than
different title and some changed words, ALL sharing just one priority
kk
Equivalents are tricky, and
what we do about them has
to be driven by theoretical
framing also.
Is this one USPTO patent?
Or is it 35?
What do we make of the 35th
patent in this series?
Is that more “incremental”?
More “strategic”?
is the original application
Where the “novel” invention
lies?
d. Very obscure problem: invisible transfers,
citations to applications
How do we account for a firm’s acquired patents? What about a firm’s
sold patents?
These are inputs into its production function that are generally ignored.
But new data suggests that this is more than just a “rounding error”
Some of the most innovative firms also have the absorptive capacity to
identify and acquire very early technology.
Often these acquisitions occur at the application stage, pre-publication
Most current patent methodologies mistakenly assign the eventual
granted patents as having been internally generated.
This underestimated the significance of open innovation and markets
for technology
kk
Just when you thought things
could not get more complicated…
ContentGuard’s fertile patent was
bought from Xerox!
ContentGuard has been
exploiting a patent that it did not
“invent”
A metaphysical conundrum:
Did ContentGuard “create” 34
patents?
Is Xerox the real innovator of this
technology?
Many cites occur at the A1 stage (application) and are
not counted in B’s
Application filed by
Nevengineering, Inc
(45 cites in 7 years)
Granted to Google
(2 cites in 3 years)
Google buys Neven…and uses
the technology in Google Glass
(Source: Rios, 2014)
If more innovative patents are cited during
application, then the granted patent’s cites are
under counted!
This is especially true for patents that are
traded during their application process and
then modified by the inventing firm
(continuations and divisionals)
This is especially important
in fast-moving technology
areas
D
Neven Hartmut Director of Engineering
Prof. Babak
Parviz
12 patents
University of
Washington
Motion Research
Technologies Inc
Dobson Dominic
3 patents
Babak Parviz Head-Project Glass
Neven Vision Founder – Neven Hartmut
9 Patents
DNNresearch
Internal Inventors 37 Patents
MicroOptics
Corporation
Salvatore Scellato
PhD Student
1 patent
Mark Spitzer Director of Operations
Salvatore Scellato
Mark Spitzer
6 patents
Geoffrey Hinton
Google as a successful technology aggregator
This is what I mean by the “genealogy of knowledge”
If we want to trace the locus of innovation, we need to
go down corporate and patent rabbit holes
US2003208447
US6236971, US6708157, US6714921,
US6895392, US6898576, US6910022,
US6925448, US6928419, US6934693,
US6944600, US6957193, US6959290,
US7024392, US7039613, US7043453,
US7058606, US7065505, US7113912,
US7200574, US7209902, US7225160,
US7260556, US7266529, US7269576,
US7269577, US7359881, US7389270,
US7523072, US7664708, US7788182,
US7970709, US8170955, US8370956,
US8393007, US8443457, US8484751,
US8671461
This is what I mean by the “genealogy of knowledge”
If we want to trace the locus of innovation, we need to
go down corporate and patent rabbit holes
US2003208447
US6236971, US6708157, US6714921,
US6895392, US6898576, US6910022,
US6925448, US6928419, US6934693,
US6944600, US6957193, US6959290,
US7024392, US7039613, US7043453,
US7058606, US7065505, US7113912,
US7200574, US7209902, US7225160,
US7260556, US7266529, US7269576,
US7269577, US7359881, US7389270,
US7523072, US7664708, US7788182,
US7970709, US8170955, US8370956,
US8393007, US8443457, US8484751,
US8671461
Pendrell has only 4 subsidiaries. Imagine what this looks like for J&J!
Google’s genealogy inseparable from its technological
roots
Some interesting areas for future research
What firms are accessing early technology?
How does organizational structure relate to innovation performance?
Does better aggregation of citations (at the family level) + sources of
citation (examiner vs. applicant) increase the reliability of citations as
measures of knowledge flows?
Are family measures (size, diversity) fruitful for capturing innovative
performance?
PATSTAT: pros and cons
Massive amounts of data. International coverage and patent family
information facilitates cross-country comparisons
Constant updates.
But..Expensive (~$2,400). And frequent errors and revisions.
Must get to know it intimately before relying on it.
For example just one buried non-ASCII character can lose millions of
observations during imports. You would never know if you did not know
what to expect the data to look like.
Still a bit of a hacker/ wild west frontier, but a great place to start for
those who want to be ahead of the curve
Must read: Gianluca Tarasconi’s RawPatentData blog. User
contributors catch errors faster than EPO
PATSTAT: pros and cons
Tips: Hardware
41