The Challenges of Managing “Big Data” in the
Patent Field14-15 April 2014, Nice
Olivier Huc
Specialistsin Patent
Information
BuildingIntelligent
Patent InformationSolutions
since 1996
What we do
Trustedby IP experts
Worldwide
Corporations,National PatentOffices, PatentAttorneys andPatent Search
Firms worldwide
InternationalCustomerSupport
Global client baseWith Offices and Support across Europe North
America, and Asia
Patent Families
Analytics
Quality Control
Fast Search
Legal Status Review
Alerts
• 23 Full Text Collections• 48 Million Families• 103 Issuing Authorities
• IPC, CPC US and JP classes• Quality Controlled content• Normalised data
3 Patent Data Myths
• Myth #1: Patent data is just another type of “Big Data”
• Myth #2: Patent Data is handled automatically• Myth #3: Patent Data is consistent worldwide
• Patent Data volume might be smaller, data is more complex (languages, text, fields)
• Patent data is not retrieved on the fly, it is hosted, indexed and optimized
• There are multiple sources with overlap• Data quality is a major issue• Users have a low tolerance for errors
The reality
• Total data volume exceeds 35 Tb
• 49 million families and 103 publishing bodies
• 95 million publications
• 47 million full-texts including over 23 million non-Latin into English machine translations
• 54 million clipped images and 45 million complete sets of drawings
Database Facts
• Minesoft and RWS host their own data center, located just outside of London
• Control• Confidentiality• Reactivity• Speed
• Distributed search engine• Continuous data update and indexing => no need to interrupt
or restart the online services, + new data immediately searchable
Hardware & Search Engine
• Multiple data sources: • DOCDB weekly feeds (EPO)• National Patent Offices• Commercial collections• External information (such as National Registers)
• Despite the complexity, having multiple sources for the same country is a great advantage:
• Complementarity• Improved quality• Security• Speed
Sources
• We perform stringent quality checks• Human• Programmatic
• Manual checks on some source data collections as they arrive: e.g. Indian (IN), Thai (TH) and The Philippines (PH)
• Errors in data are identified programmatically by strict pre-set parameters which are then manually corrected by our data team
• e.g. IC8=AO1G1/00
• Although we follow EPO’s INPADOC rules for families (extended), we recreate all our families to ensure consistency
Data Quality
Adding extra value to PatBase data:
• Families are automatically reviewed and, then if necessary, rebuilt when we receive new and/or corrected information (e.g. priority)
• Tagging of examples, paragraphs and claims is done in order to facilitate searching specific sections of text
• Machine translation: when a family gets new text, the family is reassessed to see if a machine translation needs to be added/replaced/deleted.
Data Quality
TW AN/PR inputs TW AN/PR outputs
083303675 Emperor year conversion & Type of application
TW19940303675F
092128911 TW20030128911
092128911 TW20040201682U
US AN/PR inputs US AN/PR outputs
US29/356,858 20100303 Type of application & Year US20100356858F
1301618611 A US20110016186
AT AN/PR inputs AT AN/PR outputs
A 709/95 Type of application & Year AT19950000709
GM647/96 AT19960000647U
Standardisation of patent dataFormatting application and priority information
• Formatting patent numbers and kind codes
• Formatting dates
Thailand use Buddhist years (Gregorian calendar year plus 543)
US date format - 2011/09/02 (9 February 2011)European date format – 2011/09/02 (2 September 2011)
2007
Standardisation of patent data
The EPO standardize names to assist searching.
PatBase contains both standard and non-standard names.
Standard name assigned by the EPO
Non-standard name consists of whatever is filed or published on the patent
Standard Non-standard
PIRELLI IND PIRELI SPAPIRELLI IND PIRELLA SPAPIRELLI IND PIRELLE S P APIRELLI IND PIRELLI DPAPIRELLI IND PIRELLI S p APIRELLI IND PIRELLI S APIRELLI IND PIRELLI S P A PIRELLI IND PIRELLI S P A FIRMAPIRELLI IND PIRELLI S P A ITPIRELLI IND PIRELLI S P CAPIRELLI IND PIRELLI SPA ITPIRELLI IND PIRELLI SPPPIRELLI IND PIRELLU SPAPIRELLI IND PIRELLY SPA
This is a small example set of the non-standard names that The EPO assign the standard name ‘Pirelli’
There are currently 188 non-standard names for the standard name ‘Pirelli’
Standardization of patent data
• Date Formats
• All fields, e.g. patent classifications, assignees, text etc. have set parameters. Where these are not matched data errors are identified for manual editing.
• If a text is illegible (we have programmatic systems in place measuring this) it will not be allowed into the database and be identified as requiring manual attention (often manual typing).
• Character conversions
� We have thousands of symbol / letter conversions in our programs:• & is replaced by and• œ is replaced by oe• β is replaced by ss
Data Improvements
Insertion of paragraph breaks and paragraph numbers
Data Improvements
Output in PatBase
Source text
• Errors appear in source data so manual checks are essential
• Example – Granted patent information from the Indian Patent Office Journal. Three different inventions have incorrectly been given the same publication number
Manual checks
IN000008
Data quality issues
On the Thai patent office website - the same publication number is used for two different applications
Patent copy for TH48405 A
In PatBaseApplication number: TH19981004295 Publication number: TH48406 A
Application number: TH19981002185Publication number: TH48405 A
Wrong number
Correct number
Manual checks
• Acquiring data from multiple sources enables us to supplement records, but also alerts us to errors thus ensuring accuracy
KR20010012826 A – Glial Cell Line-Derived Neurotrophic Factor Receptors
KR20010112826 A – Single phase six pole DC brushless axial fan motor of transistor type
Source EPO – Error in information This EPO record is a combination of two inventions. The publication number does not match with the invention.
Identifying data errors
Incorrect data received from source
In cases such as these we correct the error in PatBase and inform the EPO
NULL values were supplied in the EPO’s DOCDB file as Applicants
Identifying data errors
Example of an incorrect assignment from the USPTO
PatBase family 41683901
Excerpt from USPTO assignments database
Identifying data errors
Translations
• Principle: the English text of an equivalent is always better than the Machine Translation
• All non-latin Texts are machine translated into English and indexed when added to PatBase
• On a rolling basis we re-translate texts to benefit from the continuous improvements of translation engines
Machine translation
• Machine translations are made as data is added, removed / rebuilt. This is all done before indexing.
• We run a rolling re-translate and re-index program to optimize the quality of our machine translated full-text
Original translation, Thai into English Re-translation, Thai into English
Original translation, Thai into English Re-translation, Thai into English
Translations
Re-translation Korean into EnglishOriginal translation, Korean into English
Translations
Assignee translations
• Non-latin assignees are indexed
• Non-latin assignees are also translated• First 10,000 CN and JP assignees have been
manually translated by RWS• All others are Machine Translated until an “official”
Latin names appear in the family
Cross-lingual Tool
• Initially developed by WIPO, CLIR (Cross Lingual Information Retrieval) allows our users to generate multilingual searches
• Using an advanced statistical text analysis system based on the PCT corpus, the cross-lingual search tool identifies variants in multiple languages for search terms entered by the user.
=> Better translation – translated words originate from PCT applications
• Source: INPADOC
• All legal status events are categorised with a PRS code
• Challenge: 2628 different PRS codes, some no longer in use
• Solution: Grouping similar legal events together:
Legal Status
Reassignment
Deemed Withdrawn/Abandoned
Examined
Renewal Fees Paid
Granted
Lapsed/Expired/Ceased/Dead
Licence
Non-Entry into National Phase
National Phase Entry
Opposition Filed/Request for revocation
Published
Restored/Reinstated/Amended
Revoked/Rejected/Annuled/Invalid
Withdrawn/Abandoned/Terminated/Void
Legal Status Timeline
• Most patent databases are structured and optimized for Patent Searching, not for Analytics
• At Minesoft, we developed a special database with proprietary meta tags dedicated to the analytics
• Coverage is important – Beware of data gaps
• Importance of a web service (API)
• Importance of incorporating your own custom data or legal status information in your analysis
Analytics