Searching Chinese Patents Presentation at Enterprise Data World

Searching Chinese Patents: Challenges and Solutions When Building

an Innovative Discovery InterfaceERIC PUGH | epugh@o19s.com | @dep4b

Who am I?

• Principal at OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

Co-AuthorN

ext Edition May!

Agilista

Selected Customers

Telling some storieswar ^

• Cloud new at USPTO

• Discovery is tenuous concept

• Conflicting User Goals

• Fixed Budget: trade scope for budget/quality

Telling some stories

➡How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Flow of understanding

Data UnderstandingInformation

Building “Discovery”

Engine

UX DataTension

Grok data at gut level

Look for outliers

User Interviews

Surveys

Card Sorting

Scenarios/Personas

brainstormMockups

Proof of concept

Where to spend time?

Engine

We spent

• How to inject “Discovery” into your app

➡The Cloud to the Rescue (sorta!)

Boy meets Girl Story

Metadata

Ingest Pipeline

Discovery UX

Content Files

How we built it

EmberJS Single Page Search App

Server Dashboard

GPSN UI (Bootsrap CSS)

BrowsersMobile/

Tablet

Third Party Application

Servers

S3 BucketSolr

Solr as a NoSQL Datastore

• Used “atomic updates” to merge three source datasets into single final dataset.

• All text displayed in application stored in Solr.

• Dynamic schema supports many languages, en, cn right now.

Lessons Learned

Don’t Move Files

• Copying 5 TB data up to S3 was very painful.

• We used S3Funnel which is “rsync like”

• We bought more network bandwidth for our office

Never underestimate

the bandwidth of a station wagon

full of tapes hurtling down the highway.

–Andrew Tanenbaum, 1981

Data Size

250000

500000

750000

1000000

1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

Patent Count

277871

Think about Data Volume• Started with older dataset, and tasks like TIFF -> PNG

conversion became progressively harder. Map/Reduce nice, need more visibility into progress..

• Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!)

• 8 shards dropped time from 12 hours to 2 hours. Merging took 5!

• We had too many steps in our pipeline

Building a Patents IndexM

5 days 3 days 30 Minutes

Key scaling concept behind GPSN:

Cloud meets Ocean

More prosaically…

Database

Server

Client

➡Parsers and Parsers and Parsers

Why so many pipelines?Morphlines

Tika as a pipeline?

Lot’s of File Types

• Sometimes in ZIP archives, sometimes not!

• multiple XML formats as well as CSV and EDI

• Purplebook, Yellowbook, Redbook,Greenbook, Questel, SIPO…

Tika as a pipeline!

• Auto detects content type

• Metadata structure has all the key/value needed for Solr

• Allows us to scale up with Behemoth project (and others!).

Lots of files!HHHHHT APS1 ISSUE - 760106!PATN!WKU 039302717!SRC 5!APN 5328756!APT 1!ART 353!APD 19741216!TTL Golf glove!ISD 19760106!NCL 4!ECL 1

Detector to pick Filepublic class GreenbookDetector implements Detector { ! private static Pattern pattern = Pattern.compile("PATN"); @Override public MediaType detect(InputStream stream, Metadata metadata) throws IOException { ! MediaType type = MediaType.OCTET_STREAM; InputStream lookahead = new LookaheadInputStream(stream, 1024); String extract = org.apache.commons.io.IOUtils.toString(lookahead, "UTF-‐8"); ! Matcher matcher = pattern.matcher(extract); ! if (matcher.find()) { type = GreenbookParser.MEDIA_TYPE; } ! lookahead.close(); return type; } }

➡Don’t be Afraid to Share!

Your BigData solution isn’t perfect

• Allow users to export data

• Most business users want to work in Excel. Accept it!

• Allow other applications to build on top of of your application.

GPSN has• Lots of easy “Print to

PDF” options.

• Data stored in S3 as:

• individual patent files

• chunky downloads.

• Filtering to expand or select specific data sets.

• Permalinks: simple, very sharable URLs.

• Underlying Solr service is exposed to public via proxy. You can query Solr yourself.

• Need advance querying? Use Lucene syntax in search bar.

One more thought...

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

www.quepid.com

Quepid: Give your Queries some Love

Project SolrPanl

We need beta users!

Thank you! !

Questions?

• epugh@o19s.com

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s

Nervous about speaking up? Ask

me later!

Searching Chinese Patents Presentation at Enterprise Data World

Technology

CHINESE INDUSTRY 4.0 PATENTS VOLUME 01 - Fraunhofer · © Fraunhofer IAO, IAT Universität Stuttgart Slide 2 Objectives of the study »Chinese Industry 4.0 Patents« Exploring market

Search Patents or Risk Missing Important Research · 2018-05-22 · patentability searching and patent analysis This presentation describes • Why patents matter for your research

Method for Searching Optical Instrument Patents...Method for Searching Optical Instrument Patents ©20 13 Collaborator: Toshiaki AOKI, Patent Attorney, Madoka International Patent

Searching patents using Espacenet David Barford Consultant Ulaanbaatar March 2015

Searching for Patents · Michael Ladisch UCD Library, March 2012 Searching for Patents Michael Ladisch UCD Library michael.ladisch@ucd.ie

The Chinese Challenge to Hitachi's NdFeB Patents and … · The Chinese Challenge to Hitachi's NdFeB Patents and the Potential Implications for the U.S. Marketplace by Walt Benecki,

the patents guide · assistance before applying for a patent. ... patent documents in image format by application number or patent number. Searching patents worldwide Our website

Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

CJK Searching and Discovery: Recent Developments and ...€¦ · converter allows supporting searching by alternate scripts • Chinese: searching for Hanzi using Pinyin • Example:

The Chinese Challenge to Hitachi's NdFeB Patents and the ... · NdFeB Patents and the Potential Implications for the U.S. Marketplace by Walt Benecki, Walter T. Benecki LLC Magnetics

Nanotechnology in U.S. Patents: Classification, Searching and Analysis Charlotte A. Erdmann Siegesmund Engineering Library Purdue University

cve monitor · cve monitor ® US Patents: 8,899,955 & 8,883,054 European and Chinese patents applied for and issued. Progressive’s new CVe Monitor v3 tracks tool activity, allowing

Patent Searching August 2006. General overview Patents – invention, and as a research document Definitions Searching – complete preliminary patent

Searching in Chinese: the Experience of HKLII

Patent Searching & Analysis MKT4423 › learning › patents › tut › PatTut_MKT... · 2/17/2020 · Patsnap Insights Patsnap Insights provides business intelligence information

Patent Searching & Analysis MT5912 › learning › patents › tut › PatTut_MT5912_2020.pdf · Patsnap - Overview : 7 options e.g. Application Trends, Top Assignee(s) with further

Searching for Patents - University College Dublin · 2016-02-04 · Classification Searching •Different classifications •IPC (International Patent Classification) •Agreed internationally

Searching for Patents - University College Dublin · Michael Ladisch UCD Library, March 2012 Searching for Patents Michael Ladisch UCD Library michael.ladisch@ucd.ie

1 Searching US Patents. 2 US Patents Files IFIREF IFICDB (for subscribers IFICDB (for subscribers) IFIUDB IFIPAT USPATFULL IFICLS ( or IFIRXA)

DERWENT PATENTS CITATION INDEXDERWENT PATENTS CITATION · PDF fileDERWENT PATENTS CITATION INDEXDERWENT PATENTS CITATION INDEX ... Derwent Patents Citation IndexDerwent Patents Citation