29
Searching images from the past [email protected]

New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Searching images from the past

[email protected]

Page 2: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Arquivo.pt

https://arquivo.pt

From Portugal

Publicly available web archive

Research infrastructure

Source code on Github github.com/arquivo

Free

Page 3: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

URL Search

Page 4: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Text Search

Page 5: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Image Searchnew

Page 6: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Image Search - viewer

Page 7: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

6 000 000 000 files

~15% images

900 million searchable images

We need more servers!!!

Image Search - estimate

Page 8: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

17 million searchable images

Unique images within a collection

From 1996 to 2017

Size greater than 50px width and 50px height

Each image has to link to an archived web

page that contains the image.

Image Search - reality

Page 9: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Image Search Workflow

Page 10: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Workflow

1. Create image indexes from ARC/WARC files

2. Image classification

3. Solr indexing

Page 11: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

1. Create image indexes - infrastructure

Hadoop 3 cluster

MongoDB sharded cluster

Page 12: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

1.1 Extract all ARC/WARC image records

store Image Record

--image thumbnail

--image attributes

1.2 Extract all ARC/WARC html records

• Extract <img> tags in each html record

• If image exists in the database

store Image Index

--page URL

--page timestamp

--page title

1. Create image indexes - steps

Page 13: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

2. Image classification - infrastructure

2 Tesla P4 GPUs

Page 14: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

2. Image classification - step

Automatically classify images from step 1

Safe for work:• Value from 0.000 to 1.000

• Greater than 0.500 (considered safe for work)

• Less than 0.500 (may have explicit content)

Add more classifiers in the future

Page 15: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

3. SOLR indexing - infrastructure

2 Servers with Apache Solr

~ 2-3GB of Ram

~ 400 GB of disk space (indexes)

Configure Solr Cloud in a near future

Page 16: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

3. SOLR indexing - step

Index JSON image records

Page 17: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

3. SOLR index – document

Page 18: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Image Search - architecture

Arquivo

Web App

Image Search

API

Solr

Servers

arquivo.pt/api

Page 19: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Image Search API

Page 20: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Image Search API – q parameter

arquivo.pt/imagesearch?q=soccer

Search for images related with word soccer

Page 21: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Image Search API – response header

Page 22: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Imagesearch API – response item

Page 23: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Search for images related with words Euro 2004, in PNG image format.

http://arquivo.pt/imagesearch?q=Euro 2004&type=png

Image Search API – type parameter

Page 24: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Image Search API – size parameter

Search for images, with size smallrelated with words Euro 2004,

size=md (medium image size)size=lg (large image)

http://arquivo.pt/imagesearch?q=Euro 2004&size=sm

Page 25: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Future Work

Page 26: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Future work – scene classification

http://places2.csail.mit.edu/

Page 27: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Future work – mobile version

Page 28: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Future work – mobile version

Page 29: New Searching images from the past - Arquivo.pt · 2019. 7. 4. · 6 000 000 000 files ~15% images 900 million searchable images We need more servers!!! Image Search - estimate

Thank you

Fernando Melo < [email protected] >

Try our APIs and send us feedbackarquivo.pt/api