Writing a Search Engine. How hard could it be?

Preview:

Citation preview

WRITING A SEARCH ENGINE. HOW HARD COULD IT BE?

ANTHONY BROWN @BRUINBROWN93 ANTHONY@COMPOSITIONAL-IT.COM

ABOUT

ABOUT ME

▸ Consultant at Compositional IT

▸ F# dev for ~3 years now

▸ Interested in Big Data, IoT, Cloud and Distributed Systems

COMPOSITIONAL IT

FUNCTIONAL FIRST. CLOUD READY. @COMPOSITIONALIT

HOW HARD COULD IT BE?

Every software developer ever

INTRODUCTION

IT’S ONLY AN OPERATING SYSTEM, ALL IT DOES IS RUNS PROGRAMS!

Everybody when Windows blue screens

INTRODUCTION

IT’S ONLY A MULTIPLAYER ONLINE VIDEO GAME!

Anybody playing a game when lag spikes hit

TEXT

IT’S ONLY 2 LINES OF JAVASCRIPT

Backend developer needing to make a small API change

INTRODUCTION

DUDE. HOLD MY BEER.

Drunk people 10 seconds before making a terrible mistake

SATURDAY MORNING. PLANS CANCELLED.

WHAT NEXT? HIT UP GOOGLE.

WHAT TO DO IN LONDON THIS WEEKEND?

WRITING A SEARCH ENGINE. HOW HARD COULD IT BE?

WRITING A SEARCH ENGINE WITH AZURE AND F# IN A WEEKEND.

BUT FIRST.

THIS WAS A WEEKEND PROJECT.

YOU SHOULD EXPECT: - HACKY CODE.

YOU SHOULD EXPECT: - DEMOS TO FAIL.

YOU SHOULD NOT EXPECT: - A DEEP DIVE INTO SEARCH ENGINE TECH.

SEARCH ENGINE BACKGROUND

CONSTRAINTS

▸ Not a priority

▸ Can’t cost more than £85 per month

▸ No operations investment

▸ Limit to the weekend

BACKGROUND

EVERYTHING I KNOW ABOUT HOW SEARCH ENGINES WORK

THE ANATOMY OF A LARGE-SCALE HYPER TEXTUAL WEB SEARCH ENGINE

SERGEY BRIN LARRY PAGE

IT’S 2016. THE WEB’S CHANGED. A LOT.

WHAT’S NEW? + SCALE

WHAT’S NEW? + USERS

WHAT’S NEW? + GLOBALISATION

WHAT’S NEW? + CLOUD

WHAT’S NEW? + PLATFORM AS A SERVICE

WHAT’S NEW? - INFRASTRUCTURE

WHAT’S NEW? - PERSONAL HOSTING

SEARCH ENGINE BACKGROUND

WHAT’S IMPORTANT?

▸ Search

▸ Scraping

▸ Page rank

SEARCH IMPLEMENTATION

HOW TO FIND A NEEDLE IN A HAYSTACK

▸ Take all of your documents

▸ Record all of the words which occur within a file

▸ Invert that index

▸ List of all words and the documents they appear in

▸ For all words in the search query, find the files which appear in every inverted index

SOUNDS EASY RIGHT? I DON’T CARE ABOUT IT.

AZURE SEARCHMANAGED SEARCH AS A SERVICE

AZURE SEARCH

WHAT DOES AZURE SEARCH GIVE US?

▸ Hosted Search as a Service

▸ HTTP API for indexing and retrieving documents

▸ Ability to scale out (more replicas, more indexes)

▸ Free basic tier

AZURE SEARCH IN THE AZURE PORTAL.

BOOSTING DEMO.

WE HAVE SEARCH. WHAT NEXT?

INDEXING DATA

WHAT IS A CRAWLER

▸ Autonomously find every web page on the internet

▸ Pull the content from that web page and index it

▸ Read the links on that page and index those links

▸ Recursively process until every page on the internet has been reached

THE PROBLEM? THE INTERNET’S PRETTY BIG.

AZURE SERVICE BUS

DISTRIBUTED MESSAGE QUEUES

INDEXING DATA

WHAT DOES AZURE SERVICE BUS GIVE US?

▸ Scalable durable queues and topics with guaranteed availability

▸ .Net APIs to communicate with the service bus

▸ Free basic tier

WORKING WITH A SERVICE BUS QUEUE.

WE NEED TO BE GOOD CITIZENS. WE DON’T WANT TO DDOS A SINGLE WEBSITE DURING CRAWLING.

SERVICE BUS PROVIDES SUPPORT FOR MESSAGE DE-DUPLICATION BASED ON CONTENT.

WE DON’T WANT TO SCRAPE THROUGH EVERY WEB PAGE IN THE WORLD.

WE DON’T WANT TO INDEX: - GOOGLE SEARCH QUERIES

WE DON’T WANT TO INDEX: - PROTECTED CONTENT

WE DON’T WANT TO INDEX: - IRRELEVANT CONTENT

DEALING WITH THE ROBOTS.TXT FILE

WRITING BASIC PARSERS IN F#

BEING A WELL BEHAVED SCRAPER

WHAT IS ROBOTS.TXT?

▸ Text file standard for telling web scrapers what they should scrape

▸ Opt-in - crawlers can ignore the robots.txt file

▸ Simple file stored at the root of the web server

AN EXAMPLE ROBOTS.TXT FILE.

SIMPLE PARSING WITH F#.

HTML AND INFORMATION RETRIEVAL

QUERYING HTML DOCUMENTS WITH HTML AGILITY PACK

WE HAVE A HTML FILE. WE NEED THE CONTENT OUT OF IT.

INFORMATION RETRIEVAL FROM HTML DOCUMENTS

WORKING WITH THE HTML AGILITY PACK

▸ Provides a simple query layer over HTML documents

▸ Works with well formatted and poorly formatted HTML

▸ Provides XPath support over the document

▸ Allows for querying for individual properties and elements

EXTRACTING LINKS FROM A HTML DOCUMENT

EXTRACTING ALL OF THE CONTENT FROM AN HTML DOCUMENT

WE NOW HAVE A WEB SCRAPER. WE NEED TO RUN THE WEB SCRAPER.

AZURE WEBJOBSSIMPLE HOSTING OF LONG RUNNING PROCESSES

AZURE WEB JOBS

WHAT ARE WEB JOBS?

▸ A means of hosting basic executables in the cloud

▸ Provides simplified deployment and monitoring

▸ Pricing per minute of usage

WE NOW HAVE A SEARCH ENGINE. KIND OF.

SEARCH IS A RECOMMENDATION PROBLEM.

HOW DO WE RECOMMEND CONTENT TO USERS?

PAGE RANKFINDING THE MOST INFLUENTIAL SITES ON THE INTERNET

PAGE RANK

WHAT IS PAGE RANK?

▸ Stanford’s patented algorithm

▸ Helps you find the most influential websites on the internet

▸ Websites with lots of links to them are more influential

THE PROBLEM? THERE’S LOTS OF WEBSITES ON THE INTERNET.

THERE’S EVEN MORE LINKS BETWEEN WEBSITES.

WE HAVE A HUGE LINK GRAPH. WE NEED TO PROCESS THAT GRAPH.

BIG DATA PROCESSING WITH MBRACE AND CLOUDFLOWS.

WE HAVE A QUERY WHICH NEEDS TO RUN DAILY. WE NEED TO ORCHESTRATE IT.

AZURE FUNCTIONS + AZURE RESOURCE MANAGER

USING AZURE FUNCTIONS FOR DEVOPS

DEVOPS

WHAT IS AZURE RESOURCE MANAGER?

▸ Declarative way of describing Azure infrastructure

▸ REST APIs to deploy infrastructure template files

▸ APIs to see current deployment status

DEVOPS

WHAT IS AZURE FUNCTIONS?

▸ Lightweight scripting of Azure web jobs

▸ Allows for running scripts in response to certain events

▸ Billing based on number of function invocations

DEVOPS

USING AZURE FUNCTIONS FOR DEVOPS

▸ Set up a timer triggered Azure Function

▸ Deploy an Mbrace cluster through Azure Resource Manager

▸ Send an event when the job completes

▸ Second Azure Function for deleting the MBrace cluster

AZURE FUNCTIONS AND AZURE RESOURCE MANAGER.

WE NOW HAVE EVERYTHING IN PLACE FOR A SEARCH ENGINE. NOBODY CAN ACCESS IT THOUGH.

AZURE FUNCTIONS

SERVERLESS WEB APIS WITH AZURE FUNCTIONS

AZURE FUNCTIONS CAN OPERATE ON HTTP REQUESTS.

NO LONG TERM HOSTING COSTS.

AZURE FUNCTIONS HTTP API DEMO.

DONE. SEARCH ENGINE COMPLETE.

HTTP API

AZURE SEARCH

LINK DATABASE

PAGERANK

CLUSTER ORCHESTRATOR

AZURE SERVICEBUS

INDEXER

PAGERANK IMPORTERPAGERANK SCORE

STORE

PLENTY OF ROOM FOR IMPROVEMENTS.

CACHING SEARCH QUERIES.

QUERY AUTO COMPLETE.

SEARCH A GIVEN DOMAIN.

MULTIPLE LANGUAGE SUPPORT.

SUPPORT FOR OTHER DOCUMENT TYPES.

BETTER INFORMATION RETRIEVAL ALGORITHMS.

WHAT’S NEXT FOR IT? NOTHING.

PRODUCTISING A GOOGLE COMPETITOR IS BASICALLY IMPOSSIBLE.

IN SUMMARYWRAPPING UP & KEY TAKEAWAYS

AZURE + F# = <3

AZURE MAKES HARD INFRASTRUCTURE PROBLEMS SIMPLE.

F# MAKES HARD SOFTWARE PROBLEMS SIMPLE.

TOGETHER THEY MAKE HARD PROBLEMS SIMPLE.

IT’S NOT GOOGLE. BUT IT TOOK 1 DEV 2 DAYS.

CLOUD IS THE EPITOME OF BUSINESS AGILITY

COMPOSITIONAL IT

ANTHONY@COMPOSITIONAL-IT.COM FUNCTIONAL FIRST. CLOUD READY.

Q&A.

Recommended