28
SMI 2018 :: web scraping parte I Augusto Fadel 6 de novembro de 2018

SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

SMI 2018 :: web scrapingparte I

Augusto Fadel

6 de novembro de 2018

Page 2: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

parte I

R e RStudio

Web data e web scraping

Coleta de dados na web

·

·

·

download

API

web scraping

-

-

-

2/28

Page 4: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

Web Scraping

Page 5: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web scraping

5/28

Page 6: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web data

Download

APIs (Application Programming Interfaces)

Web scraping

Crawler

·

·

·

·

6/28

Page 7: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web data

7/28

Page 8: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

Download

Page 9: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

tipos de arquivos

tipos de arquivoscsv (comma-separated values)

tsv (tab-separated values)

MS Excel (xls, xlsx)

zip

JSON (JavaScript Object Notation)

XML (Extensible Markup Language)

·

·

·

·

·

·

9/28

Page 11: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

APIs

Page 12: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

APIs

12/28

Page 13: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

APIs

13/28

Page 14: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

APIs

14/28

Page 15: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

APIs

HTTP requestsGET

POST

DELETE

HEAD

outros

·

·

·

·

·

15/28

Page 16: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

APIs

Respostas

Códigos de resposta HTTP

200: sucesso!

300: redirecionamento

400: erro de cliente

500: erro de servidor

·

·

·

·

16/28

Page 17: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

APIs

The Star Wars API: swapi.co (pacote rwars) Sistema IBGE de Recuperação Automática: SIDRA Banco Central do Brasil: Portal de Dados Abertos

17/28

Page 18: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

Web Scraping

Page 19: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web scraping

Estrutura HTML (tags)

<a href="http://www.ibge.gov.br/">IBGE</a>

título: <title> ... </title>

parágrafo de texto: <p> ... </p>

blocos: <div> ... </div>

tabela: <table> ... </table>

hiperlink (âncora): <a> ... </a>

·

·

·

·

·

19/28

Page 20: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web scraping

Extrair com rvest:

<a href="http://www.ibge.gov.br/">IBGE</a>

html_name()

html_attr()

html_text()

html_table()

·

·

·

·

20/28

Page 21: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web scraping

Amazon: amazon.com.br

21/28

Page 22: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web scraping

Selenium

Executar o servidor

Selenium 2: WebDriver

Standalone server (v3.9.1)

Pacote RSelenium

·

·

·

Usando Docker

Usando RSelenium::rsDriver()

Manualmente

·

·

·

22/28

Page 23: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web scraping

Funções RSelenium:

navigate()

goBack

goForward()

refresh()

findElement()

highlightElement()

clickElement()

mouseMoveToLocation()

click()

sendKeysToElement()

·

·

·

·

·

·

·

·

·

·

23/28

Page 24: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

web scraping

GOL Linhas Aéreas: voegol.com.br

24/28

Page 25: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

Boas práticas

Page 26: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

boas práticas

Verificar e respeitar o robot.txt.

Identificação.

Usar httr::user_agent() com e-mail de contato.

Respeitar o limite de solicitações (rate-limiting).

Se não houver limite explícito, usar o bom senso.

Regra geral: intervalo de um segundo entre solicitações, usarSys.sleep(1).

Priorizar horários de menor tráfego.

·

·

·

·

·

·

·

26/28

Page 27: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

Obrigado!

Page 28: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium

obrigado

Augusto Fadel

DPE/CEEC/GCAD

21 2142-0452

[email protected]

augustofadel

28/28