1
curation-oriented web crawler for the social sciences Hyphe Mathieu Jacomy Paul Girard Benjamin Ooghe-Tabanou Tommaso Venturini Supported by Equipex DIME-SHS ANR-10-EQPX-19-01 iterative curation Import URLs Define entities Crawl Prospect Monitor corpus DEMO http://hyphe .medialab .sciences-po.fr /demo The web is a field of investigation for social sciences, and platform- based studies have long proven their relevance. However the generic web is rarely studied in itself though it contains crucial aspects of the embodi- ment of social actors: personal blogs, institutional websites, hobby-specific media… We realized that some sociolo- gists see existing web crawlers as “black boxes” unsuitable for research though they are willing to study the broad web. Hyphe is a crawler developed with and for social scientists, with an innovative “curation-oriented” approach. It solves web-mining related problems thanks to specific features such as iterative corpus building or a memory structure allowing resear- chers to redefine dynamically the granularity of their “web entities”. Custom Web-entity Granularity Hyphe’s Software Architecture Example corpus STEM CELLS COMMUNITY Virginie Tournay Sciences Po Paris CEVIPOF EUCELLEX IN 490 UND. 41 DISC. 9864 OUT 647

Hyphe Import URLs DEMO - Tommaso Venturini€¦ · However the generic web is rarely studied in itself though it contains crucial aspects of the embodi-ment of social actors: personal

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hyphe Import URLs DEMO - Tommaso Venturini€¦ · However the generic web is rarely studied in itself though it contains crucial aspects of the embodi-ment of social actors: personal

curation-orientedweb crawler forthe social sciences

Hyphe

Mathieu JacomyPaul GirardBenjamin Ooghe-TabanouTommaso Venturini

Supported byEquipex DIME-SHS

ANR-10-EQPX-19-01

iterativecuration

Import URLs

Define entities

Crawl

Prospect Monitor corpus

DEMOhttp://hyphe.medialab.sciences-po.fr/demo

The web is a field of investigation for social sciences, and platform-based studies have long proven their relevance. However the generic web is rarely studied in itself though it contains crucial aspects of the embodi-ment of social actors: personal blogs, institutional websites, hobby-specific media… We realized that some sociolo-gists see existing web crawlers as “black boxes” unsuitable for research though they are willing to study the broad web.

Hyphe is a crawler developed with and for social scientists, with an innovative “curation-oriented” approach. It solves web-mining related problems thanks to specific features such as iterative corpus building or a memory structure allowing resear-chers to redefine dynamically the granularity of their “web entities”.

CustomWeb-entityGranularity

Hyphe’sSoftwareArchitecture

Example corpusSTEM CELLS COMMUNITYVirginie TournaySciences Po Paris CEVIPOFEUCELLEX

IN490

UND.41

DISC.9864

OUT647