49
Relational to Graph Importing Data into Neo4j June 2015 Michael Hunger [email protected] |@mesirii

Relational to Graph - Import

Embed Size (px)

Citation preview

Page 1: Relational to Graph - Import

Relational to GraphImporting Data into Neo4j

June 2015Michael [email protected] |@mesirii

Page 2: Relational to Graph - Import

Agenda

• Review Webinar Series• Importing Data into Neo4j• Getting Data from RDBMS• Concrete Examples• Demo• Q&A

Page 3: Relational to Graph - Import

Webinar Review

Relational to Graph

Page 4: Relational to Graph - Import

Webinar Review – Relational to Graph

• Introduction and Overview• Introduction of Neo4j, Solving RDBMS Issues, Northwind Demo

• Modeling Concerns• Modeling in Graphs and RDBMS, Good Modeling Practices, • Model first, incremental Modeling, Model Transformation (Rules)

• Import• Importing into Neo4j, Getting Data from RDBMS, Concrete Examples

• NEXT: Querying• SQL to Cypher, Comparison, Example Queries, Hard in SQL -> Easy and Fast in

Cypher

Page 5: Relational to Graph - Import

Why are we doing this?

The Graph Advantage

Page 6: Relational to Graph - Import

Relational DBs Can’t Handle Relationships Well

• Cannot model or store data and relationships without complexity

• Performance degrades with number and levels of relationships, and database size

• Query complexity grows with need for JOINs• Adding new types of data and relationships

requires schema redesign, increasing time to market

… making traditional databases inappropriate when data relationships are valuable in real-time

Slow developmentPoor performance

Low scalabilityHard to maintain

Page 7: Relational to Graph - Import

Unlocking Value from Your Data Relationships

• Model your data naturally as a graph of data and relationships

• Drive graph model from domain and use-cases

• Use relationship information in real-time to transform your business

• Add new relationships on the fly to adapt to your changing requirements

Page 8: Relational to Graph - Import

High Query Performance with a Native Graph DB

• Relationships are first class citizen• No need for joins, just follow pre-

materialized relationships of nodes• Query & Data-locality – navigate out

from your starting points• Only load what’s needed• Aggregate and project results as you

go• Optimized disk and memory model for

graphs

Page 9: Relational to Graph - Import

Importing into Neo4j

APIs, Tools, Tricks

Page 10: Relational to Graph - Import

Getting Data into Neo4j: CSV

Cypher-Based “LOAD CSV” Capability• Transactional (ACID) writes• Initial and incremental loads of up to

10 million nodes and relationships• From HTTP and Files• Power of Cypher

• Create and Update Graph Structures• Data conversion, filtering, aggregation• Destructuring of Input Data

• Transaction Size Control• Also via Neo4j-Shell

CSV

10M

Page 11: Relational to Graph - Import

Getting Data into Neo4j: CSV

Command-Line Bulk Loader neo4j-import• For initial database population• Scale across CPUs and disk performance• Efficient RAM usage• Split- and compressed file support• For loads up to 10B+ records• Up to 1M records per second

CSV

100B

Page 12: Relational to Graph - Import

Getting Data into Neo4j: APIs

Custom Cypher-Based Loader• Uses transactional Cypher http endpoint• Parameterized, batched, concurrent

Cypher statements• Any programming/script language with

driver or plain http requests• Also for JSON and other formats• Also available as JDBC Driver

Any Data

Program

Program

Program

10M

Page 13: Relational to Graph - Import

Getting Data into Neo4j: APIs

JVM Transactional Loader• Use Neo4j’s Java-API• From any JVM language, concurrent• Fine grained TX Management• Create Nodes and Relationships directly• Also possible as Server extension• Arbitrary data loading

Any Data

Program

Program

Program

1B

Page 14: Relational to Graph - Import

Getting Data into Neo4j: API

Bulk Loader API• Used by neo4j-import tool• Create Streams of node and relationship

data • Id-groups, id-handling & generation,

conversions• Highly concurrent and memory efficient• High performance CSV Parser, Decorators

CSV

100B

Page 15: Relational to Graph - Import

Import Performance: Some Numbers

• Cypher Import 10k-10M records• Import 100K-100M records per

second transactionally• Bulk import tens of billions of records

in a few hours

Page 16: Relational to Graph - Import

Import Performance: Hardware Requirements

• Fast disk: SSD or SSD RAID• Many Cores• Medium amount of RAM (8-64G)• Local Data Files, compress to save space• High performance concurrent

connection to relational DB• Linux, OSX works better than Windows

(FS-Handling)• Disable Virus Scanners, Check Disk

Scheduler

Page 17: Relational to Graph - Import

Accessing Relational Data

Dump, Connect, Extract

Page 18: Relational to Graph - Import

Accessing Relational Data

• Dump to CSV all relational database have the option to dump query results and tables to CSV

• Access with DB-Driver access DB with JDBC/ODBC or other driver to pull out selected datasets

• Use built-in or external endpoints some databases expose HTTP-APIs or can be integrated (DataClips)

• Use ETL-Tools existing ETL Tools can read from relational and write to Neo4j e.g. via JDBC

Page 19: Relational to Graph - Import

Importing Your Data

Examples

Page 20: Relational to Graph - Import

Import Demo

Cypher-Based “LOAD CSV” Capability• Use to import address data

Command-Line Bulk Loader neo4j-import• Chicago Crime Dataset

Relational Import Tool neo4j-rdbms-import• Proof of Concept

JDBC + API

CSV

Page 21: Relational to Graph - Import

LOAD CSV

Powerhorse of Graph ETL

Page 22: Relational to Graph - Import

Data Quality – Beware of Real World Data !

• Messy ! Don‘t trust the data• Byte Order Mark• Binary Zeros, non-text characters• Inconsisent line breaks• Header inconsistent with data• Special character in non-quoted text• Unexpected newlines in quoted and unquoted text-fields• stray quotes

Page 23: Relational to Graph - Import

CSV – Power-Horse of Data Exchange

• Most Databases, ETL and Office-Tools can read and write CSV

• Format only loosely specified• Problems with quotes, newlines, charsets

• Some good checking tools (CSVKit)

Page 24: Relational to Graph - Import

Address Dataset

• Exported as large JOIN between• City• Zip• Street• Number• Enterprise

• address.csv EntityNumber TypeOfAddress Zipcode MunicipalityNL StreetNL StreetFR HouseNr

200.065.765 REGO 9070 Destelbergen

Dendermondesteenweg

Dendermondesteenweg 430

200.068.636 REGO 9000 Gent Stropstraat Stropstraat 1

Page 25: Relational to Graph - Import

LOAD CSV// create constraintsCREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;CREATE CONSTRAINT ON (z:Zip) ASSERT z.name IS UNIQUE;

// manage txUSING PERIODIC COMMIT 50000// load csv row by rowLOAD CSV WITH HEADERS FROM "file:address.csv" AS csv// transform valuesWITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip// create nodesMERGE (:City {name: city})MERGE (:Zip {name: zip});

Page 26: Relational to Graph - Import

LOAD CSV// manage txUSING PERIODIC COMMIT 100000// load csv row by rowLOAD CSV WITH HEADERS FROM "file:address.csv" AS csv// transform valuesWITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip

// find nodesMATCH (c:City {name: city}), (z:Zip {name: zip})

// create relationshipsMERGE (c)-[:HAS_ZIP_CODE]->(z);

Page 27: Relational to Graph - Import

LOAD CSV Considerations

• Provide enough memory (heap & page-cache)• Make sure your data is clean• Create indexes and constraints upfront• Use Labels for Matching• DISTINCT, SKIP, LIMIT to control data volume• Test with small batch• Use PERIODIC COMMIT for larger volumes (> 20k)• Beware of the EAGER Operation• Will pull in all your CSV data• Use EXPLAIN to detect itSimplest LOAD CSV Example | Guide Import CSV | RDBMS ETL Guide

Page 28: Relational to Graph - Import

s Demo

Page 29: Relational to Graph - Import

Mass Data Bulk Importer

neo4j-import --into graph.db

Page 30: Relational to Graph - Import

Neo4j Bulk Import Tool

• Memory efficient and scalable Bulk-Inserter• Proven to work well for billions of records• Easy to use, no memory configuration needed

CSV

Reference Manual: Import Tool

Page 31: Relational to Graph - Import

Chicago Crime Dataset

• City of Chicago, Crime Data since 2001

• Go to Website, download dataset• Prepare Dataset, Cleanup• Specify Headers (direct or separate file)• ID-definition, data-types, labels, rel-types• Import (30-50s)• Use!https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2http://markhneedham.com/blog?s=Chicago+Crime

Page 32: Relational to Graph - Import

Chicago Crime Dataset

• crimeTypes.csv• Types of crimes

• beats.csv• Police areas

• crimes.csv• Crime description

• crimesBeats.csv• In which beat did a crime happen

• crimesPrimaryTypes.csv• Primary Type assignment

Page 33: Relational to Graph - Import

Chicago Crime Dataset

crimes.csv :ID(Crime),id,:LABEL,date,description8920441,8920441,Crime,12/07/2012 07:50:00 AM,AUTOMOBILE

primaryTypes.csv :ID(PrimaryType),crimeTypeARSON,ARSON

crimesPrimaryTypes.csv :START_ID(Crime),:END_ID(PrimaryType) 5221115,NARCOTICS

Page 34: Relational to Graph - Import

Chicago Crime Dataset

./neo/bin/neo4j-import --into crimes.db --nodes:CrimeType primaryTypes.csv --nodes beats.csv --nodes crimes_header.csv,crimes.csv

--relationships:CRIME_TYPE crimesPrimaryTypes.csv --relationships crimesBeats.csv

Page 35: Relational to Graph - Import

s Demo

Page 36: Relational to Graph - Import

Neo4j-RDBMS-Importer

Proof of Concept

Page 37: Relational to Graph - Import

sRecap –

Transformation Rules

Page 38: Relational to Graph - Import

Normalized ER-Models: Transformation Rules

• Tables become nodes• Table name as node-label• Columns turn into properties• Convert values if needed• Foreign Keys (1:1, 1:n, n:1) into relationships,

column name into relationship-type (or better verb)• JOIN-Tables represent relationships• Also other tables without domain identity (w/o PK) and two FKs• Columns turn into relationship properties

Page 39: Relational to Graph - Import

Normalized ER-Models: Cleanup Rules

• Remove technical IDs (auto-incrementing PKs)• Keep domain IDs (e.g. ISBN)• Add constraints for those

• Add indexes for lookup fields• Adjust names for Label, REL_TYPE and propertyName

Note: currently no composite constraints and indexes

Page 40: Relational to Graph - Import

RDBMS Import Tool Demo – Proof of Concept

• JDBC for vendor-independent database connection• SchemaCrawler to extract DB-Meta-Data• Use Rules to drive graph model import• Optional means to override default behavior• Scales writes with Parallel Batch Importer API• Reads tables concurrently for nodes & relationships

Demo: MySQL - Employee Demo Database

Source: github.com/jexp/neo4j-rdbms-importBlog Post

Postgres MySQ

LOracle

Page 41: Relational to Graph - Import

s Demo

Page 42: Relational to Graph - Import

Architecture & Integration“Polyglot Persistence”

Page 43: Relational to Graph - Import

MIGRATE ALL DATA

MIGRATE GRAPH DATA

DUPLICATE GRAPH DATA

Non-graph data Graph data

Graph dataAll data

All data

RelationalDatabase

GraphDatabase

Application

Application

Application

Three Ways to Migrate Data to Neo4j

Page 44: Relational to Graph - Import

Data Storage andBusiness Rules Execution

Data Mining and Aggregation

Neo4j Fits into Your Enterprise Environment

Application

Graph Database Cluster

Neo4j Neo4j Neo4j

Ad HocAnalysis

Bulk AnalyticInfrastructure

Graph Compute EngineEDW …

Data Scientist

End User

DatabasesRelational

NoSQLHadoop

Kamille Nixon
Need a simplified polyglot persistence image, showing that we play well with others. Adding a new data source is no big deal. Add Hadoop and other NoSQL.
Page 45: Relational to Graph - Import

Next StepsCommunity. Training. Support.

Page 46: Relational to Graph - Import

There Are Lots of Ways to Easily Learn Neo4j

Page 47: Relational to Graph - Import

Resources

Online• Developer Site

neo4j.com/developer• RDBMS to Graph• Guide: ETL from RDBMS• Guide: CSV Import

• LOAD CSV Webinar• Reference Manual• StackOverflow

Offline• In Browser Guide „Northwind“• Import Training Classes• Office Hours• Professional Services

Workshop• Free Books: • Graph Databases 2nd Edition• Learning Neo4j

Page 48: Relational to Graph - Import

Register today at graphconnect.comEarly Bird only $99

Page 49: Relational to Graph - Import

Relational to GraphData Import

Thank you !Questions ?neo4j.com | @neo4j