Relational to Graph - Import

Relational to GraphImporting Data into Neo4j

June 2015Michael [email protected] |@mesirii

Agenda

• Review Webinar Series• Importing Data into Neo4j• Getting Data from RDBMS• Concrete Examples• Demo• Q&A

Webinar Review

Relational to Graph

Webinar Review – Relational to Graph

• Introduction and Overview• Introduction of Neo4j, Solving RDBMS Issues, Northwind Demo

• Modeling Concerns• Modeling in Graphs and RDBMS, Good Modeling Practices, • Model first, incremental Modeling, Model Transformation (Rules)

• Import• Importing into Neo4j, Getting Data from RDBMS, Concrete Examples

• NEXT: Querying• SQL to Cypher, Comparison, Example Queries, Hard in SQL -> Easy and Fast in

Cypher

Why are we doing this?

The Graph Advantage

Relational DBs Can’t Handle Relationships Well

• Cannot model or store data and relationships without complexity

• Performance degrades with number and levels of relationships, and database size

• Query complexity grows with need for JOINs• Adding new types of data and relationships

requires schema redesign, increasing time to market

… making traditional databases inappropriate when data relationships are valuable in real-time

Slow developmentPoor performance

Low scalabilityHard to maintain

Unlocking Value from Your Data Relationships

• Model your data naturally as a graph of data and relationships

• Drive graph model from domain and use-cases

• Use relationship information in real-time to transform your business

• Add new relationships on the fly to adapt to your changing requirements

High Query Performance with a Native Graph DB

• Relationships are first class citizen• No need for joins, just follow pre-

materialized relationships of nodes• Query & Data-locality – navigate out

from your starting points• Only load what’s needed• Aggregate and project results as you

go• Optimized disk and memory model for

graphs

Importing into Neo4j

APIs, Tools, Tricks

Getting Data into Neo4j: CSV

Cypher-Based “LOAD CSV” Capability• Transactional (ACID) writes• Initial and incremental loads of up to

10 million nodes and relationships• From HTTP and Files• Power of Cypher

• Create and Update Graph Structures• Data conversion, filtering, aggregation• Destructuring of Input Data

• Transaction Size Control• Also via Neo4j-Shell

CSV

10M

Getting Data into Neo4j: CSV

Command-Line Bulk Loader neo4j-import• For initial database population• Scale across CPUs and disk performance• Efficient RAM usage• Split- and compressed file support• For loads up to 10B+ records• Up to 1M records per second

CSV

100B

Getting Data into Neo4j: APIs

Custom Cypher-Based Loader• Uses transactional Cypher http endpoint• Parameterized, batched, concurrent

Cypher statements• Any programming/script language with

driver or plain http requests• Also for JSON and other formats• Also available as JDBC Driver

Any Data

Program

Program

Program

10M

Getting Data into Neo4j: APIs

JVM Transactional Loader• Use Neo4j’s Java-API• From any JVM language, concurrent• Fine grained TX Management• Create Nodes and Relationships directly• Also possible as Server extension• Arbitrary data loading

Any Data

Program

Program

Program

1B

Getting Data into Neo4j: API

Bulk Loader API• Used by neo4j-import tool• Create Streams of node and relationship

data • Id-groups, id-handling & generation,

conversions• Highly concurrent and memory efficient• High performance CSV Parser, Decorators

CSV

100B

Import Performance: Some Numbers

• Cypher Import 10k-10M records• Import 100K-100M records per

second transactionally• Bulk import tens of billions of records

in a few hours

Import Performance: Hardware Requirements

• Fast disk: SSD or SSD RAID• Many Cores• Medium amount of RAM (8-64G)• Local Data Files, compress to save space• High performance concurrent

connection to relational DB• Linux, OSX works better than Windows

(FS-Handling)• Disable Virus Scanners, Check Disk

Scheduler

Accessing Relational Data

Dump, Connect, Extract

Accessing Relational Data

• Dump to CSV all relational database have the option to dump query results and tables to CSV

• Access with DB-Driver access DB with JDBC/ODBC or other driver to pull out selected datasets

• Use built-in or external endpoints some databases expose HTTP-APIs or can be integrated (DataClips)

• Use ETL-Tools existing ETL Tools can read from relational and write to Neo4j e.g. via JDBC

Importing Your Data

Examples

Import Demo

Cypher-Based “LOAD CSV” Capability• Use to import address data

Command-Line Bulk Loader neo4j-import• Chicago Crime Dataset

Relational Import Tool neo4j-rdbms-import• Proof of Concept

JDBC + API

CSV

LOAD CSV

Powerhorse of Graph ETL

Data Quality – Beware of Real World Data !

• Messy ! Don‘t trust the data• Byte Order Mark• Binary Zeros, non-text characters• Inconsisent line breaks• Header inconsistent with data• Special character in non-quoted text• Unexpected newlines in quoted and unquoted text-fields• stray quotes

CSV – Power-Horse of Data Exchange

• Most Databases, ETL and Office-Tools can read and write CSV

• Format only loosely specified• Problems with quotes, newlines, charsets

• Some good checking tools (CSVKit)

Address Dataset

• Exported as large JOIN between• City• Zip• Street• Number• Enterprise

• address.csv EntityNumber TypeOfAddress Zipcode MunicipalityNL StreetNL StreetFR HouseNr

200.065.765 REGO 9070 Destelbergen

Dendermondesteenweg

Dendermondesteenweg 430

200.068.636 REGO 9000 Gent Stropstraat Stropstraat 1

LOAD CSV// create constraintsCREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;CREATE CONSTRAINT ON (z:Zip) ASSERT z.name IS UNIQUE;

// manage txUSING PERIODIC COMMIT 50000// load csv row by rowLOAD CSV WITH HEADERS FROM "file:address.csv" AS csv// transform valuesWITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip// create nodesMERGE (:City {name: city})MERGE (:Zip {name: zip});

LOAD CSV// manage txUSING PERIODIC COMMIT 100000// load csv row by rowLOAD CSV WITH HEADERS FROM "file:address.csv" AS csv// transform valuesWITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip

// find nodesMATCH (c:City {name: city}), (z:Zip {name: zip})

// create relationshipsMERGE (c)-[:HAS_ZIP_CODE]->(z);

LOAD CSV Considerations

• Provide enough memory (heap & page-cache)• Make sure your data is clean• Create indexes and constraints upfront• Use Labels for Matching• DISTINCT, SKIP, LIMIT to control data volume• Test with small batch• Use PERIODIC COMMIT for larger volumes (> 20k)• Beware of the EAGER Operation• Will pull in all your CSV data• Use EXPLAIN to detect itSimplest LOAD CSV Example | Guide Import CSV | RDBMS ETL Guide

http://jexp.github.io/blog/html/simplest_import_example.html

http://neo4j.com/developer/guide-import-csv/

http://neo4j.com/developer/guide-import-csv/

s Demo

Mass Data Bulk Importer

neo4j-import --into graph.db

Neo4j Bulk Import Tool

• Memory efficient and scalable Bulk-Inserter• Proven to work well for billions of records• Easy to use, no memory configuration needed

CSV

Reference Manual: Import Tool

http://neo4j.com/docs/stable/import-tool.html

Chicago Crime Dataset

• City of Chicago, Crime Data since 2001

• Go to Website, download dataset• Prepare Dataset, Cleanup• Specify Headers (direct or separate file)• ID-definition, data-types, labels, rel-types• Import (30-50s)• Use!https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2http://markhneedham.com/blog?s=Chicago+Crime

https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2


• crimeTypes.csv• Types of crimes

• beats.csv• Police areas

• crimes.csv• Crime description

• crimesBeats.csv• In which beat did a crime happen

• crimesPrimaryTypes.csv• Primary Type assignment


crimes.csv :ID(Crime),id,:LABEL,date,description8920441,8920441,Crime,12/07/2012 07:50:00 AM,AUTOMOBILE

primaryTypes.csv :ID(PrimaryType),crimeTypeARSON,ARSON

crimesPrimaryTypes.csv :START_ID(Crime),:END_ID(PrimaryType) 5221115,NARCOTICS


./neo/bin/neo4j-import --into crimes.db --nodes:CrimeType primaryTypes.csv --nodes beats.csv --nodes crimes_header.csv,crimes.csv

--relationships:CRIME_TYPE crimesPrimaryTypes.csv --relationships crimesBeats.csv

s Demo

Neo4j-RDBMS-Importer

Proof of Concept

sRecap –

Transformation Rules

Normalized ER-Models: Transformation Rules

• Tables become nodes• Table name as node-label• Columns turn into properties• Convert values if needed• Foreign Keys (1:1, 1:n, n:1) into relationships,

column name into relationship-type (or better verb)• JOIN-Tables represent relationships• Also other tables without domain identity (w/o PK) and two FKs• Columns turn into relationship properties

Normalized ER-Models: Cleanup Rules

• Remove technical IDs (auto-incrementing PKs)• Keep domain IDs (e.g. ISBN)• Add constraints for those

• Add indexes for lookup fields• Adjust names for Label, REL_TYPE and propertyName

Note: currently no composite constraints and indexes

RDBMS Import Tool Demo – Proof of Concept

• JDBC for vendor-independent database connection• SchemaCrawler to extract DB-Meta-Data• Use Rules to drive graph model import• Optional means to override default behavior• Scales writes with Parallel Batch Importer API• Reads tables concurrently for nodes & relationships

Demo: MySQL - Employee Demo Database

Source: github.com/jexp/neo4j-rdbms-importBlog Post

Postgres MySQ

LOracle

http://jexp.github.io/blog/html/relational_to_neo4j_import_tool_weekend.html

s Demo

Architecture & Integration“Polyglot Persistence”

MIGRATE ALL DATA

MIGRATE GRAPH DATA

DUPLICATE GRAPH DATA

Non-graph data Graph data

Graph dataAll data

All data

RelationalDatabase

GraphDatabase

Application

Application

Application

Three Ways to Migrate Data to Neo4j

Data Storage andBusiness Rules Execution

Data Mining and Aggregation

Neo4j Fits into Your Enterprise Environment

Application

Graph Database Cluster

Neo4j Neo4j Neo4j

Ad HocAnalysis

Bulk AnalyticInfrastructure

Graph Compute EngineEDW …

Data Scientist

End User

DatabasesRelational

NoSQLHadoop

Kamille Nixon

Need a simplified polyglot persistence image, showing that we play well with others. Adding a new data source is no big deal. Add Hadoop and other NoSQL.

Next StepsCommunity. Training. Support.

There Are Lots of Ways to Easily Learn Neo4j

Resources

Online• Developer Site

neo4j.com/developer• RDBMS to Graph• Guide: ETL from RDBMS• Guide: CSV Import

• LOAD CSV Webinar• Reference Manual• StackOverflow

Offline• In Browser Guide „Northwind“• Import Training Classes• Office Hours• Professional Services

Workshop• Free Books: • Graph Databases 2nd Edition• Learning Neo4j

Register today at graphconnect.comEarly Bird only $99

Relational to GraphData Import

Thank you !Questions ?neo4j.com | @neo4j

Data & Analytics

Relational to Graph - Import