Upload
neo4j-the-fastest-and-most-scalable-native-graph-database
View
678
Download
0
Embed Size (px)
Citation preview
Agenda
• Review Webinar Series• Importing Data into Neo4j• Getting Data from RDBMS• Concrete Examples• Demo• Q&A
Webinar Review
Relational to Graph
Webinar Review – Relational to Graph
• Introduction and Overview• Introduction of Neo4j, Solving RDBMS Issues, Northwind Demo
• Modeling Concerns• Modeling in Graphs and RDBMS, Good Modeling Practices, • Model first, incremental Modeling, Model Transformation (Rules)
• Import• Importing into Neo4j, Getting Data from RDBMS, Concrete Examples
• NEXT: Querying• SQL to Cypher, Comparison, Example Queries, Hard in SQL -> Easy and Fast in
Cypher
Why are we doing this?
The Graph Advantage
Relational DBs Can’t Handle Relationships Well
• Cannot model or store data and relationships without complexity
• Performance degrades with number and levels of relationships, and database size
• Query complexity grows with need for JOINs• Adding new types of data and relationships
requires schema redesign, increasing time to market
… making traditional databases inappropriate when data relationships are valuable in real-time
Slow developmentPoor performance
Low scalabilityHard to maintain
Unlocking Value from Your Data Relationships
• Model your data naturally as a graph of data and relationships
• Drive graph model from domain and use-cases
• Use relationship information in real-time to transform your business
• Add new relationships on the fly to adapt to your changing requirements
High Query Performance with a Native Graph DB
• Relationships are first class citizen• No need for joins, just follow pre-
materialized relationships of nodes• Query & Data-locality – navigate out
from your starting points• Only load what’s needed• Aggregate and project results as you
go• Optimized disk and memory model for
graphs
Importing into Neo4j
APIs, Tools, Tricks
Getting Data into Neo4j: CSV
Cypher-Based “LOAD CSV” Capability• Transactional (ACID) writes• Initial and incremental loads of up to
10 million nodes and relationships• From HTTP and Files• Power of Cypher
• Create and Update Graph Structures• Data conversion, filtering, aggregation• Destructuring of Input Data
• Transaction Size Control• Also via Neo4j-Shell
CSV
10M
Getting Data into Neo4j: CSV
Command-Line Bulk Loader neo4j-import• For initial database population• Scale across CPUs and disk performance• Efficient RAM usage• Split- and compressed file support• For loads up to 10B+ records• Up to 1M records per second
CSV
100B
Getting Data into Neo4j: APIs
Custom Cypher-Based Loader• Uses transactional Cypher http endpoint• Parameterized, batched, concurrent
Cypher statements• Any programming/script language with
driver or plain http requests• Also for JSON and other formats• Also available as JDBC Driver
Any Data
Program
Program
Program
10M
Getting Data into Neo4j: APIs
JVM Transactional Loader• Use Neo4j’s Java-API• From any JVM language, concurrent• Fine grained TX Management• Create Nodes and Relationships directly• Also possible as Server extension• Arbitrary data loading
Any Data
Program
Program
Program
1B
Getting Data into Neo4j: API
Bulk Loader API• Used by neo4j-import tool• Create Streams of node and relationship
data • Id-groups, id-handling & generation,
conversions• Highly concurrent and memory efficient• High performance CSV Parser, Decorators
CSV
100B
Import Performance: Some Numbers
• Cypher Import 10k-10M records• Import 100K-100M records per
second transactionally• Bulk import tens of billions of records
in a few hours
Import Performance: Hardware Requirements
• Fast disk: SSD or SSD RAID• Many Cores• Medium amount of RAM (8-64G)• Local Data Files, compress to save space• High performance concurrent
connection to relational DB• Linux, OSX works better than Windows
(FS-Handling)• Disable Virus Scanners, Check Disk
Scheduler
Accessing Relational Data
Dump, Connect, Extract
Accessing Relational Data
• Dump to CSV all relational database have the option to dump query results and tables to CSV
• Access with DB-Driver access DB with JDBC/ODBC or other driver to pull out selected datasets
• Use built-in or external endpoints some databases expose HTTP-APIs or can be integrated (DataClips)
• Use ETL-Tools existing ETL Tools can read from relational and write to Neo4j e.g. via JDBC
Importing Your Data
Examples
Import Demo
Cypher-Based “LOAD CSV” Capability• Use to import address data
Command-Line Bulk Loader neo4j-import• Chicago Crime Dataset
Relational Import Tool neo4j-rdbms-import• Proof of Concept
JDBC + API
CSV
LOAD CSV
Powerhorse of Graph ETL
Data Quality – Beware of Real World Data !
• Messy ! Don‘t trust the data• Byte Order Mark• Binary Zeros, non-text characters• Inconsisent line breaks• Header inconsistent with data• Special character in non-quoted text• Unexpected newlines in quoted and unquoted text-fields• stray quotes
CSV – Power-Horse of Data Exchange
• Most Databases, ETL and Office-Tools can read and write CSV
• Format only loosely specified• Problems with quotes, newlines, charsets
• Some good checking tools (CSVKit)
Address Dataset
• Exported as large JOIN between• City• Zip• Street• Number• Enterprise
• address.csv EntityNumber TypeOfAddress Zipcode MunicipalityNL StreetNL StreetFR HouseNr
200.065.765 REGO 9070 Destelbergen
Dendermondesteenweg
Dendermondesteenweg 430
200.068.636 REGO 9000 Gent Stropstraat Stropstraat 1
LOAD CSV// create constraintsCREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;CREATE CONSTRAINT ON (z:Zip) ASSERT z.name IS UNIQUE;
// manage txUSING PERIODIC COMMIT 50000// load csv row by rowLOAD CSV WITH HEADERS FROM "file:address.csv" AS csv// transform valuesWITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip// create nodesMERGE (:City {name: city})MERGE (:Zip {name: zip});
LOAD CSV// manage txUSING PERIODIC COMMIT 100000// load csv row by rowLOAD CSV WITH HEADERS FROM "file:address.csv" AS csv// transform valuesWITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip
// find nodesMATCH (c:City {name: city}), (z:Zip {name: zip})
// create relationshipsMERGE (c)-[:HAS_ZIP_CODE]->(z);
LOAD CSV Considerations
• Provide enough memory (heap & page-cache)• Make sure your data is clean• Create indexes and constraints upfront• Use Labels for Matching• DISTINCT, SKIP, LIMIT to control data volume• Test with small batch• Use PERIODIC COMMIT for larger volumes (> 20k)• Beware of the EAGER Operation• Will pull in all your CSV data• Use EXPLAIN to detect itSimplest LOAD CSV Example | Guide Import CSV | RDBMS ETL Guide
s Demo
Mass Data Bulk Importer
neo4j-import --into graph.db
Neo4j Bulk Import Tool
• Memory efficient and scalable Bulk-Inserter• Proven to work well for billions of records• Easy to use, no memory configuration needed
CSV
Reference Manual: Import Tool
Chicago Crime Dataset
• City of Chicago, Crime Data since 2001
• Go to Website, download dataset• Prepare Dataset, Cleanup• Specify Headers (direct or separate file)• ID-definition, data-types, labels, rel-types• Import (30-50s)• Use!https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2http://markhneedham.com/blog?s=Chicago+Crime
Chicago Crime Dataset
• crimeTypes.csv• Types of crimes
• beats.csv• Police areas
• crimes.csv• Crime description
• crimesBeats.csv• In which beat did a crime happen
• crimesPrimaryTypes.csv• Primary Type assignment
Chicago Crime Dataset
crimes.csv :ID(Crime),id,:LABEL,date,description8920441,8920441,Crime,12/07/2012 07:50:00 AM,AUTOMOBILE
primaryTypes.csv :ID(PrimaryType),crimeTypeARSON,ARSON
crimesPrimaryTypes.csv :START_ID(Crime),:END_ID(PrimaryType) 5221115,NARCOTICS
Chicago Crime Dataset
./neo/bin/neo4j-import --into crimes.db --nodes:CrimeType primaryTypes.csv --nodes beats.csv --nodes crimes_header.csv,crimes.csv
--relationships:CRIME_TYPE crimesPrimaryTypes.csv --relationships crimesBeats.csv
s Demo
Neo4j-RDBMS-Importer
Proof of Concept
sRecap –
Transformation Rules
Normalized ER-Models: Transformation Rules
• Tables become nodes• Table name as node-label• Columns turn into properties• Convert values if needed• Foreign Keys (1:1, 1:n, n:1) into relationships,
column name into relationship-type (or better verb)• JOIN-Tables represent relationships• Also other tables without domain identity (w/o PK) and two FKs• Columns turn into relationship properties
Normalized ER-Models: Cleanup Rules
• Remove technical IDs (auto-incrementing PKs)• Keep domain IDs (e.g. ISBN)• Add constraints for those
• Add indexes for lookup fields• Adjust names for Label, REL_TYPE and propertyName
Note: currently no composite constraints and indexes
RDBMS Import Tool Demo – Proof of Concept
• JDBC for vendor-independent database connection• SchemaCrawler to extract DB-Meta-Data• Use Rules to drive graph model import• Optional means to override default behavior• Scales writes with Parallel Batch Importer API• Reads tables concurrently for nodes & relationships
Demo: MySQL - Employee Demo Database
Source: github.com/jexp/neo4j-rdbms-importBlog Post
Postgres MySQ
LOracle
s Demo
Architecture & Integration“Polyglot Persistence”
MIGRATE ALL DATA
MIGRATE GRAPH DATA
DUPLICATE GRAPH DATA
Non-graph data Graph data
Graph dataAll data
All data
RelationalDatabase
GraphDatabase
Application
Application
Application
Three Ways to Migrate Data to Neo4j
Data Storage andBusiness Rules Execution
Data Mining and Aggregation
Neo4j Fits into Your Enterprise Environment
Application
Graph Database Cluster
Neo4j Neo4j Neo4j
Ad HocAnalysis
Bulk AnalyticInfrastructure
Graph Compute EngineEDW …
Data Scientist
End User
DatabasesRelational
NoSQLHadoop
Next StepsCommunity. Training. Support.
There Are Lots of Ways to Easily Learn Neo4j
Resources
Online• Developer Site
neo4j.com/developer• RDBMS to Graph• Guide: ETL from RDBMS• Guide: CSV Import
• LOAD CSV Webinar• Reference Manual• StackOverflow
Offline• In Browser Guide „Northwind“• Import Training Classes• Office Hours• Professional Services
Workshop• Free Books: • Graph Databases 2nd Edition• Learning Neo4j
Register today at graphconnect.comEarly Bird only $99
Relational to GraphData Import
Thank you !Questions ?neo4j.com | @neo4j