Design Principles
• Future-proof, scalable and auto recoverable, compatible with existing technologies, loose coupled and layered architecture
Centralized Data Governance service
• Build Schema catalog service to track all data entities and attributes for both structured and unstructured data sets
• Establish and enforce proper practices including solution patterns/design, coding, testing automation and release procedues
Logical ArchitectureData Transformation and
storageData
Acquisition
Text files Image files XML files EDI files
Event …
Data Distribution
BI Reports Text files
Image files XML files EDI files
Event …
Data Processing Pipeline
Hadoop HDFS MapReduce
Hive Pig
Flume Spark
Java/Scala
NoSql MongoDB Cassandra
Relational Database
MS Sql Oracle MySql
Logical Architecture• Data lifecycle control, access audit, replication
and DR
• On-desk and in-memory data processing technology stack - sql or nosql database, hadoop map reduce, Spark or ETL tool etc
• Central data inventory services for discovery, tracking and optimization
Technology Stack
• HDFS, MapReduce, Yarn
• Oozie, Hive, Spark, Kafka, Cassandra, MongoDB
• BI & Reporting, Data acquisition and distribution, Data inventory and data model
Schema Catalog
• MongoDB schema store
• Schemas, Entities, attributes defined using Arvo format
• Define all Data Sources, destinations including format, transfer protocol, file system, schedule etc
Data Ledger
• Ledger inventory of all business data set across enterprise
• data set producer and consumer registration
• Data set are tagged and can be queried for traceability and usages
Data Process and Persistent • Relational database for OLTP, data warehouse
and BI which need to access SQL database and existing systems
• HDFS for source, destination, staging, no structured document, large to huge data processing, data saved in either Arvo or Parquet format for better exchange and performance
• Cassanadra for high frequency, high write transaction systems and MongoDB for document