Upload
jim-mlodgenski
View
7.653
Download
2
Embed Size (px)
DESCRIPTION
This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.
Citation preview
Postgres & Hadoop
Who am I?Jim Mlodgenski
CTO, OpenSCG
Co-organizer, NYCPUG
Co-organizer, Philly PUG
Co-chair, PGConf US
@jim_mlodgenski
AgendaStrengths of PostgreSQL
Strengths of Hadoop
Hadoop Community
Use Cases
Best of Both World
PostgresWorld’s most advanced open source database solution
Enterprise class including MVCC, streaming replication & rich data type support (to name a few!)
Robust transaction support with strong ANSI-SQL compliance
HadoopBig data distributed framework
Reliable, massively scalable & proven
Failures handled at the application layer allowing commodity hardware
Strengths of PostgreSQLStrong Data Types
Concurrency
Transactions
Security
Indexes
Connectors
Components of PostgreSQLDatabase
Connectors
– JDBC
– ODBC
– Libpq
Foreign Data Wrappers
And more...
Strengths of HadoopParallelism
Flexibility
Redundancy
Scalability
Components of HadoopHDFS
Hive
Flume
Sqoop
ZooKeeper
Hbase
And many more...
HDFS
Hadoop Distributed File System
HbaseModeled after Google BigTable
Column-oriented database on top of HDFS
ZooKeeperDistributed Configuration Service
Supports synchronization and distributed locking
Automatic leader election
HiveAdds SQL on Hadoop
Converts SQL (HQL) to MapReduce Jobs
FlumeStreams data into HDFS
Distributed and Highly Available
SqoopAllows for bulk transfers of data between Hadoop and a RDBMS
Hadoop CommunityMuch more like the Linux community than the PostgreSQL community
Some competing commercial interests makes the direction unclear to some
Use Cases
Hive MetastoreAll of the meta data of the Hive tables reside in a RDBMS
The default is to use Derby
– Limits to a single connection
Hive Metastore (cont.)Use PostgreSQL for scalability and reliability
Many concurrent users
PostgreSQL BackupsPostgreSQL's WAL archiving and Point In Time Recovery is powerful
– But it requires a lot of storage
Typically used with some sort of NFS
PostgreSQL Backups (cont.)Use HDFS
– Redundancy & Scalability
PostgreSQL Backups (cont.)Archive Command
archive_command =
'hadoop dfs -copyFromLocal %p /user/postgres/wal/%f'
Log FilesMaintain log files for months or years
May use Syslog to consolidate multiple database logs
Turning on query logging makes the log file huge
Log Files (cont.)Use Flume
Consolidates logs across databases
MapReduce allows for parallel analysis
Log Files (cont.)Setup Syslog to forward messages to Flume
rsyslog.conf:
*.* @127.0.0.1:5140
Configure Flume to act as a Syslog server
pglogs.sources.sl.type = syslogudp
pglogs.sources.sl.port = 5140
pglogs.sources.sl.host = 0.0.0.0
Log Files (cont.)MapReduce jobs can quickly analyze the logs
public static class MapClass extends MapReduceBase implements Mapper<StatementOffset, Text, Text, LongWritable> {
private final static String STATEMENT_DELIM = "statement: "; private final static String SYSLOG_IDENT = "postgres";
private final static LongWritable one = new LongWritable(1);
public void map(StatementOffset key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
if (line.startsWith(SYSLOG_IDENT) && line.contains(STATEMENT_DELIM)) { output.collect(getStatementType(line), one); } }...
Transaction HistoryHistory Tables grow very rapidly
Maintaining the tables over time is a huge undertaking
Partitioning frequently used
Transaction History (cont.)Use Sqoop
– Add a sequence to the table for fast incremental loads
OLAP CubesCan take a very long time to build
PostgreSQL will use only a single CPU
Drilling down to the details can be a very long query
OLAP CubesUse a Foreign Data Wrapper
Looks like a native table to reporting tools
Drill down takes place on Hadoop
OLAP Cubes (cont.)Create a Foreign Server
CREATE EXTENSION hadoop_fdw;
CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000');
CREATE USER MAPPING FOR PUBLIC SERVER hadoop_server;
OLAP Cubes (cont.)Create a Foreign Table
CREATE FOREIGN TABLE order_line ( ol_w_id integer, ol_d_id integer, ol_o_id integer, ol_number integer, ol_i_id integer, ol_delivery_d timestamp, ol_amount decimal(6,2), ol_supply_w_id integer, ol_quantity decimal(2,0), ol_dist_info varchar(24)) SERVER hadoop_server OPTIONS (table 'order_line');
OLAP Cubes (cont.)Loading PostgreSQL aggregate tables is a simple SQL statement
Use Hive views for more complex aggregations
INSERT INTO item_sale_month SELECT ol_i_id as i_id, EXTRACT(YEAR FROM ol_delivery_d) as year, EXTRACT(MONTH FROM ol_delivery_d) as month, sum(ol_amount) as amount FROM order_line GROUP BY 1, 2, 3;
OLAP Cubes (cont.)Drill downs pass the processing down to Hive
postgres=# explain verbose select sum(ol_amount) from order_line where ol_i_id = 34928;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=11002.50..11002.51 rows=1 width=14) Output: sum(ol_amount) -> Foreign Scan on public.order_line (cost=10000.00..11000.00 rows=1000
width=14) Output: ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id,
ol_delivery_d, ol_amount, ol_supply_w_id, ol_quantity, ol_dist_info Remote SQL: SELECT * FROM order_line WHERE ((ol_i_id = 34928))(5 rows)
Audit History
All database access should be audited and autonomously logged
Must be maintained for years
Audit History (cont.)Use the Hadoop Foreign Data Wrapper to Flume
Audit History (cont.)Create a writable foreign table
CREATE FORIEGN TABLE audit ( audit_id bigint, event_d timestamp, table varchar, action varchar, user varchar,) SERVER hadoop_server OPTIONS (table 'audit', flume_port '44444');
Message Queue
Tables have a lot of churn with many updates and deletes
Causes a lot of table and index bloat in PostgreSQL
AKA a vacuuming nightmare
Message Queue (cont.)
Use an FDW to HbaseHbase is not an “Eventually Consistent” architecture so it is ideal for message queues
Message Queue (cont.)Create a writable foreign table
CREATE FOREIGN TABLE hbase_table ( key varchar, value varchar) SERVER hadoop_server OPTIONS (table 'hbase_table', hbase_address
'localhost', hbase_port '9090', hbase_mapping ':key,cf:val');
INSERT INTO hbase_table VALUES ('key1', 'value1');INSERT INTO hbase_table VALUES ('key2', 'value2');UPDATE hbase_table SET value = 'update' WHERE key = 'key2';DELETE FROM hbase_table WHERE key='key1';SELECT * from hbase_table;
High Availability
When setting up replication for high availability many necessary components are not provided by PostgreSQL
Failure detection
Split brain prevention
Replica promotion
Notification to clients of fail over
High Availability (cont.)
ZooKeeper with a custom background worker can handle all of the missing components
High Availability (cont.)
Failure Detection – Replicas watch an ephemeral lock created by the master
void watch_master() {... sprintf(root_path, "%s/lock", zookeeper_path);
while (!found_master && !got_sigterm) { elog(DEBUG1, "Looking for the master lock...");
rc = zoo_get_children(zh, root_path, 0, &children);
if (rc == ZOK) { sprintf(child, "%s", "~"); for(i=0; i < children.count; i++) { if (strcmp(child, children.data[i]) > 0) { sprintf(child, "%s", children.data[i]); found_master = 1; } }
if (found_master) { sprintf(lock_path, "%s/%s", root_path, child); elog(DEBUG1, "Found a lock at %s", lock_path);
/* Set the watch on the lock */ bufferlen= sizeof(buffer); rc = zoo_get(zh, lock_path, 1, buffer, &bufferlen, NULL); if (rc != ZOK) { found_master = 0; elog(LOG, "Unable to watch %s. Retrying...", lock_path); } } } else { elog(LOG, "The path %s does not have any children yet. ...", root_path); }
...}
High Availability (cont.)
Split brain prevention – master grabs an exclusive zooKeeper lock on startup. Shut down immediately if unsuccessful
char *create_lock() { char path[PATH_LEN]; char *buffer; int rc;
buffer = (char *) palloc(PATH_LEN);
ensure_connected();
sprintf(path, "%s/lock", zookeeper_path); if (zoo_exists(zh, path, 0, NULL) == ZNONODE) { rc = zoo_create(zh, path, NULL, -1, &ZOO_OPEN_ACL_UNSAFE, 0, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper path: %d", rc); } }
sprintf(path, "%s/s-", path);
rc = zoo_create(zh, path, "master", 6, &ZOO_OPEN_ACL_UNSAFE, ZOO_EPHEMERAL | ZOO_SEQUENCE, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper lock: %d", rc); } elog(DEBUG1, "Created a zooKeeper ephemeral path at: %s", buffer);
return buffer;}
High Availability (cont.)
Replica promotion – use zooKeeper for ballots of a election. Highest LSN wins
void elect_master() {... recptr = GetWalRcvWriteRecPtr(NULL, NULL); sprintf(lsn, "%X/%08X", (uint32) (recptr >> 32), (uint32) recptr);
elog(DEBUG1, "Entering a ballot with an LSN of: %s", lsn);
sprintf(path, "%s/lock/%s", zookeeper_path, replica_id);
rc = zoo_create(zh, path, lsn, strlen(lsn), &ZOO_OPEN_ACL_UNSAFE, ZOO_EPHEMERAL, buffer, sizeof(buffer)-1); if (rc) { elog(FATAL, "Failure creating zooKeeper path: %s", path); } elog(DEBUG1, "Created a zooKeeper ephemeral path at: %s", buffer); all_votes_in = false; while (!all_votes_in && !got_sigterm) { sprintf(path, "%s/replica", zookeeper_path); rc = zoo_get_children(zh, path, 0, &replicas);
if (rc == ZOK) { sprintf(path, "%s/lock", zookeeper_path); rc = zoo_get_children(zh, path, 0, &ballots);
if (rc == ZOK) { all_votes_in = true; for(i=0; i < replicas.count; i++) { found = false; for(j=0; j < ballots.count; j++) { if (strcmp(replicas.data[i], ballots.data[j]) == 0) { found = true; break; } }
if (!found) { all_votes_in = false; break; } } } }… }
for(j=0; j < ballots.count; j++) { if (strcmp(ballots.data[j], replica_id) != 0) { sprintf(path, "%s/lock/%s", zookeeper_path, ballots.data[j]);
memset(buffer, 0, sizeof(buffer)); bufferlen= sizeof(buffer); rc = zoo_get(zh, path, 0, buffer, &bufferlen, NULL); if (rc != ZOK) { elog(LOG, "Unable to get %s. New master probably already found...", path); }
elog(DEBUG1, "Comparing the LSN: %s", buffer);
if (strcmp(lsn, buffer) < 0) { elog(DEBUG1, "Found an LSN greater than mine. I am not the winner."); return; } else if (strcmp(lsn, buffer) == 0) { elog(DEBUG1, "Found an LSN equal to mine. See if I was the first to the start."); if (strcmp(replica_id, ballots.data[j]) > 0 ) { elog(DEBUG1, "Found an LSN equal to mine and a sequence earlier than mine. I am not the winner."); return; } } } }
elog(LOG, "Becoming the new master. Acquiring the proper locks.");
lock = create_lock();
for(j=0; j < ballots.count; j++) {
elog(DEBUG1, "Removing ballot at %s", path); rc = zoo_delete(zh, path, -1); if (rc != ZOK) { elog(LOG, "Unable to delete %s", path); }
}
if (!has_lock(lock)) { elog(LOG, "Unable to acquire a zooKeeper lock. Shutting down to prevent a split brain scenario"); do_stop(); } else { elog(LOG, "Promoting to become the new master."); do_promote(); }
publish_master_info();}
High Availability (cont.)
Client notification – Python (or others) can watch the master and act appropriately
def __init__(self,zkHosts,pathName): zkHosts = zkHosts pathName = pathName
watchPath = pathName + "/master"
zk = KazooClient(hosts=zkHosts) zk.start()
if zk.exists(watchPath + "/connection"): (data,stat) = zk.get(watchPath + "/connection") self.pgConnection = data
@zk.DataWatch(watchPath + "/connection") def host_watch(data, stat): print("The new master connection is %s" % data) self.pgConnection = data
host = self.pgConnection.split(":")[0] port = self.pgConnection.split(":")[1]
Config = ConfigParser.SafeConfigParser() Config.read(bouncer_config_file) for name, value in Config.items("databases"): if name == bouncer_db: newValue = "" options = value.split(" ")
for option in options: if option.split("=")[0] == "host": newValue = newValue + "host=" + host + " " elif option.split("=")[0] == "port": newValue = newValue + "port=" + port + " " else: newValue = newValue + option + " "
Config.set("databases", bouncer_db, newValue)
cfgfile = open(bouncer_config_file,'w') Config.write(cfgfile)
cfgfile.close()
self.reloadBouncer()
else: raise NameError("The path (" + watchPath + \ "/connection) does not exist in ZooKeeper")
Getting the Componentshttp://hadoop.apache.org/
http://hive.apache.org/
http://flume.apache.org/
http://sqoop.apache.org/
http://zookeeper.apache.org/
http://hbase.apache.org/
http://www.postgresql.org/
http://jdbc.postgresql.org/
http://openjdk.java.net/
http://openscg.com/se/hadoop-fdw/
Or...BigSQL.org
Questions?