8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 1/15
Building a Multi-terabyte Vertica
DatabaseJan. 31
Vertica Confidential. Copyright Vertica Systems Inc. 2007
February, 2007
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 2/15
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 3/15
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 4/15
4
Logical tables are physically stored as columns, and partitioned into segments on several machines and in severaldifferent projections
Projections and Performance
A projection, because of the sorting, localizes logically-grouped values, so that a disk read can
pick up many results at once. Today’s disks can read astounding fast once positioned on data
but still take many milliseconds to seek to a single record.
The best sort orders are determined by the where clauses of queries. If a sort order is (x,y) and a
query has “where x=1 and y=2”, all the needed data is found together in one place in the sort
order. The query will fly.
Another important performance factor is data compression. Compression and localization play
well together, because localized data has repeated sequences that compress very well. Also,
compression makes multiple projections affordable, and thus more queries that fly.
DB Designer
DB Designer is a component of Vertica that uses cost-based metrics to analyze your schema,
data and queries and design the best sort orders for projections. Your essential contribution is
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 5/15
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 6/15
6
If surprise queries are rare and the training queries are selective (under 5%), the number of disks
can be lowered. The number of processors cannot be lowered, however, without slowing down
the training queries, since they tend to run CPU-bound.
K-safety
If K=1 safety is needed, add one or two systems as replacement systems. These do not need to
be hot standby systems, just on-site. The diskspace suggestions above cover K=1 needs.
Building the Linux Infrastructure
Suggested hardware
A wide range of possible systems can be utilized. Systems from Dell and HP are well-made and
reliable, although more expensive than ones put together without a name brand. However,
beware of cheap consumer systems, as they are often shoddily made. For good performance,
small size (rack mountable, usually 2U), and low cost, the pizza box server is an excellent choice.
A typical 2U system can accommodate 6 disks. The following assumes direct attached storage,
the classical shared-nothing approach with local disks on each system. A SAN (storage area
network) configuration can also be utilized, but is more expensive.
Linux system for one Vertica node
1. Two CPUs, dual core if available, with at least 2MB cache, 4MB is preferable.
2. At lease 4GB memory, or 2GB/CPU core for more than two CPU cores.
3. 4 matched disks for Vertica data, preferably SATA (or SAS, or other SCSI variants, but
SCSI is more expensive). 10K rotation speed is good, 15K is better, but more expensive.
Hardware RAID is another possibility, but requires another single disk for the system
disk.
4. Possibly another, small disk for the operating system. This disk can have lowerperformance, IDE for example. See Linux System Setup below for discussion.
5. 1Gbps Ethernet interface.
6. USB 2.0 port, for loading from moveable disk.
7. If it comes with RedHat 4, make sure the OS is installed on one approximately 50 GB
partition of one disk, leaving the rest of the system disk usable for other purposes.
Addi tional Hardware
1. 1Gbps Ethernet switch with ports for all N systems, plus at least 2 more ports for external
connection, and a spare. Dell sells a 24-port 1Gbps switch with a 30 Gbs backplane that
works well in our experience.
2. Enough USB 2.0 disks to hold the fact data, and 5 USB hubs. Not needed if there is a1Gbps LAN connection from the source of the data.
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 7/15
7
Linux System Setup
During Linux install, choose locale en-US.UTF-8 for US English, or another appropriate UTF-8
locale. You can set up one system partition and clone it by image copy to the other systems, and
then fix the few things that are specific to each system.
A straightforward disk layout uses one disk for the system, and four matched disks for Vertica
data. Vertica does not require separate “temp space”, so the 4 disks can be put together in one
Linux software RAID 0 “md0” filesystem for each node, as shown in the figure below. The stripe
size (chunk size for the Linux RAID tool mdadm) should be 1MB.
The system disk does not need to be a high performance disk, that is, it could be IDE for
example. If the system is restricted to a 50 GB partition of the single disk, the rest can be used
as a second filesystem without worry of using up the system filespace. This is the current disk
layout for used for Vertica testing.
Proposed disk layout with system disk and 4-disk md RAID for a node
The system disk can double as a data disk as long as it is matched to the other data disks. The
operating system does not use significant disk bandwidth once booted, so its disk is largely idle in
the above plan.
The most audacious design is a single monolithic md0 filesystem, containing the OS as well as
the Vertica data. Here the reasoning goes that the OS is no more or no less important than thedatabase, as the failure of either causes a full rebuild of the node, so everything can be in one
boat. Diagnostics can be run from a bootable CD or temporary additional disk. However, once
RAID is allowed under system, there is no reason not to let the system have its own 50GB
partition, and swap its twice-memory size partition, as follows:
system
md0
swap
Extra
space
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 8/15
8
Proposed disk layout w ith system in RAID partition for a node
Mount the big data partition (md0) under the same name on all nodes, say “/vdata”. With 4 250G
disks, and 60GB reserved for the system, this provides 940 GB on each /vdata filesystem.
NTFS support
If you are bringing data from Windows, add NTFS support to the RedHat Linux kernel. It is
missing from RedHat for legal reasons, not stability issues. It is available as an RPM at
www.linux-ntfs.org.
Naming your systems
For ease of administration, name your Vertica systems using a repeating pattern such as
vnode01, vnode02, …, vnode20. Test the hostnames using the procedure in the installation
guide under the heading Network Configuration: “Check Hostname Resolution.”
Connectivity via port 5433
TCP port 5433 is used for JDBC, underlying both external and internal (psql) tools. Make sure
port 5433 is enabled for inside- and outside-cluster connections. See “Check Remote Access” in
the installation guide. Another port can be used if necessary, but 5433 is the default.
Testing Connectivity
A test for this connectivity is as follows. From the client machine or another node, try “telnet
vnode01.whatever.com 5433”, and see if it connects, presenting you with a blank screen.
Disconnect with control-] q. If this test fails, it may be the fault of the RedHat firewall or SELinux
security protections.
Fixing Connectivity
These can be disabled by running the Linux command system-config-securitylevel and disabling
the firewall and SELinux, if this is consistent with your security policy. If system security depends
on these protections, enable just port 5433 for JDBC and port 22 for ssh.
system
md0
swap
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 9/15
9
Linux note: getting X Windows to work for root
To enable X for root use for system-config-securitylevel and other X-enabled (GUI) tools, use the
shell command “xhost +” as root.
Setting the stage for the DB DesignVertica customizes its data representation to your actual needs, expressed by you in your most
important queries, the training queries.
Determining the Training Queries
Consider the most time-crucial queries. Note that all queries on your data will be supported, so
there is no need to include all the columns of the fact table in the training queries. In fact, it is
important to be as restrictive as possible consistent with your real needs.
What is most important in a training query is the where clause. A typical star query looks like this:
select … from fact
where … and … and … most important part: what columns restrict data
group by …
order by …
However the columns mentioned anywhere in the query are also important, to ensure their
presence along with the where-clause columns in a Vertica projection. This is not to say that the
queries should be artificial. They should be legitimate important queries, except for the exact
constants involved, which are expected to be variable in practice.
Determining the Segmentation Key
The DB Designer will choose a good segmentation key based on your sample data. Still, it’s a
good idea to think about the choices and understand the considerations.
The proper choice of segmentation key, a certain column of the fact table, is an important part of
the Vertica database implementation and does not follow directly from the schema or training
queries. Each projection for a fact table has a certain segmentation key, and it is possible to
have more than one segmentation key in use for a table, but K-safety considerations tend to align
these keys so that one segment of data of one projection can be used to reload the same
segment of another projection in a failed node. Thus the important thing is determining one or
two good segmentation keys for a fact table.
The wrong segmentation key could slow down queries because of poor load balance between
nodes. The segmentation key should be a column of the fact table that is not present in the where
clauses of important queries. It should be able to compartmentalize fact data into segments so
that each important query will use all the segments.
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 10/15
10
For example, if the fact data is about employees, the Social Security number (an integer) would
have this property. Twenty segments could be based on ranges of this number:
Site s1 values less than 050000000
Site s2 values less than 100000000
And so on
In many database systems, practical maintenance requires horizontal partitioning by time. This is
not true of Vertica, because it incrementally merges in new data and deletes old data. In fact,
since many queries are on recent data, segmentation by time is not recommended.
Mobil izing the Data
Vertica is loaded from delimited text files. See the COPY command documentation in the
Database Administrator’s Guide and further discussion below.
A common case is moving data from its current database home to Vertica. Here is more on that
case.
Migrating the Schema
Obtaining the Schema
You may need to reconstruct the create table statements, etc., that define the schema. An ETL
tool such as Informatica can do this for you. If you want to do it without such a tool, you can see
if your source database system has a way to generate DDL. DB2 has DDL generation in its
db2look tool. Oracle has the DBMS_METADATA package, which can output SQL create table
statements, etc. However, these tools (especially from Oracle) tend to use proprietary data types
and additional storage clauses, so some edits will be needed.
DB Visualizer (www.minq.se/products/dbvis) can export schema and is moderately priced, and
supports most important databases. It is based on JDBC, so Vertica should soon be usable
through this tool.
If you are using the free eclipse IDE, add the WTP (Web Tools Project) package, and try its
Generate DDL wizard in the Database Explorer. However, this is not enterprise software, and
may not work for all databases.
Checking the Schema
Make sure that the needed foreign key clauses are there, in the table definitions themselves or in
separate “alter table T add constraint …” commands. See retail_define_schema.sql of the quick
start example database for an example. Add any missing foreign key constraints that hold the
star or snowflake together. The foreign key constraints are very important to guide the DB
Designer in its work.
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 11/15
11
Exporting the Data from the Source Database
The exact way to extract the data varies across source database systems. The data should be
exported to text form by the source database to a local file or attached disk if possible.
Oracle Note
In Oracle for example, there is no export (to text) tool, only a load tool from text (SQL
Loader). To export data, you can run a select query in Oracle’s SQLPlus command line
query tool with specified column delimiter, suppressed headers, etc., with redirection of
output to a local file (for example, on a USB 2.0 disk.)
To make a practical export scheme, design queries or unloads that produce 250-500GB files of
various parts of the data of the fact table, plus files for each non-fact table. For 10 TB, this
means 20-40 files for the fact table.
Some attention to handling special characters is needed to make sure the Vertica COPY
command will accept all the exported rows. See the Appendix on Load Format Details
ETL products of course can handle these well-known difficulties of moving data from one
database system to another. They typically use ODBC or JDBC to extract data, which gives them
program-level access to column values to fix them up as needed for the load files.
Moving the Data
The data can be transported from its source to the Vertica installation on USB 2.0 (or possiblySATA) disks or across a fast local network connection. Deliver chunks of data to the different
Vertica nodes, by connecting the transport disk or writing files from network copy.
Fast network transfer of data
A 1Gbps network can deliver about 50 MB/s, or 180GB/hr. Vertica can load about 200GB in 4
hours on each node (of 4 nodes), so about 50GB/hr on each node. Thus a dedicated 1Gbps LAN
should be usable. Slower LANs will be proportionally slower, and non-local networks are
probably untenable because of the delays over distance slow down the TCP protocol to a small
percent of its apparent bandwidth, even without competing traffic.
Disk transfer of dataUSB 2.0 disks can deliver data at about 30 MB/s, or 108 GB/hour, fast enough. SATA disks are
usually internal, but can be external, or unplugged safely internally. USB 2.0 disks are easy to
use for transporting data from Linux to Linux. Simply set up a ext3 filesystem on the disk and
write large files there. Linux 2.6 has USB plug-and-play support, so a USB 2.0 disk is instantly
usable on various Linux systems.
Using a disk for one file
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 12/15
12
For other variants of UNIX, if there is no common filesystem format available, the disk can be
used without a filesystem for a single large file. You can use “cp bigfile /dev/sdc1” for example on
the source system and access the file on the Linux system as /dev/sdd1 or whatever device it
ends up with. Even without a filesystem on the disk, the plug-and-play support still works on
Linux to provide a device node for the disk. You can find out the assigned device by the shell
command “dmesg | tail -40” after plugging in the disk.
Data from a Windows System
For Windows to Linux, NTFS is the clear choice for the filesystem, which requires the added RPM
for Linux as discussed above under Linux System Setup. Although RedHat Linux as originally
installed can read Windows FAT32 filesystems, they are useless for such large files.
Building the Vertica Cluster
Setting up Vertica
The Quick Start guide shows the basic steps. The systems involved are the set of nodes of the
cluster, plus at least one additional client system to play the part of the eventual users.
Following the Installation Guide:
1. Make sure the hostnames pass the hostname tests listed in the Installation Guide, underCheck Hostname Resolution. The hostnames are used in the Vertica installationprocess.
2. Make sure the RedHat firewall and SELinux pass port 5433 (or whatever port you areusing instead). Test with “telnet <host> 5433” to and from nodes and from the clientsystem.
3. Create the unprivileged Linux account for administration on each node. I called it
vadmin. Enable the SSH logins as directed in the Installation Guide. Give vadminownership of /vdata.
4. In a root login, install Vertica on one node. This node will be your top-level administrationnode. See the Installation Guide for details, under Initial Software Installation.
5. Create the sample data, for testing.
Following the Quick Start Guide, a first test on one node.
Do the following logged in on some node as vadmin, the one user who runs adminTools in the
current version of Vertica
1. Follow the directions to copy the sample data, but to make it more realistic, put it on
/vdata: mkdir /vdata/retail_example_database, etc. All “big” data should be under /vdata.
Of course this data is not really big.
2. Try out the single-site quick start install. It only takes a few minutes, and tests the coresubset of the configuration. Use /vdata/retail_example_database/single/catalog and
/vdata/retail_example_database/single/data in the create-database step (admin tool 4)
3. Try out the canned queries, as described under Running Simple Queries.
Following the Quick Start, Second test on two nodes.
Again do the following as vadmin, while logged in on the admin node.
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 13/15
13
1. Shut down the first database. Only one database at a time may be running with the
current version of Vertica.
2. Try out the multi-site quick start install. Use /vdata/retail_example_database/multi/catalog
and /vdata/retail_example_database/multi/data in the create-database step (admin tool 4)
3. Again try out the canned queries, on each of the two nodes.
Now you have installed and tested the core Vertica server.
Since the psql SQL environment is working from the above tests, and it depends on JDBC, we
are assured that JDBC is being served. You can use “netstat –a |grep 5433” to see the listener.
Again try “telnet <host> 5433” from the client system to any node involved in a Vertica database.
Note that psql can be run outside of adminTools on any Vertica node. From a client system, try
out your favorite JDBC client.
Building Your Database
Creating the Database and Running DB Designer
Leave the little example database where it is (and shut down) and set up another Vertica
database for your real data. First bring over your schema, training queries and data for non-fact
tables, and sample data for the fact table. Because this is only a moderate amount of data, it can
be transferred over the network easily.
1. Make a top-level directory in /vdata, say /vdata/sales, for this database.
2. Put the schema definition (say schema.sql) and training queries (say queries.sql) in
directory /vdata/sales/config and the data files in say /vdata/sales/inidata.
3. Do the create-cluster step of the Multi-Site Procedure in the Quick Start Guide, except
name your cluster appropriately, say “sales”.
4. The installation of Vertica has already been done at this point for two nodes, but needs tobe done for the others.
5. Do the create-database step using the same name for your database as you did for the
cluster. Specify directories in /vdata such as /vdata/sales/catalog and /vdata/sales/data.
6. Run the Database Designer, from the config directory, entering the schema.sql and
queries.sql. Specify a temp directory on /vdata, say /vdata/temp, for best results. You
also specify the delimiter and null-value representations here. Provide the disk budget.
For our example 10 TB system with 20 nodes each with 4 250G disks, reserving 60GB
for the system, we have a 940 GB * 20 = 18800 GB for the disk budget
Checking the Projection Design
It is a good idea to examine the output of the DB Designer to make sure the projections are in factsorted on columns of importance to where clauses of your training queries. Make sure the
segmentation key is right. If a projection seems unneeded or otherwise unexpected, check if a
training query is not what you intended, or possibly not really important. Contact technical support
if needed. It is much easier to fix design problems at this point than later on.
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 14/15
14
Implementing the Projections and do ing the initial small load
1. Connect to the production database and set up the schema from
/vdata/sales/config/schema.sql. Primary and foreign key columns cannot have null
values. If “not null” is missing from the schema definition for such a column, it will be
added with a warning at this point, but this is harmless.
2. Generate the projections from the autoDBDesign.sql generated by DB Designer.
3. Load the dimensions and then the sample fact data using COPY DIRECT.
Here is a sample load command for comma-delimited data for a dimension table named
promotion, from a file on a USB disk mounted as /extdata, with comma-separated column values
and null values indicated as “null”:
copy promotion from ‘/extdata/promotion.dat’ delimiter ‘,’ null ‘null’ direct
Copy commands will not fail if a few rows are rejected, for example, for having the wrong number
of delimited values. Thus the whole COPY is not a transaction but rather each row addition is
committed. See the COPY command documentation in the DBA Guide. Check the Vertica log file
for rejected rows and other diagnostics. The log file location is displayed by the tool
/opt/vertica/bin/dbInfo. You can fix up the problems and load the corrected rows.
Testing the init ial database
The database is now functional, although still relatively small, since only the sample fact data is
loaded so far. You should try out your training queries at this point. If anything fails or runs
slowly, study the projection definitions for problems, and contact technical support. If another
projection is needed, for example, you will need to redo the database build with the new
projection, a relatively easy task at this point before the main body of data is loaded, but much
harder later.
The Big Load
Suppose we have a 10-20 node system and 20 500G USB 2.0 disks for transporting data. We
can start loads on each of 5 nodes by accessing one transport disk on each and starting a COPY
DIRECT for its data. When they are all done, another 5 can be loaded, and so on for four rounds.
This choice of 5 parallel loads is just an example. You may be able to do more at once. You can
add parallel loads until it stops loading faster.
With a USB hub, the 4 transport disks for a node can all be accessible without recabling, and the
four parts can proceed one after another following a script.
Testing Your Database
Before doing serious queries, be sure to run SELECT ANALYZE_STATISTICS(‘projectionname’);
for each projection. The tuple mover will periodically rerun this to keep statistics current.
Now your cluster is up and running. Try the training queries and then some other queries. Check
the size of each table with a count(*) query. Enjoy the speed!
8/11/2019 Ver Tica Database Multi Terabyte
http://slidepdf.com/reader/full/ver-tica-database-multi-terabyte 15/15
15
Running Your Database
Over time, new data needs to be added, and eventually old data deleted. Unlike many other
databases, no “reorg” is needed, since Vertica is continuously merging in new data and rewriting
the older data. A process that inserts new rows and deletes old rows typically accesses the
database via JDBC, connected from the intermediate systems that control the external data flow.
This process is called a trickle load, and allows new data to be added even while queries are
actively running.
APPENDIX
Load Data format details
The data delimiter and quote character
Choosing the right column-value delimiter is important. You need to choose a character that
does not show up in any char(n) or varchar(n) data values. The vertical bar, ‘|’ is a good one totry. You can test for the existence of a certain character c in column x by using the query “select
count(*) from T where x like ‘%c%’”. If a few values are using |, they can be eliminated from the
main load by a where clause and separately loaded using another delimiter. Alternatively, one
could try to quote the delimiter character with \ if the database can do this. Also, \ chars in the
char data will disappear on load into Vertica unless doubled up, and newlines will cause trouble
too.
Oracle has a REGEX_REPLACE function that can substitute one substring with another,
although this will slow down the unload operation significantly. It might be practical to use a where
clause to avoid problem rows on the main load, and the opposite where clause with
REGEX_REPLACE for just the problem rows.
Non-ASCII dataVertica stores data in the UTF-8 compressed encoding of Unicode. The resulting UTF-8 codes
are identical to ASCII codes for the ASCII characters (codes 0 to 127 in one byte). If your table
data (char columns) is all ASCII, it should be easy to transport, since all current OS environments
treat it the same way. If you have UTF-8 data, it is just a matter of preserving it that way. Make
sure that the extraction method does not convert char column values to the current (non-UTF-8)
locale of the source system. On most UNIX systems, you can see the current locale with the
“locale” command, and change it for a session by setting the LANG environment variable to
en_US.UTF-8. If you have data in another character encoding such as Latin-1 (ISO 8859), itneeds to be converted to UTF-8 if your data is actively using the non-ASCII characters of Latin-1
such as the euro sign and the diacritical marks of many European languages. The Linux tool
iconv can do the needed conversions. Luckily it is rare to have non-ASCII characters in the fact
table, so these conversions are usually needed only for the smaller dimension table’s data.
Recommended