RODBC Package R software

Embed Size (px)

Citation preview

  • 8/9/2019 RODBC Package R software

    1/34

    ODBC Connectivity

    by Brian RipleyDepartment of Statistics, University of Oxford

    [email protected]

    March 9, 2012

    Package RODBC implements ODBC database connectivity. It was originallywritten by Michael Lapsley (St Georges Medical School, University of Lon-don) in the early days of R (1999), but after he disappeared in 2002, itwas rescued and since much extended by Brian Ripley. Version 1.0-1 wasreleased in January 2003, and RODBC is nowadays a mature and much-usedplatform for interfacing R to database systems.

    1 ODBC Concepts

    ODBC aims to provide a common API for access to SQL 1 -based databasemanagement systems (DBMSs) such as MySQL, PostgreSQL, Microsoft Ac-cess and SQL Server, DB2, Oracle and SQLite. It originated on Windowsin the early 1990s, but ODBC driver managers unixODBC and iODBC arenowadays available on a wide range of platforms (and a version of iODBCships with recent versions of Mac OS X). The connection to the particularDBMS needs an ODBC driver : these may come with the DBMS or theODBC driver manager or be provided separately by the DBMS developers,and there are third-party 2 developers such as Actual Technologies, Easysoftand OpenLink. (This means that for some DBMSs there are several differentODBC drivers available, and they can behave differently.)

    Microsoft provides drivers on Windows for non-SQL database systems suchas DBase and FoxPro, and even for at les and Excel spreadsheets. ActualTechnologies sell a driver for Mac OS X that covers (some) Excel spread-sheets and at les.

    A connection to a specic database is called a Data Source Name or DSN1 SQL is a language for querying and managing data in databasessee http://en.

    wikipedia.org/wiki/SQL .2 but there are close links between unixODBC and Easysoft, and iODBC and OpenLink.

    1

    http://localhost/var/www/apps/conversion/tmp/scratch_4/[email protected]://en.wikipedia.org/wiki/SQLhttp://en.wikipedia.org/wiki/SQLhttp://en.wikipedia.org/wiki/SQLhttp://en.wikipedia.org/wiki/SQLhttp://en.wikipedia.org/wiki/SQLhttp://localhost/var/www/apps/conversion/tmp/scratch_4/[email protected]
  • 8/9/2019 RODBC Package R software

    2/34

    (see http://en.wikipedia.org/wiki/Database_Source_Name ). See Ap-

    pendix B for how to set up DSNs on your system. One of the greatestadvantages of ODBC is that it is a cross-platform client-server design, so itis common to run R on a personal computer and access data on a remoteserver whose OS may not even be known to the end user. This does rely onsuitable ODBC drivers being available on the client: they are for the majorcross-platform DBMSs, and some vendors provide bridge drivers, so thatfor example a bridge ODBC driver is run on a Linux client and talks tothe Access ODBC driver on a remote Windows machine.

    ODBC provides an abstraction that papers over many of the differencesbetween DBMSs. That abstraction has developed over the years, and RODBCworks with ODBC version 3. This number describes both the API (mostdrivers nowadays work with API 3.51 or 3.52) and capabilities. The latterallow ODBC drivers to implement newer features partially or not at all,so some drivers are much more capable than others: in the main RODBCworks with basic features. ODBC is a superset of the ISO/IEC 9075-3:1995SQL/CLI standard.

    A somewhat biased overview of ODBC on Unix-alikes can be found at http://www.easysoft.com/developer/interfaces/odbc/linux.html .

    2 Basic Usage

    Two groups of functions are provided in RODBC. The mainly internal odbc*commands implement low-level access to C-level ODBC functions with sim-ilar 3 names. The sql* functions operate at a higher level to read, save,copy and manipulate data between data frames and SQL tables. The twolow-level functions which are commonly used make or break a connection.

    2.1 Making a connection

    ODBC works by setting up a connection or channel from the client (hereRODBC) to the DBMSs as specied in the DSN. Such connections are normallyused throughout a session, but should be closed explicitly at the end of the sessionhowever RODBC will clear up after you if you forget (with awarning that might not be seen in a GUI environment). There can be manysimultaneous connections.

    The simplest way to make a connection is

    library(RODBC)3 in most cases with prex SQL replacing odbc .

    2

    http://en.wikipedia.org/wiki/Database_Source_Namehttp://www.easysoft.com/developer/interfaces/odbc/linux.htmlhttp://www.easysoft.com/developer/interfaces/odbc/linux.htmlhttp://www.easysoft.com/developer/interfaces/odbc/linux.htmlhttp://www.easysoft.com/developer/interfaces/odbc/linux.htmlhttp://en.wikipedia.org/wiki/Database_Source_Name
  • 8/9/2019 RODBC Package R software

    3/34

    ch

  • 8/9/2019 RODBC Package R software

    4/34

    sqlTables(ch, tableType = "TABLE")

    sqlTables(ch, schema = " some pattern ")sqlTables(ch, tableName = " some pattern ")

    The details are driver-specic but in most cases some pattern can use wild-cards 5 with underscore matching a single character and percent matchingzero or more characters. Since underscore is a valid character in a tablename it can be handled literally by preceding it by a backslashbut it israrely necessary to do so.

    A table can be retrieved as a data frame by

    res

  • 8/9/2019 RODBC Package R software

    5/34

    > sqlQuery(ch, paste(SELECT "State", "Murder" FROM "USArrests",

    + WHERE "Rape" > 30 ORDER BY "Murder"))or even in upper case. Describing how to extract data from databases isthe forte of the SQL language, and doing so efficiently is the aim of manyof the DBMSs, so this is a very powerful tool. To learn SQL it is best tond a tutorial specic to the dialect you will use; for example Chapter 3of the MySQL manual is a tutorial. A basic tutorial which covers somecommon dialects 6 can be found at http://www.1keydata.com/sql/sql.html : tutorials on how to perform common tasks in several commonly usedDBMSs are available at http://sqlzoo.net/ .

    2.3 Table Names

    SQL-92 expects both table and column names to be alphanumeric plus un-derscore, and RODBC does not in general support vendor extensions (forexample Access allows spaces). There are some system-specic quotingschemes: Access and Excel allow table names to be enclosed in [ ] in SQLqueries, MySQL (by default) quotes via backticks, and most other systemsuse the ANSI SQL standard of double quotes.

    The odbcConnnect function allows the specication of the quoting rulesfor names RODBC itself sends, but sensible defaults 7 are selected. Users do

    need to be aware of the quoting issue when writing queries for sqlQuery

    themselves.

    Note the underscore is a wildcard character in table names for some of thefunctions, and so may need to be escaped (by backslash) at times.

    Normally table names containing a period are interpreted as references toanother schema (see below): this can be suppressed by opening the connec-tion with argument interpretDot = FALSE .

    2.4 Types of table

    The details are somewhat DBMS-specic, but tables usually means tables,views or similar objects.

    In some systems tables are physical objects (les) that actually store dataMimer calls these base tables . For these other tables can be derived thatpresent information to the user, usually called views. The principal dis-tinctions between a (base) table and a view are

    6 MySQL, Oracle and SQL Server.7 backticks for MySQL, [ ] for the Access and Excel convenience wrappers, otherwise

    ANSI double quotes.

    5

    http://www.1keydata.com/sql/sql.htmlhttp://www.1keydata.com/sql/sql.htmlhttp://sqlzoo.net/http://sqlzoo.net/http://sqlzoo.net/http://www.1keydata.com/sql/sql.htmlhttp://www.1keydata.com/sql/sql.html
  • 8/9/2019 RODBC Package R software

    6/34

    Using DROP on a table removes the data, whereas using it on a view

    merely removes the convenient access to a representation of the data. The access permission ( privilege ) of a view can be very different from

    those of a table: this is commonly used to hide sensitive information.

    A view can contain a subset of the information available in a single table orcombine information from two or more tables.

    Further, some DBMSs distinguish between tables and views generated byordinary users and system tables used by the DBMS itself. Where present,this distinction is reected in the result of sqlTable() calls.

    Some DBMSs support synonyms and/or aliases which are simply alternative

    names for an existing table/view/synonym, often those in other schemas (seebelow).

    Typically tables, views, synonyms and aliases share a name space and somust have a name that is unique (in the enclosing schema where schemasare implemented).

    3 Writing to a Database

    To create or update a table in a database some more details need to be

    considered. For some systems, all table and column names need to be lowercase (e.g. PostgreSQL, MySQL on Windows) or upper case (e.g. someversions of Oracle). To make this a little easier, the odbcConnect functionallows a remapping of table names to be specied, and this happens bydefault for DBMSs where remapping is known to be needed.

    The main tool to create a table is sqlSave . It is safest to use this afterhaving removed any existing table of the same name, which can be done by

    sqlDrop(ch, " table name ", errors=FALSE)

    Then in the simplest usage

    sqlSave(ch, some data frame )creates a new table whose name is the name of the data frame (remappedto upper or lower case as needed) and with rst column rownames the rownames of the data frame, and remaining columns the columns of the dataframe (with names remapped as necessary). For the many options, see thehelp page.

    sqlSave works well when asked to write integer, numeric and reasonable-length 8 character strings to the database. It needs some help with other

    8 which of course depends on the DBMS. Almost all have an implementation of varchar

    6

  • 8/9/2019 RODBC Package R software

    7/34

    types of columns in mapping to the DBMS-specic types of column. For

    some drivers it can do a good job with date and date-time columns; in oth-ers it needs some hints (and e.g. for Oracle dates are stored as date-times).The les in the RODBC/tests directory in the sources and the installed letests.R provide some examples. One of the options is the fast argument:the default is fast=TRUE which transfers data in binary format: the alter-native is fast=FALSE which transfer data as character strings a row at atimethis is slower but can work better with some drivers (and worse withothers).

    The other main tool for writing is sqlUpdate which is used to change rowsin an existing table. Note that RODBC only does this in a simple fashion, andon up-market DBMSs it may be better to set cursors and use direct SQLqueries, or at least to control transactions by calls to odbcSetAutoCommitand odbcEndTran . The basic operation of sqlUpdate is to take a data framewith the same column names (up to remapping) as some or all of the columnsof an existing table: the values in the data frame are then used either toreplace entries or to create new rows in the table.

    Rows in a DBMS table are in principle unordered and so cannot be referredto by number: the sometimes tricky question is to know what rows are toreplaced. We can help the process by giving one or more index columnswhose values must match: for a data frame the row names are often a goodchoice. If no index argument is supplied, a suitable set of columns is chosen

    based on the properties of the table.

    3.1 Primary keys and indices

    When a table is created (or afterwards) it can be given additional informa-tion to enable it to be used effectively or efficiently.

    Primary keys are one (usually) or more columns that provide a reliable wayto reference rows in the table: values of the primary key must be uniqueand not NULL (SQL parlance for missing). Primary keys in one table arealso used as foreign keys in another table: this ensure that e.g. values of customer id only take values which are included in the primary key columnof that name in table customers . Support of foreign keys is patchy: someDBMSs (e.g, MySQL prior to 6.0) accept specications but ignore them.

    RODBC allows primary keys to be set as part of the sqlSave() function whenit creates a table: otherwise they can be set by sqlQuery() in DBMS-specicways (usually by ALTER TABLE).

    that allows up to 255 bytes or characters, and some have much larger limits. CallingsqlTypeInfo will tell you about the data type limits.

    7

  • 8/9/2019 RODBC Package R software

    8/34

    Columns in a table can be declared as UNIQUE: primary keys and such

    columns are usually used as the basis for table indices, but other indices(sometimes called secondary indices ) can be declared by a CREATE INDEXSQL command. Whether adding primary keys or other indices has anyeffect on performance depends on the DBMS and the query.

    4 Data types

    This can be confusing: R has data types (including character , double ,integer and various classes including Date and POSIXct ), ODBC has bothC and SQL data types, the SQL standards have data types and so do thevarious DBMSs and they all have different names and different usages of the same names.

    Double- and single-precision numeric values and 32- and 16-bit integers(only) are transferred as binary values, and all other types as characterstrings. However, unless as.is=TRUE , sqlGetResults (used by all thehigher-level functions to return a data frame) converts character data toan date/date-time class or via type.convert .

    You can nd out the DBMS names for the data types used in the columnsof a table by a call to sqlColumns , and further information is given on thosetypes in the result of sqlTypeInfo . For example in MySQL,

    TABLE_CAT TABLE_SCHEM TABLE_NAME COLUMN_NAME DATA_TYPE TYPE_NAME COLUMN_SIZE1 ripley USArrests State 12 varchar 2552 ripley USArrests Murder 8 double 153 ripley USArrests Assault 4 integer 104 ripley USArrests UrbanPop 4 integer 105 ripley USArrests Rape 8 double 15

    BUFFER_LENGTH DECIMAL_DIGITS NUM_PREC_RADIX NULLABLE REMARKS COLUMN_DEF1 255 NA NA 0 2 8 NA NA 1 3 4 0 10 1 4 4 0 10 1 5 8 NA NA 1

    SQL_DATA_TYPE SQL_DATETIME_SUB CHAR_OCTET_LENGTH ORDINAL_POSITION IS_NULLABLE1 12 NA 255 1 NO2 8 NA NA 2 YES3 4 NA NA 3 YES4 4 NA NA 4 YES5 8 NA NA 5 YES

    This gives the DBMS data by name and by number (twice, once the numberused in the DBMS and once that used by SQLthey agree here). Otherthings of interest here are the column size, which gives the maximum sizeof the character representation, and the two columns about nullable whichindicate if the column is allowed to contain missing values (SQL NULLs).

    The result of sqlTypeInfo has 19 columns and in the version of MySQL

    8

  • 8/9/2019 RODBC Package R software

    9/34

    used here, 52 types. We show a small subset of the more common types:> sqlTypeInfo(channel)[, c(1:3,7,16)]

    TYPE_NAME DATA_TYPE COLUMN_SIZE NULLABLE SQL_DATATYPE1 bit -7 1 1 -72 tinyint -6 3 1 -66 bigint -5 19 1 -518 text -1 65535 1 -119 mediumtext -1 16777215 1 -120 longtext -1 2147483647 1 -122 char 1 255 1 123 numeric 2 19 1 224 decimal 3 19 1 325 integer 4 10 1 437 smallint 5 5 1 541 double 6 15 1 643 float 7 7 1 7

    45 double 8 15 1 847 date 91 10 1 948 time 92 8 1 949 year 5 4 1 550 datetime 93 21 1 951 timestamp 93 14 0 952 varchar 12 255 1 12

    Note that there are both duplicate type names and duplicate type numbers.

    Most DBMSs started with their own data types and later mapped the stan-dard SQL data types on to them, although these may only be partiallyimplemented. Some DBMSs allow user-dened data types, for example enu-merations.

    Commonly used data types fall into a number of groups:

    Character types Character types can be classied three ways: xed orvariable length, by the maximum size and by the character set used.The most commonly used types 9 are varchar for short strings of vari-able length (up to some maximum) and char for short strings of xedlength (usually right-padded with spaces). The value of short differsby DBMS and is at least 254, often a few thousandoften other typeswill be available for longer character strings. There is a sanity checkwhich will allow only strings of up to 65535 bytes when reading: thiscan be removed by recompiling RODBC.

    Many other DBMSs have separate types to hold Unicode characterstrings, often with names like nvarchar or wvarchar . Note that cur-rently RODBC only uses the current locale for character data, whichcould be UTF-8 (and will be on Mac OS X and in many cases onLinux and other Unix-alikes), but is never UCS-2 as used on Win-dows. So if character data is stored in the database in Unicode, it

    9 the SQL names for these are CHARACTER VARYING and CHARACTER, but these are toocumbersome for routine use.

    9

  • 8/9/2019 RODBC Package R software

    10/34

    will be translated (with a possible loss of information) in non-Unicode

    locales. (This may change in future versions of RODBC.)Some DBMSs such as PostgreSQL and SQL Server allow variable-length character strings of length only limited by resources. These donot t well with the ODBC model that requires buffers to be allocatedto transfer character data, and so such types may be subjected (by theODBC driver) to a xed limit or not work at all.

    Integer types Most DBMSs have types for 32-bit ( integer , synomyn int )and 16-bit ( smallint ) integers. Some, including MySQL, also haveunsigned versions and 1-bit, 8-bit and 64-bit integer types: these fur-ther types would usually be transferred as character strings and con-

    verted on reading to an integer or double vector.Type names int2 , int4 and int8 are common as synonyms for thebasic type names.

    The SQL standard does not require integer and smallint to be bi-nary (rather than decimal) types, but they almost always are binary.

    Note that 64-bit integers will be transferred as character strings andread by sqlGetResults as character vectors or (for 2 31 | x | < 253 )as double vectors.

    Floating-point types The basic SQL oating-point types are 8 and 7 for

    double- and single-precision binary types. The SQL names are doubleprecision and real , but beware of the variety of names. Type 6is float in the standard, but is used by some DBMSs 10 for single-precision and by some for double-precision: the forms float(24) andfloat(53) are also commonly supported.

    You should not assume that these types can store Inf , -Inf or NaN,but they often can.

    Other numeric types It is common to store decimal quantities indatabases (e.g. currency amounts) and types 2 and 3 are for deci-mals. Some DBMSs have specialized types to handle currencies, e.g. money in SQL Server.Decimal types have a precision (the maximum number of signicantdecimal digits) and scale (the position of the decimal point). numericand decimal are usually synonymous, but the distinction in the stan-dards is that for numeric the precision is exact whereas for decimalthe DBMS can use a larger value than that specied.

    10 In Oracle the FLOAT type is a decimal and not a binary type.

    10

  • 8/9/2019 RODBC Package R software

    11/34

    Some DBMSs have a type integer( p ) to represent up to p decimal

    digits, and this may or may not be distinct from decimal( p , 0) .DBMSs do not necessarily fully implement decimal types, e.g. MySQLcurrently stores them in binary and used to store them as characterstrings.

    Dates and times The handling of dates and times is very much specicto the DBMS. Some allow fractional seconds in date-times, and somedo not; some store timezones with date-times or always use UTC andsome do not, and so on. Usually there are also types for time intervals.

    All such types are transferred as character strings in RODBC.

    Binary types These are less common, and unsupported by RODBC prior toversion 1.3-0. They parallel character types in that they are a sequenceof bytes of xed or variable length, sometimes with additional typesfor long sequences: there are separate ODBC types for SQL BINARY,SQL VARBINARY and SQL LONGVARBINARY.

    Binary types can currently only be read as such, and they are returnedas column of class "ODBC binary" which is a list of raw vectors.

    It is possible (but rare) for the DBMS to support data types that the ODBCdriver cannot handle. Most DBMSs have binary data types which have nocorresponding R data type ( raw corresponds to a single byte, not a xed or

    variable length set of bytes): these are not currently covered by RODBC.

    4.1 Data types when saving a data frame

    When sqlSave creates a table, there is some choice as to the SQL datatypes used.

    The default is to select the SQL data type from the R type via the typeInfoargument to sqlSave . If this is not supplied (usual) a default mappingis looked up using getSqlTypeInfo() or by interrogating sqlTypeInfo() .This will almost always produce the correct mapping for numeric, integerand character columns of up to 254 characters (or bytes). In other cases(include dates and date-times) the desired SQL type can be specied foreach column via the argument varTypes , a named character vector withnames corresponding to (some of) the names in the data frame to be saved.

    Only a very few DBMSs have a logical data type and the default mapping isto store R logical vectors as varchar(5) . For others DBMSs BIT, TINYINTor an enumeration type could be used (but the column may be need to beconverted to and from a suitable representation). For example, in MySQL wecould use enum(FALSE, TRUE) , but this is actually stored as char(5) .

    11

  • 8/9/2019 RODBC Package R software

    12/34

    Note that to represent NA the SQL data type chosen needs to be nullable,

    which BIT often is not. (Mimer has a nullable data type BOOLEAN but thisis not supported by the ODBC client.)

    4.2 SQLite

    SQLites concept of data type is anomalous: version 3 does recognize typesof data (in version 2 everything was a character string), but it does nothave a xed type for a column in a table (although the type specied inthe CREATE TABLE statement is a recommended type for the values of thatcolumn). Every value is categorized as null, integer (of length 1, 2, 3, 4, 6 or

    8 bytes), double, text (UTF-8 or UTF-16) or BLOB (a sequence of bytes).This does not t well with the ODBC interface which pre-determines a typefor each column before reading or writing it: the SQLite ODBC driver fallsback to a SQL VARCHAR or SQL LONGVARCHAR type if the column type is notavailable.

    4.3 ODBC data types

    ODBC denes two sets of data types: SQL data types and C data types .SQL data types indicate the data types of data stored at the data sourceusing standard names. C data types indicate the data types used in thecompiled code in the application (here RODBC) when transferring data andare the same for all drivers.

    The ODBC SQL data types are abstractions of the data types discussedabove with names like SQL INTEGER. They include SQL LONGVARCHAR forlarge character types and SQL WVARCHAR for Unicode character types. Itis usually these types that are returned (by number) in the SQL DATA TYPEcolumn of the result of sqlColumns and SQL DATATYPE column of the resultof sqlTypeInfo . The mapping from names to numbers is given in table 1.

    The only ODBC C data types currently used by RODBC are SQL C DOUBLE,SQL C SLONG (32-bit signed integers) and SQL C CHAR for reading and writ-ing, and SQL C FLOAT (single-precision), SQL C SSHORT (16-bit signed inte-gers) and SQL C BINARY for reading from the database.

    http://msdn.microsoft.com/en-us/library/ms713607%28VS.85%29.aspx is the dentiive source of information about ODBC data types.

    12

    http://msdn.microsoft.com/en-us/library/ms713607%28VS.85%29.aspxhttp://msdn.microsoft.com/en-us/library/ms713607%28VS.85%29.aspxhttp://msdn.microsoft.com/en-us/library/ms713607%28VS.85%29.aspxhttp://msdn.microsoft.com/en-us/library/ms713607%28VS.85%29.aspx
  • 8/9/2019 RODBC Package R software

    13/34

    SQL CHAR 1 SQL LONGVARCHAR -1

    SQL NUMERIC 2 SQL BINARY -2SQL DECIMAL 3 SQL VARBINARY -3SQL INTEGER 4 SQL LONGVARBINARY -4SQL SMALLINT 5 SQL BIGINT -5SQL FLOAT 6 SQL TINYINT -6SQL REAL 7 SQL BIT -7SQL DOUBLE 8 SQL WCHAR -8SQL DATETIME 9 SQL WVARCHAR -9SQL INTERVAL 10 SQL WLONGVARCHAR -10SQL TIMESTAMP 11 SQL GUID -11SQL VARCHAR 12SQL TYPE DATE 91SQL TYPE TIME 92SQL TYPE TIMESTAMP 93

    Table 1: Mapping between ODBC SQL data type names and numbers.(GUIDs are 16-byte numbers, Microsofts implementation of UUIDs.)

    5 Schemas and Catalogs

    This is a more technical section: few users will need to deal with theseconcepts.

    Schemas11 are collections of objects (such as tables and views) within adatabase that are supported by some DBMSs: often a separate schema isassociated with each user (and schema in ODBC 3 replaced owner inODBC 2). In SQL-92, schemas are collected in a catalog which is oftenimplemented as a database. Where schemas are implemented, there is acurrent schema used to nd unqualied table names, and tables in otherschemas can be referred to within SQL queries using the schema . tablenotation. You can think of a schema as analogous to a name space; it allowsrelated objects to be grouped together without worrying about name clasheswith other groups. (Some DBMSs will search for unqualied table names ina search path: see the detailed descriptions below.)

    Note that schema is used in another sense in the database literature, forthe design of a database and in particular of tables, views and privileges.

    Here are some details of various DBMSs interpretations of catalog andschema current at the time of writing (mid 2009). (These descriptions aresimplistic, and in some cases experimental observations.)

    SQLite uses dotted names for alternative databases that are attached11 which is the usual plural in this technial usage, athough schemata is more usual in

    English.

    13

  • 8/9/2019 RODBC Package R software

    14/34

    by an ATTACH DATABASE command. 12 There is a search path of

    databases, so it is only necessary to use the dotted name notationwhen there are tables of the same name on attached databases. Theinitial database is known as main and that used for temporary tablesas temp .

    MySQL uses catalog to refer to a database. In MySQLs parlance,schema is a little-used synonym for database.

    PostgreSQL only allows a session to access one database, and doesnot use catalog except to refer to the current database. Version7.3 introduced schemasusers can create their own schemas with aCREATE SCHEMA query. Tables are by default in the public schema,

    and unqualied table names are searched for along a search path of schemas (by default, containing public ).

    Oracle uses schemas as synonymous with owner (also known asuser). There is no way for a user to create additional schemas (thatis not what CREATE SCHEMA does in Oracle).

    IBM DB2 uses schemas as name spaces for objects that may lie ondifferent databases: using aliases allows objects to be in more thanone schema. The initial current schema is named the same as the user(SQLID in DB2 parlance), but users can create additional schemas withCREATE SCHEMA statements.

    Microsoft SQL Server 2008 uses both catalog and schema , catalogfor the database and schema for the type of object, e.g. "sys" formost of the system tables/views and (default) "dbo" for user tables.Further schemas can be created by users. The default schema for auser can be set when the user is created and changed via ALTER USER.

    Prior to SQL Server 2005, schema meant user, and the search pathfor unqualied names was the database user then "dbo" .

    The Microsoft Excel and Access ODBC drivers do not use schemas,but do use catalog to refer to other database/spreadsheet les.

    Mimer (www.mimer.com ) uses schemas which are normally the same asusers (which it calls IDENT s), but users can create additional schemaswith CREATE SCHEMA statements. There are also system schemas.Mimer uses schemata as the plural of schema.

    It is often possible to use sqlTables to list the available catalogs or schemas:see its help page for the driver-specic details.

    12 and may be subsequently detached by a DETACH DATABASE command

    14

    http://www.mimer.com/http://www.mimer.com/
  • 8/9/2019 RODBC Package R software

    15/34

    RODBC usually works with tables in the current schema, but unless the con-

    nection was opened with interpretDot = FALSE most functions will at-tempt to interpret the dotted name notation. The interpretation dependson the DBMS: the SQL-92 meaning is schema . table and this is acceptedby PostgreSQL, Microsoft SQL Server, Oracle, DB2 and Mimer. However,MySQL uses database . table , and the functions try 13 that interpretationif they recognize a MySQL driver. Some DBMSs allow more than two com-ponents, but these are not currently supported by the RODBC functions.

    Functions sqlTables , sqlColumns and sqlPrimaryKeys have argumentscatalog and schema which in principle allow tables in other schemas to belisted or examined: however these are only partially implemented in manycurrent ODBC drivers. See the help page for sqlTables for some furtherdetails.

    For other uses, the trick is to select the schema(s) you want to use, which isdone via an SQL statement sent by sqlQuery . For Oracle you can set thedefault schema (owner) by

    ALTER SESSION SET CURRENT SCHEMA = schema

    whereas for PostgreSQL the search path can be changed via

    SET search path TO schema1 , schema2 .

    In DB2, creating an alias in the current schema can be used to access tables

    in other schemas, and a CURRENT SCHEMA query can be used to change thecurrent schema. In MySQL and SQL Server a database can be selected bya USE database query.

    6 Internationalization Issues

    Internationalization issues are made more complex by ODBC being a client-server system, and the ODBC client ( RODBC) and the server may be runningon different machines with different OSes on different continents. So theclient may need some help.

    In most cases numeric data are transferred to and from R in binary form, sothe representation of the decimal point is not an issue. But in some cases itcould be (e.g. decimal rather than binary SQL data types will be transferredas character strings) and then the decimal point to be used will be taken fromoptions("dec") : if unset this is set when RODBC is loaded from the settingof the current locale on the machine running R ( via Sys.localeconv ). Some

    13 currerntly this is stymied by bugs in the ODBC driver, so SQLColumns is unable toreport on tables in specied databases.

    15

  • 8/9/2019 RODBC Package R software

    16/34

    ODBC drivers (e.g. for SQL Server, Oracle) allow the locale (NLS) to be

    used for numeric values to be selected for the connection.The other internationalization issue is the character encoding used. WhenR and the DBMS are running on the same machine this is unlikely to bean issue, and in many cases the ODBC driver has some options to translatecharacter sets. SQL is an ANSI (US) standard, and DBMSs tended toassume that character data was ASCII or perhaps 8-bit. More recentlyDBMSs have started to (optionally or by default) to store data in Unicode,which unfortunately means UCS-2 on Windows and UTF-8 elsewhere. Socross-OS solutions are not guaranteed to work, but most do.

    Encoding issues are best resolved in the ODBC driver or in DBMS settings.

    In the unusual case that this cannot be done, the DBMSencoding argumentto odbcDriverConnect allows for recoding when sending data to or fromthe ODBC driver and thence the DBMS.

    7 Excel Drivers

    The Microsoft Excel ODBC drivers (Windows only) have a number of pe-culiarities which mean that it should be used with care.

    It seems that its concept of a table is principally a named range . It treats

    worksheets as system tables, and appends a dollar to their name (makingthen non-standard SQL table names: the quoting convention used is toenclose such names in square brackets).

    Column names are taken as the rst row of the named range/worksheet.Non-standard SQL names are allowed here too, but the driver maps . to #in column names. Annoyingly, sqlTables is allowed to select named rangesonly by tableType = "TABLE" but not to select only worksheets.

    There are at least two known problems with reading columns that do nothave a format set before data entry, and so start with format General.First, the driver uses the rst few rows to determined the column type, and

    is over-fond of declaring Numeric even when there are non-numeric entries.The default number of rows consulted is 8, but attempts to change thisin the DSN setup are ignored. Second, if a column is declared as Text,numeric entries will be read as SQL nulls and hence R NAs. Unfortunately,in neither case does reformatting the column help.

    The connection is by default read-only. It is possible to de-select this in theDSN (and the convenience wrapper odbcConnectExcel has a readOnly =FALSE argument to do so), but this does not support deletion, including SQLDROP, DELETE, UPDATE and ALTER statements). In particular, sqlDrop will

    16

  • 8/9/2019 RODBC Package R software

    17/34

    remove the data in a worksheet but not the worksheet itself. The driver does

    allow a worksheet to be updated by sqlUpdate , and for a new worksheet(with a different name from existing worksheets) to be created by sqlSave(which also creates a named range).

    As far as we know, no similar issues affect the Actual Technologies Mac OSX Excel driver: however, it allows only read-only access to Excel les anddoes not support Excel 2007/2008 .xlsx les.

    8 DBMS-specic tidbits

    This section covers some useful DBMS-specic SQL commands and otheruseful details.

    Recent versions of several DBMSs have a schema INFORMATION SCHEMA thatholds many predened system views. These include MySQL (the name of a database, mainly populated beginning with MySQL 5.1), SQL Server andMimer.

    MySQL

    We have already mentioned USE database as the way to change the

    database in use. SHOW DATABASES lists the databases for which you havesome kind of privilege, and can have a LIKE clause to restrict the result tosome pattern of database names.

    The DESCRIBE table command is a compact way to get a description of a table or view, similar to the most useful parts of the result of a call tosqlColumns . (It is also known as SHOW COLUMNS FROM table .)

    SHOW TABLES is the command to produce a table of the tables/views on thecurrent database, similar to sqlTables . For example,

    > sqlQuery(channel, "USE ripley")[1] "No Data"> sqlQuery(channel, "SHOW TABLES")

    Tables_in_ripley1 USArrests> sqlQuery(channel, "DESCRIBE USArrests")

    Field Type Null Key Default Extra1 State varchar(255) NO PRI NA NA2 Murder double YES NA NA3 Assault int(11) YES NA NA4 UrbanPop int(11) YES NA NA5 Rape double YES NA NA

    17

  • 8/9/2019 RODBC Package R software

    18/34

    SHOW FULL TABLES gives an additional additional column Table type , the

    types of the tables/views.There is useful information for end users in the INFORMATION SCHEMAdatabase , much more extensively as from MySQL 5.1.

    Some of the non-standard behaviour can be turned off, e.g. starting MySQLwith --sql-mode=ANSI gives closer conformance to the standard, and thiscan be set for a single session by

    SET SESSION sql mode=ANSI

    To change just the behaviour of quotes (to use double quotes in place of backticks) replace ANSI by ANSI QUOTE.

    The maximum size of a char column is 255 characters. That of a varcharcolumn is up to 65535 characters (but there is a limit of 65535 bytes on thetotal size of a row), and those with a maximum of 255 or less are stored moreefficiently. Types text , mediumtext and longtext can hold more, and arenot subject to the row-size limit ( text has default maximum size 65535, thedefault RODBC limit on transfers).

    There are binary , varbinary and blob types which are very similar to theircharacter counterparts but with lengths in bytes.

    PostgreSQL

    Table pg tables lists all tables in all schemas; you probably want to lteron tableowner= current user , e.g.> sqlQuery(channel, "select * from pg_tables where tableowner=ripley")

    schemaname tablename tableowner tablespace hasindexes hasrules hastriggers1 public dtest ripley NA 0 0 0

    There are both ANSI and Unicode versions of the ODBC driver on Windows:they provide many customizations. One of these is read-only access, anotheris if system tables are reported by sqlTables .

    The default size of a varchar column is unlimited, but those with maximumlength of 126 bytes or less are stored more efficiently. However, the ODBCinterface has limits, which can be set in the conguration options. Theseinclude the maximum sizes for varchar (default 254) and longvarchar (de-fault 8190), and how to handle unknown column sizes (default as the max-imum), and whether Text is taken as varchar or longvarchar (whichaffects the reported maximum size for a varchar column).

    There is a single binary data types, bytea .

    18

  • 8/9/2019 RODBC Package R software

    19/34

  • 8/9/2019 RODBC Package R software

    20/34

    Oracles character data types are CHAR, VARCHAR2 (character set specied

    when the database was created) and NCHAR, NVARCHAR2 (Unicode), as wellas CLOB and NCLOB for large character strings. For the non-Unicode typesthe units of length are either bytes or charactor (set as a default for thedatabase) but can be overriden by adding a BYTE or CHAR qualier. Thelimits are 4000 bytes apart from for CLOB and NCLOB, which have very highlimits.

    There are RAW and BLOB data types.

    DB2

    Schema syscat contains many views with information about tables: forexample view syscat.tables lists all tables, and

    > sqlQuery(channel,"select * from syscat.columns where tabname=USArrests")

    TABSCHEMA TABNAME COLNAME COLNO TYPESCHEMA TYPENAME LENGTH SCALE1 RIPLEY USArrests State 0 SYSIBM VARCHAR 255 02 RIPLEY USArrests Murder 1 SYSIBM DOUBLE 8 03 RIPLEY USArrests Assault 2 SYSIBM INTEGER 4 04 RIPLEY USArrests UrbanPop 3 SYSIBM INTEGER 4 05 RIPLEY USArrests Rape 4 SYSIBM DOUBLE 8 0...

    The CHAR type can have size up to 254 bytes: the maximum size of theVARCHAR type is 32762 bytes. For larger character strings there is the CLOBtype (up to 2Gb). These types can be used to store data in a MBCS,including various Unicode encodings.

    There are corresponding BINARY, VARBINARY and BLOB data types.

    SQL Server

    There are several hundred views in schemas INFORMATION SCHEMA andsys which will be listed by sqlTables and also by the stored proceduresp tables . Another way to list tables is

    SELECT * FROM sysobjects WHERE xtype=U

    where the condition restricts to user tables.

    USE database changes the database in use.

    Types char and varchar have a maximum specied size of 8000 bytes. Itis possible to use varchar(max) (previously known as text ) for a limit of 2Gb, but this may not work well with the ODBC interface. The Unicode

    20

  • 8/9/2019 RODBC Package R software

    21/34

    types nchar and nvarchar have a maximum specied size of 4000 characters:

    again there is nvarchar(max) (formerly ntext ).There are corresponding binary and varbinary data types (with image asan earlier name for varbinary(max) ).

    Mimer

    There are tens of views in schema INFORMATION SCHEMA which can be readby SQL SELECT queries of the form

    SELECT column-listFROM INFORMATION_SCHEMA.view-nameWHERE condition

    See the Mimer SQL Reference Manual chapter on Data Dictionary views forfull details: two views are TABLES and VIEWS.

    A session can be set to be read-only by the SQL command SET SESSIONREAD ONLY.

    Mimer uses Latin-1 for its default character types but Unicode types ( NCHARand NVARCHAR) are also available. Unsurprisingly given that the companyis Swedish, different collations are allowed for both Latin-1 and Unicodecharacter types.

    The char and varchar columns have a maximum size of 15000 bytes: theclob data type is available for larger character columns. The nchar andnvarchar columns have a maximum size of 5000 characters: the nclob datatype is available for larger Unicode columns.

    There are corresponding binary , varbinary and blob binary data types.

    21

  • 8/9/2019 RODBC Package R software

    22/34

    A Installation

    RODBC is simple to install, and binary distributions are available for Mac OSX and Windows from CRAN.

    To install from the sources, an ODBC Driver Manager is required. Windowsnormally comes with one (it is part of MDAC and can be installed separatelyif required). Mac OS X since 10.2 has shipped with iODBC, which is alsoavailable for other Unix-alikes. But for other systems the driver manager of choice is unixODBC, part of almost all Linux distributions and with sourcesdownloadable from http://www.unixODBC.org . In Linux binary distribu-tions it is likely that package unixODBC-devel or unixodbc-dev or some

    such will be needed.In most cases the packages configure script will nd the driver managerles, and the package will install with no extra settings. However, if furtherinformation is required, use --with-odbc-include and --with-odbc-libor environment variables ODBC INCLUDE and ODBC LIBS to set the includeand library paths as needed. A specic ODBC driver manager can be speci-ed by the --with-odbc-manager configure option, with likely values odbcor iodbc : if this is done for odbc and the program odbc config is found, itis used to set the libpath as a last resort (it is often wrong), and to add anyadditional CFLAGS.

    Sources of drivers

    A fairly comprehensive list of drivers is maintained at http://www.sqlsummit.com/ODBCVend.htm , and one for unixODBC14 at http://www.unixodbc.org/drivers.html . unixODBC ships with a number of drivers(although in most cases the DBMS vendors driver is preferred)these in-clude for MySQL, PostgreSQL, Mimer and at les.

    MySQL provides drivers under the name Connector/ODBC (formerly My-ODBC) in source form, and binaries for all common 32-bit and most 64-bitR platforms.

    PostgreSQL has an associated project at http://pgfoundry.org/projects/psqlodbc/ and another project for at http://pgfoundry.org/projects/odbcng/ and http://projects.commandprompt.com/public/odbcng . (Documentation for psqlodbc is currently hard to nd, but thereis some in the PostgreSQL 7.2 manual at http://www.postgresql.org/docs/7.2/static/odbc.html from before it was unbundled.) There are

    14 that the author works for Easysoft is conspicuous.

    22

    http://www.unixodbc.org/http://www.sqlsummit.com/ODBCVend.htmhttp://www.sqlsummit.com/ODBCVend.htmhttp://www.unixodbc.org/drivers.htmlhttp://www.unixodbc.org/drivers.htmlhttp://www.unixodbc.org/drivers.htmlhttp://pgfoundry.org/projects/psqlodbc/http://pgfoundry.org/projects/psqlodbc/http://pgfoundry.org/projects/odbcng/http://pgfoundry.org/projects/odbcng/http://projects.commandprompt.com/public/odbcnghttp://projects.commandprompt.com/public/odbcnghttp://projects.commandprompt.com/public/odbcnghttp://www.postgresql.org/docs/7.2/static/odbc.htmlhttp://www.postgresql.org/docs/7.2/static/odbc.htmlhttp://www.postgresql.org/docs/7.2/static/odbc.htmlhttp://www.postgresql.org/docs/7.2/static/odbc.htmlhttp://projects.commandprompt.com/public/odbcnghttp://projects.commandprompt.com/public/odbcnghttp://pgfoundry.org/projects/odbcng/http://pgfoundry.org/projects/odbcng/http://pgfoundry.org/projects/psqlodbc/http://pgfoundry.org/projects/psqlodbc/http://www.unixodbc.org/drivers.htmlhttp://www.unixodbc.org/drivers.htmlhttp://www.sqlsummit.com/ODBCVend.htmhttp://www.sqlsummit.com/ODBCVend.htmhttp://www.unixodbc.org/
  • 8/9/2019 RODBC Package R software

    23/34

    drivers for Unix-alikes and Windows 64-bit Windows support is available

    for PostgreSQL 9.0.An SQLite ODBC driver for Unix-alikes, including Mac OS X, and(32- and 64-bit) Windows is available from http://www.ch-werner.de/sqliteodbc/ .

    Oracle provides ODBC drivers as a supplement to its Instant Client forsome of its platforms (including 32/64-bit Windows and Linux but not cur-rently Mac OS X). See http://www.oracle.com/technology/software/tech/oci/instantclient/ . One quirk of the Windows drivers is thatthe Oracle binaries must be in the path, so PATH should include e.g.c:\Oracle\bin .

    For IBMs DB2, search its site for drivers for ODBC and CLI. There aresome notes about using this under Linux at http://www.unixodbc.org/doc/db2.html .

    Mimer ( www.mimer.com ) is a cross-platform DBMS with integral ODBCsupport, so

    The Mimer SQL setup process automatically installs an ODBCdriver when the Mimer SQL client is installed on any Windowsor UNIX platform.

    The HowTos at http://developer.mimer.se/howto/index.tml provide

    some useful hints.Some details of the 32-bit Microsoft ODBC Desktop Database Drivers (forAccess, Excel, Paradox, dBase and text les on Windows) can be foundat http://msdn.microsoft.com/en-us/library/ms709326%28VS.85%29.aspx . There is also a Visual FoxPro driver and an (outdated) Oracle driver.

    32-bit Windows drivers for Access 2007 and Excel 2007are bundled with Office 2007 but can be installed sepa-rately via the installer AccessDatabaseEngine.exe availablefrom http://www.microsoft.com/downloads/details.aspx?FamilyID=7554f536-8c28-4598-9b72-ef94e038c891&DisplayLang=

    en . The Access/Excel 2010 versions at http://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=en have a64-bit version: however the 64-bit drivers cannot be installed alongside32-bit versions of Office (as far as we know, and denitely not for Office2007).

    For recent versions of Mac OS X, low-cost and easy-to-use drivers are avail-able from http://www.actualtechnologies.com/products.php : thesecover MySQL/PostgreSQL/SQLite (one driver), SQL Server/Sybase, Ora-

    23

    http://www.ch-werner.de/sqliteodbc/http://www.ch-werner.de/sqliteodbc/http://www.oracle.com/technology/software/tech/oci/instantclient/http://www.oracle.com/technology/software/tech/oci/instantclient/http://www.unixodbc.org/doc/db2.htmlhttp://www.unixodbc.org/doc/db2.htmlhttp://www.mimer.com/http://developer.mimer.se/howto/index.tmlhttp://msdn.microsoft.com/en-us/library/ms709326%28VS.85%29.aspxhttp://msdn.microsoft.com/en-us/library/ms709326%28VS.85%29.aspxhttp://www.microsoft.com/downloads/details.aspx?FamilyID=7554f536-8c28-4598-9b72-ef94e038c891&DisplayLang=enhttp://www.microsoft.com/downloads/details.aspx?FamilyID=7554f536-8c28-4598-9b72-ef94e038c891&DisplayLang=enhttp://www.microsoft.com/downloads/details.aspx?FamilyID=7554f536-8c28-4598-9b72-ef94e038c891&DisplayLang=enhttp://www.microsoft.com/downloads/details.aspx?FamilyID=7554f536-8c28-4598-9b72-ef94e038c891&DisplayLang=enhttp://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=enhttp://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=enhttp://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=enhttp://www.actualtechnologies.com/products.phphttp://www.actualtechnologies.com/products.phphttp://www.actualtechnologies.com/products.phphttp://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=enhttp://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=enhttp://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=enhttp://www.microsoft.com/downloads/details.aspx?FamilyID=7554f536-8c28-4598-9b72-ef94e038c891&DisplayLang=enhttp://www.microsoft.com/downloads/details.aspx?FamilyID=7554f536-8c28-4598-9b72-ef94e038c891&DisplayLang=enhttp://www.microsoft.com/downloads/details.aspx?FamilyID=7554f536-8c28-4598-9b72-ef94e038c891&DisplayLang=enhttp://msdn.microsoft.com/en-us/library/ms709326%28VS.85%29.aspxhttp://msdn.microsoft.com/en-us/library/ms709326%28VS.85%29.aspxhttp://developer.mimer.se/howto/index.tmlhttp://www.mimer.com/http://www.unixodbc.org/doc/db2.htmlhttp://www.unixodbc.org/doc/db2.htmlhttp://www.oracle.com/technology/software/tech/oci/instantclient/http://www.oracle.com/technology/software/tech/oci/instantclient/http://www.ch-werner.de/sqliteodbc/http://www.ch-werner.de/sqliteodbc/
  • 8/9/2019 RODBC Package R software

    24/34

    cle, and a read-only driver for Access and related formats (including Ac-

    cess 2007 and Excel, but not Excel 2007). That SQLite driver needsbelieveNRows = FALSE set.

    Mac OS X drivers for the MySQL, PostgreSQL and the major commercialdatabases are available from http://uda.openlinksw.com/ .

    Specifying ODBC drivers

    The next step is to specify the ODBC drivers to be used for specic DBMSs.On Windows installing the drivers will register them automatically. Thismight happen as part of the installation on other systems, but usually doesnot.Both unixODBC and iODBC store information on drivers in congura-tion les, normally system-wide in /etc/odbcinst.ini and per-user in~/.odbcinst.ini . However, the system location can vary, and on systemswith unixODBC can be found by at the Unix command line by one of

    $ odbcinst -j$ odbc config --odbcinstini

    For iODBC use iodbc config : on Mac OS X the system location is/Library/ODBC/odbcinst.ini .

    The format can be seen from gure 1. (unixODBC allows Driver64 here toallow for different paths on 32-bit and 64-bit platforms sharing a le sys-tem.) The MySQL and PostgreSQL drivers were installed from the FedoraRPMs mysql-connector-odbc and postgresql-odbc , and also from the mysql-connector-odbc RPM in the MySQL distribution (which insertedthe entry in the driver le).

    The MySQL manual gives detailed information (including screenshots) of installing its drivers and setting up DSNs that may also be informative tousers of other DBMSs.

    24

    http://uda.openlinksw.com/http://uda.openlinksw.com/http://uda.openlinksw.com/
  • 8/9/2019 RODBC Package R software

    25/34

    $ cat /etc/odbcinst.ini

    [MySQL]Description = ODBC 3.51.26 for MySQLDriver = /usr/lib64/libmyodbc3.soFileUsage = 1

    [MySQL ODBC 5.1 Driver]Description = ODBC 5.1.05 for MySQLDriver = /usr/lib64/libmyodbc5.soUsageCount = 1

    [PostgreSQL]Description = ODBC for PostgreSQLDriver = /usr/lib64/psqlodbc.soFileUsage = 1

    [sqlite3]Description = sqliteodbcDriver = /usr/local/lib64/libsqlite3odbc.soSetup = /usr/local/lib64/libsqlite3odbc.soFileUsage = 1

    Figure 1: A system ODBC driver le from a x86 64 Fedora 10 Linux systemusing unixODBC.

    25

  • 8/9/2019 RODBC Package R software

    26/34

    B Specifying DSNs

    The ODBC driver managers have User DSNs and System DSNs: thesediffer only in where the information is stored, the rst on a per-user basisand the second for all users of the system.

    Windows has a GUI 15 to set up DSNs, called something like Data Sources(ODBC) under Administrative Tools in the Control Panel. You can add,remove and edit (congure) DSNs there (see gure 2). When adding aDSN, rst select the ODBC driver and then complete the driver-specicdialog box. There will usually be an option to test the DSN and it is wiseto do so.

    If Rgui is to be used on Windows, incomplete DSNs can be created andthe dialog box will be brought up for completion when odbcConnect iscalledthis can be helpful to avoid storing passwords in the Windows Reg-istry or to allow alternate users or databases. On that platform, callingodbcDriverConnect() with no arguments will bring up the main ODBCData Sources dialog box to allow a DSN to be constructed on the y.

    Mac OS X comes with a very similar GUI (gure 3) found at Applications /Utilities / ODBC Administrator .

    Both unixODBC and iODBC provide GUIs (which might be packaged sep-arately in binary distributions) to create DSNs, and iODBC also has aweb-grounded DSN administrator. UnixODBCs GUI is currently calledODBCConfig (see gure 4), and there is a KDE control widget calledDataManager to manage both ODBC drivers and DSNs. See the unixODBCuser manual at http://www.unixodbc.org/doc/UserManual/ . (On Fedorathese are in the unixODBC-kde RPM. It has been announced that they willbecome separate projects after unixODBC 2.2.14 .)

    On Unix-alikes DSNs can also be specied in les (and the graphical tools just manipulate these les). The system-wide le is usually /etc/odbc.iniand the per-user le 16 ~/.odbc.ini . Some examples of the format are showngure 5.

    What elds are supported is driver-specic (and it can be hard to nd doc-umentation). There is no clear distinction between elds that specify thedriver and those which specify the DSN, so any parts of the driver spec-ication which might differ between connections can be used in the DSN

    15 Extra care is needed on a 64-bit version of Windows, as this GUI shows only 64-bitsettings for ODBC, including drivers and DSNs. If you are running 32-bit R (and hence32-bit ODBC) on 64-bit Windows, you need the 32-bit version of the GUI at somethinglike c: \ Windows \ SysWOW64\ odbcad32.exe and beware that both 32- and 64-bit versionsare called odbcad32.exe .

    16 ~/Library/ODBC/odbc.ini on Mac OS X.

    26

    http://www.unixodbc.org/doc/UserManual/http://www.unixodbc.org/doc/UserManual/
  • 8/9/2019 RODBC Package R software

    27/34

    Figure 2: (Top) The main Data Sources (ODBC) dialog box from a WindowsXP system. (Bottom) The dialog box to select a driver that comes up whenthe Add button is clicked.

    27

  • 8/9/2019 RODBC Package R software

    28/34

    Figure 3: (Top) The main ODBC Administrator dialog box from a MacOS X system. (Bottom) A page of the dialog box to specify a DSN for theActual Technologies Access/Excel driver.

    28

  • 8/9/2019 RODBC Package R software

    29/34

    Figure 4: The dialog box of ODBCconfig on Fedora 10 Linux, and the Con-gure screen for the SQLite driver.

    29

  • 8/9/2019 RODBC Package R software

    30/34

    [test_mysql]Description = test MySQLDriver = MySQLTrace = NoServer = localhostPort = 3306Database = test

    [test_mysql5]Description = myodbc5Driver = MySQL ODBC 5.1 DriverServer = gannetPort = 3306Database = ripley

    [test_pg]Description = test PostgreSQLDriver = PostgreSQLTrace = NoTraceFile =ServerName = localhostUserName = ripleyPort = 5432Socket =Database = testdbReadOnly = 0

    [test_sqlite3]Description = test SQLite3Driver = sqlite3Database = /tmp/mysqlite3.db

    Figure 5: A personal ( ~/.odbc.ini ) le from a Fedora 10 Linux systemusing unixODBC.

    le.

    Things that are often set here are if the connection is read-only ( test pg is

    not readonly) and the character encoding to be used.Command-line programs isql (unixODBC) and iodbctest (iODBC) can beused to test a DSN that has been created manually in a le. The formatsare

    $ isql -v dsn db_username db_password $ iodbctest

    Both give a command-line SQL interface: use quit to terminate.

    30

  • 8/9/2019 RODBC Package R software

    31/34

    Figure 6: Parts of the ODBC driver conguration screens on Windows XPfor Microsoft Access, MySQL Connector/ODBC 5.1, Oracles ODBC driver

    and Microsoft SQL Server.

    31

  • 8/9/2019 RODBC Package R software

    32/34

    C Internals

    The appendix is in part an aide memoire for the maintainer, but may interestthe curious user.

    RODBC connection objects are an integer with several attributes: they arenumbered consecutively in the current session. For example> channel unclass(channel)[1] 1attr(,"connection.string")[1] "DATABASE=ripley;DESCRIPTION=myodbc;DSN=test;OPTION=0;PORT=3306;SERVER=localhost;"attr(,"handle ptr")

    attr(,"case")[1] "nochange"attr(,"id")[1] 11371attr(,"believeNRows")[1] TRUEattr(,"colQuote")[1] ""attr(,"tabQuote")[1] ""attr(,"encoding")[1] ""attr(,"rows at time")[1] 100attr(,"isMySQL")[1] FALSE

    Most of the attributes record the arguments of odbcDriverConnect . The"connection.string" attribute is as returned by SQLDriverConnect andlist driver-specic parameters separated (and perhaps terminated) by a semi-colon. The "id" attribute is a random integer used for integrity checks (andin particular to reject connection objects should they be saved and restoredin a different session). The "isMySQL" attribute is used both to select thedefault quote character and the interpretation of qualifier.table names.

    The main structure of the connection is kept as a C struct , a pointerto which is passed around as the R external pointer "handle ptr" . Thishas a nalizer that will close the connection when there is no longer an Robject referring to it (including at the end of the R session), with a warningunless the connection has already been closed by close or odbcClose . Inaddition, a C-level table keeps the pointers of the rst 1000 connections of an R session, to enable odbcCloseAll to close them.

    The struct is currently dened astypedef struct rodbcHandle {

    SQLHDBC hDbc; /* connection handle */SQLHSTMT hStmt; /* statement handle */SQLLEN nRows; /* number of rows and columns in result set */SQLSMALLINT nColumns;

    32

  • 8/9/2019 RODBC Package R software

    33/34

    int channel; /* as stored on the R-level object */

    int id; /* ditto */int useNRows; /* value of believeNRows */

    /* entries used to bind data for result sets and updates */COLUMNS *ColData;int nAllocated;SQLUINTEGER rowsFetched; /* use to indicate the number of rows fetched */SQLUINTEGER rowArraySize; /* use to indicate the number of rows we expect back */SQLUINTEGER rowsUsed; /* for when we fetch more than we need */

    SQLMSG *msglist; /* root of l inked list of messages */SEXP extPtr; /* the external pointer address */

    } RODBCHandle, *pRODBCHandle;

    Most ODBC operations work by sending a query, explicitly or implicitly via

    e.g. sqlColumns

    , and this creates a result set

    which is transferred to an Rdata frame by sqlGetResults . nRows and nCols indicate the size of thepending result set, with nCols = -1 used if there are no pending results.

    ODBC works with various handles . There is a SQLHENV handle for the en-vironment that RODBC opens when a connection is rst opened or DSNs arelistedits main use is to request ODBC 3 semantics. Then each connec-tion has a SQLHDBC handle, and each query (statement) a SQLHSTMT handle.Argument literal=TRUE of sqlTables and sqlColumns is used to set theSQL_ATTR_METADATA_ID attribute of the statement handle to be true.

    All the functions 17 that create a result set call C function cachenbind . This

    allocates buffers under the colData pointer and binds the result set to themby SQLBindCol . Then when sqlGetResults calls the C function SQLFetchor SQLFetchScroll the results for one or more (up to MAX ROWS FETCH =1024 ) rows are loaded into the buffers and then copied into R vectors.

    Prior to RODBC 1.3-0 the default was to fetch a row at a time, but it is now tofetch up to 100 rows at a time. Entries rowsArraySize and rowsFetchedare used to indicate how many rows were requested and how many wereavailable. Since e.g. sqlFetch allows a maximum number of rows to bereturned in the data frame, rowsUsed indicates how many of the rows lastfetched have so far been returned to R.

    The buffers are part of the ColData entry, which is an array of COLUMNSstructures, one of each column in the result set. These have the formtypedef struct cols {

    SQLCHAR ColName[256];SQLSMALLINT NameLength;SQLSMALLINT DataType;SQLULEN ColSize;SQLSMALLINT DecimalDigits;SQLSMALLINT Nullable;char *pData;int datalen;

    17 odbcQuery , sqlColumns , sqlPrimaryKeys , sqlTables and sqlTypeInfo .

    33

  • 8/9/2019 RODBC Package R software

    34/34

    SQLDOUBLE RData [MAX ROWS FETCH];

    SQLREAL R4Data[MAX ROWS FETCH];SQLINTEGER IData [MAX ROWS FETCH];SQLSMALLINT I2Data[MAX ROWS FETCH];SQLLEN IndPtr[MAX ROWS FETCH];

    } COLUMNS;

    The rst six entries are returned by a call to SQLDescribeCol : DataTypeis used to select the buffer to use. There are separate buffers for double-precision, single-precision, 32-bit and 16-bit integer and character/byte data.When character/data buffers are allocated, datalen records the length al-located per row (which is based on the value returned as ColSize ). TheIndPtr value is used to record the actual size of the item in the current rowfor variable length character and binary types, and for all nullable types the

    special value SQL NULL DATA (-1) indicates an SQL null value.The other main C-level operation is to send data to the ODBC driver forsqlSave and sqlUpdate . These use INSERT INTO and UPDATE queries re-spectively, and for fast = TRUE use parametrized queries. So we have thequeries (split across lines for display)> sqlSave(channel, USArrests, rownames = "State", addPK = TRUE, verbose = TRUE)Query: CREATE TABLE "USArrests"

    ("State" varchar(255) NOT NULL PRIMARY KEY, "Murder" double, "Assault" integer,"UrbanPop" integer, "Rape" double)

    Query: INSERT INTO "USArrests"( "State", "Murder", "Assault", "UrbanPop", "Rape" ) VALUES ( ?,?,?,?,? )

    Binding: State DataType 12, ColSize 255Binding: Murder DataType 8, ColSize 15Binding: Assault DataType 4, ColSize 10Binding: UrbanPop DataType 4, ColSize 10Binding: Rape DataType 8, ColSize 15Parameters:...

    > sqlUpdate(channel, foo, "USArrests", verbose=TRUE)Query: UPDATE "USArrests" SET "Assault"=? WHERE "State"=?Binding: Assault DataType 4, ColSize 10Binding: State DataType 12, ColSize 255Parameters:...

    At C level, this works by calling SQLPrepare to record the insert/updatequery on the statement handle, then calling SQLBindParameter to bind abuffer for each column with values to be sent, and nally in a loop over rowscopying the data into the buffer and calling SQLExecute on the statementhandle.

    The same buffer structure is used as when retrieving result sets. The differ-ence is that the arguments which were ouptuts from SQLBindCol and inputsto SQLBindParameter , so we need to use sqlColumns to retrieve the columncharacteristics of the table and pass these down to the C interface.

    34