24535004 a Lesson in Outer Joins

Teradata DBA

Rolf Hanusa

A Lesson in Outer Joins (Learned the Hard Way!)

Using outer joins to ease the query process

Outer joins are extremely powerful tools, and as such they are very difficult to understand and use properly. A lack of understanding can give you unexpected and costly results. (For example, your company might mail promotional fliers to 17 million customers, instead of the 17,000 customers that you intended to target.) Although this article won't make you an expert on outer joins, it will help you understand their complexities using real-world examples and explanations. Before we get too deep into the SQL syntax, we need a framework on which to build. Because most of the outer joins that I see are used in Teradata, I'll limit this article to queries written and executed in a Teradata environment. The rules should still apply with other RDBMSs, but the queries may execute differently.

OUTER JOIN: A LOGICAL DEFINITION

An outer join is defined in sections; it is defined as the UNION ALL of various pieces. The pieces pulled together are defined by the type of outer join: • Piece 1: The inner join the result of the two tables as described by the full ON clause, with all conditions applied • Piece 2: All rows from the left table not included in Piece1, extended with NULL values for each column of the right table • Piece 3: All rows from the right table not included in Piece 1, extended with NULL values for each column of the left table.

Left Outer Join is Piece 1 UNION ALL Piece 2.Right Outer Join is Piece 1 UNION ALL Piece 3.Full Outer Join is Piece 1 UNION ALL Piece 2

UNION ALL Piece 3

For each type of outer join (left, right, full), just put the proper "pieces" together using UNION ALL.

SOME BASIC RULES AND RECOMMENDATIONS

One or more join conditions, also called "connecting terms," are required in the ON clause for each relation in an outer join. These join conditions are used to define the rows in the outer table that take part in the match to the inner table. I recommend that you use only join conditions in ON clauses. However, when a search condition (used for row selection) is required on the inner table, it should be put in the ON clause as well. A search condition in the ON clause of the inner table will not limit the number of rows in the answer set. It only defines the rows eligible to take part in the match to the outer table. An outer join can also include a WHERE clause; however, the results you get when you do include it may be surprising--or at least not obvious. This will be explained in more detail later in the article. To limit the number of qualifying rows in the outer table (and therefore the answer set), the search condition for the outer table must be in the WHERE clause. Note: The WHERE clause is applied only after the outer join has been produced. Here's a little known (or less understood) outer join rule: If a search condition on the inner table is placed in the WHERE clause, the JOIN is logically equivalent to an INNER JOIN, even if you code OUTER JOIN in the query. Read on to see how this can impact your results. These rules are not strange concepts unique to Teradata. This is a fully SQL-92-compliant implementation (for better or worse). Teradata's optimizer does, however, take advantage of these concepts in processing these queries. Instead of executing the outer join just as it is defined, the optimizer rewrites the query to roll the whole, complex process into a single step, as well as to eliminate outer joins that really aren't.

FROM THEORY TO REAL-WORLD ANALYSIS

The following examples represent actual cases that I have encountered as a DBA. Although I've changed them slightly to avoid any conflict of interest, the basic syntax and counts remain accurate. Since Teradata EXPLAINs may be new to some readers, they have been altered slightly for clarity (that is, aliases were replaced with database names, and so forth). Before writing a query, it is important to understand the business question that it is supposed to answer. Here is a simple explanation of the business question we are trying to answer in the remainder of this article: We want to know all the customers (using table CUSTOMER, which contains over 18 million rows):

• Who reside in the DISTRICT of K, And: • Who have a SERVICE_TYPE of ABC or XYZ, And: Their monthly revenue (using table REVENUE, which contains over 234 million rows) for the month of July 1997 (199707) • Using DATA_DATE = 199707, And (here's where the outer join comes in): If the customer revenue is unknown (that is, if no revenue records are found), we want to keep the customer record with a NULL for MONTHLY REVENUE. Sounds simple enough, doesn't it? I thought so too until I started analyzing my original answer sets and found them to be incorrect and, in some cases, very surprising. In fact, until I researched several coding alternatives and repeatedly questioned one of NCR's developers (who now probably uses caller ID to screen my calls), I was convinced that Teradata's optimizer had lost its mind. It hadn't, but I almost did. You'll see what I mean as we go through the following examples and analyze the results. The first example (see Listing 1) is a single table select, which provides the base of customer records that we want. The second example (see Listing 2) is an inner join that will help EXPLAIN the remaining queries and results. It starts with the same base of customer records but matches them with revenue records for a particular month. Note that all customer records found a matching revenue record.

Listing 1. Single table select.

SELECT C.CUSTNUMFROM SAMPDB.CUSTOMER CWHERE C.DISTRICT='K' AND (C.SERVICE_TYPE= 'ABC' OR C.SERVICE_TYPE= 'XYZ')ORDER BY 1;

Result: This query returns 18,034 rows.

Listing 2. Inner join.

SELECT C.CUSTNUM, B.MONTHLY_REVENUEFROM SAMPDB.CUSTOMER C , SAMPDB2.REVENUE BWHERE C.CUSTNUM = B.CUSTNUM AND C.DISTRICT = 'K' AND B.DATA_DATE = 199707 AND (C.SERVICE_TYPE = 'ABC' OR C.SERVICE_TYPE = 'XYZ')ORDER BY 1;


In Listing 3, an outer join is requested, but if we apply these rules stated, we end up with a surprising result. Although we are asking for a LEFT OUTER JOIN, it is in fact treated as an inner join. Because all the selection criteria are in the WHERE clause, they are logically applied only after the outer join processing has been completed. This means that Listings 2 and 3 are logically similar and will provide the same result. It is important to note that Teradata recognizes that this query is the same as an inner join and executes it as such (see EXPLAIN Exq3). Therefore, it executes with the speed of an inner join.

Listing 3. Outer join. (But is it?)

SELECT C.CUSTNUM, B.MONTHLY_REVENUEFROM SAMPDB.CUSTOMER C LEFT OUTER JOIN SAMPDB2.REVENUE B ON C.CUSTNUM = B.CUSTNUMWHERE AND C.DISTRICT='K' AND B.DATA_DATE= 199707 AND (C.SERVICE_TYPE= 'ABC' OR C.SERVICE_TYPE= 'XYZ')ORDER BY 1;


Note: For those of you who are unfamiliar with a Teradata EXPLAIN, it is a textual description of the processing steps that the Teradata Optimizer will use to execute an SQL query.

EXPLAIN Exq3:

1. First, we lock SAMPDB.CUSTOMER for access, and we lock SAMPDB2.REVENUE for access. 2. Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a RowHash match scan with a condition of ("(SAMPDB. T1.DISTRICT = 'K') and ((SAMPDB. T1.SERVICE_TYPE= 'ABC ') or (SAMPDB. T1.SERVICE_ TYPE = 'XYZ '))"), which is joined to SAMPDB.CUSTOMER with a condition of ("SAMPDB2.REVENUE.DATA_DATE = 199707"). SAMPDB.CUSTOMER and SAMPDB2.REVENUE are joined using a merge join, with a join condition of ("(SAMPDB.CUSTOMER.CUSTNUM = SAMPDB2. REVENUE.CUSTNUM)"). The input table SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1, which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1. The size of Spool 1 is estimated to be 1,328,513 rows. The estimated time for this step is 6 minutes and 2 seconds. 3. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. --> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0 hours and 6 minutes and 2 seconds. The NCR/Teradata developer's explanation: Logically, terms in the WHERE clause are supposed to be applied after the outer join has been performed using the terms in the ON clause. If we do that, there will be 18,034 rows in the result. But when we apply the term B.DATA_DATE= 199707 afterward, it will eliminate all rows where B.DATA_DATE is null (these are the rows where no inner table rows matched outer table rows). Thus, it is quite reasonable to expect that this client's request should return fewer than 18,034 rows. Perhaps (since the EXPLAIN will not appear to reflect the logic I've described) I should mention that we do not really apply the term B.DATA_DATE= 199707 after doing the outer join. The optimizer recognizes that outer joins with a WHERE clause containing a term referencing the inner table, which would not evaluate true when the

column is null, are logically equivalent to an inner join. In such cases, the optimizer generates a plan to perform an inner join. (Note that step 2 of the EXPLAIN says that we do a merge join, not an outer merge join.) My explanation of the developer's explanation: Notice that the restrictions on the outer table are in the WHERE clause. This causes the left table to be reduced from 17,713,502 to 18,034 rows. The restrictions on the inner table are also in the WHERE clause (instead of the ON clause), so they will be applied afterward to remove all rows containing NULLs (as a result of the outer join). This reduces the answer set to 13,010 rows. Confusing, yes. But it gets worse. Our next example (see Listing 4) is an outer join, but the answer set returned is vastly different from the desired result, as we shall see. This query was the most confusing for me to understand, at least at first. As the developer told me, it is counterintuitive.

Listing 4. Outer join. (Yes, but is this what you want?)

SELECT C.CUSTNUM, B.MONTHLY_REVENUEFROM SAMPDB.CUSTOMER C LEFT OUTER JOIN SAMPDB2.REVENUE B ON C.CUSTNUM = B.CUSTNUM AND C.DISTRICT='K' AND B.DATA_DATE= 199707 AND (C.SERVICE_TYPE= 'ABC' OR C.SERVICE_TYPE= 'XYZ')ORDER BY 1;

Result: This query returns 17,713,502 rows.

The NCR/Teradata developer's explanation: As long as there is no WHERE clause, the result of an outer join will always have at least one row in the result for every row in the outer relation. That is what we have here. Listing 4 demonstrates the result of one of the possible placements of single-relation terms on the outer relation. When such terms are placed in the ON clause, they do not eliminate any rows from the result. Outer table rows where DISTRICT = 'C' and (SERVICE_TYPE= 'ABC' OR SERVICE_TYPE= 'XYZ') are considered to be nonmatches with the inner table whether or not the join terms (those that reference both inner and outer relations) all evaluate as true. In other words,

every outer relation row for which those two terms do not evaluate true, do not match any inner relation rows, even if all the connecting terms in the ON clause evaluate true for that outer row and some inner relation row. My explanation of the developer's explanation: The selection criteria (search conditions) in the ON clause only define the rows to which nulls are to be used for nonmatching rows (see EXPLAIN Exq4). This means that all the rows (17,713,502 of them) in the left table (CUSTOMER) will be returned. But only the rows (13,010) with a SERVICE_ TYPE of "ABC" or "XYZ" in DISTRICT "C" and matching rows from the right table (BILL HISTORY) for month 199707 will have non-NULL value for MONTHLY_ REVENUE. This query will also perform more slowly since there are no WHERE conditions to limit the query ... well, almost none. Teradata is smart enough to treat the right table as an inner join, applying the DATA_DATE= 199707 to limit the query. Otherwise, this query would run much longer. Note that when you review EXPLAIN Exq4, you will see the words "Left outer joined using a merge join." This statement confirms that this query is in fact an outer join. EXPLAIN Exq4: 1) First, we lock SAMPDB.CUSTOMER for access, and we lock SAMPDB2.REVENUE for access. 2) Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a RowHash match scan with no residual conditions, which is joined to SAMPDB.CUSTOMER with a condition of ("SAMPDB2.REVENUE.DATA_DATE = 199707"). SAMPDB.CUSTOMER and SAMPDB2.REVENUE are left outer joined using a merge join, with condition(s) used for nonmatching on left table ("((SAMPDB.T1.SERVICE_TYPE='ABC') or (SAMPDB.T1.SERVICE_TYPE='XYZ')) and (SAMPDB. T1.DISTRICT = 'K')"), with a join condition of (" (SAMPDB.T1. CUSTNUM = SAMPDB2.REVENUE.CUSTNUM)"). The input table SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1, which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1. The size of Spool 1 is estimated to be 17,713,502 rows. The estimated time for this step is 7 minutes and 15 seconds. 3) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. --> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0 hours and 7 minutes and 15 seconds. Listing 5 is another example of a query where an outer join is requested, but it is logically, and therefore transformed into, an inner join.

Listing 5. Outer join. (Not! This will be treated as an inner join.)

SELECT C.CUSTNUM, B.MONTHLY_REVENUEFROM SAMPDB.CUSTOMER C LEFT OUTER JOIN

SAMPDB2.REVENUE B ON C.CUSTNUM = B.CUSTNUM AND C.DISTRICT='K' AND (C.SERVICE_TYPE= 'ABC' OR C.SERVICE_TYPE= 'XYZ')WHERE B.DATA_DATE= 199707ORDER BY 1;


My explanation of Listing 5: Using what we have learned from the previous examples, we can quickly see the similarity to Listing 3. Again, this query is treated as an inner join, even though we asked for an outer join. The WHERE clause on the right (inner) table, logically changes this query from an outer join to an inner join (see EXPLAIN Exq5). As in previous examples, the WHERE clause is logically applied after the outer join processing has been completed, removing all rows that were NULLed in the process (that is, nonmatching rows between left and right table). As before, the optimizer knows to execute this as an inner join to improve the performance of the query.

EXPLAIN Exq5 (As you can see, this EXPLAIN output is identical to EXPLAIN Exq3 and, as expected, so is the answer set.): 1. First, we lock SAMPDB.CUSTOMER for access, and we lock SAMPDB2.REVENUE or access. 2. Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a RowHash match scan with a condition of ("(SAMPDB. T1.DISTRICT = 'K') and ((SAMPDB. T1.SERVICE_TYPE= 'ABC') or (SAMPDB. T1.SERVICE_ TYPE='XYZ'))"), which is joined to SAMPDB.CUSTOMER with a condition of ("SAMPDB2.REVENUE.DATA_DATE =199707"). SAMPDB.CUSTOMER and SAMPDB2.REVENUE are joined using a merge join, with a join condition of (" (SAMPDB. T1.CUSTNUM = SAMPDB2. REVENUE.CUSTNUM)"). The input table SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1, which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key in spool field1. The size of Spool 1 is estimated to be 1,328,513 rows. The estimated time for this step is 6 minutes and 2 seconds. 3. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. --> The contents of Spool 1 are sent back to the user as the result of statement 1. The

total estimated time is 0 hours and 6 minutes and 2 seconds. Finally, we have the correct answer. This example (see Listing 6) is an outer join providing the answer set, which answers the original business question.

Listing 6. Outer join. (The correct answer.)

SELECT C.CUSTNUM, B.MONTHLY_REVENUEFROM SAMPDB.CUSTOMER C LEFT OUTER JOIN SAMPDB2.REVENUE B ON C.CUSTNUM = B.CUSTNUM AND B.DATA_DATE= 199707WHERE C.DISTRICT='K' AND (C.SERVICE_TYPE= 'ABC' OR C.SERVICE_TYPE= 'XYZ')ORDER BY 1;

This query returns 18,034 rows. 13,010 rows have non-NULL values for MONTHLY_REVENUE.

In this query, the left (outer) table is limited by the search conditions in the WHERE clause, and the search condition in the ON clause for the right (inner) table defines the NULL-able nonmatching rows. This EXPLAIN confirms that this is in fact an outer join (see EXPLAIN Exq6). EXPLAIN Exq6: 1. First, we lock SAMPDB.CUSTOMER for access, and we lock SAMPDB2.REVENUE for access. 2. Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a RowHash match scan with a condition of ( "((SAMPDB. T1.SERVICE_TYPE= 'ABC') or (SAMPDB. T1.SERVICE_ TYPE='XYZ')) and (SAMPDB.T1. DISTRICT = 'K')"), which is joined to SAMPDB.CUSTOMER with a condition of ( "SAMPDB2.REVENUE.DATA_ DATE = 199707"). SAMPDB.CUSTOMER and SAMPDB2.REVENUE are left outer joined using a merge join, with a join condition of (" (SAMPDB. T1.CUSTNUM = SAMPDB2.REVENUE.CUSTNUM )"). The input table SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1, which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key

in spool field1. The size of Spool 1 is estimated to be 1,328,513 rows. The estimated time for this step is 6 minutes and 2 seconds. 3. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. --> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0 hours and 6 minutes and 2 seconds.

GETTING THE "CORRECT" ANSWER

As the previous examples show, outer joins, when used properly, provide additional information from a single query that formerly required multiple queries and/or steps to achieve. However, the proper use of outer joins requires training and/or experience because simple logic does not always apply. Use the following steps to be sure that you're getting the "correct" answer (that is, the one you expect to get): 1. Make sure that you understand the question you are trying to answer; you should have a pretty good idea what the answer set should look like. 2. Write the query, keeping in mind the proper placement of join conditions and search conditions: • All join conditions are placed on the ON clause. • Search conditions for the inner table are placed on the ON clause while search conditions on the outer table are placed in the WHERE clause. 3. Always EXPLAIN the query before executing it. Look for the words "outer join." If you don't see them, it's not one. 4. Run the query and compare the result with your expectations. If your answer set matches your expectations, it is probably correct. If not, check the locations of any selection criteria that you have placed in the ON and/or WHERE clauses. As this article demonstrates, many results are possible and the correct solution is not necessarily intuitive, especially in a more complex query. Now let's look at that 12-way complex join...• Rolf Hanusa is the project leader and lead DBA for Southwestern Bell's Corporate Data Warehouse (CDW) Project. Rolf has more than 10 years experience as a DBA, supporting both Teradata and DB2 DSS systems. He is also an active member of the Partners Product Advisory Council, a group of NCR/Teradata customers that provides input to NCR on the product direction of NCR's large system products, as well as enhancements to the Teradata RDBMS. You can reach him via email at [email protected].

mailto:[email protected]

Documents

24535004 a Lesson in Outer Joins