View
217
Download
0
Embed Size (px)
Citation preview
Oracle OptimizerOracle Optimizer
Combining Output From Combining Output From Multiple Index ScansMultiple Index Scans
• AND-EQUAL: – select * from sailors
where sname = 'Jim' and rating = 10
• Suppose we have 2 indexes: sname, rating
TABLE ACCESS BY ROWID
AND-EQUAL
INDEX RANGE SCAN Sailors(sname)
INDEX RANGE SCAN Sailors(rating)
• Suppose we also have an index on (sname, rating)– How should the query be performed?
Operations that Manipulate Data Operations that Manipulate Data SetsSets
• Up until now, all operations returned the rows as they were found
• There are operations that must find all rows before returning a single row
• Try to avoid these operations for online users!– SORT ORDER BY: query with order by
select sname, age
from Sailors
order by age;
Operations that Manipulate Data Operations that Manipulate Data SetsSets
– SORT UNIQUE: sorting records while
eliminating duplicates
e.g., query with distinct; query with minus,
intersect or union
select DISTINCT age from Sailors;
– SORT AGGREGATE, SORT GROUP BY:
queries with aggregate or grouping
functions (like MIN, MAX)
Is the table always Is the table always accessed?accessed?
What if there is no index?
Operations that Manipulate Data Operations that Manipulate Data SetsSets
• Consider the query:
– select sname from sailors
union
select bname from boats;
Operations that Manipulate Data Operations that Manipulate Data SetsSets
• Consider the query:
– select sname from sailors
minus
select bname from boats;
How do you think that
Oracle implements intersect?union all?
• Select age, COUNT(*)
from Sailors
GROUP BY age
SORT GROUP BY
TABLE ACCESS FULL
Operations that Manipulate Data Operations that Manipulate Data SetsSets
DistinctDistinct
• What should Oracle do when
processing the query (assuming that
sid is the primary key):
– select distinct sid
from Sailors
Join MethodsJoin Methods
• Select * from Sailors, Reserves
where Sailors.sid = Reserves.sid
• Oracle can use an index on Sailors.sid
or on Reserves.sid (note that both will
not be used)
• Join Methods: MERGE JOIN, NESTED
LOOPS, HASH JOIN
Nested Loops JoinsNested Loops Joins
• Block nested loop join
NESTED LOOPS
TABLE ACCESS FULL OF our_outer_table
TABLE ACCESS FULL OF our_inner_table
• Index nested loop joinNESTED LOOPS
TABLE ACCESS FULL OF our_outer_table
TABLE ACCESS BY ROWID OF our_inner_table
INDEX RANGE SCAN OF inner_table_index
When Are Nested Loops When Are Nested Loops Joins Used?Joins Used?
• If tables are of unequal size
• If results should be returned
online
Hash JoinHash Join//Partition R into k partitions
foreach tuple r in R do //flush when fills
read r and add it to buffer page h(ri)
foreach tuple s in S do //flush when fills
read s and add it to buffer page h(sj)
for l = 1..k
//Build in-memory hash table for Rl using h2
foreach tuple r in Rl do
read r and insert into hash table with h2
foreach tuple s in Sl do
read s and probe table using h2
output matching pairs <r,s>
Hash Join PlanHash Join Plan
HASH JOINTABLE ACCESS FULL OF table_ATABLE ACCESS FULL OF table_B
When Are Hash Joins When Are Hash Joins Used?Used?
• If tables are small
• If results should be returned online
Sort-Merge Join PlanSort-Merge Join Plan
MERGE JOINSORT JOINTABLE ACCESS FULL OF table_ASORT JOINTABLE ACCESS FULL OF table_B
When Are Sort/Merge Joins When Are Sort/Merge Joins Used?Used?
• Performs badly when tables are
of unequal size. Why?
HintsHints
• You can give the optimizer hints about
how to perform query evaluation
• Hints are written in /*+ */ right after
the select
• Note: These are only hints. The oracle
optimizer can choose to ignore your
hints
ExamplesExamples
Select /*+ FULL (sailors) */ sidFrom sailorsWhere sname=‘Joe’;
Select /*+ INDEX (sailors) */ sidFrom sailorsWhere sname=‘Joe’;
Select /*+ INDEX (sailors s_ind) */ sidFrom sailors S, reserves RWhere S.sid=R.sid AND sname=‘Joe’;
More ExamplesMore Examples
Select /*+ USE_NL (sailors) */ sidFrom sailors S, reserves RWhere S.sid=R.sid AND sname=‘Joe’;
Select /*+ USE_MERGE (sailors, reserves) */ sidFrom sailors S, reserves RWhere S.sid=R.sid AND sname=‘Joe’;
Select /*+ USE_HASH */ sidFrom sailors S, reserves RWhere S.sid=R.sid AND sname=‘Joe’;
inner table
Information Retrieval and DBInformation Retrieval and DB
CONTAINSCONTAINS
• Introduce text search in SQL
• CONTAINS operator
select Name
from article
where CONTAINS(abstract, ‘play’) > 0;
• Can combine OR, AND
StemmingStemming
• Given the “stem” of a word, Oracle will
expand the list of words to search for
to include all words having the same
stem
– Stem of plays, played, playing, playful:
play
– where CONTAINS(abstract, ‘$play’) > 0;
RankingRanking
• We need to rank between the retrieved
tuples according to their relevance
– Open challenge
– Several implementations for oracle
The following slides are based on those of Dr. Sara Cohen
The Vector Space ModelThe Vector Space Model
• The Vector Space Model (VSM) is a way of representing text data through the words that they contain
• It is a standard technique in Information Retrieval
• In the following, we call this text data, document (classical IR)
• The VSM allows decisions to be made about which documents are similar to each other and to keyword queries
How Does it Work?How Does it Work?
• Each document is represented as a vector
which contains a value for each word in the
vocabulary
– this value is 0, if the word does not appear in the
document
• Similarly, a query is represented as a vector
• The rank of the document with respect the the
query is the distance between their vectors
Example: Boolean ValueExample: Boolean Value
• P1 = “I live in a green
house with a green roof”
• P2 = “There is no life
form on Mars”
• P3 = “Men love green
cars”
• P4 = “I saw some little
green men yesterday”
P1 P2 P3 P4a 1 0 0 0cars 0 0 1 0green 1 0 1 1house 1 0 0 0I 0 0 0 1is 1 1 0 0life 0 1 0 0little 1 0 0 1love 0 0 1 0mars 0 1 0 0men 0 0 1 1my 1 0 0 0no 0 1 0 0on 1 1 0 0roof 1 0 0 0saw 0 0 0 1there 1 1 0 0
1 if the word appears, 0 otherwise
Example: Boolean ValueExample: Boolean Value
• P1 = “I live in a green
house with a green roof”
• P2 = “There is no life
form on Mars”
• P3 = “Men love green
cars”
• P4 = “I saw some little
green men yesterday”
P1 P2 P3 P4a 1 0 0 0cars 0 0 1 0green 1 0 1 1house 1 0 0 0I 0 0 0 1is 1 1 0 0life 0 1 0 0little 1 0 0 1love 0 0 1 0mars 0 1 0 0men 0 0 1 1my 1 0 0 0no 0 1 0 0on 1 1 0 0roof 1 0 0 0saw 0 0 0 1there 1 1 0 0
Vector for P1
Example: Boolean ValueExample: Boolean Value
• Q = green OR men OR marsQuery
a 0cars 0green 1house 0I 0is 0life 0little 0love 0mars 1men 1my 0no 0on 0roof 0saw 0there 0
Distance Between VectorsDistance Between Vectors
• For two vectors d and d’ the cosine distance between d and d’ is given by:
• d d’ is the scalar product of d and d’, calculated by multiplying corresponding values together
• |d| is the norm of d
• The “cosine measure” calculates the cosine between the vectors in a high-dimensional virtual space
'
'
dd
dd
Distance Between Distance Between DocumentsDocuments
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
P3 Querycars 1 0green 1 1love 1 0men 1 1
ExampleExample
• Consider the query Q="green
men" and the document P3 =
"Men love green cars"
• The cosine distance:
– scalar product:
1*0 + 1*1+ 1*0 + 1*1 = 2
– norms:
(12 + 12 + 12 + 12 ) = 2
(02 + 12 + 02 + 12 ) = 2
– Similarity: 2/(2 2) = 1/ 2
Only dimensions that are non-zero in one of
the vectors are shown
Defining Vector Values: TFDefining Vector Values: TF
• Instead of boolean value, put word frequency (called tf, for "term frequency")
• What affect does this give?
• Sometimes a normalized version is used:– term frequency/number of
words in the document
P1 P2 P3 P4a 1 0 0 0cars 0 0 1 0green 2 0 1 1house 1 0 0 0I 0 0 0 1is 1 1 0 0life 0 1 0 0little 1 0 0 1love 0 0 1 0mars 0 1 0 0men 0 0 1 1my 1 0 0 0no 0 1 0 0on 1 1 0 0roof 1 0 0 0saw 0 0 0 1there 1 1 0 0
P1 P2 P3 P4a 0.1 0 0 0cars 0 0 0.25 0green 0.2 0 0.25 0.2house 0.1 0 0 0I 0 0 0 0.2is 0.1 0.1667 0 0life 0 0.1667 0 0little 0.1 0 0 0.2love 0 0 0.25 0mars 0 0.1667 0 0men 0 0 0.25 0.2my 0.1 0 0 0no 0 0.1667 0 0on 0.1 0.1667 0 0roof 0.1 0 0 0saw 0 0 0 0.2there 0.1 0.1667 0 0
Normalized TFNormalized TF
Always: Sum = 1
Another Option:Another Option:Defining Vector Values as IDFDefining Vector Values as IDF
• We can combine TF
with IDF, inverse
document frequency
– 1/(number of
documents
containing the word)
• What is the affect?
P1 P2 P3 P4a 1 0 0 0cars 0 0 1 0green 0.3333 0 0.3333 0.3333house 1 0 0 0I 0 0 0 1is 0.5 0.5 0 0life 0 1 0 0little 0.5 0 0 0.5love 0 0 1 0mars 0 1 0 0men 0 0 0.5 0.5my 1 0 0 0no 0 1 0 0on 0.5 0.5 0 0roof 1 0 0 0saw 0 0 0 1there 0.5 0.5 0 0
Normalized IDFNormalized IDF
• Sometimes a normalized version is used:
• The logarithm gives less influence to IDF when TF and IDF are combined
• What is the value for a word that appears in all documents? Why?
ww n
Nidf log
P1 P2 P3 P4a 0.6021 0 0 0cars 0 0 0.6021 0green 0.1249 0 0.1249 0.1249house 0.6021 0 0 0I 0 0 0 0.6021is 0.301 0.301 0 0life 0 0.6021 0 0little 0.301 0 0 0.301love 0 0 0.6021 0mars 0 0.6021 0 0men 0 0 0.301 0.301my 0.6021 0 0 0no 0 0.6021 0 0on 0.301 0.301 0 0roof 0.6021 0 0 0saw 0 0 0 0.6021there 0.301 0.301 0 0
Number of documents
Number of documents in
which w appears
Standard Measure is TF-IDFStandard Measure is TF-IDF
• Use normalized TF
times normalized IDF
• Note: Once the values
are chosen (using any
of the schemes
considered), we use
cosine distance to
compare the document
and query
P1 P2 P3 P4a 0.06021 0 0 0cars 0 0 0.1505 0green 0.02498 0 0.0312 0.025house 0.06021 0 0 0I 0 0 0 0.1204is 0.0301 0.0502 0 0life 0 0.1003 0 0little 0.0301 0 0 0.0602love 0 0 0.1505 0mars 0 0.1003 0 0men 0 0 0.0753 0.0602my 0.06021 0 0 0no 0 0.1003 0 0on 0.0301 0.0502 0 0roof 0.06021 0 0 0saw 0 0 0 0.1204there 0.0301 0.0502 0 0
XML (Extensible Markup XML (Extensible Markup Language) Language)
andand
the Semi-Structured Data Modelthe Semi-Structured Data Model
MotivationMotivation
• We have seen that relational databases
are very convenient to query. However:
– There is a LOT of data not in relational
databases!!
• Perhaps the most widely accessed
database is the web, and it certainly
isn’t a relational database.
Querying the WebQuerying the Web
• The web can be queried using a search engine, however, we can’t ask questions like:– What is the lowest price for which a Jaguar
is sold on the web?
• Problems:– There are no facilities for asking complex
questions, such as aggregation of data
Understanding the WebUnderstanding the Web
• In order to query the web, we must be
able to understand it.
• 2 Computer Science Approaches:
– Artificial Intelligence Approach
– Database Approach
Database ApproachDatabase Approach
“The web is unstructured and we will structure it”
• Sometimes problems that are very difficult can be solved easily by enforcing a standard
• Encourage the use of XML as a standard for data exchange on the web
<addresses >
<person friend="yes">
<name> Jeff Cohen</name>
<tel> 04-828-1345 </tel>
<tel> 054-470-778 </tel>
<email> [email protected] </email>
</person>
<person friend="no">
<name> Irma Levy</name>
<tel> 03-426-1142 </tel>
<email>[email protected]</email>
</person>
</addresses>
Example XML DocumentExample XML Document
Opening Tag
AttributeElement
Closing Tag
Very Unstructured XMLVery Unstructured XML
<?xml version=“1.0”?>
<DamageReport>
The insured’s <Vehicle Make = “Toyota”>
Corolla </Vehicle> broke through the guard rail and plummeted into the ravine. The cause was determined to be <Cause>faulty brakes </Cause>. Amazingly there were no casualties.
</DamageReport>
XML Vs. HTMLXML Vs. HTML
• XML and HTML are brothers. They are both
special cases of SGML.
• HTML has specific tag and attribute names.
These are associated with a specific meaning
• XML can have any tag and attribute name.
These are not associated with any meaning
• HTML is used to specify visual style
• XML is used to specify meaning
A Different Data ModelA Different Data Model
RelationalSemi-Structured
Abstract
Model
Sets of
tuples
Labeled Directed
Graph
Concrete
Model
TablesXML Documents
Standard
for
Storing
Data
Data Exchange
Separating Content
from Style
Data ExchangeData Exchange
• Problem: Many data sources, each of a different type (different vendor), with a different schema. – How can the data be combined and used
together?
– How can different companies collaborate on their data?
– What format should be used to exchange the data?
Separating Content from Separating Content from StyleStyle
• Web sites develop over time
• Important to separate style from data in order to allow changes to the site structure and appearance
• Using XML, we can store data alone
• CSS separates style from data only in a limited way
• Using XSL, this data can be translated into HTML
• The data can be translated differently as the site develops
Write Once Use Write Once Use EverywhereEverywhere
XML Data
XSL
WML(hand-held
devices)
XSL
HTML(web browser
XSL
TEXT(Excel)
Using XMLUsing XML
• Quering and Searching XML: There are query languages and search engines that query XML and return XML. Examples: Xpath, Xquery /SQL4X, Equix, XSEarch
• Displaying XML: An XML document can have an associated style-sheet which specifies how the document should be translated to HTML. Examples: CSS, XSL
DTD: Document Type DTD: Document Type DescriptorsDescriptors
• Document Type Descriptors (DTDs)
impose structure on an XML
document
• There is some relationship
between a DTD and a schema