32
SEWI ZG514 Data Warehousing Purushotham BV utham74@gmail. com Performance Enhancing Techniques: Partitioning Strategy Aggregation

SEWI ZG514 Data Warehousing

Embed Size (px)

DESCRIPTION

SEWI ZG514 Data Warehousing. Performance Enhancing Techniques: Partitioning Strategy Aggregation. Purushotham BV [email protected]. Performance Enhancing Techniques. Partitioning Strategy Introduction Horizontal Partitioning Vertical Partitioning Hardware Partitioning - PowerPoint PPT Presentation

Citation preview

Page 1: SEWI  ZG514   Data Warehousing

SEWI ZG514 Data

Warehousing

Purushotham [email protected]

Performance Enhancing Techniques:• Partitioning Strategy• Aggregation

Page 2: SEWI  ZG514   Data Warehousing

2

Performance Enhancing Techniques

• Partitioning Strategy– Introduction– Horizontal Partitioning– Vertical Partitioning– Hardware Partitioning– Which Key to Partition by?– Sizing the Partition

• Aggregations– Introduction– Why Aggregate?– What is an Aggregation?– Designing Summary Tables– Which Summaries to create?

• Summary

Page 3: SEWI  ZG514   Data Warehousing

3

Partitioning

• Partitioning is performed for a number of performance related and manageability reasons, and the strategy as a whole must balance all the various requirements.

• Partitioning is needed in any large data ware house to ensure that the performance and manageability is improved.

• It can help the query redirection to send the queries to the appropriate partition, thereby reducing the overall time taken for query processing.

• Three types:1. Horizontal Partitioning.2. Vertical partitioning.3. Hardware Partitioning.

Page 4: SEWI  ZG514   Data Warehousing

4

• The table is partitioned after the first few thousand entries, and the next few thousand entries etc.

• This is because in most cases, not all the information in the fact table needed all the time.

• Thus horizontal partitioning helps to reduce the query access time, by directly cutting down the amount of data to be scanned by the queries.

• Horizontal partitioning the fact table was a good way to speed up Queries, by minimizing the set of data to be scanned(without using an index).

• Partition a fact table into segments.

Horizontal Partitioning

Page 5: SEWI  ZG514   Data Warehousing

5

Horizontal Partitioning (Contd.,)

• Each segment of different size, because the number of transaction within the business at a given point in the year may not be the same.

• Example– Higher transaction volume at peak periods,

such as Christmas etc.– If sales fact table is partitioned monthly.

Page 6: SEWI  ZG514   Data Warehousing

6

Horizontal Partitioning

• Various ways in which fact data can be partitioned, before deciding on the optimum solution, we have to consider the requirements for manageability of the data warehouse.

1. Partitioning by Time into Different –sized segments.

2. Partitioning on a Different Dimension.3. Partitioning by Size of Table.4. Using Round Robin Partitions

Page 7: SEWI  ZG514   Data Warehousing

7

Partitioning by Time into Equal Segments

• Partition the fact table on a time period basis.

• Example– Partitioning into monthly segments,– Number of tables does not exceed in the order

of 500.– Number of the partitions will store transactions

over a busy period in the business, and that the rest may be substantially smaller.

• This is the most straight forward method of partitioning by months or years etc.

Page 8: SEWI  ZG514   Data Warehousing

8

Partitioning by Time into Equal Segments (Contd.,)

• This will help if the queries often come regarding the fortnightly or monthly performance / sales etc.

Page 9: SEWI  ZG514   Data Warehousing

9

Advantages and Disadvantages

• The advantage is that the slots are reusable.– Suppose we are sure that we will no more need the data of

10 years back, then we can simply delete the data of that slot and use it again.

• Serious draw back in this scheme • If the partitions tend to differ too much in size.

– The number of visitors visiting a hill station, say in summer months, will be much larger than in winter months and hence the size of the segment should be big enough to take case of the summer rush.

• This, of course, would mean wastage of space during winter month data space. – Partitioning tables into same sized segments course, would

mean wastage of space during winter

Page 10: SEWI  ZG514   Data Warehousing

10

Partitioning by Time into Different –Sized Segments.

• Three monthly partitions for the last three months (including current month).

• One quarterly partition for the previous quarter.

• One half-year partition for the remainder of the year.

Page 11: SEWI  ZG514   Data Warehousing

11

Advantages and Disadvantages

1. Detailed information remains available online, without having to restore to using aggregations.

2. Number of physical tables is kept relatively small, reducing operating costs.

This technique may be particularly appropriate in environments that require a mix of data dipping recent history.

3. The partitioning profile will change on a regular basis

4. Repartitioning will increase the operational cost of the data warehouse.

Page 12: SEWI  ZG514   Data Warehousing

12

Partitioning on a Different Dimension

• Data collection and storing need not always be partitioned based on time, though it is a very safe and relatively straight forward method.

• It can be partitioned based on the different regions of operation, different items under consideration or any other such dimension.

• Most of the queries are likely to be based on the region wise performance, region wise sales etc.

Page 13: SEWI  ZG514   Data Warehousing

13

Partitioning on a Different Dimension (Contd.,)

• If we are worried about the total performance of all regions, total sales of a month or total sales of a product etc, then region wise partitioning could be a disadvantage, since each such queries will have to move across several partitions.

Page 14: SEWI  ZG514   Data Warehousing

14

Partitioning by size of table

• We will not be sure of any dimension on which partitions can be made.

• Neither the time nor the products or regions etc.• We are sure of the type of queries that we are likely to

frequently encounter. • In such cases, it is ideal to partition by size.• Loading the data until a pre-specified memory is consumed,

then create a new partition.• However, this creates a very complex situation similar to

simply dumping the objects in a room.• Normally metadata (data about data) may be needed to keep

track of the identifications of data stored in each of the partitions.

Page 15: SEWI  ZG514   Data Warehousing

15

Using Round Robin Partitions

• Once the warehouse is holding full amount of data, if a new partition is required, it can be done only by reusing the oldest partition.

• Then meta data is needed to note the beginning and ending of the historical data.

• This method, though simple, may land into trouble, if the sizes of the partitions are not same.

• Special techniques to hold the overflowing data may become necessary.

Page 16: SEWI  ZG514   Data Warehousing

16

Vertical Partitioning

• As the name suggests, a vertical partitioning scheme divides the table vertically – i.e. each row is divided into 2 or more partitions.

Page 17: SEWI  ZG514   Data Warehousing

17

Vertical Partitioning (Contd.,)

• Consider the following table:

Page 18: SEWI  ZG514   Data Warehousing

18

Normalization

• The usual approach in normalization in database applications is to ensure that the data is divided into two or more tables, such that when the data in one of them is updated, it does not lead to anomalies of data

Page 19: SEWI  ZG514   Data Warehousing

19

Row Splitting

• The method involves identifying the not so frequently used fields and putting them into another table.

• This would ensure that the frequently used fields can be accessed more often, at much lesser computation time.

Page 20: SEWI  ZG514   Data Warehousing

20

Hardware Partitioning

• The data ware design process should try to maximize the performance of the system.

• One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture.

• The exact details of optimization depends on the hardware platforms.

• Normally the following guidelines are useful:– maximize the processing power availability,– maximize disk and I/O operations.– reduce bottlenecks at the CPU and I/O throughput.

Page 21: SEWI  ZG514   Data Warehousing

21

Maximizing the Processing and Avoiding Bottlenecks

• One of the ways of ensuring faster processing is to split the data query into several parallel queries, convert them into parallel threads and run them parallelly.

• This method will work only when there are sufficient number of processors or sufficient processing power to ensure that they can actually run in parallel.

• Example: – To run five threads, it is not always necessary that we should

have five processors.– But to ensure optimality, even a lesser number of processors

should be able to do the job, provided they are able to do it fast enough to avoid bottlenecks at processor.

– Shared architectures are ideal for such situations, because one can be almost sure that sufficient processing powers are available at most of the times.

Page 22: SEWI  ZG514   Data Warehousing

22

Maximizing the Processing and Avoiding Bottlenecks

• In such a networked environment, where each of the processors is able access data on several active disks, several problems of data contention and data integrity need to be resolved

Page 23: SEWI  ZG514   Data Warehousing

23

Stripping Data Across MPP Nodes

• This mechanism distributes the data by dividing a large table into several smaller units and storing them in each of the disks.

• There sub tables need not be of equal size, but are so distributed to ensure optimum query performance.

• The trick is to ensure that the queries are directed to the respective processors, which access the corresponding data disks to service the queries.

Page 24: SEWI  ZG514   Data Warehousing

24

Stripping Data Across MPP Nodes (Contd.,)

• The method is unsuitable for smaller data volumes.

Page 25: SEWI  ZG514   Data Warehousing

25

• This technique spreads the processing load by horizontally partitioning the fact table into smaller segments and physically storing each segment into a different node.

• When a query needs to access in several partitions, the accessing is done in a way similar to the above methods.

• If the query is parallelized, then each sub query can run on the other nodes

Horizontal Hardware Partitioning

Page 26: SEWI  ZG514   Data Warehousing

26

Horizontal H/w Partitioning (Contd.,)

• This technique will minimize the traffic on the network.

Page 27: SEWI  ZG514   Data Warehousing

27

Why Key to Partition By?

• It is very crucial• If working key is chosen, eventually end up

having to totally recognize your fact data

Page 28: SEWI  ZG514   Data Warehousing

28

Why Key to Partition By? (Contd.,)

• Could be chosen to partition on any key, possibly:– region– transaction_date

• Suppose the business is organized into 20 geographical regions, each with a varying number of branches of different sizes

• It leads to 20 regions, which is reasonable• Nice partitioning scheme, covers vast

majority of queries are restricted to the user’s own business region

Page 29: SEWI  ZG514   Data Warehousing

29

Why Key to Partition By? (Contd.,)

• If partitioned by transaction_date rather than region

• All the latest transactions from every region will be in one partition

• This is horrible, because user wanted by region has to look across multiple partitions

• So partition by region is better.

Page 30: SEWI  ZG514   Data Warehousing

30

Sizing the Partition

• Key decision made on the size of partition used, will affect the consideration

• The SLA also acts as a limit on the size of any partitioning scheme

• A partition will most likely become the unit of backup and recovery

• The availability stipulations in the SLA will act as a limit on the size of a partition

• The disk setup used will act as a constraint on the number of partitions you can use

• Query performance is a major consideration

Page 31: SEWI  ZG514   Data Warehousing

31

Summary

• Partitioning• Horizontal Partitioning• Vertical Partitioning• Hardware Partitioning• Which Key to Partition by?• Sizing the Partition

Page 32: SEWI  ZG514   Data Warehousing

Thank You