Upload
phamdan
View
221
Download
4
Embed Size (px)
Citation preview
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e28
Chapter 3:
Results and observations Part-I
3.1 Flat file database
Experimental Setup
A Experiments are Run on
Intel ® Core ™ i3 CPU, [email protected] GHz Processor
4.00 GB RAM
Windows XP Operating System
Summary of the sample/real DB for these Experimentations
Portfolios/ Scripts details from Moneycontrol.com
Balance Sheets of various Scirpts from Religare Securities
National Stock Exchange www.nseindia.com
Bombay Stock Exchange www.bseindia.com
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e29
Figure 2 Scripts Sold Data
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e30
Figure 3 Portfolio Data
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e31
Figure 4 Balance Sheet Data
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e32
Figure 5 History and Alerts Data
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e33
Figure 6 One Year Backup Basic Data
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e34
Figure 7 One Year Advanced in detailed Data
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e35
Files are called “Flat Files” when they contain a single data structure. Generally this
structure is the column and row structure like a spreadsheet or table, but a file in binary or
encrypted with a single encryption key could also be called a flat file. Files that are not flat;
marked up files like XML or HTML, EDI files, other formats like HL7 or SEF files and others.
Two flat file types; Delimited Files, and Fixed Width Files.
Delimited File
A delimited file is a file where the data is organized in rows and columns. Each row has a
set of data, and each column has a type of data. If it sounds like I am describing a spreadsheet,
you are right on the money. To make the column, each row has the columns separated with a
character called a delimiter. See the example below Delimited File in figure 8, Fixed Length
Delimited Flat file in Figure 9 and Variable Length Delimited Flat file in figure 10.
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e36
Figure 8 Delimited Flat file
Figure 9 Fixed Length Delimited Flat file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e37
Figure 10 Variable Length Delimited Flat file
Flat files have some advantages over databases:
Available and versatile: we can create and save data in any operating system's file
system. We don't need to install any extra software. Additionally, text data stored in flat
Stock,Price,Change,Close,Volume,Qty,Inv.Price,Inv.Amt,Gain,Overall Gain,Latest Value
Reliance,748.1,-3.45,751.55,2.40m,1,799,799,-3,-51,748
Alok Industries,19.65,0.05,19.6,2.73m,1,16.55,17,0,3,20
FCS Software,0.45,0.05,0.4,209633,9820,2.55,25041,491,-20622,4419
Bharti Airtel,317.05,-6.45,323.5,3.88m,1,425.2,425,-6,-108,317
DCB,49.9,1.75,48.15,6.42m,100,64,6400,175,-1410,4990
Punj Lloyd,55.65,1.1,54.55,2.01m,111,97.6,10834,122,-4657,6177
Unitech,28.6,0.15,28.45,8.54m,60,32.5,1950,9,-234,1716
Apollo Tyres,86.65,0.45,86.2,2.08m,1,81.8,82,0,5,87
GMR Infra,29.05,0.35,28.7,2.51m,1,27.15,27,0,2,29
Jaypee Infra,48.4,-0.2,48.6,104078,1,90,69,0,-21,48
Mahindra Satyam,78.9,0.45,78.45,1.97m,1,91,91,0,-12,79
ILandFS,26.75,-0.05,26.8,26293,20,29.25,585,-1,-50,535
Suzlon Energy,24.1,-0.05,24.15,9.72m,110,36.8,4048,-5,-1397,2651
Hind Constr,25.15,-0.15,25.3,1.83m,140,51.5,7210,-21,-3689,3521
Wire & Wireless,9.25,-0.05,9.3,338281,1000,14.01,14010,-50,-4760,9250
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e38
files can be read by a variety of software programs, such as word processors or
spreadsheets [59].
Easy to use: We don‟t need to do any extra preparation, such as install database software,
design a database, create a database, and so on. Just create the file and store the data with
statements in your PHP script [54].
Smaller: Flat files store data by using less disk space than databases [58].
A flat file is quick and easy and takes less space than a database. Flat files are particularly useful
for making information available to other software, such as an editing program or a spreadsheet.
Flat files can be looked at by anyone with access to the computer directory where they are stored,
so they are useful when information needs to be made available to other people.
Databases have some advantages as well:
Security: A database provides a security layer of its own, in addition to the security
provided by the operating system. A database protects the data from outside intrusion
better than a flat file.
Accessibility of data: we can store data in a database by using a very complex data
structure, specifying data types and relationships among the data. The organization of the
data makes it easy to search the data and retrieve what you need.
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e39
Ability to handle multiple users: When many users store or access data in a single file,
such as a file containing names and addresses, a database ensures that users take their turn
with the file to avoid overwriting each other's data.
Databases require more start-up effort and use more space than a flat file, but are much more
suitable for handling complex information. The database handles the internal organization of the
data, making data retrieval much simpler [50, 57]. A database provides more security, making it
more suitable for sensitive, private information. Databases can more easily and efficiently handle
high traffic when many users may try to access the data almost simultaneously [53].
Some of the flat file systems disadvantages are listed as follow.
Disadvantages
1. Less security easy to extract information.
2. Data Inconsistency.
3. Data Redundancy.
4. Searching of a record is very time consuming.
This research work makes use of all the advantages of the flat file database and also tries to
nullify all the disadvantages mentioned above.
These are the following measures taken in order to overcome the above disadvantages.
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e40
a. This research work explicitly provides security using one time password over a file thus
the security during transfer of a file from source to the destination can be maintained.
Protection against hacking of file is possible as one time password is used in order to
avoid the leaking of password.
b. Data consistency can be maintained using change data capture mechanism where triggers
are used to alert the destination system about the record updates at the source systems and
as these flat files are converted from data base file itself the data is already consistent
before extraction as well [48, 51].
c. Data redundancy does not exist in flat file as it is created from the database file itself
before extracting the file, once after the extraction is done it is once again converted back
to database file at the destination system [49, 52].
d. Search time is very high in normal any flat file as there is no constraint in the records of
the file. Data may be redundant and missing of primary key are the main factors for which
the search time becomes very high. This research work does not use flat files for any kind
of operation. Flat files are used only during the transfer of files and as this flat file is
created from the database file itself, most the advantages of database files are also
applicable.
Fixed Width Flat File
There is another type of file, is is called a Fixed Width or Fixed Position file. It is different from
a delimited file in that the data fields are defined by the character position. See the example
below.
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e41
In a fixed width file, the delimiter characters are eliminated. If the data is formulated such that
the data fields are the same size, this format can be more compact than a delimited file. You can
see here that we know the size of the Birthdate data, so we eliminate all the spaces between the
Bdate and Department fields. If all of the data was formatted for size like this, we could really
make this file small, so that it only contains the data.
We also eliminate the pesky problem of delimiters found in data. The issue of a comma delimited
files containing a field that has a comma in the data. How does the parser know that this comma
is not really a delimiter, but is part of the data? Anyway, that problem is eliminated in a fixed
width file.
Comparison
This is not a contest of which format is superior. Both file architectures are useful and both are
used commonly enough that you need to be at ease working with both. Delimited files are really
easy to work with as long as your data is clean of the delimiter character. Doing quick integration
of data common in ETL tasks, delimited files are far more common that Fixed Width.
Continuous operations of data integration and importation many times find that Fixed Width or
Position files are more reliable for the unattended operation, even ETL if it is unattended.
As with many things in integration work, we want to pick the best option. Knowing and working
with both fixed and delimited files will help you determine the right choice for the task we have
before us.
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e42
3.2 Implementation
Experimental Results:
FRONT PAGE:
Main front page designed to compare and check time extraction for flat file and database file is as
shown in the following snap shots by varying the number of the records in a file.
To compare the performance with respect to size and time, the record size were varied from 100
to 10000 records. The performance variation of the flat file can be compared with that of database
file with respect to the following implementation snapshots. (For some part of the implementation
module code, refer Appendix-B). The implementation was carried out in asp.Net, MYSQL.
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e43
Figure 11: Main page to execute the extraction time for flat file and Data base file.
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e44
Number of Records : 100
a) Data base file extraction
Figure 12: Data Extraction time result when the numbers of records are 100 in a Database file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e45
b) Flat file extraction
Figure 13: Data Extraction time result when the numbers of records are 100 in a flat file.
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e46
Number of Records: 500
a) Database file
Figure 14: Data Extraction time result when the numbers of records are 500 in a Database file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e47
b) Flat file extraction
Figure 15: Data Extraction time result when the numbers of records are 500 in a flat file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e48
Number of Records: 1000
a) Database file
Figure 16: Data Extraction time result when the numbers of records are 1000 in a Database file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e49
a) Flat file Extraction
Figure 17 : Data Extraction time result when the number of records are 1000 in a flat file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e50
Number of Records : 2000
a) Database file Extration
Figure 18: Data Extraction time result when the numbers of records are 2000 in a Database file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e51
b) Flat file Extration
Figure 19: Data Extraction time result when the numbers of records are 2000 in a flat file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e52
Number of Records: 5000
a) Database file
Figure 20: Data Extraction time result when the numbers of records are 5000 in a Database file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e53
b) Flat file
Figure 21: Data Extraction time result when the numbers of records are 5000 in a flat file
Improved Extraction mechanism in ETL process for building of a Data Warehouse
MPSTME, SVKM’s NMIMS, Mumbai
Pag
e54
Number of Records : 10000
a) Database file Extration
Figure 22: Data Extraction time result when the numbers of records are 10000 in a data base file