Centralization, Normalization, and Warehousing Non-common ...penguin.ewu.edu/cscd506/Projects/albee_kaleb_project1_… · Web viewCentralization, Normalization, and ... The proposed

Centralization, Normalization, and Warehousing Non-common data points

By Kaleb Joel Albee

2/29/2012

Project Report for aMaster of Science in Computer Science Degree

atEastern Washington University

Table of Contents

Table of Contents

1 Introduction............................................................................................................................................10

1.1 Background .....................................................................................................................................10

1.2 Problem defined and project involvement.......................................................................................10

1.3 Need for managing many data sources ...........................................................................................11

1.4 Proposed solutions..........................................................................................................................11

1.5 Project Objectives............................................................................................................................11

2 Literature Review....................................................................................................................................11

2.1 Database..........................................................................................................................................11

2.1.1 Definition and Attributes...........................................................................................................11

2.1.2 Typical use in Businesses...........................................................................................................11

Chapter 2.2 Data Warehouse.................................................................................................................11

2.2.1 History.......................................................................................................................................11

2.2.2 Definition and Attributes...........................................................................................................12

2.2.3 Importance and Trends.............................................................................................................12

2.2.4 Database Vs. Data Warehouse..................................................................................................12

2.2.5 Schemas....................................................................................................................................12

2.2.5.1 Star schema....................................................................................................................12

2.2.5.2 Snowflake Schema..........................................................................................................13

2.2.5.3 Fact Constellation...........................................................................................................13

2.3 Data formatting................................................................................................................................14

2.3.1 Data cleansing and Standardization..........................................................................................14

2.3.2 Definition...........................................................................................................................14

2.3.3 Procedure..........................................................................................................................14

2.3.4 Approaches........................................................................................................................14

2.3.5 Challenges.........................................................................................................................14

3 Scrum......................................................................................................................................................14

3.1 Definition.........................................................................................................................................14

3.2 History..............................................................................................................................................14

3.3 Attributes.........................................................................................................................................14

3.4 Popularity.........................................................................................................................................15

3.5 Application of Scrum........................................................................................................................15

4 Data Modeling and Architecting provide Ad-hoc reporting....................................................................16

4.1 Clarify/Amdocs.................................................................................................................................16

4.2 Family History Center (FHC) Profile..................................................................................................17

4.3 Kanisa...............................................................................................................................................17

4.4 Omniture web services....................................................................................................................18

4.5 LANDesk...........................................................................................................................................18

5 Operation of the Warehouse..................................................................................................................19

5.1 Now vs. Before.................................................................................................................................19

5.2 User satisfaction...............................................................................................................................19

6 Conclusions.............................................................................................................................................20

7 Future Work............................................................................................................................................21

APPENDIX A – SQL SCRIPTS........................................................................................................................22

AMDOCS/CLARIFY..................................................................................................................................22

LANDesk.................................................................................................................................................30

APPENDIX B – SCRIPTS...............................................................................................................................35

KANISA...................................................................................................................................................35

OMNITURE.............................................................................................................................................51

Family History Center (FHC) Profile........................................................................................................70

APPENDIX C – REPORTS.............................................................................................................................77

Works Cited...............................................................................................................................................78

1.1 Background ....................................................................................................................................... 7

1.2 Problem defined and project involvement ......................................................................................... 7

1.3 Need for managing many data sources ............................................................................................. 7

1.4 Proposed solutions ............................................................................................................................ 7

1.5 Project Objectives .............................................................................................................................. 7

2 Literature Review ...................................................................................................................................... 7

2.1 Database ............................................................................................................................................ 7

2.1.1 Definition and Attributes ............................................................................................................. 7

2.1.2 Typical use in Businesses ............................................................................................................. 8

Chapter 2.2 Data Warehouse ................................................................................................................... 8

2.2.1 History ......................................................................................................................................... 8


2.2.3 Importance and Trends ............................................................................................................... 8

2.2.4 Database Vs. Data Warehouse .................................................................................................... 8

2.2.5 Schemas ...................................................................................................................................... 8

2.2.5.1 Star schema ...................................................................................................................... 8

2.2.5.2 Snowflake Schema ............................................................................................................ 9

2.2.5.3 Fact Constellation ............................................................................................................. 9

2.3 Data formatting ................................................................................................................................ 10

2.3.1 Data cleansing and Standardization .......................................................................................... 10

2.3.2 Definition ........................................................................................................................... 10

2.3.3 Procedure .......................................................................................................................... 10

2.3.4 Approaches ........................................................................................................................ 10

2.3.5 Challenges ......................................................................................................................... 10

3 Scrum ...................................................................................................................................................... 10

3.1 Definition ......................................................................................................................................... 10

3.2 History .............................................................................................................................................. 10

3.3 Attributes ......................................................................................................................................... 10

3.4 Popularity ......................................................................................................................................... 11

3.5 Application of Scrum to our project ................................................................................................. 11

4 Data Modeling and Architecting provide Ad-hoc reporting .................................................................... 12

4.1 Clarify/Amdocs ................................................................................................................................. 12

4.2 Family History Center (FHC) Profile .................................................................................................. 13

4.3 Kanisa ............................................................................................................................................... 13

4.4 Omniture web services .................................................................................................................... 14

4.5 LANDesk ........................................................................................................................................... 14

5 Operation of the Warehouse .................................................................................................................. 14

5.1 Now vs. Before ................................................................................................................................. 14

5.2 User satisfaction ............................................................................................................................... 15

6 Conclusions ............................................................................................................................................. 16

7 Future Work ............................................................................................................................................ 17

APPENDIX A – SQL SCRIPTS ........................................................................................................................ 18

AMDOCS/CLARIFY .................................................................................................................................. 18

LANDesk ................................................................................................................................................. 26

APPENDIX B – SCRIPTS ............................................................................................................................... 31

KANISA ................................................................................................................................................... 31

OMNITURE ............................................................................................................................................. 47

Family History Center (FHC) Profile ........................................................................................................ 66

APPENDIX C – REPORTS ............................................................................................................................. 73

Works Cited ............................................................................................................................................... 74

1.1 Background ....................................................................................................................................... 3

1.2 Problem defined and project involvement ......................................................................................... 3

1.3 Need for managing many data sources ............................................................................................. 3

1.4 Proposed solutions ............................................................................................................................ 3

1.5 Project Objectives .............................................................................................................................. 3

2 Literature Review ...................................................................................................................................... 3

2.1 Database ............................................................................................................................................ 3


2.1.2 Typical use in Businesses ............................................................................................................. 4

Chapter 2.2 Data Warehouse ................................................................................................................... 4

2.2.1 History ......................................................................................................................................... 4


2.2.3 Importance and Trends ............................................................................................................... 4

2.2.4 Database Vs. Data Warehouse .................................................................................................... 4

2.2.5 Schemas ...................................................................................................................................... 4

2.2.5.1 Star schema ...................................................................................................................... 4

2.2.5.2 Snowflake Schema ............................................................................................................ 5

2.2.5.3 Fact Constellation ............................................................................................................. 5

2.3 Data formatting .................................................................................................................................. 6

2.3.1 Data cleansing and Standardization ............................................................................................ 6

2.3.2 Definition ............................................................................................................................. 6

2.3.3 Procedure ............................................................................................................................ 6

2.3.4 Approaches .......................................................................................................................... 6

2.3.5 Challenges ........................................................................................................................... 6

3 Scrum ........................................................................................................................................................ 6

3.1 Definition ........................................................................................................................................... 6

3.2 History ................................................................................................................................................ 6

3.3 Attributes ........................................................................................................................................... 6

3.4 Popularity ........................................................................................................................................... 7

3.5 Application of Scrum to our project ................................................................................................... 7

4 Data Modeling and Architecting provide Ad-hoc reporting ...................................................................... 8

4.1 Clarify/Amdocs ................................................................................................................................... 8

4.2 Family History Center (FHC) Profile .................................................................................................... 9

4.3 Kanisa ................................................................................................................................................. 9

4.4 Omniture web services .................................................................................................................... 10

4.5 LANDesk ........................................................................................................................................... 10

5 Operation of the Warehouse .................................................................................................................. 10

5.1 Now vs. Before ................................................................................................................................. 10

5.2 User satisfaction ............................................................................................................................... 11

6 Conclusions ............................................................................................................................................. 12

7 Future Work ............................................................................................................................................ 13

APPENDIX A – SQL SCRIPTS ........................................................................................................................ 14

AMDOCS/CLARIFY .................................................................................................................................. 14

LANDesk ................................................................................................................................................. 22

APPENDIX B – SCRIPTS ............................................................................................................................... 27

KANISA ................................................................................................................................................... 27

OMNITURE ............................................................................................................................................. 43

Family History Center (FHC) Profile ........................................................................................................ 62

APPENDIX C – REPORTS ............................................................................................................................. 69

Works Cited ............................................................................................................................................... 70

1 Introduction

1.1 Background The Church of Jesus Christ of Latter Day Saints (hereafter LDS Church) has a

historical focus of genealogical research, derived from an interpretation of a section in

the Old Testament of the Bible (KJV Malachi 4:5-6). The LDS Church contains a Family

History Department dedicated to genealogy research, and has placed high priority and

allocated significant resources to genealogy. The FamilySearch.org website of the LDS

Church is one of the fastest growing genealogy Internet sites in the world (Top 10 U.S.

Websites to Search for Your Ancestors, 2012). The Tools, Technology, and Support

(TTS) Division of the Family History Department is tasked with improving accessibility

and usability of the FamilySearch.org site.

In a climate where personal computer access and power is rapidly expanding

across the world, the TTS Division has observed the inevitable limitations of diverse

users, and recognized the importance of addressing those limitations. Genealogical

raw data consists of government records (e.g., census data), graveyard records,

library/community histories, newspaper articles, and personal records (e.g., journals

where accessible). Raw data has been physically stored in a climate controlled facility

near Salt Lake City termed ‘Granite Mountain’, where approximately 35 billion images of

genealogical information contained mostly on 2.4 million rolls of microfilm reside (Taylor,

2010). Those raw data sources must go through a lengthy process of record validation,

digitization, storage, and archival prior to end use and research. This process of

digitization is primarily conducted by volunteers, but also through collaboration with

other businesses, to become viable for worldwide genealogical research at the

FamilySearch.org site.

1.2 Problem defined and project involvementA large volume of end users such as indexers, genealogists, and curious website

surfers from various backgrounds and of different countries use the website on a regular

basis. When these users encounter problems, customer service for these users were

previously disjointed and inefficient. Customer service responses were stored in a

number of different storage systems. The volume of these records was quite large with

gigabytes of data . These records were used to improve the LDS geneology website

and provide service to users.

The main problem with the customer service records was the disjointed,

disparate sources of these records. The challenge was how-to integrate unique data

record storage systems, which were without obvious associations. These storage

systems were commonly obtained from different applications and indeed different

countries. I Initial attempts to streamline responses were inefficient because the staff

discovered much of the user response data was not recorded in the databases.

Instances occurred when computationally intensive reports analyzing the user

experience with several of the contributing databases such as LANDesk data or Client

Management tools crippled entire production systems.

These inefficiencies led to an initiative to provide the most complete feedback

possible to FamilySearch.org management by identifying a single point of access for

quality reports analyzing user experiences and reporting in a universally accessible

format (e.g., MS Excel, HTTPS, and Crystal). The solution to the storage and retrieval

of the massive amount of user experience data in separate formats was to gather user

experience data in all its forms and place it into a centralized warehouse in a universally

accessible format. This is the main purpose and task of this Masters project and will be

described in full detail in the following sections.

Figure 1 : Process chart

. Five most common user experience data sources from FamilySearch.org research efforts.

1.3 Need for managing many data sources User experience data is derived from multiple sources such as the Family History

Center (FHC) Profile (ref), Amdocs (ref), LANDesk (ref), Kanisa (ref), and online

tracking tools (See Figure 1). Each source of data has its own unique set of metrics for

the data that will be tracked. Family History Center (FHC) Profile tracked the personnel

usage, software usage, and the volunteer usage. LANDesk (http://www.landesk.com)

tracked the specific usage of a Family History Center’s computer. Amdocs

(http://www.amdocs.com/Pages/HomePage.aspx) tracked patron agent interactions,

resources accessed, and how quickly a solution was found. Kanisa

(http://crm.consona.com/software/products/knowledge-management.aspx) tracked

which documents where accessed, what key words were used, and the approximate

time spent on each page. Online tracking tools, such as Omniture

(http://www.omniture.com/en/) tracked user’s country of origin, IP address, where the

user would enter the FamilySearch website and where they would leave. Together

these five sources, with the combined storage space of 60 Gigabytes, of user

experience data comprised nearly all the inputs for data and proved extremely

challenging to incorporate into a single warehouse. The five data sources were

approximately100 million records in size.

1.4 Proposed solutionsThe purpose of the five data sources was to improve the user experience by

utilizing each application’s record-keeping and analysis tools. Thus, the goal was to

create a data warehouse derived from five user experience data applications, and

provide an enterprise wide solution where a business user of any expertise could create

a customizable report from the warehouse. The proposed solution to this challenge was

to assess data integration feasibility, design cleansing/standardization procedures,

automate data consumption, and architect and integrate data warehouse schemas.

The process of creating a data warehouse required the use of enterprise level

tools, assessment of databases and programming languages, and incorporation of

custom scripts at the database (DB) level to deliver data manipulations. The enterprise

level tools consisted of Business Objects Data Services (BODS) and other extended

data cleansing tools (http://www.sap-businessobjects.info/data-services.htm).

Additionally, techniques were researched that reduced the time required to query the

data warehouse. Members from the TTS Division were tasked with utilizing the

proposed warehoused data records for creating applications to enable end-user output

that would contain graphs, charts, and raw data.

1.5 Project ObjectivesGiven data record storage systems in diverse formats with few obvious relational

connections, the objectives of this project were to:

Identify the major data record systems

Cleanse and standardize data

Unify the data records into one warehouse

Conduct user accessibility testing to ensure the storage warehouse would

operate properly with each application

The overall goal of the TTS project was to provide a simple graphical interface

that a user of any technical background, with nominal knowledge of SQL, could use to

create reports. The end-users of the proposed data warehouse were comprised of

executives, managers, business analyses, volunteers, and product developers. The

following sections provide an explanation of the research process, techniques, and final

accomplishments of this project.

2 Literature Review

2.1 Database

2.1.1 Definition and AttributesA Database is a, “structured collection of data. It may be anything from a simple

shopping list to a picture gallery or the vast amounts of information in a corporate

network (ref). A relational database stores data in separate tables rather than putting all

the data in one big combined table. The database structures are organized into

physical files optimized for speed. The logical model, with objects such as databases,

tables, views, rows, and columns, offers a flexible programming environment.” (What is

MySQL, 2012) For the purposes of the data warehousing project, MS Excel

spreadsheets and MS Access will be included as database data sources.

2.1.2 Typical use in BusinessesSome of the most common uses of databases in industry are retail customer

records, governmental records, large complex computations on statistical data, and

medical (patient) records (ref). The retail businesses often use customer records to

analyze consumer habits, inventory usage, or ads which will target responses to

regional products. An example of government records are a person’s social security

records or their tax records. Larger complex queries can be done with a database

because databases have been optimized to do these kinds of transactions often. And

finally, medical patient records would allow medical facilities the ability to call up patient

records in an emergency or call up medication usage. Or patient records can provide

insight from large data sets.

Chapter 2.2 Data Warehouse

2.2.1 HistoryData warehousing is a relatively new technology born out of consumer needs.

The warehouse technology was driven by consumers assembling their assets or

technologies to accomplish one goal, a single source access point to mine for data from

many sources (ref). Consumers, businesses, and organizations, needed to analyze

data in ways previously impossible and or impractical to do because single and

separate reports couldn’t be combined in a reasonable time period. Further, the

computational resources were often inadequate to sustain production performance and

generate reports from the data. In an effort to meet their needs, the customers

combined several pieces of hardware, software, data mining techniques, and finally

analytic tools in an effort to accomplish their goals (ref). As a result, the movement

towards data analytics of multiple data sources was created. The following sections

detail the attributes and importance of data warehouses.

2.2.2 Definition and AttributesWilliam Inmon introduced four standards required for the design of a good data

warehouse (DWH Concepts and Fundamentals, 2007). First, the warehouse must be

subject oriented; second, integrated; third, non-volatile; and, fourth, time variant. Each

of the proceeding qualities allow a business analyst to ask a wide variety of questions.

The questions about a company can be submitted and retrieved in a timely, reliable, and

focused way. Each standard will be described in more detail as follows.

A data warehouse must be subject oriented. Therefore, the data within the data

warehouse has to be organized in such a way that it can answer questions about the

company. An example of a business question could be, how many users from a country

are accessing the systems and at what time of day or night. Also, the data must be

organized in a manner to facilitate many different kinds of questions (ref).

For a warehouse to be integrated, all the data in the warehouse should be

unified. The data fields have to match formats. Naming conflicts need to be corrected

in all places like those in a country fields. A country can be abbreviated, capitalized, or

misspelled. To be unified, one of these approaches must become the standard. The

units have to match up to guarantee that a report writer will receive accurate results

from the warehouse. Inaccuracies develop when they aren’t coordinated. For example,

when multiple servers placed around the world have their server clocks set to different

time zones, the data time stamp will vary. A non-volatile data warehouse has to ensure

that the data already in place never changes. Considering the business questions the

data warehouse is designed for, the warehouse provides a historical snap shot of the

business and its’ performance. As a result the warehouse grows perpetually larger by

nature of the design.

Lastly, a data warehouse must be time variant based. The purpose of a data

warehouse is largely to report on trends, statistics, and any of the needs of the

businesses. So, whenever new data is entered into the system, a time stamp or other

detail linking to a data will need to be inserted (ref).

A data warehouse has several other features attributed to the design. Oftentimes,

the data warehouse must have several indexes placed on all the tables. Indexes are a

way to provide the host database quick and pre-calculated access into vast quantities of

data. Because they are pre-calculated, every time new data is stored into the

warehouse the indexes must be recalculated in order to utilize the effectiveness of

indexes. Indexes take up space based upon the number of indexes and quantity of data

being indexed.

Pre-calculated metrics are another common feature to a data warehouse. The

utilization of the pre-calculated metrics is another consequence of the immense

amounts of data stored in a data warehouse (ref).

A warehouse is usually de-normalized which produces duplicate data. This

would seem to be a problem in a database; however, in a warehouse, de-normalizing

the data structure decreases the complexity, increases the search speed of the

warehouse, and improves the simplicity of the queries. The overall performance of the

queries increases substantially. (ref)

2.2.3 Importance and TrendsA data warehouse can provide business executives deep insights as to how their

business is performing in near real time (Benefits of a Data Warehouse, 2011). In a

global and twenty four hour market, every business needs an edge over the

competition. However, sustaining the required appliances which includes expensive

reporting software for an analytics services can be prohibitive. The average business is

moving towards finding analytic tools not requiring specialized technical skill sets. As a

result, companies such as WebFOCUS, SAP Business Objects, MicroStrategy, and

Microsoft Business Intelligence have all created tools to aid in analytics (List of

Business Intelligence (BI) Tools, 2012).

Due to the change in the global market, businesses are trying to reduce costs

and achieve the same processing potentials delivered by custom built appliances within

their businesses. They have turned to Cloud analytics which has been encouraged by

google’s BigQuery™, Infinit.e™ and other fine products (Higginbotham, 2012).

(Higginbotham, 2012)

In a harsh economy business are trying to save money in as many places as

they can and the use of the cloud computing and appliances provided by cloud analytic

businesses, corporations can make decisions based upon near real time reports and

statistics on their products.

2.2.4 Database Vs. Data WarehouseWhat is the difference between a database and a data warehouse? A data

warehouse can be a database, but a Database may not necessarily be a data

warehouse. A database is optimized for disk writes and is normalized to conserve disk

space due high volume of data. A data warehouse is built with the intent to do analytics

and reporting on joined metrics across several sources. A data warehouse will

commonly contain data from several different databases where as a standard database

will be tuned to handle only one application. Further, a data warehouse will be

optimized to handle analytic business intelligence (BI) questions the business would

need to know.

Several enterprise applications are available if the data warehouse has been

architected to accommodate a BI tool. WebFOCUS, SAP Business Objects,

MicroStrategy, and Microsoft Business Intelligence where four of the tools we were

evaluating to provide an enterprise solution after the warehouse was built. We decided

to focus our efforts on SAP Business Objects, primarily needed a Fact scheme, to

produce its solution. Microsoft Business Intelligence was a second choice which relied

on Data Cubes.

2.2.5 SchemasThere are three of types commonly used schemas in industry: a Star schema, a

Snowflake schema, and the fact constellation schema.

2.2.5.1 Star schemaThe Star schema has the most parsimonious joins among records. “A Star schema is

characterized by one or more very large fact tables that contain the primary information

in the data warehouse, and a number of much smaller dimension tables (or lookup

tables), each of which contains information about the entries for a particular attribute in

the fact table.” (Oracle.com) ( 2). Within fact tables, the first data types are aggregates

of the dimension tables and the second type of data are the foreign keys to the

associated dimension tables. The fact table contains foreign keys which join to

surrounding dimension tables within the schema. Databases such as MySQL, Oracle,

and MS SQL servers recognize the Star schema queries and automatically optimize the

execution plan to take advantage of schema architecture (Star-Schema Design, 2010).

Figure 2: Star schema

The straightforward approach the Star schema provides, and the large number of

Business Intelligence (BI) tools available to read Star schemas, make it an efficient

choice to organize information for reporting. Because the Star architecture is simple to

understand and to maintain, it is less expensive to maintain and therefore more

palatable for businesses. Additionally, simplicity ensures fewer dependencies that could

otherwise prohibit system improvements (Star Schema, 2009).[ref]

Nevertheless, implementation has increased costs due to record storage volume

and limited availability of complex reports. A data warehouse architect must carefully

analyze resources available and plan appropriately for their business growth. If data

warehouse systems are limited by storage space or necessary complex reports are

required, then the star schema is not ideal choice (Star Schema, 2009).

2.2.5.2 Snowflake SchemaFigure 3: Snowflake Schema

Snow Flake Schema – “In computing, a Snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a Snowflake in shape. The Snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.” (wikipedia.org, 2012).

An advantage of the Snowflake schema, illustrated by figure 3, over the Star

schema is its ability to handle more complex reports and queries. Another advantage

rests in its ability to save on storage space if that is a factor in the warehouse. Business

needs are often met by these two schemas (wikipedia.org, 2012).

Some disadvantages of a Snowflake schema include the potential for overly

complex sets of the queries and increased difficulty in reading the schema. The

complexity of the queries using the Snowflake schema increases the work load of the

host database’s CPU, RAM, and IO transactions. Furthermore, the complexity of the

table relationships increases the difficulty of maintaining the systems schema structure

as more data is added.

http://en.wikipedia.org/wiki/Dimension_(data_warehouse)

http://en.wikipedia.org/wiki/Fact_table

http://en.wikipedia.org/wiki/Snowflake

http://en.wikipedia.org/wiki/Entity-relationship_model

http://en.wikipedia.org/wiki/Multidimensional_database

http://en.wikipedia.org/wiki/Logical_schema

http://en.wikipedia.org/wiki/Logical_schema

http://en.wikipedia.org/wiki/Computing

2.2.5.3 Fact ConstellationA fact constellation is a set of Star schemas joined together by common

dimension tables or fact tables. By creating a fact constellation, the complexity of a

schema is increased exponentially, maintenance is costly, and the space usage remains

large due to the size of the dimension tables.

Figure 4: Fact Constellation

Thus, the fact constellation schema is used sparingly for complex and in-depth reports

(Figure 4) (Dimensional Model Schemas- Star, Snow-Flake and Constellation, 2012).

2.3 Data formatting

2.3.1 Data cleansing and Standardization

2.3.2 DefinitionData cleansing is the process of detecting enigmatic values then removing or

entering values which provide a standard answer. Data cleansing can be one of the

most time consuming and difficult complications in creating a data warehouse. An

example is how a database interprets empty, or null, values. A NULL is an expression

of an empty field which needs to be present in case a record has need of it. The conflict

is realized where a database will parse a zero as an empty value and a different

database will construe the value as a literal integer.

2.3.3 ProcedureBefore data can be entered into a database the records are audited for

contradictions such as spelling, formatting, and false entries. Then a process is

developed to remove or correct the discrepancies within the data records. The

developed cleansing process is accepted and automated for data alteration. Next, the

process is implemented and tested in a staging environment. Finally, the data is

examined a third time to inspect for irregularities in the records.

2.3.4 ApproachesCommon practices used to prepare data records for data warehousing include

parsing, duplicate record deletion, and utilizing statistical methods. Each practice has

its advantages and disadvantages which apply to different situations and in different

combinations.

The parsing approach employs the use of regular expressions and strict word

matching. An example of parsing is as follows: SELECT SUBSTR(v_content, 0,

REGEXP_INSTR(v_content, '<PROBLEM(.*?)>', 1,1,0,'i') - 1) INTO fp FROM

document_wvar_mv WHERE documented = ‘110118’ AND draft = 0; The preceding

example SQL regular expression is looking for an XML tag which contains the word

PROBLEM then followed by a close brace tag. Strict word/ phrase matching is useful

when the possible values in a particular field or dataset in question are limited to a small

subset. In relation to a database, utilizing a strict word or phrase, unique set of values

is queried and then a standardized value is agreed upon. Finally, a script or developer

would execute the alteration devised by the architect or engineers.

Duplication deletion removes all duplicate records and adjusts all join values to

point to the single instance of the duplicate. Data transformation is the approach where

a certain value is detected by several means and is then changed into an agreed upon

value. On such example is how a geographical state is expressed such as Utah. Utah

can be expressed in UT, Utah, Utah, and even by its zip code.

Finally, statistical methods can prove useful when data records are too numerous

to format into a report. If records on an application are taken for every instance of an

event the records quickly become too extensive to provide timely reports. Often that

level of information is not necessary. Records can be calculated using statistical

methods such as an average, mean, or deviation on a specified time period.

2.3.5 ChallengesThe most common challenges with data cleansing are errors in the error

correction procedure and the time required for maintenance. Error correction can be

difficult due to the nature of the types of corrections. If the users assigned to implement

the cleansing procedure do not understand the data, a desired value may be altered

and skew the results of any report utilizing the field(s). As the data sources are live

documents (constantly changing), errors are regularly occurring. As such, the

warehouse needs to be continually maintained. The time allocation of maintenance can

be prohibitive.

3 Scrum

Scrum is one of many possible software engineering methodologies used in the

development of large projects. Scrum was utilized for this project as the best way to

foster communication between everyone involved and reduce the complexity into

manageable tasks. This section describes the Scrum methodology, background and

history.

3.1 Definition “Scrum is an agile framework for completing complex projects.” The

methodology my team used to deliver all of the requested products is based on a

process called Scrum. Scrum “[…] is an agile framework for completing complex

projects” (Scrum Is an Innovative Approach to Getting Work Done, 2012). An agile

framework relies upon the ability to take a large task and break it down into smaller

tasks.

3.2 HistoryIn 1986 Hirotaka Takeuchi and Ikujiro Nonaka discovered a new methodology for

product development called Scrum. Scrum dictates that teams overlap responsibilities

plus scrum encourages teams to work together. Scrum is a term from the sport rugby.

To scrum is where a game violation is called by the referee and then the two teams

need to bring the ball into play. The team must work together to get control of the ball.

The ball is equated to be the problem or project in hand. And in order to make their goal

they must work together against the opposition to achieve it.

Scrum methodology was able to get further formal attention from a book called

“Wicked Problems, Righteous Solutions” written by DeGrace and Stahl (ref). This was

the first book to term the agile approach as scrum. Next in 1995 Jeff Sutherland and

Ken Schwaber presented a book called, “Business object design and implementation:

OOPSLA '95 Workshop Proceedings.” (ref) This conference emphasized the initial

scrum processes.

Since then, businesses and organizations have enhanced and personalized the

process to meet their needs, but the foundation of the approach is owned by Ken

Schwaber. Scrum has become widely popular not only in software development, but

also in other engineering fields.

3.3 AttributesIn order to understand using the Scrum process, we first need to define some

terms associated with the scrum philosophy. The list of terms is as follows: Scrum

Master, Product backlog, Sprint backlog, Sprint, Daily Scrum Meeting, product owner,

development team, and a usable product.

A Scrum Master is a person on a scrum team who is assigned the responsibility

to remove problems, encourages and enforces project members to follow the scrum

process, and finally prevents distractions from reaching the development team.

Problems which may need to be removed can include, but are not exclusive to,

necessary resources, outside teams refusing to do necessary work, and

implementations.

When resources are required, such as data expert guidance or a software

package is needed, the scrum master can dedicate their time to finding the expert and

setting up appointments/meetings or finding monetary resources to acquire software or

equipment.

Processes are especially important to ensure work progresses through

transparency between the development team and the management, communication

between the client and the project team, and project progress overall. Transparency

between the development team, client, and management prevents constant rework,

management miss-understandings, and client impatience. Re-work is caused when an

amount of work is done and it is not what the client or manager wants. Further

advantage is when the client is aware of project problems, progress, and realistic

expectations of project completion. If the client is aware they can plan their schedules

to meet their own needs.

Distractions can cause serious delays especially when ‘scope creep’ enters into a

project. Scope creep is when the customer requests additional features into their

project which is not in the accepted project outline. Another common distraction can be,

for example, when other people in your company ask a developer for ‘favors’ or tasks to

be done ‘real quick.’ Every time a project team member is distracted to a quick fix, the

developer has to spend roughly 15 minutes changing from his current task to the

requested task and then back again. And if the transition is done once a day all week

long an average developer team member will lose two and a half hours a week.

Next, a Product backlog is a list of features or tasks needed to be completed for

an entire project to be completed. A product backlog is a List of Requirements (LoR)

negotiated by the Scrum Master and the client who has worked together to break down

into a set of manageable tasks.

A sprint is an interval of time between two to four weeks in length. Sprints are

designed to encourage smaller tasks to be assigned there by preventing large amounts

of rework and emphasizing constant communication between the stakeholders in the

project.

Daily Scrums are necessary to promote communication between team members,

expedite the problem solving process, and draw attention to potentially time consuming

problems. During the fifteen minute scrum meeting the team members will talk about

the problems they had and give suggestions on how to solve the problems. Further the

Figure 5 Basic Scrum Process

scrum master is informed of the potential problems where they will attempt to remove

the problem.

The product owner is the person who is the representative of the client and the

designof the product. The product owner needs to communicate the needs of the

customer and the potential problems that arise during the process.

Finally, a usable product is something that can be given to the customer in

working form no matter how minimal it might be. Something workable can be defined

as simple as a login feature and as complex as complete security suit built into a

custom made application. However defined the product must be useful to the client.

The basic process of scrum is as follows and shown in Figure 5:

1. Project conception: An idea for work is first presented

2. Project backlog: The problem or idea is then put into a product/project back log.

The back log is a list of tasks to be completed for a given project.

3. Select tasks: work with the development team and plan for what can be

accomplished in the scrum time period and move those tasks to the sprint back

log.

4. Sprint back Log: assign tasks to each developer or group of developers by what

they feel they can handle.

5. Communicate with stakeholders and acquire resources: Talk with the

stakeholders and negotiate the amount of work to be done during that time

frame.

6. Sprint: work on sprint items and report to the stake holders when each item is

done. The stakeholders can give their approval of the work and quality of it too.

If in the event a developer finishes all their tasks for the sprint, they are to go to

other developers and help them complete their tasks. By helping each other

cross training occurs and the team become stronger.

3.4 PopularityScrum, in the beginning, was used to help software development but has quickly

been adopted into other projects and industries. Scrum has become increasingly

popular due to its ability to control new requirements of a project, how the agile

framework can manage enormous projects relatively easy, and the structures simplicity.

3.5 Application of ScrumIn the first iteration of the scrum process, the data was analyzed and then a risk

assessment of the deliverables was performed. The approximate time intervals were

planned for the next two week interval. The assessment included the requirements set

in place by a proper data warehouse.

The first standard was easy to assess: could the data be modeled to answer

business questions? Each of the data sources had business analysts who had worked

with the data already. Data analysts commonly had a set of business questions they

were already answering. Frequently, the analysts had a set of backlogged questions,

which needed to be answered in addition to the current questions. As a result, I was

tasked to figure out how to add those backlogged queries to our ETL and thus into our

system.

Next the integrate-ability of the data into our existing warehouse needed to be

assessed. I consulted with the data experts and they identified common and, all too

often, hard to solve integration points. The technologies were engaged which would

ensure data would come across automatically to the data warehouse.

Next, the data needed to be non-volatile. One of the frustrating problems, in

relation to the records volatility, was that the engineers would report the data was

corrupted due to some problem in the production systems. During the two week scrum

process, the data was first captured in raw format. Because our existing and regularly

used reports relied on certain columns of data, we needed to take advantage of

materialized views1, table views2, or simple SQL queries to increase the chances I could

compensate for the possibility of corruption by adding in additional corruption checks.

1

? A materialized view is a replica of a target master from a single point in time. The master can be either a master table at a master site or a master materialized view at a materialized view site. Whereas in multimaster replication tables are continuously updated by other master sites, materialized views are updated from one or more masters through individual batch updates, known as a refreshes, from a single master site or master materialized view site.2

Finally, the data needed to be time variant. However, adding a simple time

stamp was not sufficient in many of the sources. The business experts wanted to have

a time of the event in the system based upon several factors within the system and

sometimes within other systems. I had to apply two timestamps which was a time of

insertion and a timestamp of client/system event in the database.

Next, in the SCRUM process, a burn down chart would be used. A burn down

chart, tracks the progress each team member made during the scrum sprint. Using the

burn down chart to estimate the number of tasks we could handle, we would bid for

tasks we could handle during the sprint. Then, the clients would be informed what

would be possible during the sprint. Often, “no” was an accepted answer to new tasks

given to me in the middle of a scrum period until the current task was completed. The

only exceptions were business needs which were considered critical to running the

business.

The scrum assignments were tasks to create automated import processes.

Specifically, tasks were to create scripts for cleansing records, unifying core tables,

adding metrics, building simple reports and adding to existing reports. (

An example of a task performed was dealing with data integrity issues. Data

integrity problems were a constant throughout the entire process of the project. Users

consequently, would question the validity of the reports. The discrepancy was analyzed

and a reason for the discrepancy was the users understanding of the data in many

? A view is a representation of a SQL statement that is stored in memory so that it can be re-used.

instances, problems in the data also meant a problem with the source data. The source

was occasionally corrupted by engineers or system users trying to correct another

problem. Corrections would need to be made to our views and the reports.

Another part of the development process was the agile piece of the SCRUM

methodology. Executives requested custom reports or projects which took precedence

over the current tasks, resulting in reports to the customer and adjustments to our

scrum time lines. In many cases, I had to negotiate with the executive and ask them to

wait for our sprint to end so we could complete our current tasks. One paradigms of

scrum is to eliminate the disturbances created by change of projects.

Delivering products was crucial to the process. Reworking a project takes a lot of

time and effort. By delivering a project in small pieces, frequent client input was

considered quickly and changes would be made before the requested adjustments

became too difficult to apply during data architecting and design phase. And data

experts were frequently consulted to verify the current schema queries were accurate.

Through these efforts and with many reworks of the schema, the schema became

stable and more reliable.

Additionally, after the product was delivered, relationships of trust were created

with the clients as problems were addressed. The clients were always aware of rising

problems and could anticipate delay, thus, correcting their schedules to fit the needs of

the product.

Even with the product’s delivery and constant communication between the team

and the client, we were still obligated to ask for feedback. The feedback was expected

at the end of the cycle and would be used to improve the cycle for future efforts.

4 Data Modeling and Architecting provide Ad-hoc reporting

In this section the application of methods toward warehousing data in user friendly

formats is presented. As each of the five most common data source types, required

distinctly unique approaches and attention to different details, this chapter is separated

into five subsections corresponding to those five major input data source types. In the

following subsections each of the databases will be described along with the process of

incorporation into the warehouse. The different sources were Clarify/Amdocs, the

Family History Missionary Profile system, the Kanisa knowledge document

management system, Omniture web reporting analytics, and, the LANDesk systems

reporting server as it applies to our worldwide infrastructure.

4.1 Clarify/Amdocs

The Clarify/Amdocs (http://www.amdocs.com/Pages/HomePage.aspx) –

Clarify/Amdocs data source is built on an Oracle 11G DB and is called the (CMS)

Case Management System. This system is used for patrons calling into the

FamilySearch support centers. Clarify is a legacy name for the current system Amdocs

uses. Clarify had hooks in many of the Amdocs components even though at the time

Clarify was being upgraded and phased out.

Amdocs is TTS’s agent case management tool. The tool was bought under the

assumption it had desired reporting capabilities and could handle large case loads.

However, after extensive use and testing, Amdoc’s tool lacked many of the reporting

capabilities the business needed to make effective decisions. However, the FHD

engineers knew all of the report metrics were available within the Amdocs database. In

fact the database held more information than was anticipated. As a result of the

reporting deficiencies within Amdocs, the Amdocs database was in the process of being

upgraded so the current legacy warehouse had to be imported into the new data

warehouse. SQL was used to upgrade existing reports were built upon an older oracle

9i database, which restricted the use of many functions available in newer database

versions.

All the existing reports were converted by working with the more complex reports

and analyzing the objectives of the report. Then, by modeling the reports within a

simple time frame, they could be matched against the current legacy report. The next

step was to utilize the databases ‘explain plan’ to determine which query would be the

lower cost to implement. “A statement's execution plan is the sequence of operations

Oracle performs to run the statement and obtain results” (Oracle.com). Significantly, an

explain/cost plan is similar across the databases such as MySQL, PostgreSQL, and MS

SQL. The cost is a numeric representation of the sequence of operations Oracle and

other databases perform, in order to complete the query. The process of reading

existing examples, researching how and what they are doing, and implementing

improved and optimized versions taught me the correct queries to research. And why

one SQL query might work better than another.

Next, Business Objects Data Services (BODS) was used to do the majority of the

complex ETL operations that did not need custom scripts. Although BODS was not

designed to be a data transfer tool and would be considerably slow doing so, the TTS

utilized BODS for that purpose in many cases. The BODS also had a comprehensive

set of built-in ETL operations designed to ease the burden of data validation and

integration.

Amdocs data imported to the warehouse required its data records to be

manipulated and transferred to another database for further manipulations. The transfer

was sent to a MySQL server only to be transferred back after certain manipulations

where performed. Oracle 9i, which was the basis of Amdocs , did not have complex

regular expressions built into the engine while the version of MySQL 5.0 did. As a result

of these requirements, I had to learn how BODS controlled its connections and

manipulations of data records. Ultimately, BODS was used to do the majority of ETL

and data transfers. At the time of this phase of the project, I knew I would be integrating

more databases into the ware house. So, I leveraged the strengths of BODS, so I could

concentrate on learning how to architect a warehouse; developmental techniques on

integration of data records; and better data cleansing techniques.

The integration of the Amdocs data records required interaction with three

different databases: Oracle, MySQL, and MS SQL. Although the TTS Division was

primarily Oracle oriented and usually stayed current on oracle appliances, the Oracle

databases were of an older version 9i. Oracle 9i only supported simple ‘wild card’ data

matching and did not support complex regular expressions. Oracle began to support

regular expression as of version 10g (Goyvaerts, 2010). Oracle 9i required that the

Amdocs data be transferred to a MySQL database to utilize many of the MySQL servers

query expressions. The data would then be transferred back to the data warehouse in

its cleansed form. After our data warehouse was upgraded to Oracle version 11g,

transferring data to the MySQL server was no longer necessary and a direct database

link was established between the two databases.

Many of the existing reports related to the Amdocs system also required

statistical data from our LANDesk database servers which were MS SQL. Database

links from the LANDesk database server were integrated into the Oracle data

warehouse. Once the database links were established, I would be able to do simple

data manipulations were performed before the data even reached the warehouse. The

accepted reporting tools used in conjunction with the new Oracle 11g was Crystal

reports, which is a business objects tool utilized to create reports from multiple systems.

The reports needed to be corrected and were SQL based and Crystal Reports

can be configured to use ODBC connections to individual servers and sources.

However, allowing users access to multiple systems would not be in alignment to our

objectives of this project. We addressed this by restricting access to only the data

warehouse that we were building. Funneling data to one site allowed me to control

what data is seen and how it would be seen by end-users, and create uniform reports

across the TTS Division.

As a final step in warehousing the Amdocs database, the newly developed

warehouse had to handle an Amdocs database upgrade. Although, the data was

supposed to be unaffected, several critical columns were found to have been cleared

and others consolidated. A vast number of incorrect Crystal reports resulting from the

Amdocs database upgrade forced systematic transformation of all the pre-built

materialized views. Additional views were necessary to compensate for the change in

database upgrades. Despite some nominal data loss, the majority of data was salvaged

and was re-integrated into the TTS Division warehouse.

The following snapshot in Figure 6 is of the Amdocs/Clarify portion of the Warehouse.

4.2 Family History Center (FHC) Profile

The FHC Profile database has been an evolving system which was initially an

MS Access database utilizing an MS Accesses data entry interface. Then FHC moved

from the access database to an Oracle 10g system utilizing a simple ruby-on-rails

interface, drawing upon an additional data source Church Directory online Listing

(CDOL). Finally, the FHC Profile database system was integrated into a complex ruby-

Figure 6 Amdocs/Clarify portion of the Warehouse

on-rails web interface drawing upon CDOL, LANDesk information, and a custom

missionary application developed by the FHC engineering teams. FHC Profile was

evolved into a completely custom application by the end of the warehousing project.

The FHC profile was the second major system integrated into the TTS data

warehouse. The FHC missionary profile system had several problems which needed to

be overcome. The profile had several data sources which all but two had no data

validation. Second the SQL queries were unnecessarily complex. Finally, the original

implementation was not implemented well and would make debugging extremely

difficult. While addressing these problems an additional task was to maintain and write

reports for the Clarify/Amdocs management and user agents. These reports would

become increasingly involved and complex. Ultimately, techniques were developed to

reduce the lines of SQL and eliminate the ‘bugs’ with in the data and queries.

Reports generated by the FHC Profile system were constantly in question due to

the data integrity issues. Integrity validity stemmed from lack of data validation tools in

place during usage. A volunteer group and I were assigned to deal with the integrity

problem. The volunteer group was assigned to build a web interface which would

interact with the warehouse and to stay in constant contact with me while they were

building the interface. At this point, the data was gathered, unnecessary sources were

eliminated, implement data guards on the database were implemented, and finally new

tables were integrated into the data warehouse.

Commonly encountered sources of data were MS Excel spread sheets that were

commonly used as databases for entire projects. The spread sheets would encounter

error problems due to multiple users entering data at any time and in any format. The

greatest problem phasing out the spread sheets was tracking down all the owners and

experts of the data. Many of the sources had complex macros and functions cross

linking to other sites and sources.

The Data Services data integration tool was utilized to extract, transform, and

load MS Access and Excel spreadsheets into our warehouse. The extraction and

loading were easy, but still the transformation was difficult. The largest MS Access

database had historical data recorded by adding in columns to a table as needed. A

pivot table was implemented, which reorients the data either from a row to columns or

vice versa, to turn many columns into rows and separated the names from the other

data. Joins within the MS Access data yielded duplication. All the data was extracted

and placed it into a separate spreadsheet and an external database to align those fields

with the warehouse standards. Dates, countries, and addresses had to be run through

the Data Services address libraries. Two major corruptions occurred in the Family

History centers statistical data and the employee schedules. The statistical data would

contain answers like yes, 1, 0, no, not passed, etc. A distinct list of all the possible

values was extracted and cleansed out all the corrupted answers then replaced them

appropriately. Schedules needed to have numbers corrected then reprocessed to count

the hours the people listed on the spreadsheet and access database worked or didn’t

work.

The data sources needed to be further prepared by adding primary keys foreign

key relationships, indexes, and data type checks. Assigning primary/foreign keys would

prevent data duplications, indexes ensured query speeds, and data type checks

guaranteed proper formatting in many cases.

By placing primary keys, foreign key relationships, and referential integrity

checks we were able to control data changes including and deleting of the data.

(Foreign key Constraints, 2012). Indexing data columns allowed the warehouse to

query the data much quicker and for more complex queries if necessary. Enabling data

type checks forced the warehouse to conserve on memory and ensure proper data

formatting.

The data coming from our CDOL was created from existing SQL queries and was

extremely inefficient. A method had to be designed to optimize the queries. Queries

were stored in procedures and sometime just written in custom user scripts. First, a

schema diagram was acquired for the source database. We then researched how to

properly replicate queries which would greatly reduce the execution costs presented in

a tool provided by the database engines, an explain plan. Further research was done to

better utilize sub-queries, materialized views, temporary tables, and SQL features such

as “group by”, and finding the maximum/minimum values in fields.

Understanding how the database engines execute the queries and the cost of

placing a process in one spot over the other became highly important to the

performance of the overall warehouse. As a result, SQL queries were systematically

reduced in size and compared the results against existing queries. In many original

queries there were flaws in counts, groupings, and data sub-queries which were

creating duplication and unintended data elimination.

By the conclusion of this stage of the project we were able to learn how to reduce

the overall lines of code. We reduced the number of lines of code by 50 percent, data

guards were implemented like data validations, key constraints, and creating

applications which guard against errors. The improved error handling meant the

volunteer group no longer needs to spend time weekly removing errors. Users don’t

need to insert data into multiple locations and report writing does not need to be done

by a developer and can be shifted to an analyst.

The following is a graph of the Family History Center (FHC) Profile portion of the

Warehouse.

4.3 Kanisa

The Kanisa Database (http://crm.consona.com/software/products/knowledge-

management.aspx) – Kanisa Database was built on an Oracle 9i DB and is a

Knowledge Management System custom configured to monitor usage of the different

knowledge documents. Server logs which held a plethora of data on system usage the

Oracle DB was not including in the DB tables were utilized for the data warehouse.

Kanisa was the third majorly successful system integrated into our data

warehouse illustrated by figure 7. Kanisa is a Knowledge document management

system (KDMS) used to support all the patrons of the Familysearch.com company

research sites and software. Kanisa’s data helped FamilySearch.com manage and

improve the self-help documentation and reduce support personnel costs.

There were four significant issues that had to be resolved with the Kanisa data.

First, primary data was constantly under change by users that made capturing historical

statistics difficult if not impossible as time progressed. We had to figure out how to

capture and what to capture. Second, the database was not capturing all the necessary

data points we wanted. Thus, a plan was formulated to parse the cache of log files

which held data we wanted and how to make the warehousing process automated.

Third, the cache of logs would contain duplicate data that interfered with reports and

other metrics. A procedure was created to detect when the log cache contained a

duplicate and eliminate the extra data. Finally, we needed to reassess the kinds of

questions the warehouse could answer with all the new data available to the warehouse

users.

Note, there were nine different file types that were of interest. The file held data

that would provide different perspectives on the system which would give us an insight

into the user’s experience. The data would be extracted from these file types.

To solve the export and import issues, a ruby script would have to be created

then Data Services scripting language was used to execute the ruby script on the

remote system. Data Services would have to be notified when the script was done.

The challenge in creating the script was optimizing ruby so the script would be able to

extract thousands of files and append them onto one of the nine master files.

Ruby 1.8.7 is not true multi-threading meaning each thread cannot be run in an

individual process on the processor (Mittag, 2008). We discovered Ruby 1.8.7 had

‘green’ threads where the system would take almost a day and half to go through just

one of the nine files. Consequently, we switch to Ruby 1.9.2 which supported

concurrency (Ruby, Concurrency, and You, 2011). Though Ruby 1.9.2 does not support

‘true parallelism’, the concurrency did improve the scripts performance. The script

would first have to extract the row, and then detect how many commas were in the row.

If the row had too many or too few, the script would then have to check to see if one of

the fields were either missing or had another set of data in it. Often columns would

have xml in the field so the script would have to detect the beginning and end of the xml

and replace the commas with another character which didn’t occur in the particular

master file in the event a comma was missing, that would have to be able to detect that

and insert the comma in the appropriate place.

By definition, a data warehouse must not be volatile or changing. However due

to the nature of the source the contents of the source were constantly changing. We

would need to constantly query the changes and document when they happened.

Several materialized views where created which consisted of snapshots of data in time,

to retain data twice a day. Business Objects Data Services was scheduled to run an

Extraction Transformation and Load against the Kanisa system to retain valuable

metrics about the usage. An interestingly difficult problem to solve was a legacy data

type originating from the Kanisa database. One of the database columns was of type

“long” (Oracle Datatypes Data types for oracle 8 to Oracle 11g, 2012). A long is like a

binary type for a standard file system. It is neither an integer nor just a character.

Longs in oracle cannot be directly queried if extracted and longs cannot be indexed.

We had to extract the column with the primary key, identifying its place in the export to

the warehouse, and export it to a Character Large Object Block (CLOB). A CLOB would

then allow for indexing, for direct querying, and to optimize for performance we included

the exported data into our materialized view.

Taking the cache of logs and parsing through them was far more difficult than

anticipated and prevented the use of default import tools by the databases or BODS.

Oracles flat file extraction tools were tried along with the data services data import

functions, but we discovered many of the columns contained several types of data. One

column would contain the query string the user used to access the system. Another

column within the logs would retain the xml output from the data server, and another

column in the same file would contain a CSV formatted data output from the log files.

Finally after all the scripts were written, the data was imported, and business

questions had been defined, the Kanisa data was imported into our warehouse

producing ad-hoc reports generated by the user. Along the way many optimization

techniques in ruby and Data Services were used. Data warehouse data guards were

carry out to prevent information from being corrupted. One of the new guards

discovered was a way to detect past historical data duplication through Data services.

Data services had to feed in the current warehouse records and compare the records to

the information being passed in from the logs and source databases. The records being

passed in had session identification numbers and time stamps of the event.

The following is a figure of the Kanisa portion of the Warehouse.

Figure 7 Kanisa Portion of the Warehouse

4.4 Omniture web services

Omniture (http://www.omniture.com/en/) Data – Omniture is a commonly used

website analytics software which was chosen to integrated into a Business Objects

application, Data Services

Omniture is web site analytic software similar to google analytics. Omniture in

FamilySearch serves as our primary tool to observe the activity on our many sites at

FamilySearch. The analytic tool has a set of statistics it can track out-of-the-box, it is

also good to note one of Omnitures more powerful attributes is the ability to track

custom fields which can be built in to the tracking code.

Integrating Omniture is the fourth majorly successful source that was integrated

into our Data warehouse. In order to integrate the data from Omniture the import

process had to be automated which created new challenges. Three major obstacles

had to be overcome; the web service documentation was poorly written and in many

cases missing, the WSDL was in a non-standard format which BODS couldn’t

understand, and finally we had to figure out how to connect the Omniture data to the

existing Data warehouse data.

Interfacing BODS to Omniture was significant at the time because no one so far

had been able to do it. When the WSDL was written, the WSDL did not conform to the

W3C standards (W3C, 2001) which BODS needed to automate the connection and

import. As a result, a custom interface to Omniture had to be written which BODS could

use to extract the queried data into our data warehouse.

Java was chosen as the custom interface language because there were existing

examples of the needed connection. The connection had to be ambiguous enough so

the script could be altered to allow for other queries without user knowledge of java

programming. The queries would need to include questions like, “give me all users from

China, which entered the site from March 2 2002 to March 4 2002 and give me the

user’s computer type, version, web browser. And finally, compare the results to the

current systems we have at the FamilySearch centers.” Further I had to custom craft

JavaScript Object Notation (JSON), a lightweight data-interchange format, queries

inside of the java calls so the extraction could be further automated. In order to create a

model for Omniture I had to research a concept called database polymorphic

associations was utilized. By combining all these techniques, truly ad-hoc reporting

cross joined with Amdocs, Kanisa, LANDesk, and FHC profile would be achieved

One last hurdle to overcome was the extremely poor and incomplete

documentation for the web services. Omniture support services were contacted

constantly for clarification and custom queries were created consequently, because the

documentation was incomplete a great deal of experimentation was tried and create the

custom queries.

4.5 LANDesk

LDS FHD LANDesk (http://www.landesk.com) – LDS FHD LANDesk is built on a

MSSQL Database (DB) server which monitors all the computer systems throughout the

corporation worldwide. It provides the health of all the systems based on the system

hardware. Data was stored in the Registry where the Servers worldwide could query

the systems registries and then deposit the data into one central location.

“LANDesk Software provides systems management, security management,

service management, asset management, and process management solutions to

organizations. It is one of the oldest companies providing this type of product.”

(LANDesk, 2012) We use LANDesk on all of our FamilySearch center computers.

Each of the system is constantly tracking both hardware and software usages.

LANDesk catches the statistics on the state of the hardware and is able to give us

insight on which centers would need to have more or newer computer systems.

My fifth majorly successful data source was integrating and connecting LANDesk

records into our data warehouse illustrated by figure 8. At the time we had difficulty

acquiring an up-to-date database schema because of how the data was being stored.

Later, the database was upgraded and the fields were better defined, but until then we

had to deal with un-modeled data. Un-modeled data fields were data points extracted

from all the custom registry fields inserted into all the computers around the world. The

data types had ID’s assigned to them by the LANDesk servers, but we needed to

experiment to figure out which ids belonged to which description. The four types of

problems encountered were, multiple data sources, no current database schemas

available, historical counts were not being captured properly, finally finding a data point

to link all the other systems to would be difficult to find and once found needed complex

SQL’s to complete.

The LANDesk project needed to gather its records from two different sources.

First, the LANDesk servers and from the Sophos an anti-virus firewall software, Sophos

was included in the data warehouse integration project because Sophos was related to

the health of the Computer Systems housed in all the family history centers around the

world. We needed to figure out how to first link LANDesk computers to the computers

Sophos was installed on. Sophos provided insight into how many intrusions detected,

how many times sites were accessed, and how current the anti-virus system is.

At first, we didn’t know how to join Sophos and LANDesk computer together and

was not obvious. We figured out LANDesk could query the type of anti-virus software

installed then use the serial number as a joining point to the LANDesk system. While

we had the joins set we needed to utilize Profile to connect all the LANDesk computers

to specific Family Search centers and libraries around the world. Linking all the

systems together allowed us to get statistics on countries all the way down to internal

organizations.

After we had solved all the joins up to this point, we had to next try to figure out

the database schema within the Sophos antivirus and firewall database. The schema

tracking was done by slow and steady querying of the tables and by joining them

together to produce a schema diagram which could be used to join LANDesk and

Sophose.

Then, the current SQL used to capture historical data and translate all the

existing queries into Optimized queries from MS SQL to Oracle was accomplished. In

many cases, SQL was written to be faster and shorter. Furthermore, the captured

records could potentially grow very large. We had approximately 55,000 computers

around the world and we needed to keep complete statistical records of all the systems

and surrounding networks. Not all the systems would change every time, an SQL query

was created which would analyze when the systems statistics would change and update

only those which did. To update the records, we leveraged BODS into accomplishing

the task. The difficult part was rewriting all the original SQL to improve the database

explain plan costs. A sample piece of code will be displayed in the appendix to illustrate

the complexity of this process.

As a result historical reports and trends of our existing FamilySearch centers

were created. Redundant SQL for historical records was removed and the performance

of the reporting system was greatly improved. Most importantly reports were created

that joined to our four other sources which was able to enhance understanding of our

systems.

The following is a figure of the LANDesk portion of the Warehouse.

5 Operation of the Warehouse

5.1 Now vs. BeforePreviously, users were unable to gain access to reports, production systems

performances were compromised, and report results were conflicting. After a concerted

Figure 8 Landesk portion of the Warehouse

effort from engineers, the TTS team members, and myself were able to accomplish all

of our goals set forth in this project. All five data sources were unified into one

warehouse and an interface was provided where any user could create custom accurate

reports.

5.2 User satisfactionThe warehouse has lasted two years and has undergone improvements from

other engineering teams to include new sources. However, the core warehouse is still

intact and in use today. The system has added value to the LDS Center by providing

reports to all business members at all levels.

A survey was distributed to the users of the system and many of the responses

were similar. James Ison a manager of the Family History department was asked,

“What aspect of the reporting portal (Data warehouse) was most beneficial?, His

response was “Church-wide insight into use of the New.FamilySearch system via the

Area Adviser report” which reflected the value of the data warehouse in providing new

reports.

One of the major report writers, David Lifferth, found added value when he

responded to the question, “What aspect of the reporting portal (Data warehouse) was

most beneficial?” His answer was, “Drag-and-Drop simplicity in creating new, ad-hoc

reports.”

A major Business Analysit, David Armond Acree answered the question, “Did you

see cost savings in using the reporting portal (data warehouse)?” by answering, “Yes,

we saw a savings of 5 hours per week * ($30 per hr estimated) * (48 weeks per year) =

$7200 per year”

These are but a few of the users who gained benefit from the data warehouse.

For each user the need was wide and varied. But, each user saw a benefit from the

project and was able to improve the business for it.

6 Conclusions

The FHD set out to identify the major data records, align, clean and standardize

the data, unify the data records into one warehouse which could be used as a tool to

enable a user to act and make changes. To accomplish the project proper techniques in

Data Modeling, Architecting, and data warehousing had to be understood and

implemented. A warehouse from the beginning to the end had to be built and proper

standards had to be established to include future data sources.

The scrum methodology played a key role in user satisfaction by enhancing the

user experience from beginning to end. Further, scrum improved the overall productivity

of all the team members by encouraging an open environment and reducing costs

across the board.

The results of the project can be seen in the numerous hours saved, reports built

from the warehouse, and the hundreds of thousands of dollars saved in upgrade and

systems costs. In the appendix, example screen shots of reports made and a few

samples of the scripts necessary to undertake the project will be provided.

7 Future Work

The FamilySearch’s user experience data warehouse has only taken in five

different sources of user experience. To achieve the goals of the business executives

many other sources need to be integrated into the data warehouse. Further, to conform

more closely to the corporations Information Communication Systems standards table

names and column names need to aligned with the businesses standards. Finally, as

the warehouse grows further system optimizations will be required.

APPENDIX A – SQL SCRIPTS

AMDOCS/CLARIFY--###############################################

--###############################################--Clarify_case_mv_dmSELECT c.objid, c.creation_time, c.id_number as "CASE_ID", c.title AS "CASE_TITLE", c.x_lang AS "LANGUAGE", c.x_routing, h.title AS "CATEGORY1", h1.title as "CATEGORY2", h2.title as "CATEGORY3", -- h3.title as "CATEGORY4", con.title as "CASE_CONDITION", st.title as "CASE_STATUS", q.title as "CASE_QUEUE"FROM table_case c, table_hgbst_elm h, table_hgbst_elm h1, table_hgbst_elm h2, --table_hgbst_elm h3, table_condition con, table_gbst_elm st, table_queue qWHERE c.CASE_LVL12HGBST_ELM = h.objid(+) AND c.CASE_LVL12HGBST_ELM = h1.objid(+) AND c.CASE_LVL12HGBST_ELM = h2.objid(+)-- AND c.x_case_type42hgbst_elm = h3.objid(+) AND c.CASE_CURRQ2QUEUE = q.objid(+) AND c.case_state2condition = con.objid(+) AND c.casests2gbst_elm = st.objid(+)--###############################################

--###############################################--clarify_email_response_mv_dm SELECT a1.objid, a1.act_entry2case, a1.title, a1.entry_time,

MIN( CASE WHEN a2.entry_time > a1.entry_time THEN a2.entry_time END ) AS "RESPONSE_DATE", ROUND( ( MIN( CASE WHEN a2.entry_time > a1.entry_time THEN a2.entry_time ELSE SYSDATE END ) - a1.entry_time) * 24, 3) AS "EMAIL_SLA"

FROM (SELECT

a.objid, eb.title, a.act_code, a.ACT_ENTRY2CASE, a.entry_time, a.addnl_infoFROM table_act_entry a, table_gbst_elm ebWHERE a.act_code = eb.rank AND a.ACT_ENTRY2CASE IS NOT NULL) a1, --clarify_act_all_mv_dm (SELECT a.objid, eb.title, a.act_code, a.ACT_ENTRY2CASE, a.entry_time, a.addnl_infoFROM table_act_entry a, table_gbst_elm ebWHERE a.act_code = eb.rank AND a.ACT_ENTRY2CASE IS NOT NULL) a2

WHERE a1.act_entry2case = a2.act_entry2case AND a1.ACT_CODE = '3500' AND a2.ACT_CODE IN ('500','1700','200','3400')

GROUP BY a1.objid, a1.act_entry2case, a1.title, a1.entry_time--###############################################

--###############################################

--###############################################-- clarify_act_all_mv_dm SELECT a.objid, eb.title, a.act_code, a.ACT_ENTRY2CASE, a.entry_time, a.addnl_infoFROM table_act_entry a, table_gbst_elm ebWHERE a.act_code = eb.rank AND a.ACT_ENTRY2CASE IS NOT NULL;

--###############################################

--###############################################

--###############################################-- clarify_user_mv_dm SELECT u.objid, u.login_name, e.first_name, e.last_name, c.NAME AS COUNTRY_NAME, e2.first_name AS MANAGER_FNAME, e2.last_name AS MANAGER_LNAME, u2.login_name AS MANAGER_LOGIN, u3.login_name AS TOP_MANAGER, hb.title AS WORKGROUP, hb2.title AS TOP_WORKGROUP

FROM table_user u, table_employee e, table_employee e2, table_employee e3, table_site s, table_address a, table_country c, table_user u2, table_user u3, table_hgbst_elm hb, table_hgbst_elm hb2 WHERE u.objid = e.employee2user AND e.supp_person_off2site = s.objid AND s.cust_primaddr2address = a.objid AND a.address2country = c.objid AND e.emp_supvr2employee = e2.objid(+) AND e2.work_group = hb.ref_id(+) AND u2.objid = e2.employee2user AND e2.emp_supvr2employee = e3.objid(+) AND e3.work_group = hb2.ref_id(+) AND u3.objid(+) = e3.employee2user--###############################################

--###############################################

--###############################################-- CLARIFY_FACT_MV SELECT c.objid AS "CASE_OBJID", c.case_reporter2site, c.case_reporter2contact, c.case_owner2user, COUNT ( CASE WHEN a.ACT_CODE = '3500' THEN e.objid END ) AS "EMAIL_IN", COUNT ( CASE WHEN a.ACT_CODE = '3400' THEN e.objid END) AS "EMAIL_OUT", (CASE WHEN con.condition <> 4 THEN ((sysdate - c.creation_time)*24*60*60) ELSE ((cc.close_date - c.creation_time)*24*60*60) END) AS "CASE_SEC"FROM table_case c, table_act_entry a, table_email_log e, ( SELECT last_close2case, MAX(close_date) AS "CLOSE_DATE" FROM table_close_case GROUP BY last_close2case ) cc, table_condition con WHERE c.objid = a.act_entry2case(+) AND a.act_entry2email_log = e.objid(+) AND cc.last_close2case(+) = c.OBJID AND c.case_state2condition(+) = con.objid

GROUP BY c.objid, c.case_reporter2site, c.case_reporter2contact, c.case_owner2user, ( case when con.condition <> 4 then ((sysdate - c.creation_time)*24*60*60) else ((cc.close_date - c.creation_time)*24*60*60) end )--###############################################

LANDeskUSE [DTM_FCH_9]GO-- ##########################################################################################-- Landesk configuration DIM-- ##########################################################################################SELECT

ISNULL( CAST(cd.FHCIDNum AS INT), 0) fhcidnum, c.computer_idn, ISNULL( c.HWLastScanDate, CAST('1/1/1830' AS DATETIME)) HWLastScanDate, CAST( CONVERT( VARCHAR( 8 ), ISNULL(c.HWLastScanDate, '1/1/1830'), 112) AS INTEGER) HWLastScanDate_key, ISNULL(c.LastUpdInvSvr, CAST('1/1/1830' AS DATETIME)) LastUpdInvSvr, CAST( CONVERT( VARCHAR( 8 ), ISNULL(c.HWLastScanDate, '1/1/1830'), 112) AS INTEGER) LastUpdInvSvr_key, ISNULL(C.SecurityLastScanDate, CAST('1/1/1830' AS DATETIME)) SecurityLastScanDate, CAST( CONVERT( VARCHAR( 8 ), ISNULL(c.SecurityLastScanDate, '1/1/1830'), 112) AS INTEGER)

SecurityLastScanDate_key, ISNULL(C.SWLastScanDate, CAST('1/1/1830' AS DATETIME)) SWLastScanDate, CAST( CONVERT( VARCHAR( 8 ), ISNULL(c.SecurityLastScanDate, '1/1/1830'), 112) AS INTEGER)

SWLastScanDate_key, ISNULL(CAST(fh.DPCustomCfg_Date AS DATETIME), CAST('1/1/1830' AS DATETIME)) DPCustomCfg_Date, CAST(CONVERT( VARCHAR( 8 ), CAST(ISNULL(fh.DPCustomCfg_Date, '1/1/1830' ) AS DATETIME), 112) AS

INTEGER) DPCustomCfgDate_key, (CASE WHEN localsch_version > 98 OR localsch_version < 7 THEN 1 ELSE 0 END) dpcustomcfg_out_of_date, ISNULL( fh.LDSReconnectVer, '0.0.0.0') LDSReconnectVer, ISNULL( fh.Localsch_Version, -1) Localsch_Version, ISNULL( fh.Policy_Ran, 18300101000000 ) LANDesk_Policy_Checkin, ISNULL( fh.Sophos_Primary, 'http://www.example.com/') Sophos_Primary, ISNULL( fh.Sophos_Secondary, 'http://www.example.com/') Sophos_Secondary, ISNULL( fh.Version_Installed, '0.0.0.0') Version_Installed, ld.FileDate LDAPPL3_File_Date, nt.Language OS_Language, nt.MUILang OS_MUILanguage, ISNULL( pm.AUTmonVer, 0.0) AUTmonVer, ISNULL( pm.GMT_Offset, -99) GMT_Offset, ISNULL( pm.nFSmonVer, '0.0') nFSmonVer, Case

WHEN umd.DATASTRING IS NULL THEN 'Null' WHEN umd.DATASTRING LIKE '5.[0-9.]%' THEN '5.x' WHEN umd.DATASTRING LIKE '6.[0-9.]%' THEN '6.x' WHEN umd.DATASTRING LIKE '7.[0-9.]%' THEN '7.x' WHEN umd.DATASTRING LIKE '8.[0-9.]%' THEN '8.x' WHEN umd.DATASTRING LIKE '9.[0-9.]%' THEN '9.x' ELSE 'Unknown' END ie_version_grp

, umd.DATASTRING ie_versionFROM

dbo.Computer cLEFT OUTER JOIN dbo.CDF cd ON c.computer_idn = cd.computer_idnLEFT OUTER JOIN dbo.Family_History fh ON c.computer_idn = fh.computer_idnLEFT OUTER JOIN dbo.LanDesk ld ON c.computer_idn = ld.computer_idnLEFT OUTER JOIN dbo.osnt nt ON c.computer_idn = nt.computer_idnLEFT OUTER JOIN dbo.pem pm ON c.computer_idn = pm.computer_idn

LEFT OUTER JOIN dbo.UNMODELEDDATA umd ON c.computer_idn = umd.computer_idn AND umd.METAOBJATTRRELATIONS_IDN = 1799ORDER BY

c.computer_idn;

USE [DTM_FCH_9]GO-- ##########################################################################################-- Landesk configuration FACT-- ood is out of date-- ##########################################################################################SELECT

ISNULL( CAST(cd.FHCIDNum AS INT), 0) fhcidnum, COUNT( DISTINCT c.computer_idn ) NUM_OF_COMPUTER, COUNT( CASE WHEN ISNULL( c.HWLastScanDate, CAST('1/1/1830' AS DATETIME)) < GETDATE()-21 THEN 1 ELSE

NULL END) HWLastScan_ood -- gt 21 days, COUNT( CASE WHEN ISNULL( c.LastUpdInvSvr, CAST('1/1/1830' AS DATETIME)) < GETDATE()-21 THEN 1 ELSE

NULL END) LastUpdInvSvr_ood -- gt 21 days, COUNT( CASE WHEN ISNULL( c.SecurityLastScanDate, CAST( '1/1/1830' AS DATETIME ) ) < GETDATE( ) - 30 THEN

1 ELSE NULL END ) SecurityLastScanDate_ood -- gt 30 days, COUNT( CASE WHEN ISNULL( c.SWLastScanDate, CAST( '1/1/1830' AS DATETIME ) ) < GETDATE( ) - 30 THEN 1

ELSE NULL END ) SWLastScanDate_ood -- gt 30 days, COUNT( CASE WHEN ISNULL( fh.DPCustomCfg_Date, CAST( '1/1/1830' AS DATETIME ) ) < GETDATE( ) - 21 THEN

1 ELSE NULL END ) DPCustomCfg_ood -- gt 30 days, COUNT( CASE WHEN fh.localsch_version > 98 OR fh.localsch_version < 7 THEN 1 ELSE NULL END )

cnt_locsch_ver_out_of_date, COUNT( CASE WHEN ISNULL( fh.LDSReconnectVer, '0.0.0.0') LIKE '0.[0-9.]%' THEN 1 ELSE NULL END )

LDSReconnectVer_0x, COUNT( CASE WHEN ISNULL( fh.LDSReconnectVer, '0.0.0.0') LIKE '1.[0-9.]%' THEN 1 ELSE NULL END )





LDSReconnectVer_5x, COUNT( CASE WHEN CONVERT( DATETIME, CAST( ISNULL( fh.Policy_Ran, '18300101000000' ) AS CHAR( 8 ) ) ) <

GETDATE( ) - 21 THEN 1 ELSE NULL END ) LANDesk_Policy_Checkin_ood --problem lies here, COUNT( CASE WHEN ISNULL( fh.Sophos_Primary, 'http://www.example.com/') NOT LIKE '%ldssr3[de]%' THEN 1

ELSE NULL END ) Sophos_Primary_NC, COUNT( CASE WHEN ISNULL( fh.Sophos_Secondary, 'http://www.example.com/') NOT LIKE 'http://es-

web.sophos.com/update/' THEN 1 ELSE NULL END ) Sophos_Secondary_NC, COUNT( CASE WHEN ISNULL( fh.Version_Installed, '0.0.0.0' ) <> '9.0.1.0' THEN 1 ELSE NULL END )

Version_Installed_NC, COUNT( DISTINCT nt.Language) OS_Language_CNT -- number of diff langs, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '0.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_0x, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '1.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_1x, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '2.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_2x, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '3.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_3x, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '4.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_4x, COUNT( CASE WHEN ISNULL( pm.nFSmonVer, 0.0) LIKE '0.[0-9.]%' THEN 1 ELSE NULL END ) nFSmonVer_0x, COUNT( CASE WHEN ISNULL( pm.nFSmonVer, 0.0) LIKE '1.[0-9.]%' THEN 1 ELSE NULL END ) nFSmonVer_1x, COUNT( CASE WHEN ISNULL( pm.nFSmonVer, 0.0) LIKE '2.[0-9.]%' THEN 1 ELSE NULL END ) nFSmonVer_2x, COUNT( CASE WHEN umd.DATASTRING IS NULL THEN 1 ELSE NULL END) IE_NULLS, COUNT( CASE WHEN umd.DATASTRING LIKE '5.[0-9.]%' THEN 1 ELSE NULL END) 'IE_5x', COUNT( CASE WHEN umd.DATASTRING LIKE '6.[0-9.]%' THEN 1 ELSE NULL END) 'IE_6x', COUNT( CASE WHEN umd.DATASTRING LIKE '7.[0-9.]%' THEN 1 ELSE NULL END) 'IE_7x', COUNT( CASE WHEN umd.DATASTRING LIKE '8.[0-9.]%' THEN 1 ELSE NULL END) 'IE_8x', COUNT( CASE WHEN umd.DATASTRING LIKE '9.[0-9.]%' THEN 1 ELSE NULL END) 'IE_9x'

FROM

dbo.Computer cLEFT OUTER JOIN dbo.CDF cd ON c.computer_idn = cd.computer_idnLEFT OUTER JOIN dbo.Family_History fh ON c.computer_idn = fh.computer_idnLEFT OUTER JOIN dbo.LanDesk ld ON c.computer_idn = ld.computer_idnLEFT OUTER JOIN dbo.osnt nt ON c.computer_idn = nt.computer_idnLEFT OUTER JOIN dbo.pem pm ON c.computer_idn = pm.computer_idnLEFT OUTER JOIN dbo.UNMODELEDDATA umd ON c.computer_idn = umd.computer_idn AND

umd.METAOBJATTRRELATIONS_IDN = 1799GROUP BY

ISNULL( CAST(cd.FHCIDNum AS INT), 0)ORDER BY

ISNULL( CAST(cd.FHCIDNum AS INT), 0);

APPENDIX B – SCRIPTS

KANISA=begin

************************************* NOTES SECTION ********************************************

# KSC_authoring-Production-PASK-009-033-00_00_00-2009_05_28.log

# KSC_case_activity-Production-PASK-009-033-00_00_00-2009_05_28.log

# KSC_response_central-Production-PASK-009-033-00_00_00-2009_05_28.log

# KSS_favorites-Production-PASK-009-033-00_00_00-2009_05_28.log

# KSS_forum-Production-PASK-009-033-00_00_00-2009_05_28.log

# KSS_kc_view-Production-PASK-009-033-00_00_00-2009_05_28.log

# KSS_RAR_events-Production-PASK-009-033-00_00_00-2009_05_28.log

# PLATFORM-Kanisa-Build-1242789327-PASK-009-033-2009_05_28.log

DI will load the master files in by updating the tables.

i.e. if the table has data in it, DI will append to the table.

rules:

- delete old master files

- load only with the new lines

get all the file names in the directory

hash the filenames like the following:

@filenames = {

'file_one' => [[one, date_1], [two, date_2], [three, date_3]],

'file_two' => [[one, date_1], [two, date_2], [three, date_3]],

'file_three' => [[one, date_1], [two, date_2], [three, date_3]]

}

delete all entries that are strictly older than current set group of files

loop through all the remaining files appending them together and updating the logs_status

goto the current file of the set

if cur.mtime != f.mtime

open f

load into an array

goto cur.line

add lines to master file

close file

update log_status

end

=end

require 'yaml'

class XmlStuff

###################################################################################################

attr_accessor :keys, :log_config, :category_f_name_arrays, :file_names

LOG_STATS = 'logs_status.yml'

AUTHORING = 0

CASE_ACTIVITY = 1

FAVORITES = 2

FORUM = 3

KC_VIEW = 4

RAR = 5

PLATFORM_KANISA = 6

RESPONSE_CENTRAL = 7

###################################################################################################

###################################################################################################

# Setup all the variables that are need for this transfer

###################################################################################################

def initialize()

@log_config = YAML.load_file( LOG_STATS )

@keys = Array.new( 8, true )

@category_f_name_arrays = nil

end

###################################################################################################

###################################################################################################

# Takes in no parameters, but utilizes a constant to figure out what file you want to use.

# The only dependant variable which is needed is @log_config.

###################################################################################################

def update_logs_status

x = File.open( LOG_STATS, 'w' ) do |out|

YAML.dump( @log_config , out )

end

end

###################################################################################################

###################################################################################################

# loop through all the remaining files appending them together and updating the logs_status

# loop_and_append_to_masters depends on @log_config and what catagory is coming in. Are the

# different file names, ie authoring and case activity. Please see log_status.yml

# to see the different attributes @log_config can have.

###################################################################################################

def loop_and_append_to_masters(options={})

# Each new file will be started with master then the given catagory. i.e. forum, response central

fout = File.open("master_#{ options[ :catagory ] }.log", File::WRONLY|File::TRUNC|File::CREAT )

fout.puts( @log_config[ options[ :catagory ] ][ 'headers' ] )

#Now lets loop through the rest of the files.

@category_f_name_arrays[ options[:catagory] ].each do |l_file|

# sod stands for start of data. I dont want to grab the headers when I append the data.

sod = @log_config[ options[ :catagory ] ][ 'start_of_data' ]

# grab the file by its full name and then split it into an array so, we can skip ahead to parse out the headers.

fin = open_and_load( l_file.first )

# Since the file has data in it, state the new current file and date it was created.

@log_config[ options[ :catagory ] ]['current_file'] = l_file.first

@log_config[ options[ :catagory ] ]['cdate'] = l_file.last

# We don't need to parse through the rest of the file if it the exact length of the start of data.

next if (fin.length - 1) == sod

# sod is based on zero, so since fin.length is the actual length, we need to return one less so i know which index

# to start on.

@log_config[ options[ :catagory ] ]['end_line'] = fin.length - 1

# grab the range of data and work with it.

fin[ sod..fin.length-1 ].each do |item|

# Now lets concatinate the files if the line is not empty.

fout.puts( item.strip ) unless item.empty?

# execute the code unless the data is empty. which is what nil means in this case.

end unless fin[ sod..fin.length-1 ].nil? # end fin[ sod..fin.length-1 ].each |item|

end # end @category_f_name_arrays[:catagory].each do |fname|

# once file processing is done for the range, close and write all the newly aquired data to the output.

fout.close()

end

###################################################################################################

###################################################################################################

# files_with_date takes uses @category_f_name_arrays to catagorize the different files. The

# decision to use @category_f_name_arrays in this manner was due to debuging, file loading,

# and complexity issues. If you are in Linux you can execute ruby make_xml_file.rb `ls *.log`

# Or if you are in windows, just executing the file will look in a default location, which can be changed.

###################################################################################################

def files_with_date

# find out if the platform that the script is running on is windows or linux. If it is linux, just look for the arguments. If it is windows

look for the files in a predefined folder.

filenames = []

# Check for the platform and execute the appropreate commands

# filenames = ( RUBY_PLATFORM =~ /mswin32/ ) ? ( %x{ dir /B files\\Logs }.split( "\n" ) ) : ARGV

# %x{ dir /B files\\Logs }.split( "\n" ).each do |f| filenames << "files\\Logs\\" + f end

# Gather all the files from the correct drives and directories. What we are not seeing

# is the given drives are mapped directly to the files we are looking for.

%x{ dir /B w:\\ }.split( "\n" ).each do |f| filenames << "w:\\" + f end

%x{ dir /B x:\\ }.split( "\n" ).each do |f| filenames << "x:\\" + f end

%x{ dir /B y:\\ }.split( "\n" ).each do |f| filenames << "y:\\" + f end

# remove from the list all the names that dont have a datae attached to the

filenames = filenames.delete_if{ | x |

!( x =~ /\d{4}_\d{2}_\d{2}/ )

}

# we used @category_f_name_arrays as a hash, is becuase we wanted versatility.

@category_f_name_arrays = { }

@category_f_name_arrays[ 'authoring' ] ||= []

@category_f_name_arrays[ 'case_activity' ] ||= []

@category_f_name_arrays[ 'favorites' ] ||= []

@category_f_name_arrays[ 'forum' ] ||= []

@category_f_name_arrays[ 'kc_view' ] ||= []

@category_f_name_arrays[ 'rar' ] ||= []

@category_f_name_arrays[ 'platform-kanisa' ] ||=[]

@category_f_name_arrays[ 'response_central' ] ||=[]

# Loop through all the newly aquired filenames and place them in the appropreate hash to be sorted later.

# :i is used for keys on the debug. :category is the first match in the regular expression. :cdate is

# the date that is grabed by the regular expression. :fn is the full name of the file.

for fn in filenames

fin = fn.downcase

if fin =~ /(authoring).*(\d{4}_\d{2}_\d{2})/ and @keys[ 0 ]

load_files( { :fn => fn, :i => 0, :cdate => $2, :category => $1 } )

elsif fin =~ /(case_activity).*(\d{4}_\d{2}_\d{2})/ and @keys[ 1 ]


elsif fin =~ /(favorites).*(\d{4}_\d{2}_\d{2})/ and @keys[ 2 ]


elsif fin =~ /(forum).*(\d{4}_\d{2}_\d{2})/ and @keys[ 3 ]


elsif fin =~ /(kc_view).*(\d{4}_\d{2}_\d{2})/ and @keys[ 4 ]


elsif fin =~ /(rar).*(\d{4}_\d{2}_\d{2})/ and @keys[ 5 ]


elsif fin =~ /(platform-kanisa).*(\d{4}_\d{2}_\d{2})/ and @keys[ 6 ]

load_files( { :fn => fn, :i => 6, :cdate => $2, :category => $1, :min_length => 4, :depth => 7 }

)

elsif fin =~ /(response_central).*(\d{4}_\d{2}_\d{2})/ and @keys[ 7 ]


end # end if fn =~ /authoring/

end # end for fn in filenames

# after we have loaded all the files into @category_f_name_arrays, we need to sort them by date.

@category_f_name_arrays.each do |item, value|

next if value.empty?

value.sort! do |x,y|

x.last <=> y.last

end

end

end

###################################################################################################

###################################################################################################

# Load the files in by using the cdate (the date pulled out by the regulare expression) in the yaml file,

###################################################################################################

def load_files( options={} )

# d is the date in question from the regulare expression

date_from_file_name = Date.parse(options[:cdate].gsub(/_/, '-'))

#dc is the date from the log_status.yaml file. :category should be the name garnered from

# the regular epression or some simplified name for the log_status file.

stored_date_in_yaml_file = @log_config[ options[ :category ] ][ 'cdate' ]

# if the current file is older then the file in question, put it in the queue

# to be processed.

if stored_date_in_yaml_file < date_from_file_name && date_from_file_name != Date.today() # 1-1-2000 < 1-1-2009

@category_f_name_arrays[ options[ :category ] ] << [ options[:fn], date_from_file_name ]

end

end

###################################################################################################

###################################################################################################

# debug_info is used to display information,

###################################################################################################

def debug_info( options={} )

# @keys[ options[:i] ] = false

# d is the date in question

d = Date.parse(options[:cdate].gsub(/_/, '-'))

#dc == currently logged date

dc = @log_config[ options[:category] ]['cdate']

if dc < d

@category_f_name_arrays[ options[ :category ] ] << [ options[:fn], d ]

#@log_config[ options[ :category ] ]['date'] = d.to_s

end

the_file = open_and_parse( options[:fn] )

if the_file.length > options[:min_length] ||= 4

#puts "The name of the file is: #{ options[ :fn ].downcase }"

#puts "The length of the file is: #{ the_file.length }"

the_file.each_with_index do |item, index|

#break if index > options[ :depth ] ||= 5

item.each do |cell|

puts cell

end if item.length > 52

puts "The length of the line[#{index}] is: #{item.length}" if item.length > 52

end

end

#puts @keys.to_yaml

end

###################################################################################################

###################################################################################################

def setup_the_xml

filenames = ( RUBY_PLATFORM =~ /mswin32/ ) ? ( %x{ dir /B files\\Logs }.split( "\n" ) ) : ARGV

filenames = filenames.delete_if{ | x |

!( x =~ /\d{4}_\d{2}_\d{2}/ )

}

@category_f_name_arrays = { }

@category_f_name_arrays[ 'authoring' ] ||= []

@category_f_name_arrays[ 'case_activity' ] ||= []

@category_f_name_arrays[ 'favorites' ] ||= []

@category_f_name_arrays[ 'forum' ] ||= []

@category_f_name_arrays[ 'kc_view' ] ||= []

@category_f_name_arrays[ 'rar' ] ||= []

@category_f_name_arrays[ 'platform-kanisa' ] ||=[]

@category_f_name_arrays[ 'response_central' ] ||=[]

for fn in filenames

fin = fn.downcase

#headers are at line[3] and the length is 17

if fin =~ /(authoring).*(\d{4}_\d{2}_\d{2})/ and @keys[ 0 ]

debug_info( { :fn => fn, :i => 0, :cdate => $2, :category => $1 } )

elsif fin =~ /(case_activity).*(\d{4}_\d{2}_\d{2})/ and @keys[ 1 ]


elsif fin =~ /(favorites).*(\d{4}_\d{2}_\d{2})/ and @keys[ 2 ]


elsif fin =~ /(forum).*(\d{4}_\d{2}_\d{2})/ and @keys[ 3 ]


elsif fin =~ /(kc_view).*(\d{4}_\d{2}_\d{2})/ and @keys[ 4 ]


elsif fin =~ /(rar).*(\d{4}_\d{2}_\d{2})/ and @keys[ 5 ]


elsif fin =~ /(platform-kanisa).*(\d{4}_\d{2}_\d{2})/ and @keys[ 6 ]

debug_info( { :fn => fn, :i => 6, :cdate => $2, :category => $1, :min_length => 4, :depth => 7

} )

elsif fin =~ /(response_central).*(\d{4}_\d{2}_\d{2})/ and @keys[ 7 ]


end # end if fn =~ /authoring/

end # end for fn in filenames

end # end def set_up_xml

###################################################################################################

###################################################################################################

def open_and_load(fn_in=nil)

fin = File.open(fn_in, "r" )

file_array = []

fin.each_line do |line| file_array.push line end

#close the file

fin.close

return file_array

end

###################################################################################################

###################################################################################################

def open_and_parse( fn_in=nil, fn_out="temp.log", start_line=0 )

fin = File.open(fn_in, "r" )

file_array = []

fin.each_line do |line| file_array.push line.strip.split("\t") end

#close the file

fin.close

return file_array

end # end def open_and_parse fn_in=nil, fn_out="temp.log", start_line=0

###################################################################################################

###################################################################################################

end

# Delete all the log files just to be clean.

puts "Deleting all log files"

%x{ erase *.log }

# delete the old alldcn.zip file

puts "Deleting the old alldcn.zip"

%x{ erase alldcn.zip }

# Delete the log_conversion.zip file

puts "Deleting log_conversion.zip"

%x{ erase log_conversion.zip }

puts "creating tapestry"

x = XmlStuff.new

tapestry = []

x.files_with_date

# Lets start to create the tapestry off all the threads which create the files.

x.category_f_name_arrays.each do |key, value|

tapestry << Thread.new {x.loop_and_append_to_masters( {:catagory => key} )}

puts "#{key} has been threaded."

end

# Before we move on, lets wait for all the threads to finish

puts "waiting to join all the threads"

tapestry.each do |t|

t.join

end

# We need to update my log file so we have the most upto date data. The if statment is used for debuging purposes.

if true

puts "updating logs status file"

x.update_logs_status

end

# Use the 7-zip command line utility (32bit) to compress all the logs after they are made.

# http://sourceforge.net/projects/sevenzip/files/7-Zip/4.65/7za465.zip/download

puts "creating log_conversion.zip"

%x{ 7za a -tzip log_conversion.zip *.log }

puts "creating alldcn.zip"

%x{ 7za a -tzip alldcn.zip "c:/fch/Kanisa/Kanisa Platform/KSM/Archive/dcnFiles/*" }

OMNITURE/* * Simple example makes call to Omniture API to get a companies report suites * * Requires the following libraries * * jakarta commons-lang 2.4 * jakarta commons-beanutils 1.7.0 * jakarta commons-collections 3.2 * jakarta commons-logging 1.1.1 * ezmorph 1.0.6 * json-lib-2.3-jdk13 * * * @author Lamont Crook * @email [email protected] * * @edited Kaleb J. Albee * @email [email protected] * */

//package com.omniture.security;import java.io.*;import java.util.regex.*;import java.net.URL;import java.net.URLConnection;import java.security.MessageDigest;import java.text.SimpleDateFormat;import java.util.Calendar;import java.util.Date;import java.util.HashMap;import java.util.Map;import net.sf.json.JSONArray;

import net.sf.json.JSONObject;import java.text.DateFormat;import java.text.Format;import java.text.ParseException;

public class OMTR_REST {private static String USERNAME = "albeekj:LDS";private static String PASSWORD = "849a07ace5ad6451ac861f158d77dd05";private static String LOGOUTPUT = "F:\\fh_share\\fhd_tts\\wiki\\omniture";private static String DEVELOPMENT_FOLDER = "F:\\fh_share\\fhd_tts\\dev";private static Integer WAIT_TIME = 30;//private static String ENDPOINT = "https://sc.omniture.com/p/am/1.2/rest-api.html"; //san jose endpointprivate static String ENDPOINT = "https://api.omniture.com/admin/1.2/rest/"; //san jose endpointpublic static final String DATE_FORMAT_NOW = "yyyy-MM-dd";

private OMTR_REST() {}

public static String now(int back) { Calendar cal = Calendar.getInstance(); cal.add(Calendar.DATE, - back);///This will put in yesterdays date SimpleDateFormat sdf = new SimpleDateFormat(DATE_FORMAT_NOW); return sdf.format(cal.getTime());

}//#############################################################################################

//#############################################################################################public static void loadCsv(String msg, String rptName){

//Set up the variable to catch the json message.String response = msg;String fout = "";

//assign the dateformat class the formatting of the string coming inDateFormat df = new SimpleDateFormat("EEE d MMM yyyy");

//setup the output formattingFormat f = new SimpleDateFormat("yyyyMMdd");

//I needed to know what date I was going to access and this is its place holder.Date given_date = null;

//reach in and grab the report arrayJSONObject jsonObj = JSONObject.fromObject(response).getJSONObject("report");

//now pass the array data to the json array.JSONArray jsonArry = JSONArray.fromObject(jsonObj.get("data")); JSONArray jtmp = null;

//the headers to the csv being pumped out.fout = "date\tpageViews\tvisits\tunique_visitors\n";

//I need another date holder.String jdate = null;

//we are not going to go through the report => data. for(int i = 0; i < jsonArry.size(); i++) {

//jtmp stands for json temporary

//for the first item in the array I am looking for the element countsjtmp = JSONArray.fromObject(JSONObject.fromObject(jsonArry.get(i)).get("counts"));

//now I need a way to to store the date which comes from the element name.jdate = JSONObject.fromObject(jsonArry.get(i)).get("name").toString();

try {jdate = jdate.replaceAll("(?i)\\.", "");jdate = jdate.replaceAll("(?i)\\s{2,}", " ");

} catch (PatternSyntaxException ex) {// Syntax error in the regular expressionex.printStackTrace( );

} catch (IllegalArgumentException ex) {// Syntax error in the replacement text (unescaped $ signs?)ex.printStackTrace( );

} catch (IndexOutOfBoundsException ex) {// Non-existent backreference used the replacement textex.printStackTrace( );

}

// Now try to parse out jdate and formate it to a date type.try{

given_date = df.parse( jdate );}catch( ParseException e ){

e.printStackTrace( );}// end catch(ParseException e)

//once it is formated we can concatinate it to fout which later// will be written to the log file on the f: drive.fout += f.format( given_date )//get the date from the array

+ "\t"+ jtmp.get( 0 )+ "\t"+ jtmp.get( 1 )+ "\t"+ jtmp.get( 2 )+ "\n";

}// end for(int i = 0; i < jsonArry.size(); i++)

// since fout might have some tailing white space characters, we are// going to trim off those characters just in case.toLogs( fout.trim( ), rptName );

// I am not totally sure if this is useful, but at this point it is just displaying// to the screen/console.jsonArry = JSONArray.fromObject( jsonObj.get( "totals" ) );System.out.println( jsonArry.get(1) );

} //end public static void loadCsv(String msg)//#############################################################################################

//#############################################################################################public static void processReport(Map map, String rptName) throws IOException {

//Now we need to ask the question to the Omniture webservices server.String response = OMTR_REST.callMethod( "Report.QueueOvertime", JSONObject.fromObject( map ).toString(

) );//on 2/18/2011 we had 1942 tokens

//The following line is used to parse out the json formatted response. we could put this in //XML format, but I don't want to do that yet till we get more complicated queries.JSONObject jsonObj = JSONObject.fromObject( response );

//I cast the number to an int so it is easier to ask questions about it later.int rptid = ( Integer )jsonObj.get( "reportID" );

//check to see if we have a report.if( rptid > 0 ){

//We need to wait for 10 seconds so Omniture has time to generate the reportwaiting( WAIT_TIME );

//print to the user the report id we received from the Omniture serversSystem.out.println("the value we got was: " + rptid);

//set map to null so we can clear out the value. I ran into problems with this earlier.map = null;

//Now assign the map variable to a new hash map. Once that is done we can assign the parameters to it.

map = new HashMap();

//Assign the key, reportID the report id. we have to cast it as a string so the mapping will put quotes around it.

map.put("reportID", "" + rptid);

//Tell the user the vale of the hash map.System.out.println( JSONObject.fromObject( map ).toString( ) );

//We need to check the status of the report. if the status is done we can move on.response = OMTR_REST.callMethod( "Report.GetStatus", JSONObject.fromObject(map).toString( ) );

//pass the response to the json parser so we can quickly check what the value wasjsonObj = JSONObject.fromObject( response );

//Finally, lets get the actual report from the Omniture server and then parse it later.response = OMTR_REST.callMethod( "Report.GetReport", JSONObject.fromObject(map).toString() );jsonObj = JSONObject.fromObject(response);int trys = 10;while( jsonObj.get( "status" ).toString( ).compareTo( "not ready" ) == 0 && trys > 0 ){

trys -= 1;System.out.println( "not ready waiting for 10 seconds" );System.out.println( trys + " trys left" );waiting( 10 );response = OMTR_REST.callMethod( "Report.GetReport",

JSONObject.fromObject(map).toString() );jsonObj = JSONObject.fromObject( response );

}// end while(jsonObj.get( "status" ).toString( ).compareTo( "not ready" ) == 0 && trys > 0)

//we need to wait x seconds so it has time to pass in all the data.waiting( WAIT_TIME );

if(trys > 1)//if the trys are greater then one, then it must have succeeded.loadCsv( response, rptName );

}// end if( rptid > 0 )else{

System.out.println("Report id was " + rptid + ". That was not acceptable.");

}// end if( rptid > 0) else}// end public static void processReport()//#############################################################################################

//#############################################################################################public static void toLogs(String msg, String rptName){

BufferedWriter fout = null;try{

//if I want this to append I need to pass true to the FileWriter constructor as a//second parameter.

fout = new BufferedWriter(new FileWriter(LOGOUTPUT + "_" + rptName + ".csv")); fout.write(msg); fout.flush(); fout.close();

} catch(Exception e){

e.printStackTrace();} finally {

if(fout != null) try{

fout.close();} catch(IOException ioe2){

//ignore}

}}//#############################################################################################

//#############################################################################################public static void debug_ouput(String msg){

BufferedWriter fout = null;try{

//if I want this to append I need to pass true to the FileWriter constructor as a//second parameter.

fout = new BufferedWriter(new FileWriter( DEVELOPMENT_FOLDER + "\\output.txt" )); fout.write( msg ); fout.flush(); fout.close();

} catch(Exception e){

e.printStackTrace();} finally {

if(fout != null) try{

fout.close();} catch(IOException ioe2){

//ignore}

}}

//#############################################################################################

//#############################################################################################public static String callMethod(String method, String data) throws IOException {

URL url = new URL(ENDPOINT + "?method=" + method);System.out.println(url);URLConnection connection = url.openConnection();connection.addRequestProperty("X-WSSE", getHeader());

connection.setDoOutput(true);OutputStreamWriter wr = new OutputStreamWriter(connection.getOutputStream());wr.write(data);wr.flush();

InputStream in = connection.getInputStream();BufferedReader res = new BufferedReader(new InputStreamReader(in, "UTF-8"));

StringBuffer sBuffer = new StringBuffer();String inputLine;while ((inputLine = res.readLine()) != null)

sBuffer.append(inputLine);

res.close();

return sBuffer.toString();}//#############################################################################################

//#############################################################################################private static String getHeader() throws UnsupportedEncodingException {

byte[] nonceB = generateNonce();String nonce = base64Encode(nonceB);String created = generateTimestamp();String password64 = getBase64Digest(nonceB, created.getBytes("UTF-8"), PASSWORD.getBytes("UTF-8"));StringBuffer header = new StringBuffer("UsernameToken Username=\"");header.append(USERNAME);header.append("\", ");header.append("PasswordDigest=\"");header.append(password64.trim());header.append("\", ");header.append("Nonce=\"");header.append(nonce.trim());header.append("\", ");header.append("Created=\"");header.append(created);header.append("\"");return header.toString();

}//#############################################################################################

//#############################################################################################private static byte[] generateNonce() { String nonce = Long.toString(new Date().getTime()); return nonce.getBytes();}//#############################################################################################

//#############################################################################################private static String generateTimestamp() {

SimpleDateFormat dateFormatter = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");return dateFormatter.format(new Date());

}//#############################################################################################

//#############################################################################################private static synchronized String getBase64Digest(byte[] nonce, byte[] created, byte[] password) { try { MessageDigest messageDigester = MessageDigest.getInstance("SHA-1"); // SHA-1 ( nonce + created + password ) messageDigester.reset(); messageDigester.update(nonce); messageDigester.update(created); messageDigester.update(password); return base64Encode(messageDigester.digest()); } catch (java.security.NoSuchAlgorithmException e) { throw new RuntimeException(e); }}//#############################################################################################

//#############################################################################################// waiting was taken from a website, which I cant remember the URL to.public static void waiting (int n){

long t0, t1; t0 = System.currentTimeMillis(); do{ t1 = System.currentTimeMillis(); } while ((t1 - t0) < (n * 1000)); }

//#############################################################################################

//#############################################################################################private static String base64Encode( byte[ ] bytes ) { return Base64Coder.encodeLines( bytes );}//#############################################################################################

//#############################################################################################public static void main( String[ ] args ) throws IOException {

// Declare the base map so we can eventually ask the question to the omniture server.Map map = new HashMap( );

// desc is used to pass parameters to the omniture serverMap desc = new HashMap( );

// A is used to state the type of report we are asking for. We will be asking for an overtime report later.Map a = new HashMap( );

// We need to ask for the page views report.

a.put( "id", "pageViews" );

// and we are asking for specifically visitsMap b = new HashMap( );

b.put( "id", "visits" );Map c = new HashMap();c.put("id", "visitorsdaily");

// Now we ask for the starting date. The formatting does matter at this point.// I am going to have DI check for the duplicate dates. If it does exist, then ignore the entry.// the integer parameter is a time in days back. if you really want now, you need to enter 0desc.put( "dateFrom", OMTR_REST.now( 14 ).toString() );

// The formatting of the date matters to omniture.// since we have a 1 that means we are one day back, yesterday.desc.put( "dateTo", OMTR_REST.now( 1 ).toString() );//now is miss leading, it is actually yesterday.

// If we look into the api we have several options, but we have chosen to look at the daydesc.put( "dateGranularity", "day" );

// Now we need to pass the metrics setup earlier, to the metrics portion of the json request.desc.put("metrics", new Map[]{a, b, c});

// we need to loop through all the arguments passed in to the applicationfor(int i = 0; i < args.length; i++){

// We are not going to ask for the specific site we want to look at.desc.put( "reportSuiteID", args[i] );

// We have description assembled we need to pass it to the reportDescription key.map.put( "reportDescription", desc );// the following line is primarily used for debug purposes. It is useful to see the format of the the json

passed// to the omniture web services server

processReport( map, args[i] );}

}}

Family History Center (FHC) ProfileSELECT

ot.fhc_no,

ot.fhc_unit_no,

ot.fhc_name,

DECODE(ot.org_type_id, 1211, 'AHC', NVL(cl.fhc_type, 'FHC') ) AS fhc_type,

ot.fhc_area,

cfdt.cfar_no AS cfar_no,

good.spons_unit_no,

good.sponsoring_unit,

ot.parent_unit_no,

ot.parent_unit,

cfdt.hrs_open AS hrs_open,

ot.center_hours AS fhc_hrs,

cfdt.closed AS closed,

cfdt.area_advisor AS area_advisor,

ot.bill_to_unit_no,

ot.bill_to_unit_name,

cfdt.supp_stks AS supp_stks,

ot.temple_district,

cfdt.visitor_ctr AS visitor_ctr,

cfdt.corr_fac AS corr_fac,

ta.assgn_person_name AS dir_name,

ta.home_phone_number AS dir_phone,

ta.work_phone_number AS dir_work_phone,

ot.fhc_phone,

ta.assgn_email_address AS dir_email,

ot.fhc_email,

l.lang_name AS fhc_lang,

cfdt.NETWK_TYPE AS NETWK_TYPE,

ot.fhc_loc_add1,

ot.fhc_loc_add2,

ot.fhc_loc_add3,

ot.fhc_loc_add4,

ot.fhc_loc_country,

ot.fhc_loc_postal,

ot.fhc_loc_city,

ot.fhc_loc_state,

cfdt.fhc_loc_county AS FHC_LOC_COUNTY,

ot.approval_date,

cfdt.meetinghouse AS MEETINGHOUSE,

cfdt.BLDG_TYPE AS BLDG_TYPE,

cfdt.FM_PROPERTY_NO AS FM_PROPERTY_NO,

cfdt.FM_GROUP_UNIT_NO AS FM_GROUP_UNIT_NO,

cfdt.FM_GROUP AS FM_GROUP,

cfdt.FM_GROUP_PHONE AS FM_GROUP_PHONE,

cfdt.fm_same_bldg AS FM_SAME_BLDG,

cfdt.notes AS NOTES,

cfdt.admin_notes AS ADMIN_NOTES,

cfdt.attention_notes AS ATTENTION_NOTES,

cfdt.INITIAL_FHC_NO AS INITIAL_FHC_NO,

cfdt.CO_FLAG AS CO_FLAG,

cfdt.NET_NO_SHOW AS NET_NO_SHOW,

ot.fhc_name AS fhc_mail_ctr_name,

ta.ASSGN_PERSON_NAME AS fhc_mail_name,

ta.MAILING_STREET_4 AS fhc_mail_add1,

ta.MAILING_STREET_3 as fhc_mail_add2,



ta.MAILING_POSTAL_CODE as fhc_mail_postal,

ta.MAILING_COUNTRY_COMMON_NAME as fhc_mail_country,

ta.MAILING_STATE_PROV_COMMON_NAME as fhc_mail_state,

cfdt.SUPPORT_OFFICE as SUPPORT_OFFICE,

cfdt.COUNTRY_ADVISOR as COUNTRY_ADVISOR,

ot.mission,

cfdt.no_hrs_open as NO_HRS_OPEN,

cfdt.rept_to_country as REPT_TO_COUNTRY,

ta.cell_phone_number as dir_cell_phone,

cfdt.film_circulation_q as FILM_CIRCULATION_Q,

ot.fhc_loc_add_comp,

ta.MAILING_ADDRESS_COMPOSED as fhc_mail_comp_addr,

ta.MAILING_CITY as fhc_mail_city,

cfdt.FHC_HISTORICAL_NOTES as FHC_HISTORICAL_NOTES,

ta.ASSIGNMENT_ACTIVE_DATE as dir_start_date,

ot.ORG_STATUS_CODE,

ot.fax,

ot.date_loaded updated,

cfdt.XP_LICENSES as XP_LICENSES,

cfdt.NODE as NODE

FROM

orgs_temp ot,

(

--selecting director

select * from tmp_asst sub_ta where sub_ta.POSITION_TYPE_ID = 97

) ta,

(

--########################

--sponsering unit and number

--########################

select

fhc_stuff.fhc_unit_no, fhc_stuff.fhc_sponsoring_unit_type, fhc_stuff.spons_unit_no, fhc_stuff.Sponsoring_Unit

FROM

( SELECT

nvl(CASE fhc.PARENT_ORG_TYPE_ID

WHEN 5 THEN par.org_name




ELSE gpar.org_name

END, fhc.PARENT_ORG_NAME) AS Sponsoring_Unit,


WHEN 5 THEN par.unit_number




ELSE gpar.unit_number

END, fhc.PARENT_UNIT_NUMBER) AS spons_unit_no,


WHEN 5 THEN par.org_type




ELSE gpar.org_type

END, fhc.PARENT_ORG_type) AS fhc_sponsoring_unit_type,

fhc.UNIT_NUMBER AS fhc_unit_no

FROM

mdmr.mdm_org_association moa,

mdmr.mdm_org fhc,

mdmr.mdm_org sup_stake,

mdmr.mdm_org par,

mdmr.mdm_org gpar

WHERE

moa.association_type_code(+)=78

and fhc.org_id=moa.CONSUMER_ORG_ID(+)

and fhc.ORG_TYPE_ID in (44, 49)

and sup_stake.org_id(+)=moa.ASSOCIATED_PROVIDER_ORG_ID

and fhc.ORG_STATUS_CODE=1

and par.org_id(+)=fhc.PARENT_ORG_ID

and gpar.org_id(+)=par.parent_org_id) fhc_stuff

group by

fhc_stuff.fhc_unit_no, fhc_stuff.fhc_sponsoring_unit_type, fhc_stuff.spons_unit_no, fhc_stuff.Sponsoring_Unit) good,

(--###########################

--language subquery join

select

ol.org_id,l.lang_name

from

mdmr.mdm_org_language ol,

mdmr.mdm_language l

where

ol.LANGUAGE_CODE = l.LANGUAGE_CODE and

ol.ORG_SPOKEN_LANGUAGE_RANK = 1

) l, --#########################

(

SELECT

case oc.org_subclass_id

when 91 THEN 'RGN'

when 89 THEN 'CO'

when 90 THEN 'FHC'

ELSE

'FHC'

end AS FHC_TYPE,

oc.ORG_ID

FROM

mdmr.mdm_org_classification oc

WHERE

upper(oc.ORG_CLASSIFICATION) like upper('%family%')

) cl,

DEBUG_TMP_CFHCD4 cfdt

WHERE

ot.org_id = l.org_id(+)

AND ot.FHC_UNIT_NO = cfdt.fhc_unit_no(+)

AND ot.org_id = cl.org_id(+)

AND good.fhc_unit_no(+) = ot.fhc_unit_no

AND ta.org_id(+) = ot.ORG_ID

AND ot.org_type_id in (44, 48, 47)

ORDER BY

ot.fhc_unit_no desc

APPENDIX C – REPORTS

Works Citedregular-expressions.info. (2002, December 2). Retrieved 3 2, 2012, from regular-expressions.info:

http://www.regular-expressions.info/oracle.html

DWH Concepts and Fundamentals. (2007). Retrieved June 01, 2012, from dwhinfo.com: http://www.dwhinfo.com/Concepts/DWHConceptsMain.html

Star Schema. (2009). Retrieved May 12, 2012, from Datawarehouse4u.info: http://datawarehouse4u.info/Data-warehouse-schema-architecture-star-schema.html

Star-Schema Design. (2010, January 26). Retrieved July 31, 2012, from Stack Over Flow: http://stackoverflow.com/questions/110032/star-schema-design

Benefits of a Data Warehouse. (2011, Jul 31). Retrieved July 10, 2012, from BI-INSIDER.COM: http://bi-insider.com/portfolio/benefits-of-a-data-warehouse/

Ruby, Concurrency, and You. (2011, October 14). Retrieved March 3, 2012, from engine yard: http://www.engineyard.com/blog/2011/ruby-concurrency-and-you/

Dimensional Model Schemas- Star, Snow-Flake and Constellation. (2012). Retrieved July 31, 2012, from Execution-MiH: http://www.executionmih.com/data-warehouse/star-snowflake-schema.php

Foreign key Constraints. (2012). Retrieved March 2, 2012, from msdn.microsoft.com: http://msdn.microsoft.com/en-us/library/ms175464.aspx

LANDesk. (2012, July 20). Retrieved July 31, 2012, from Wikipedia: http://en.wikipedia.org/wiki/LANDesk

List of Business Intelligence (BI) Tools. (2012). Retrieved July 31, 2012, from Business Intelligence Tool Box: http://www.businessintelligencetoolbox.com/list-of-business-intelligence-bi-tools/

Oracle Datatypes Data types for oracle 8 to Oracle 11g. (2012). Retrieved July 31, 2012, from ss64.com: http://ss64.com/ora/syntax-datatypes.html

Scrum Is an Innovative Approach to Getting Work Done. (2012). Retrieved February 20, 2012, from ScrumAlliance: http://www.scrumalliance.org/learn_about_scrum

Top 10 U.S. Websites to Search for Your Ancestors. (2012). Retrieved May 22, 2012, from EasyFamilyHistory.com: http://www.easyfamilyhistory.com/best-of-internet/top-10-websites

What is MySQL? (2012). Retrieved June 1, 2012, from dev.mysql.com: http://dev.mysql.com/doc/refman/5.0/en/what-is-mysql.html

Goyvaerts, J. (2010, December 2). Oracle Database 10g Regular Expressions. Retrieved July 31, 2012, from Regular-expressions.info: http://www.regular-expressions.info/oracle.html

Higginbotham, S. (2012, May 1). Google opens up its BigQuery data analytics service to all. Retrieved July 31, 2012, from gigaom.com: http://gigaom.com/cloud/google-opens-up-its-biq-query-data-analytics-service-to-all/

Mittag, J. W. (2008, September 11). Does ruby have real multithreading? Retrieved March 3, 2012, from Stakeoverflow.com: http://stackoverflow.com/questions/56087/does-ruby-have-real-multithreading

Oracle.com. (n.d.). Oracle9i Data Warehousing Guide. Retrieved 5 6, 2010, from docs.oracle.com: http://docs.oracle.com/cd/B10501_01/server.920/a96520/concept.htm

ScrumAlliance. (n.d.). ScrumAlliance transforming the world of work. Retrieved 2 20, 2012, from ScrumAlliance: http://www.scrumalliance.org/learn_about_scrum

ss64.com. (2011). ss64. Retrieved march 2, 2012, from ss64: http://ss64.com/ora/syntax-datatypes.html

Taylor, S. (2010, April 29). Mormon church's storied Granite Mountain vault opened for virtual tour. Retrieved May 1, 2012, from Deseret News: http://www.deseretnews.com/article/700028045/Mormon-churchs-storied-Granite-Mountain-vault-opened-for-virtual-tour.html

W3C. (2001, March 15). Web Services Description Language (WSDL) 1.1. Retrieved March 3, 2012, from World Wide Web Consortium: http://www.w3.org/TR/wsdl

wikipedia.org. (2012, February 15). wikipedia.org. Retrieved February 1, 2012, from wikipedia: http://en.wikipedia.org/wiki/Snowflake_schema

Documents

Centralization, Normalization, and Warehousing Non-common ...penguin.ewu.edu/cscd506/Projects/albee_kaleb_project1_… · Web viewCentralization, Normalization, and ... The proposed