Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Table of Contents
Table of Contents
1 Introduction............................................................................................................................................10
1.1 Background .....................................................................................................................................10
1.2 Problem defined and project involvement.......................................................................................10
1.3 Need for managing many data sources ...........................................................................................11
1.4 Proposed solutions..........................................................................................................................11
1.5 Project Objectives............................................................................................................................11
2 Literature Review....................................................................................................................................11
2.1 Database..........................................................................................................................................11
2.1.1 Definition and Attributes...........................................................................................................11
2.1.2 Typical use in Businesses...........................................................................................................11
Chapter 2.2 Data Warehouse.................................................................................................................11
2.2.1 History.......................................................................................................................................11
2.2.2 Definition and Attributes...........................................................................................................12
2.2.3 Importance and Trends.............................................................................................................12
2.2.4 Database Vs. Data Warehouse..................................................................................................12
2.2.5 Schemas....................................................................................................................................12
2.2.5.1 Star schema....................................................................................................................12
2.2.5.2 Snowflake Schema..........................................................................................................13
2.2.5.3 Fact Constellation...........................................................................................................13
2.3 Data formatting................................................................................................................................14
2.3.1 Data cleansing and Standardization..........................................................................................14
2.3.2 Definition...........................................................................................................................14
2.3.3 Procedure..........................................................................................................................14
2.3.4 Approaches........................................................................................................................14
2.3.5 Challenges.........................................................................................................................14
3 Scrum......................................................................................................................................................14
3.1 Definition.........................................................................................................................................14
3.2 History..............................................................................................................................................14
3.3 Attributes.........................................................................................................................................14
3.4 Popularity.........................................................................................................................................15
3.5 Application of Scrum........................................................................................................................15
4 Data Modeling and Architecting provide Ad-hoc reporting....................................................................16
4.1 Clarify/Amdocs.................................................................................................................................16
4.2 Family History Center (FHC) Profile..................................................................................................17
4.3 Kanisa...............................................................................................................................................17
4.4 Omniture web services....................................................................................................................18
4.5 LANDesk...........................................................................................................................................18
5 Operation of the Warehouse..................................................................................................................19
5.1 Now vs. Before.................................................................................................................................19
5.2 User satisfaction...............................................................................................................................19
6 Conclusions.............................................................................................................................................20
7 Future Work............................................................................................................................................21
APPENDIX A – SQL SCRIPTS........................................................................................................................22
AMDOCS/CLARIFY..................................................................................................................................22
LANDesk.................................................................................................................................................30
APPENDIX B – SCRIPTS...............................................................................................................................35
KANISA...................................................................................................................................................35
OMNITURE.............................................................................................................................................51
Family History Center (FHC) Profile........................................................................................................70
APPENDIX C – REPORTS.............................................................................................................................77
Works Cited...............................................................................................................................................78
1.1 Background ....................................................................................................................................... 7
1.2 Problem defined and project involvement ......................................................................................... 7
1.3 Need for managing many data sources ............................................................................................. 7
1.4 Proposed solutions ............................................................................................................................ 7
1.5 Project Objectives .............................................................................................................................. 7
2 Literature Review ...................................................................................................................................... 7
2.1 Database ............................................................................................................................................ 7
2.1.1 Definition and Attributes ............................................................................................................. 7
2.1.2 Typical use in Businesses ............................................................................................................. 8
Chapter 2.2 Data Warehouse ................................................................................................................... 8
2.2.1 History ......................................................................................................................................... 8
2.2.2 Definition and Attributes ............................................................................................................. 8
2.2.3 Importance and Trends ............................................................................................................... 8
2.2.4 Database Vs. Data Warehouse .................................................................................................... 8
2.2.5 Schemas ...................................................................................................................................... 8
2.2.5.1 Star schema ...................................................................................................................... 8
2.2.5.2 Snowflake Schema ............................................................................................................ 9
2.2.5.3 Fact Constellation ............................................................................................................. 9
2.3 Data formatting ................................................................................................................................ 10
2.3.1 Data cleansing and Standardization .......................................................................................... 10
2.3.2 Definition ........................................................................................................................... 10
2.3.3 Procedure .......................................................................................................................... 10
2.3.4 Approaches ........................................................................................................................ 10
2.3.5 Challenges ......................................................................................................................... 10
3 Scrum ...................................................................................................................................................... 10
3.1 Definition ......................................................................................................................................... 10
3.2 History .............................................................................................................................................. 10
3.3 Attributes ......................................................................................................................................... 10
3.4 Popularity ......................................................................................................................................... 11
3.5 Application of Scrum to our project ................................................................................................. 11
4 Data Modeling and Architecting provide Ad-hoc reporting .................................................................... 12
4.1 Clarify/Amdocs ................................................................................................................................. 12
4.2 Family History Center (FHC) Profile .................................................................................................. 13
4.3 Kanisa ............................................................................................................................................... 13
4.4 Omniture web services .................................................................................................................... 14
4.5 LANDesk ........................................................................................................................................... 14
5 Operation of the Warehouse .................................................................................................................. 14
5.1 Now vs. Before ................................................................................................................................. 14
5.2 User satisfaction ............................................................................................................................... 15
6 Conclusions ............................................................................................................................................. 16
7 Future Work ............................................................................................................................................ 17
APPENDIX A – SQL SCRIPTS ........................................................................................................................ 18
AMDOCS/CLARIFY .................................................................................................................................. 18
LANDesk ................................................................................................................................................. 26
APPENDIX B – SCRIPTS ............................................................................................................................... 31
KANISA ................................................................................................................................................... 31
OMNITURE ............................................................................................................................................. 47
Family History Center (FHC) Profile ........................................................................................................ 66
APPENDIX C – REPORTS ............................................................................................................................. 73
Works Cited ............................................................................................................................................... 74
1.1 Background ....................................................................................................................................... 3
1.2 Problem defined and project involvement ......................................................................................... 3
1.3 Need for managing many data sources ............................................................................................. 3
1.4 Proposed solutions ............................................................................................................................ 3
1.5 Project Objectives .............................................................................................................................. 3
2 Literature Review ...................................................................................................................................... 3
2.1 Database ............................................................................................................................................ 3
2.1.1 Definition and Attributes ............................................................................................................. 3
2.1.2 Typical use in Businesses ............................................................................................................. 4
Chapter 2.2 Data Warehouse ................................................................................................................... 4
2.2.1 History ......................................................................................................................................... 4
2.2.2 Definition and Attributes ............................................................................................................. 4
2.2.3 Importance and Trends ............................................................................................................... 4
2.2.4 Database Vs. Data Warehouse .................................................................................................... 4
2.2.5 Schemas ...................................................................................................................................... 4
2.2.5.1 Star schema ...................................................................................................................... 4
2.2.5.2 Snowflake Schema ............................................................................................................ 5
2.2.5.3 Fact Constellation ............................................................................................................. 5
2.3 Data formatting .................................................................................................................................. 6
2.3.1 Data cleansing and Standardization ............................................................................................ 6
2.3.2 Definition ............................................................................................................................. 6
2.3.3 Procedure ............................................................................................................................ 6
2.3.4 Approaches .......................................................................................................................... 6
2.3.5 Challenges ........................................................................................................................... 6
3 Scrum ........................................................................................................................................................ 6
3.1 Definition ........................................................................................................................................... 6
3.2 History ................................................................................................................................................ 6
3.3 Attributes ........................................................................................................................................... 6
3.4 Popularity ........................................................................................................................................... 7
3.5 Application of Scrum to our project ................................................................................................... 7
4 Data Modeling and Architecting provide Ad-hoc reporting ...................................................................... 8
4.1 Clarify/Amdocs ................................................................................................................................... 8
4.2 Family History Center (FHC) Profile .................................................................................................... 9
4.3 Kanisa ................................................................................................................................................. 9
4.4 Omniture web services .................................................................................................................... 10
4.5 LANDesk ........................................................................................................................................... 10
5 Operation of the Warehouse .................................................................................................................. 10
5.1 Now vs. Before ................................................................................................................................. 10
5.2 User satisfaction ............................................................................................................................... 11
6 Conclusions ............................................................................................................................................. 12
7 Future Work ............................................................................................................................................ 13
APPENDIX A – SQL SCRIPTS ........................................................................................................................ 14
AMDOCS/CLARIFY .................................................................................................................................. 14
LANDesk ................................................................................................................................................. 22
APPENDIX B – SCRIPTS ............................................................................................................................... 27
KANISA ................................................................................................................................................... 27
OMNITURE ............................................................................................................................................. 43
Family History Center (FHC) Profile ........................................................................................................ 62
APPENDIX C – REPORTS ............................................................................................................................. 69
Works Cited ............................................................................................................................................... 70
1 Introduction
1.1 Background The Church of Jesus Christ of Latter Day Saints (hereafter LDS Church) has a
historical focus of genealogical research, derived from an interpretation of a section in
the Old Testament of the Bible (KJV Malachi 4:5-6). The LDS Church contains a Family
History Department dedicated to genealogy research, and has placed high priority and
allocated significant resources to genealogy. The FamilySearch.org website of the LDS
Church is one of the fastest growing genealogy Internet sites in the world (Top 10 U.S.
Websites to Search for Your Ancestors, 2012). The Tools, Technology, and Support
(TTS) Division of the Family History Department is tasked with improving accessibility
and usability of the FamilySearch.org site.
In a climate where personal computer access and power is rapidly expanding
across the world, the TTS Division has observed the inevitable limitations of diverse
users, and recognized the importance of addressing those limitations. Genealogical
raw data consists of government records (e.g., census data), graveyard records,
library/community histories, newspaper articles, and personal records (e.g., journals
where accessible). Raw data has been physically stored in a climate controlled facility
near Salt Lake City termed ‘Granite Mountain’, where approximately 35 billion images of
genealogical information contained mostly on 2.4 million rolls of microfilm reside (Taylor,
2010). Those raw data sources must go through a lengthy process of record validation,
digitization, storage, and archival prior to end use and research. This process of
digitization is primarily conducted by volunteers, but also through collaboration with
other businesses, to become viable for worldwide genealogical research at the
FamilySearch.org site.
1.2 Problem defined and project involvementA large volume of end users such as indexers, genealogists, and curious website
surfers from various backgrounds and of different countries use the website on a regular
basis. When these users encounter problems, customer service for these users were
previously disjointed and inefficient. Customer service responses were stored in a
number of different storage systems. The volume of these records was quite large with
gigabytes of data . These records were used to improve the LDS geneology website
and provide service to users.
The main problem with the customer service records was the disjointed,
disparate sources of these records. The challenge was how-to integrate unique data
record storage systems, which were without obvious associations. These storage
systems were commonly obtained from different applications and indeed different
countries. I Initial attempts to streamline responses were inefficient because the staff
discovered much of the user response data was not recorded in the databases.
Instances occurred when computationally intensive reports analyzing the user
experience with several of the contributing databases such as LANDesk data or Client
Management tools crippled entire production systems.
These inefficiencies led to an initiative to provide the most complete feedback
possible to FamilySearch.org management by identifying a single point of access for
quality reports analyzing user experiences and reporting in a universally accessible
format (e.g., MS Excel, HTTPS, and Crystal). The solution to the storage and retrieval
of the massive amount of user experience data in separate formats was to gather user
experience data in all its forms and place it into a centralized warehouse in a universally
accessible format. This is the main purpose and task of this Masters project and will be
described in full detail in the following sections.
Figure 1 : Process chart
. Five most common user experience data sources from FamilySearch.org research efforts.
1.3 Need for managing many data sources User experience data is derived from multiple sources such as the Family History
Center (FHC) Profile (ref), Amdocs (ref), LANDesk (ref), Kanisa (ref), and online
tracking tools (See Figure 1). Each source of data has its own unique set of metrics for
the data that will be tracked. Family History Center (FHC) Profile tracked the personnel
usage, software usage, and the volunteer usage. LANDesk (http://www.landesk.com)
tracked the specific usage of a Family History Center’s computer. Amdocs
(http://www.amdocs.com/Pages/HomePage.aspx) tracked patron agent interactions,
resources accessed, and how quickly a solution was found. Kanisa
(http://crm.consona.com/software/products/knowledge-management.aspx) tracked
which documents where accessed, what key words were used, and the approximate
time spent on each page. Online tracking tools, such as Omniture
(http://www.omniture.com/en/) tracked user’s country of origin, IP address, where the
user would enter the FamilySearch website and where they would leave. Together
these five sources, with the combined storage space of 60 Gigabytes, of user
experience data comprised nearly all the inputs for data and proved extremely
challenging to incorporate into a single warehouse. The five data sources were
approximately100 million records in size.
1.4 Proposed solutionsThe purpose of the five data sources was to improve the user experience by
utilizing each application’s record-keeping and analysis tools. Thus, the goal was to
create a data warehouse derived from five user experience data applications, and
provide an enterprise wide solution where a business user of any expertise could create
a customizable report from the warehouse. The proposed solution to this challenge was
to assess data integration feasibility, design cleansing/standardization procedures,
automate data consumption, and architect and integrate data warehouse schemas.
The process of creating a data warehouse required the use of enterprise level
tools, assessment of databases and programming languages, and incorporation of
custom scripts at the database (DB) level to deliver data manipulations. The enterprise
level tools consisted of Business Objects Data Services (BODS) and other extended
data cleansing tools (http://www.sap-businessobjects.info/data-services.htm).
Additionally, techniques were researched that reduced the time required to query the
data warehouse. Members from the TTS Division were tasked with utilizing the
proposed warehoused data records for creating applications to enable end-user output
that would contain graphs, charts, and raw data.
1.5 Project ObjectivesGiven data record storage systems in diverse formats with few obvious relational
connections, the objectives of this project were to:
Identify the major data record systems
Cleanse and standardize data
Unify the data records into one warehouse
Conduct user accessibility testing to ensure the storage warehouse would
operate properly with each application
The overall goal of the TTS project was to provide a simple graphical interface
that a user of any technical background, with nominal knowledge of SQL, could use to
create reports. The end-users of the proposed data warehouse were comprised of
executives, managers, business analyses, volunteers, and product developers. The
following sections provide an explanation of the research process, techniques, and final
accomplishments of this project.
2 Literature Review
2.1 Database
2.1.1 Definition and AttributesA Database is a, “structured collection of data. It may be anything from a simple
shopping list to a picture gallery or the vast amounts of information in a corporate
network (ref). A relational database stores data in separate tables rather than putting all
the data in one big combined table. The database structures are organized into
physical files optimized for speed. The logical model, with objects such as databases,
tables, views, rows, and columns, offers a flexible programming environment.” (What is
MySQL, 2012) For the purposes of the data warehousing project, MS Excel
spreadsheets and MS Access will be included as database data sources.
2.1.2 Typical use in BusinessesSome of the most common uses of databases in industry are retail customer
records, governmental records, large complex computations on statistical data, and
medical (patient) records (ref). The retail businesses often use customer records to
analyze consumer habits, inventory usage, or ads which will target responses to
regional products. An example of government records are a person’s social security
records or their tax records. Larger complex queries can be done with a database
because databases have been optimized to do these kinds of transactions often. And
finally, medical patient records would allow medical facilities the ability to call up patient
records in an emergency or call up medication usage. Or patient records can provide
insight from large data sets.
Chapter 2.2 Data Warehouse
2.2.1 HistoryData warehousing is a relatively new technology born out of consumer needs.
The warehouse technology was driven by consumers assembling their assets or
technologies to accomplish one goal, a single source access point to mine for data from
many sources (ref). Consumers, businesses, and organizations, needed to analyze
data in ways previously impossible and or impractical to do because single and
separate reports couldn’t be combined in a reasonable time period. Further, the
computational resources were often inadequate to sustain production performance and
generate reports from the data. In an effort to meet their needs, the customers
combined several pieces of hardware, software, data mining techniques, and finally
analytic tools in an effort to accomplish their goals (ref). As a result, the movement
towards data analytics of multiple data sources was created. The following sections
detail the attributes and importance of data warehouses.
2.2.2 Definition and AttributesWilliam Inmon introduced four standards required for the design of a good data
warehouse (DWH Concepts and Fundamentals, 2007). First, the warehouse must be
subject oriented; second, integrated; third, non-volatile; and, fourth, time variant. Each
of the proceeding qualities allow a business analyst to ask a wide variety of questions.
The questions about a company can be submitted and retrieved in a timely, reliable, and
focused way. Each standard will be described in more detail as follows.
A data warehouse must be subject oriented. Therefore, the data within the data
warehouse has to be organized in such a way that it can answer questions about the
company. An example of a business question could be, how many users from a country
are accessing the systems and at what time of day or night. Also, the data must be
organized in a manner to facilitate many different kinds of questions (ref).
For a warehouse to be integrated, all the data in the warehouse should be
unified. The data fields have to match formats. Naming conflicts need to be corrected
in all places like those in a country fields. A country can be abbreviated, capitalized, or
misspelled. To be unified, one of these approaches must become the standard. The
units have to match up to guarantee that a report writer will receive accurate results
from the warehouse. Inaccuracies develop when they aren’t coordinated. For example,
when multiple servers placed around the world have their server clocks set to different
time zones, the data time stamp will vary. A non-volatile data warehouse has to ensure
that the data already in place never changes. Considering the business questions the
data warehouse is designed for, the warehouse provides a historical snap shot of the
business and its’ performance. As a result the warehouse grows perpetually larger by
nature of the design.
Lastly, a data warehouse must be time variant based. The purpose of a data
warehouse is largely to report on trends, statistics, and any of the needs of the
businesses. So, whenever new data is entered into the system, a time stamp or other
detail linking to a data will need to be inserted (ref).
A data warehouse has several other features attributed to the design. Oftentimes,
the data warehouse must have several indexes placed on all the tables. Indexes are a
way to provide the host database quick and pre-calculated access into vast quantities of
data. Because they are pre-calculated, every time new data is stored into the
warehouse the indexes must be recalculated in order to utilize the effectiveness of
indexes. Indexes take up space based upon the number of indexes and quantity of data
being indexed.
Pre-calculated metrics are another common feature to a data warehouse. The
utilization of the pre-calculated metrics is another consequence of the immense
amounts of data stored in a data warehouse (ref).
A warehouse is usually de-normalized which produces duplicate data. This
would seem to be a problem in a database; however, in a warehouse, de-normalizing
the data structure decreases the complexity, increases the search speed of the
warehouse, and improves the simplicity of the queries. The overall performance of the
queries increases substantially. (ref)
2.2.3 Importance and TrendsA data warehouse can provide business executives deep insights as to how their
business is performing in near real time (Benefits of a Data Warehouse, 2011). In a
global and twenty four hour market, every business needs an edge over the
competition. However, sustaining the required appliances which includes expensive
reporting software for an analytics services can be prohibitive. The average business is
moving towards finding analytic tools not requiring specialized technical skill sets. As a
result, companies such as WebFOCUS, SAP Business Objects, MicroStrategy, and
Microsoft Business Intelligence have all created tools to aid in analytics (List of
Business Intelligence (BI) Tools, 2012).
Due to the change in the global market, businesses are trying to reduce costs
and achieve the same processing potentials delivered by custom built appliances within
their businesses. They have turned to Cloud analytics which has been encouraged by
google’s BigQuery™, Infinit.e™ and other fine products (Higginbotham, 2012).
(Higginbotham, 2012)
In a harsh economy business are trying to save money in as many places as
they can and the use of the cloud computing and appliances provided by cloud analytic
businesses, corporations can make decisions based upon near real time reports and
statistics on their products.
2.2.4 Database Vs. Data WarehouseWhat is the difference between a database and a data warehouse? A data
warehouse can be a database, but a Database may not necessarily be a data
warehouse. A database is optimized for disk writes and is normalized to conserve disk
space due high volume of data. A data warehouse is built with the intent to do analytics
and reporting on joined metrics across several sources. A data warehouse will
commonly contain data from several different databases where as a standard database
will be tuned to handle only one application. Further, a data warehouse will be
optimized to handle analytic business intelligence (BI) questions the business would
need to know.
Several enterprise applications are available if the data warehouse has been
architected to accommodate a BI tool. WebFOCUS, SAP Business Objects,
MicroStrategy, and Microsoft Business Intelligence where four of the tools we were
evaluating to provide an enterprise solution after the warehouse was built. We decided
to focus our efforts on SAP Business Objects, primarily needed a Fact scheme, to
produce its solution. Microsoft Business Intelligence was a second choice which relied
on Data Cubes.
2.2.5 SchemasThere are three of types commonly used schemas in industry: a Star schema, a
Snowflake schema, and the fact constellation schema.
2.2.5.1 Star schemaThe Star schema has the most parsimonious joins among records. “A Star schema is
characterized by one or more very large fact tables that contain the primary information
in the data warehouse, and a number of much smaller dimension tables (or lookup
tables), each of which contains information about the entries for a particular attribute in
the fact table.” (Oracle.com) ( 2). Within fact tables, the first data types are aggregates
of the dimension tables and the second type of data are the foreign keys to the
associated dimension tables. The fact table contains foreign keys which join to
surrounding dimension tables within the schema. Databases such as MySQL, Oracle,
and MS SQL servers recognize the Star schema queries and automatically optimize the
execution plan to take advantage of schema architecture (Star-Schema Design, 2010).
Figure 2: Star schema
The straightforward approach the Star schema provides, and the large number of
Business Intelligence (BI) tools available to read Star schemas, make it an efficient
choice to organize information for reporting. Because the Star architecture is simple to
understand and to maintain, it is less expensive to maintain and therefore more
palatable for businesses. Additionally, simplicity ensures fewer dependencies that could
otherwise prohibit system improvements (Star Schema, 2009).[ref]
Nevertheless, implementation has increased costs due to record storage volume
and limited availability of complex reports. A data warehouse architect must carefully
analyze resources available and plan appropriately for their business growth. If data
warehouse systems are limited by storage space or necessary complex reports are
required, then the star schema is not ideal choice (Star Schema, 2009).
2.2.5.2 Snowflake SchemaFigure 3: Snowflake Schema
Snow Flake Schema – “In computing, a Snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a Snowflake in shape. The Snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.” (wikipedia.org, 2012).
An advantage of the Snowflake schema, illustrated by figure 3, over the Star
schema is its ability to handle more complex reports and queries. Another advantage
rests in its ability to save on storage space if that is a factor in the warehouse. Business
needs are often met by these two schemas (wikipedia.org, 2012).
Some disadvantages of a Snowflake schema include the potential for overly
complex sets of the queries and increased difficulty in reading the schema. The
complexity of the queries using the Snowflake schema increases the work load of the
host database’s CPU, RAM, and IO transactions. Furthermore, the complexity of the
table relationships increases the difficulty of maintaining the systems schema structure
as more data is added.
2.2.5.3 Fact ConstellationA fact constellation is a set of Star schemas joined together by common
dimension tables or fact tables. By creating a fact constellation, the complexity of a
schema is increased exponentially, maintenance is costly, and the space usage remains
large due to the size of the dimension tables.
Figure 4: Fact Constellation
Thus, the fact constellation schema is used sparingly for complex and in-depth reports
(Figure 4) (Dimensional Model Schemas- Star, Snow-Flake and Constellation, 2012).
2.3 Data formatting
2.3.1 Data cleansing and Standardization
2.3.2 DefinitionData cleansing is the process of detecting enigmatic values then removing or
entering values which provide a standard answer. Data cleansing can be one of the
most time consuming and difficult complications in creating a data warehouse. An
example is how a database interprets empty, or null, values. A NULL is an expression
of an empty field which needs to be present in case a record has need of it. The conflict
is realized where a database will parse a zero as an empty value and a different
database will construe the value as a literal integer.
2.3.3 ProcedureBefore data can be entered into a database the records are audited for
contradictions such as spelling, formatting, and false entries. Then a process is
developed to remove or correct the discrepancies within the data records. The
developed cleansing process is accepted and automated for data alteration. Next, the
process is implemented and tested in a staging environment. Finally, the data is
examined a third time to inspect for irregularities in the records.
2.3.4 ApproachesCommon practices used to prepare data records for data warehousing include
parsing, duplicate record deletion, and utilizing statistical methods. Each practice has
its advantages and disadvantages which apply to different situations and in different
combinations.
The parsing approach employs the use of regular expressions and strict word
matching. An example of parsing is as follows: SELECT SUBSTR(v_content, 0,
REGEXP_INSTR(v_content, '<PROBLEM(.*?)>', 1,1,0,'i') - 1) INTO fp FROM
document_wvar_mv WHERE documented = ‘110118’ AND draft = 0; The preceding
example SQL regular expression is looking for an XML tag which contains the word
PROBLEM then followed by a close brace tag. Strict word/ phrase matching is useful
when the possible values in a particular field or dataset in question are limited to a small
subset. In relation to a database, utilizing a strict word or phrase, unique set of values
is queried and then a standardized value is agreed upon. Finally, a script or developer
would execute the alteration devised by the architect or engineers.
Duplication deletion removes all duplicate records and adjusts all join values to
point to the single instance of the duplicate. Data transformation is the approach where
a certain value is detected by several means and is then changed into an agreed upon
value. On such example is how a geographical state is expressed such as Utah. Utah
can be expressed in UT, Utah, Utah, and even by its zip code.
Finally, statistical methods can prove useful when data records are too numerous
to format into a report. If records on an application are taken for every instance of an
event the records quickly become too extensive to provide timely reports. Often that
level of information is not necessary. Records can be calculated using statistical
methods such as an average, mean, or deviation on a specified time period.
2.3.5 ChallengesThe most common challenges with data cleansing are errors in the error
correction procedure and the time required for maintenance. Error correction can be
difficult due to the nature of the types of corrections. If the users assigned to implement
the cleansing procedure do not understand the data, a desired value may be altered
and skew the results of any report utilizing the field(s). As the data sources are live
documents (constantly changing), errors are regularly occurring. As such, the
warehouse needs to be continually maintained. The time allocation of maintenance can
be prohibitive.
3 Scrum
Scrum is one of many possible software engineering methodologies used in the
development of large projects. Scrum was utilized for this project as the best way to
foster communication between everyone involved and reduce the complexity into
manageable tasks. This section describes the Scrum methodology, background and
history.
3.1 Definition “Scrum is an agile framework for completing complex projects.” The
methodology my team used to deliver all of the requested products is based on a
process called Scrum. Scrum “[…] is an agile framework for completing complex
projects” (Scrum Is an Innovative Approach to Getting Work Done, 2012). An agile
framework relies upon the ability to take a large task and break it down into smaller
tasks.
3.2 HistoryIn 1986 Hirotaka Takeuchi and Ikujiro Nonaka discovered a new methodology for
product development called Scrum. Scrum dictates that teams overlap responsibilities
plus scrum encourages teams to work together. Scrum is a term from the sport rugby.
To scrum is where a game violation is called by the referee and then the two teams
need to bring the ball into play. The team must work together to get control of the ball.
The ball is equated to be the problem or project in hand. And in order to make their goal
they must work together against the opposition to achieve it.
Scrum methodology was able to get further formal attention from a book called
“Wicked Problems, Righteous Solutions” written by DeGrace and Stahl (ref). This was
the first book to term the agile approach as scrum. Next in 1995 Jeff Sutherland and
Ken Schwaber presented a book called, “Business object design and implementation:
OOPSLA '95 Workshop Proceedings.” (ref) This conference emphasized the initial
scrum processes.
Since then, businesses and organizations have enhanced and personalized the
process to meet their needs, but the foundation of the approach is owned by Ken
Schwaber. Scrum has become widely popular not only in software development, but
also in other engineering fields.
3.3 AttributesIn order to understand using the Scrum process, we first need to define some
terms associated with the scrum philosophy. The list of terms is as follows: Scrum
Master, Product backlog, Sprint backlog, Sprint, Daily Scrum Meeting, product owner,
development team, and a usable product.
A Scrum Master is a person on a scrum team who is assigned the responsibility
to remove problems, encourages and enforces project members to follow the scrum
process, and finally prevents distractions from reaching the development team.
Problems which may need to be removed can include, but are not exclusive to,
necessary resources, outside teams refusing to do necessary work, and
implementations.
When resources are required, such as data expert guidance or a software
package is needed, the scrum master can dedicate their time to finding the expert and
setting up appointments/meetings or finding monetary resources to acquire software or
equipment.
Processes are especially important to ensure work progresses through
transparency between the development team and the management, communication
between the client and the project team, and project progress overall. Transparency
between the development team, client, and management prevents constant rework,
management miss-understandings, and client impatience. Re-work is caused when an
amount of work is done and it is not what the client or manager wants. Further
advantage is when the client is aware of project problems, progress, and realistic
expectations of project completion. If the client is aware they can plan their schedules
to meet their own needs.
Distractions can cause serious delays especially when ‘scope creep’ enters into a
project. Scope creep is when the customer requests additional features into their
project which is not in the accepted project outline. Another common distraction can be,
for example, when other people in your company ask a developer for ‘favors’ or tasks to
be done ‘real quick.’ Every time a project team member is distracted to a quick fix, the
developer has to spend roughly 15 minutes changing from his current task to the
requested task and then back again. And if the transition is done once a day all week
long an average developer team member will lose two and a half hours a week.
Next, a Product backlog is a list of features or tasks needed to be completed for
an entire project to be completed. A product backlog is a List of Requirements (LoR)
negotiated by the Scrum Master and the client who has worked together to break down
into a set of manageable tasks.
A sprint is an interval of time between two to four weeks in length. Sprints are
designed to encourage smaller tasks to be assigned there by preventing large amounts
of rework and emphasizing constant communication between the stakeholders in the
project.
Daily Scrums are necessary to promote communication between team members,
expedite the problem solving process, and draw attention to potentially time consuming
problems. During the fifteen minute scrum meeting the team members will talk about
the problems they had and give suggestions on how to solve the problems. Further the
Figure 5 Basic Scrum Process
scrum master is informed of the potential problems where they will attempt to remove
the problem.
The product owner is the person who is the representative of the client and the
designof the product. The product owner needs to communicate the needs of the
customer and the potential problems that arise during the process.
Finally, a usable product is something that can be given to the customer in
working form no matter how minimal it might be. Something workable can be defined
as simple as a login feature and as complex as complete security suit built into a
custom made application. However defined the product must be useful to the client.
The basic process of scrum is as follows and shown in Figure 5:
1. Project conception: An idea for work is first presented
2. Project backlog: The problem or idea is then put into a product/project back log.
The back log is a list of tasks to be completed for a given project.
3. Select tasks: work with the development team and plan for what can be
accomplished in the scrum time period and move those tasks to the sprint back
log.
4. Sprint back Log: assign tasks to each developer or group of developers by what
they feel they can handle.
5. Communicate with stakeholders and acquire resources: Talk with the
stakeholders and negotiate the amount of work to be done during that time
frame.
6. Sprint: work on sprint items and report to the stake holders when each item is
done. The stakeholders can give their approval of the work and quality of it too.
If in the event a developer finishes all their tasks for the sprint, they are to go to
other developers and help them complete their tasks. By helping each other
cross training occurs and the team become stronger.
3.4 PopularityScrum, in the beginning, was used to help software development but has quickly
been adopted into other projects and industries. Scrum has become increasingly
popular due to its ability to control new requirements of a project, how the agile
framework can manage enormous projects relatively easy, and the structures simplicity.
3.5 Application of ScrumIn the first iteration of the scrum process, the data was analyzed and then a risk
assessment of the deliverables was performed. The approximate time intervals were
planned for the next two week interval. The assessment included the requirements set
in place by a proper data warehouse.
The first standard was easy to assess: could the data be modeled to answer
business questions? Each of the data sources had business analysts who had worked
with the data already. Data analysts commonly had a set of business questions they
were already answering. Frequently, the analysts had a set of backlogged questions,
which needed to be answered in addition to the current questions. As a result, I was
tasked to figure out how to add those backlogged queries to our ETL and thus into our
system.
Next the integrate-ability of the data into our existing warehouse needed to be
assessed. I consulted with the data experts and they identified common and, all too
often, hard to solve integration points. The technologies were engaged which would
ensure data would come across automatically to the data warehouse.
Next, the data needed to be non-volatile. One of the frustrating problems, in
relation to the records volatility, was that the engineers would report the data was
corrupted due to some problem in the production systems. During the two week scrum
process, the data was first captured in raw format. Because our existing and regularly
used reports relied on certain columns of data, we needed to take advantage of
materialized views1, table views2, or simple SQL queries to increase the chances I could
compensate for the possibility of corruption by adding in additional corruption checks.
1
? A materialized view is a replica of a target master from a single point in time. The master can be either a master table at a master site or a master materialized view at a materialized view site. Whereas in multimaster replication tables are continuously updated by other master sites, materialized views are updated from one or more masters through individual batch updates, known as a refreshes, from a single master site or master materialized view site.2
Finally, the data needed to be time variant. However, adding a simple time
stamp was not sufficient in many of the sources. The business experts wanted to have
a time of the event in the system based upon several factors within the system and
sometimes within other systems. I had to apply two timestamps which was a time of
insertion and a timestamp of client/system event in the database.
Next, in the SCRUM process, a burn down chart would be used. A burn down
chart, tracks the progress each team member made during the scrum sprint. Using the
burn down chart to estimate the number of tasks we could handle, we would bid for
tasks we could handle during the sprint. Then, the clients would be informed what
would be possible during the sprint. Often, “no” was an accepted answer to new tasks
given to me in the middle of a scrum period until the current task was completed. The
only exceptions were business needs which were considered critical to running the
business.
The scrum assignments were tasks to create automated import processes.
Specifically, tasks were to create scripts for cleansing records, unifying core tables,
adding metrics, building simple reports and adding to existing reports. (
An example of a task performed was dealing with data integrity issues. Data
integrity problems were a constant throughout the entire process of the project. Users
consequently, would question the validity of the reports. The discrepancy was analyzed
and a reason for the discrepancy was the users understanding of the data in many
? A view is a representation of a SQL statement that is stored in memory so that it can be re-used.
instances, problems in the data also meant a problem with the source data. The source
was occasionally corrupted by engineers or system users trying to correct another
problem. Corrections would need to be made to our views and the reports.
Another part of the development process was the agile piece of the SCRUM
methodology. Executives requested custom reports or projects which took precedence
over the current tasks, resulting in reports to the customer and adjustments to our
scrum time lines. In many cases, I had to negotiate with the executive and ask them to
wait for our sprint to end so we could complete our current tasks. One paradigms of
scrum is to eliminate the disturbances created by change of projects.
Delivering products was crucial to the process. Reworking a project takes a lot of
time and effort. By delivering a project in small pieces, frequent client input was
considered quickly and changes would be made before the requested adjustments
became too difficult to apply during data architecting and design phase. And data
experts were frequently consulted to verify the current schema queries were accurate.
Through these efforts and with many reworks of the schema, the schema became
stable and more reliable.
Additionally, after the product was delivered, relationships of trust were created
with the clients as problems were addressed. The clients were always aware of rising
problems and could anticipate delay, thus, correcting their schedules to fit the needs of
the product.
Even with the product’s delivery and constant communication between the team
and the client, we were still obligated to ask for feedback. The feedback was expected
at the end of the cycle and would be used to improve the cycle for future efforts.
4 Data Modeling and Architecting provide Ad-hoc reporting
In this section the application of methods toward warehousing data in user friendly
formats is presented. As each of the five most common data source types, required
distinctly unique approaches and attention to different details, this chapter is separated
into five subsections corresponding to those five major input data source types. In the
following subsections each of the databases will be described along with the process of
incorporation into the warehouse. The different sources were Clarify/Amdocs, the
Family History Missionary Profile system, the Kanisa knowledge document
management system, Omniture web reporting analytics, and, the LANDesk systems
reporting server as it applies to our worldwide infrastructure.
4.1 Clarify/Amdocs
The Clarify/Amdocs (http://www.amdocs.com/Pages/HomePage.aspx) –
Clarify/Amdocs data source is built on an Oracle 11G DB and is called the (CMS)
Case Management System. This system is used for patrons calling into the
FamilySearch support centers. Clarify is a legacy name for the current system Amdocs
uses. Clarify had hooks in many of the Amdocs components even though at the time
Clarify was being upgraded and phased out.
Amdocs is TTS’s agent case management tool. The tool was bought under the
assumption it had desired reporting capabilities and could handle large case loads.
However, after extensive use and testing, Amdoc’s tool lacked many of the reporting
capabilities the business needed to make effective decisions. However, the FHD
engineers knew all of the report metrics were available within the Amdocs database. In
fact the database held more information than was anticipated. As a result of the
reporting deficiencies within Amdocs, the Amdocs database was in the process of being
upgraded so the current legacy warehouse had to be imported into the new data
warehouse. SQL was used to upgrade existing reports were built upon an older oracle
9i database, which restricted the use of many functions available in newer database
versions.
All the existing reports were converted by working with the more complex reports
and analyzing the objectives of the report. Then, by modeling the reports within a
simple time frame, they could be matched against the current legacy report. The next
step was to utilize the databases ‘explain plan’ to determine which query would be the
lower cost to implement. “A statement's execution plan is the sequence of operations
Oracle performs to run the statement and obtain results” (Oracle.com). Significantly, an
explain/cost plan is similar across the databases such as MySQL, PostgreSQL, and MS
SQL. The cost is a numeric representation of the sequence of operations Oracle and
other databases perform, in order to complete the query. The process of reading
existing examples, researching how and what they are doing, and implementing
improved and optimized versions taught me the correct queries to research. And why
one SQL query might work better than another.
Next, Business Objects Data Services (BODS) was used to do the majority of the
complex ETL operations that did not need custom scripts. Although BODS was not
designed to be a data transfer tool and would be considerably slow doing so, the TTS
utilized BODS for that purpose in many cases. The BODS also had a comprehensive
set of built-in ETL operations designed to ease the burden of data validation and
integration.
Amdocs data imported to the warehouse required its data records to be
manipulated and transferred to another database for further manipulations. The transfer
was sent to a MySQL server only to be transferred back after certain manipulations
where performed. Oracle 9i, which was the basis of Amdocs , did not have complex
regular expressions built into the engine while the version of MySQL 5.0 did. As a result
of these requirements, I had to learn how BODS controlled its connections and
manipulations of data records. Ultimately, BODS was used to do the majority of ETL
and data transfers. At the time of this phase of the project, I knew I would be integrating
more databases into the ware house. So, I leveraged the strengths of BODS, so I could
concentrate on learning how to architect a warehouse; developmental techniques on
integration of data records; and better data cleansing techniques.
The integration of the Amdocs data records required interaction with three
different databases: Oracle, MySQL, and MS SQL. Although the TTS Division was
primarily Oracle oriented and usually stayed current on oracle appliances, the Oracle
databases were of an older version 9i. Oracle 9i only supported simple ‘wild card’ data
matching and did not support complex regular expressions. Oracle began to support
regular expression as of version 10g (Goyvaerts, 2010). Oracle 9i required that the
Amdocs data be transferred to a MySQL database to utilize many of the MySQL servers
query expressions. The data would then be transferred back to the data warehouse in
its cleansed form. After our data warehouse was upgraded to Oracle version 11g,
transferring data to the MySQL server was no longer necessary and a direct database
link was established between the two databases.
Many of the existing reports related to the Amdocs system also required
statistical data from our LANDesk database servers which were MS SQL. Database
links from the LANDesk database server were integrated into the Oracle data
warehouse. Once the database links were established, I would be able to do simple
data manipulations were performed before the data even reached the warehouse. The
accepted reporting tools used in conjunction with the new Oracle 11g was Crystal
reports, which is a business objects tool utilized to create reports from multiple systems.
The reports needed to be corrected and were SQL based and Crystal Reports
can be configured to use ODBC connections to individual servers and sources.
However, allowing users access to multiple systems would not be in alignment to our
objectives of this project. We addressed this by restricting access to only the data
warehouse that we were building. Funneling data to one site allowed me to control
what data is seen and how it would be seen by end-users, and create uniform reports
across the TTS Division.
As a final step in warehousing the Amdocs database, the newly developed
warehouse had to handle an Amdocs database upgrade. Although, the data was
supposed to be unaffected, several critical columns were found to have been cleared
and others consolidated. A vast number of incorrect Crystal reports resulting from the
Amdocs database upgrade forced systematic transformation of all the pre-built
materialized views. Additional views were necessary to compensate for the change in
database upgrades. Despite some nominal data loss, the majority of data was salvaged
and was re-integrated into the TTS Division warehouse.
The following snapshot in Figure 6 is of the Amdocs/Clarify portion of the Warehouse.
4.2 Family History Center (FHC) Profile
The FHC Profile database has been an evolving system which was initially an
MS Access database utilizing an MS Accesses data entry interface. Then FHC moved
from the access database to an Oracle 10g system utilizing a simple ruby-on-rails
interface, drawing upon an additional data source Church Directory online Listing
(CDOL). Finally, the FHC Profile database system was integrated into a complex ruby-
Figure 6 Amdocs/Clarify portion of the Warehouse
on-rails web interface drawing upon CDOL, LANDesk information, and a custom
missionary application developed by the FHC engineering teams. FHC Profile was
evolved into a completely custom application by the end of the warehousing project.
The FHC profile was the second major system integrated into the TTS data
warehouse. The FHC missionary profile system had several problems which needed to
be overcome. The profile had several data sources which all but two had no data
validation. Second the SQL queries were unnecessarily complex. Finally, the original
implementation was not implemented well and would make debugging extremely
difficult. While addressing these problems an additional task was to maintain and write
reports for the Clarify/Amdocs management and user agents. These reports would
become increasingly involved and complex. Ultimately, techniques were developed to
reduce the lines of SQL and eliminate the ‘bugs’ with in the data and queries.
Reports generated by the FHC Profile system were constantly in question due to
the data integrity issues. Integrity validity stemmed from lack of data validation tools in
place during usage. A volunteer group and I were assigned to deal with the integrity
problem. The volunteer group was assigned to build a web interface which would
interact with the warehouse and to stay in constant contact with me while they were
building the interface. At this point, the data was gathered, unnecessary sources were
eliminated, implement data guards on the database were implemented, and finally new
tables were integrated into the data warehouse.
Commonly encountered sources of data were MS Excel spread sheets that were
commonly used as databases for entire projects. The spread sheets would encounter
error problems due to multiple users entering data at any time and in any format. The
greatest problem phasing out the spread sheets was tracking down all the owners and
experts of the data. Many of the sources had complex macros and functions cross
linking to other sites and sources.
The Data Services data integration tool was utilized to extract, transform, and
load MS Access and Excel spreadsheets into our warehouse. The extraction and
loading were easy, but still the transformation was difficult. The largest MS Access
database had historical data recorded by adding in columns to a table as needed. A
pivot table was implemented, which reorients the data either from a row to columns or
vice versa, to turn many columns into rows and separated the names from the other
data. Joins within the MS Access data yielded duplication. All the data was extracted
and placed it into a separate spreadsheet and an external database to align those fields
with the warehouse standards. Dates, countries, and addresses had to be run through
the Data Services address libraries. Two major corruptions occurred in the Family
History centers statistical data and the employee schedules. The statistical data would
contain answers like yes, 1, 0, no, not passed, etc. A distinct list of all the possible
values was extracted and cleansed out all the corrupted answers then replaced them
appropriately. Schedules needed to have numbers corrected then reprocessed to count
the hours the people listed on the spreadsheet and access database worked or didn’t
work.
The data sources needed to be further prepared by adding primary keys foreign
key relationships, indexes, and data type checks. Assigning primary/foreign keys would
prevent data duplications, indexes ensured query speeds, and data type checks
guaranteed proper formatting in many cases.
By placing primary keys, foreign key relationships, and referential integrity
checks we were able to control data changes including and deleting of the data.
(Foreign key Constraints, 2012). Indexing data columns allowed the warehouse to
query the data much quicker and for more complex queries if necessary. Enabling data
type checks forced the warehouse to conserve on memory and ensure proper data
formatting.
The data coming from our CDOL was created from existing SQL queries and was
extremely inefficient. A method had to be designed to optimize the queries. Queries
were stored in procedures and sometime just written in custom user scripts. First, a
schema diagram was acquired for the source database. We then researched how to
properly replicate queries which would greatly reduce the execution costs presented in
a tool provided by the database engines, an explain plan. Further research was done to
better utilize sub-queries, materialized views, temporary tables, and SQL features such
as “group by”, and finding the maximum/minimum values in fields.
Understanding how the database engines execute the queries and the cost of
placing a process in one spot over the other became highly important to the
performance of the overall warehouse. As a result, SQL queries were systematically
reduced in size and compared the results against existing queries. In many original
queries there were flaws in counts, groupings, and data sub-queries which were
creating duplication and unintended data elimination.
By the conclusion of this stage of the project we were able to learn how to reduce
the overall lines of code. We reduced the number of lines of code by 50 percent, data
guards were implemented like data validations, key constraints, and creating
applications which guard against errors. The improved error handling meant the
volunteer group no longer needs to spend time weekly removing errors. Users don’t
need to insert data into multiple locations and report writing does not need to be done
by a developer and can be shifted to an analyst.
The following is a graph of the Family History Center (FHC) Profile portion of the
Warehouse.
4.3 Kanisa
The Kanisa Database (http://crm.consona.com/software/products/knowledge-
management.aspx) – Kanisa Database was built on an Oracle 9i DB and is a
Knowledge Management System custom configured to monitor usage of the different
knowledge documents. Server logs which held a plethora of data on system usage the
Oracle DB was not including in the DB tables were utilized for the data warehouse.
Kanisa was the third majorly successful system integrated into our data
warehouse illustrated by figure 7. Kanisa is a Knowledge document management
system (KDMS) used to support all the patrons of the Familysearch.com company
research sites and software. Kanisa’s data helped FamilySearch.com manage and
improve the self-help documentation and reduce support personnel costs.
There were four significant issues that had to be resolved with the Kanisa data.
First, primary data was constantly under change by users that made capturing historical
statistics difficult if not impossible as time progressed. We had to figure out how to
capture and what to capture. Second, the database was not capturing all the necessary
data points we wanted. Thus, a plan was formulated to parse the cache of log files
which held data we wanted and how to make the warehousing process automated.
Third, the cache of logs would contain duplicate data that interfered with reports and
other metrics. A procedure was created to detect when the log cache contained a
duplicate and eliminate the extra data. Finally, we needed to reassess the kinds of
questions the warehouse could answer with all the new data available to the warehouse
users.
Note, there were nine different file types that were of interest. The file held data
that would provide different perspectives on the system which would give us an insight
into the user’s experience. The data would be extracted from these file types.
To solve the export and import issues, a ruby script would have to be created
then Data Services scripting language was used to execute the ruby script on the
remote system. Data Services would have to be notified when the script was done.
The challenge in creating the script was optimizing ruby so the script would be able to
extract thousands of files and append them onto one of the nine master files.
Ruby 1.8.7 is not true multi-threading meaning each thread cannot be run in an
individual process on the processor (Mittag, 2008). We discovered Ruby 1.8.7 had
‘green’ threads where the system would take almost a day and half to go through just
one of the nine files. Consequently, we switch to Ruby 1.9.2 which supported
concurrency (Ruby, Concurrency, and You, 2011). Though Ruby 1.9.2 does not support
‘true parallelism’, the concurrency did improve the scripts performance. The script
would first have to extract the row, and then detect how many commas were in the row.
If the row had too many or too few, the script would then have to check to see if one of
the fields were either missing or had another set of data in it. Often columns would
have xml in the field so the script would have to detect the beginning and end of the xml
and replace the commas with another character which didn’t occur in the particular
master file in the event a comma was missing, that would have to be able to detect that
and insert the comma in the appropriate place.
By definition, a data warehouse must not be volatile or changing. However due
to the nature of the source the contents of the source were constantly changing. We
would need to constantly query the changes and document when they happened.
Several materialized views where created which consisted of snapshots of data in time,
to retain data twice a day. Business Objects Data Services was scheduled to run an
Extraction Transformation and Load against the Kanisa system to retain valuable
metrics about the usage. An interestingly difficult problem to solve was a legacy data
type originating from the Kanisa database. One of the database columns was of type
“long” (Oracle Datatypes Data types for oracle 8 to Oracle 11g, 2012). A long is like a
binary type for a standard file system. It is neither an integer nor just a character.
Longs in oracle cannot be directly queried if extracted and longs cannot be indexed.
We had to extract the column with the primary key, identifying its place in the export to
the warehouse, and export it to a Character Large Object Block (CLOB). A CLOB would
then allow for indexing, for direct querying, and to optimize for performance we included
the exported data into our materialized view.
Taking the cache of logs and parsing through them was far more difficult than
anticipated and prevented the use of default import tools by the databases or BODS.
Oracles flat file extraction tools were tried along with the data services data import
functions, but we discovered many of the columns contained several types of data. One
column would contain the query string the user used to access the system. Another
column within the logs would retain the xml output from the data server, and another
column in the same file would contain a CSV formatted data output from the log files.
Finally after all the scripts were written, the data was imported, and business
questions had been defined, the Kanisa data was imported into our warehouse
producing ad-hoc reports generated by the user. Along the way many optimization
techniques in ruby and Data Services were used. Data warehouse data guards were
carry out to prevent information from being corrupted. One of the new guards
discovered was a way to detect past historical data duplication through Data services.
Data services had to feed in the current warehouse records and compare the records to
the information being passed in from the logs and source databases. The records being
passed in had session identification numbers and time stamps of the event.
The following is a figure of the Kanisa portion of the Warehouse.
Figure 7 Kanisa Portion of the Warehouse
4.4 Omniture web services
Omniture (http://www.omniture.com/en/) Data – Omniture is a commonly used
website analytics software which was chosen to integrated into a Business Objects
application, Data Services
Omniture is web site analytic software similar to google analytics. Omniture in
FamilySearch serves as our primary tool to observe the activity on our many sites at
FamilySearch. The analytic tool has a set of statistics it can track out-of-the-box, it is
also good to note one of Omnitures more powerful attributes is the ability to track
custom fields which can be built in to the tracking code.
Integrating Omniture is the fourth majorly successful source that was integrated
into our Data warehouse. In order to integrate the data from Omniture the import
process had to be automated which created new challenges. Three major obstacles
had to be overcome; the web service documentation was poorly written and in many
cases missing, the WSDL was in a non-standard format which BODS couldn’t
understand, and finally we had to figure out how to connect the Omniture data to the
existing Data warehouse data.
Interfacing BODS to Omniture was significant at the time because no one so far
had been able to do it. When the WSDL was written, the WSDL did not conform to the
W3C standards (W3C, 2001) which BODS needed to automate the connection and
import. As a result, a custom interface to Omniture had to be written which BODS could
use to extract the queried data into our data warehouse.
Java was chosen as the custom interface language because there were existing
examples of the needed connection. The connection had to be ambiguous enough so
the script could be altered to allow for other queries without user knowledge of java
programming. The queries would need to include questions like, “give me all users from
China, which entered the site from March 2 2002 to March 4 2002 and give me the
user’s computer type, version, web browser. And finally, compare the results to the
current systems we have at the FamilySearch centers.” Further I had to custom craft
JavaScript Object Notation (JSON), a lightweight data-interchange format, queries
inside of the java calls so the extraction could be further automated. In order to create a
model for Omniture I had to research a concept called database polymorphic
associations was utilized. By combining all these techniques, truly ad-hoc reporting
cross joined with Amdocs, Kanisa, LANDesk, and FHC profile would be achieved
One last hurdle to overcome was the extremely poor and incomplete
documentation for the web services. Omniture support services were contacted
constantly for clarification and custom queries were created consequently, because the
documentation was incomplete a great deal of experimentation was tried and create the
custom queries.
4.5 LANDesk
LDS FHD LANDesk (http://www.landesk.com) – LDS FHD LANDesk is built on a
MSSQL Database (DB) server which monitors all the computer systems throughout the
corporation worldwide. It provides the health of all the systems based on the system
hardware. Data was stored in the Registry where the Servers worldwide could query
the systems registries and then deposit the data into one central location.
“LANDesk Software provides systems management, security management,
service management, asset management, and process management solutions to
organizations. It is one of the oldest companies providing this type of product.”
(LANDesk, 2012) We use LANDesk on all of our FamilySearch center computers.
Each of the system is constantly tracking both hardware and software usages.
LANDesk catches the statistics on the state of the hardware and is able to give us
insight on which centers would need to have more or newer computer systems.
My fifth majorly successful data source was integrating and connecting LANDesk
records into our data warehouse illustrated by figure 8. At the time we had difficulty
acquiring an up-to-date database schema because of how the data was being stored.
Later, the database was upgraded and the fields were better defined, but until then we
had to deal with un-modeled data. Un-modeled data fields were data points extracted
from all the custom registry fields inserted into all the computers around the world. The
data types had ID’s assigned to them by the LANDesk servers, but we needed to
experiment to figure out which ids belonged to which description. The four types of
problems encountered were, multiple data sources, no current database schemas
available, historical counts were not being captured properly, finally finding a data point
to link all the other systems to would be difficult to find and once found needed complex
SQL’s to complete.
The LANDesk project needed to gather its records from two different sources.
First, the LANDesk servers and from the Sophos an anti-virus firewall software, Sophos
was included in the data warehouse integration project because Sophos was related to
the health of the Computer Systems housed in all the family history centers around the
world. We needed to figure out how to first link LANDesk computers to the computers
Sophos was installed on. Sophos provided insight into how many intrusions detected,
how many times sites were accessed, and how current the anti-virus system is.
At first, we didn’t know how to join Sophos and LANDesk computer together and
was not obvious. We figured out LANDesk could query the type of anti-virus software
installed then use the serial number as a joining point to the LANDesk system. While
we had the joins set we needed to utilize Profile to connect all the LANDesk computers
to specific Family Search centers and libraries around the world. Linking all the
systems together allowed us to get statistics on countries all the way down to internal
organizations.
After we had solved all the joins up to this point, we had to next try to figure out
the database schema within the Sophos antivirus and firewall database. The schema
tracking was done by slow and steady querying of the tables and by joining them
together to produce a schema diagram which could be used to join LANDesk and
Sophose.
Then, the current SQL used to capture historical data and translate all the
existing queries into Optimized queries from MS SQL to Oracle was accomplished. In
many cases, SQL was written to be faster and shorter. Furthermore, the captured
records could potentially grow very large. We had approximately 55,000 computers
around the world and we needed to keep complete statistical records of all the systems
and surrounding networks. Not all the systems would change every time, an SQL query
was created which would analyze when the systems statistics would change and update
only those which did. To update the records, we leveraged BODS into accomplishing
the task. The difficult part was rewriting all the original SQL to improve the database
explain plan costs. A sample piece of code will be displayed in the appendix to illustrate
the complexity of this process.
As a result historical reports and trends of our existing FamilySearch centers
were created. Redundant SQL for historical records was removed and the performance
of the reporting system was greatly improved. Most importantly reports were created
that joined to our four other sources which was able to enhance understanding of our
systems.
The following is a figure of the LANDesk portion of the Warehouse.
5 Operation of the Warehouse
5.1 Now vs. BeforePreviously, users were unable to gain access to reports, production systems
performances were compromised, and report results were conflicting. After a concerted
Figure 8 Landesk portion of the Warehouse
effort from engineers, the TTS team members, and myself were able to accomplish all
of our goals set forth in this project. All five data sources were unified into one
warehouse and an interface was provided where any user could create custom accurate
reports.
5.2 User satisfactionThe warehouse has lasted two years and has undergone improvements from
other engineering teams to include new sources. However, the core warehouse is still
intact and in use today. The system has added value to the LDS Center by providing
reports to all business members at all levels.
A survey was distributed to the users of the system and many of the responses
were similar. James Ison a manager of the Family History department was asked,
“What aspect of the reporting portal (Data warehouse) was most beneficial?, His
response was “Church-wide insight into use of the New.FamilySearch system via the
Area Adviser report” which reflected the value of the data warehouse in providing new
reports.
One of the major report writers, David Lifferth, found added value when he
responded to the question, “What aspect of the reporting portal (Data warehouse) was
most beneficial?” His answer was, “Drag-and-Drop simplicity in creating new, ad-hoc
reports.”
A major Business Analysit, David Armond Acree answered the question, “Did you
see cost savings in using the reporting portal (data warehouse)?” by answering, “Yes,
we saw a savings of 5 hours per week * ($30 per hr estimated) * (48 weeks per year) =
$7200 per year”
These are but a few of the users who gained benefit from the data warehouse.
For each user the need was wide and varied. But, each user saw a benefit from the
project and was able to improve the business for it.
6 Conclusions
The FHD set out to identify the major data records, align, clean and standardize
the data, unify the data records into one warehouse which could be used as a tool to
enable a user to act and make changes. To accomplish the project proper techniques in
Data Modeling, Architecting, and data warehousing had to be understood and
implemented. A warehouse from the beginning to the end had to be built and proper
standards had to be established to include future data sources.
The scrum methodology played a key role in user satisfaction by enhancing the
user experience from beginning to end. Further, scrum improved the overall productivity
of all the team members by encouraging an open environment and reducing costs
across the board.
The results of the project can be seen in the numerous hours saved, reports built
from the warehouse, and the hundreds of thousands of dollars saved in upgrade and
systems costs. In the appendix, example screen shots of reports made and a few
samples of the scripts necessary to undertake the project will be provided.
7 Future Work
The FamilySearch’s user experience data warehouse has only taken in five
different sources of user experience. To achieve the goals of the business executives
many other sources need to be integrated into the data warehouse. Further, to conform
more closely to the corporations Information Communication Systems standards table
names and column names need to aligned with the businesses standards. Finally, as
the warehouse grows further system optimizations will be required.
APPENDIX A – SQL SCRIPTS
AMDOCS/CLARIFY--###############################################
--###############################################--Clarify_case_mv_dmSELECT c.objid, c.creation_time, c.id_number as "CASE_ID", c.title AS "CASE_TITLE", c.x_lang AS "LANGUAGE", c.x_routing, h.title AS "CATEGORY1", h1.title as "CATEGORY2", h2.title as "CATEGORY3", -- h3.title as "CATEGORY4", con.title as "CASE_CONDITION", st.title as "CASE_STATUS", q.title as "CASE_QUEUE"FROM table_case c, table_hgbst_elm h, table_hgbst_elm h1, table_hgbst_elm h2, --table_hgbst_elm h3, table_condition con, table_gbst_elm st, table_queue qWHERE c.CASE_LVL12HGBST_ELM = h.objid(+) AND c.CASE_LVL12HGBST_ELM = h1.objid(+) AND c.CASE_LVL12HGBST_ELM = h2.objid(+)-- AND c.x_case_type42hgbst_elm = h3.objid(+) AND c.CASE_CURRQ2QUEUE = q.objid(+) AND c.case_state2condition = con.objid(+) AND c.casests2gbst_elm = st.objid(+)--###############################################
--###############################################--clarify_email_response_mv_dm SELECT a1.objid, a1.act_entry2case, a1.title, a1.entry_time,
MIN( CASE WHEN a2.entry_time > a1.entry_time THEN a2.entry_time END ) AS "RESPONSE_DATE", ROUND( ( MIN( CASE WHEN a2.entry_time > a1.entry_time THEN a2.entry_time ELSE SYSDATE END ) - a1.entry_time) * 24, 3) AS "EMAIL_SLA"
FROM (SELECT
a.objid, eb.title, a.act_code, a.ACT_ENTRY2CASE, a.entry_time, a.addnl_infoFROM table_act_entry a, table_gbst_elm ebWHERE a.act_code = eb.rank AND a.ACT_ENTRY2CASE IS NOT NULL) a1, --clarify_act_all_mv_dm (SELECT a.objid, eb.title, a.act_code, a.ACT_ENTRY2CASE, a.entry_time, a.addnl_infoFROM table_act_entry a, table_gbst_elm ebWHERE a.act_code = eb.rank AND a.ACT_ENTRY2CASE IS NOT NULL) a2
WHERE a1.act_entry2case = a2.act_entry2case AND a1.ACT_CODE = '3500' AND a2.ACT_CODE IN ('500','1700','200','3400')
GROUP BY a1.objid, a1.act_entry2case, a1.title, a1.entry_time--###############################################
--###############################################
--###############################################-- clarify_act_all_mv_dm SELECT a.objid, eb.title, a.act_code, a.ACT_ENTRY2CASE, a.entry_time, a.addnl_infoFROM table_act_entry a, table_gbst_elm ebWHERE a.act_code = eb.rank AND a.ACT_ENTRY2CASE IS NOT NULL;
--###############################################
--###############################################
--###############################################-- clarify_user_mv_dm SELECT u.objid, u.login_name, e.first_name, e.last_name, c.NAME AS COUNTRY_NAME, e2.first_name AS MANAGER_FNAME, e2.last_name AS MANAGER_LNAME, u2.login_name AS MANAGER_LOGIN, u3.login_name AS TOP_MANAGER, hb.title AS WORKGROUP, hb2.title AS TOP_WORKGROUP
FROM table_user u, table_employee e, table_employee e2, table_employee e3, table_site s, table_address a, table_country c, table_user u2, table_user u3, table_hgbst_elm hb, table_hgbst_elm hb2 WHERE u.objid = e.employee2user AND e.supp_person_off2site = s.objid AND s.cust_primaddr2address = a.objid AND a.address2country = c.objid AND e.emp_supvr2employee = e2.objid(+) AND e2.work_group = hb.ref_id(+) AND u2.objid = e2.employee2user AND e2.emp_supvr2employee = e3.objid(+) AND e3.work_group = hb2.ref_id(+) AND u3.objid(+) = e3.employee2user--###############################################
--###############################################
--###############################################-- CLARIFY_FACT_MV SELECT c.objid AS "CASE_OBJID", c.case_reporter2site, c.case_reporter2contact, c.case_owner2user, COUNT ( CASE WHEN a.ACT_CODE = '3500' THEN e.objid END ) AS "EMAIL_IN", COUNT ( CASE WHEN a.ACT_CODE = '3400' THEN e.objid END) AS "EMAIL_OUT", (CASE WHEN con.condition <> 4 THEN ((sysdate - c.creation_time)*24*60*60) ELSE ((cc.close_date - c.creation_time)*24*60*60) END) AS "CASE_SEC"FROM table_case c, table_act_entry a, table_email_log e, ( SELECT last_close2case, MAX(close_date) AS "CLOSE_DATE" FROM table_close_case GROUP BY last_close2case ) cc, table_condition con WHERE c.objid = a.act_entry2case(+) AND a.act_entry2email_log = e.objid(+) AND cc.last_close2case(+) = c.OBJID AND c.case_state2condition(+) = con.objid
GROUP BY c.objid, c.case_reporter2site, c.case_reporter2contact, c.case_owner2user, ( case when con.condition <> 4 then ((sysdate - c.creation_time)*24*60*60) else ((cc.close_date - c.creation_time)*24*60*60) end )--###############################################
LANDeskUSE [DTM_FCH_9]GO-- ##########################################################################################-- Landesk configuration DIM-- ##########################################################################################SELECT
ISNULL( CAST(cd.FHCIDNum AS INT), 0) fhcidnum, c.computer_idn, ISNULL( c.HWLastScanDate, CAST('1/1/1830' AS DATETIME)) HWLastScanDate, CAST( CONVERT( VARCHAR( 8 ), ISNULL(c.HWLastScanDate, '1/1/1830'), 112) AS INTEGER) HWLastScanDate_key, ISNULL(c.LastUpdInvSvr, CAST('1/1/1830' AS DATETIME)) LastUpdInvSvr, CAST( CONVERT( VARCHAR( 8 ), ISNULL(c.HWLastScanDate, '1/1/1830'), 112) AS INTEGER) LastUpdInvSvr_key, ISNULL(C.SecurityLastScanDate, CAST('1/1/1830' AS DATETIME)) SecurityLastScanDate, CAST( CONVERT( VARCHAR( 8 ), ISNULL(c.SecurityLastScanDate, '1/1/1830'), 112) AS INTEGER)
SecurityLastScanDate_key, ISNULL(C.SWLastScanDate, CAST('1/1/1830' AS DATETIME)) SWLastScanDate, CAST( CONVERT( VARCHAR( 8 ), ISNULL(c.SecurityLastScanDate, '1/1/1830'), 112) AS INTEGER)
SWLastScanDate_key, ISNULL(CAST(fh.DPCustomCfg_Date AS DATETIME), CAST('1/1/1830' AS DATETIME)) DPCustomCfg_Date, CAST(CONVERT( VARCHAR( 8 ), CAST(ISNULL(fh.DPCustomCfg_Date, '1/1/1830' ) AS DATETIME), 112) AS
INTEGER) DPCustomCfgDate_key, (CASE WHEN localsch_version > 98 OR localsch_version < 7 THEN 1 ELSE 0 END) dpcustomcfg_out_of_date, ISNULL( fh.LDSReconnectVer, '0.0.0.0') LDSReconnectVer, ISNULL( fh.Localsch_Version, -1) Localsch_Version, ISNULL( fh.Policy_Ran, 18300101000000 ) LANDesk_Policy_Checkin, ISNULL( fh.Sophos_Primary, 'http://www.example.com/') Sophos_Primary, ISNULL( fh.Sophos_Secondary, 'http://www.example.com/') Sophos_Secondary, ISNULL( fh.Version_Installed, '0.0.0.0') Version_Installed, ld.FileDate LDAPPL3_File_Date, nt.Language OS_Language, nt.MUILang OS_MUILanguage, ISNULL( pm.AUTmonVer, 0.0) AUTmonVer, ISNULL( pm.GMT_Offset, -99) GMT_Offset, ISNULL( pm.nFSmonVer, '0.0') nFSmonVer, Case
WHEN umd.DATASTRING IS NULL THEN 'Null' WHEN umd.DATASTRING LIKE '5.[0-9.]%' THEN '5.x' WHEN umd.DATASTRING LIKE '6.[0-9.]%' THEN '6.x' WHEN umd.DATASTRING LIKE '7.[0-9.]%' THEN '7.x' WHEN umd.DATASTRING LIKE '8.[0-9.]%' THEN '8.x' WHEN umd.DATASTRING LIKE '9.[0-9.]%' THEN '9.x' ELSE 'Unknown' END ie_version_grp
, umd.DATASTRING ie_versionFROM
dbo.Computer cLEFT OUTER JOIN dbo.CDF cd ON c.computer_idn = cd.computer_idnLEFT OUTER JOIN dbo.Family_History fh ON c.computer_idn = fh.computer_idnLEFT OUTER JOIN dbo.LanDesk ld ON c.computer_idn = ld.computer_idnLEFT OUTER JOIN dbo.osnt nt ON c.computer_idn = nt.computer_idnLEFT OUTER JOIN dbo.pem pm ON c.computer_idn = pm.computer_idn
LEFT OUTER JOIN dbo.UNMODELEDDATA umd ON c.computer_idn = umd.computer_idn AND umd.METAOBJATTRRELATIONS_IDN = 1799ORDER BY
c.computer_idn;
USE [DTM_FCH_9]GO-- ##########################################################################################-- Landesk configuration FACT-- ood is out of date-- ##########################################################################################SELECT
ISNULL( CAST(cd.FHCIDNum AS INT), 0) fhcidnum, COUNT( DISTINCT c.computer_idn ) NUM_OF_COMPUTER, COUNT( CASE WHEN ISNULL( c.HWLastScanDate, CAST('1/1/1830' AS DATETIME)) < GETDATE()-21 THEN 1 ELSE
NULL END) HWLastScan_ood -- gt 21 days, COUNT( CASE WHEN ISNULL( c.LastUpdInvSvr, CAST('1/1/1830' AS DATETIME)) < GETDATE()-21 THEN 1 ELSE
NULL END) LastUpdInvSvr_ood -- gt 21 days, COUNT( CASE WHEN ISNULL( c.SecurityLastScanDate, CAST( '1/1/1830' AS DATETIME ) ) < GETDATE( ) - 30 THEN
1 ELSE NULL END ) SecurityLastScanDate_ood -- gt 30 days, COUNT( CASE WHEN ISNULL( c.SWLastScanDate, CAST( '1/1/1830' AS DATETIME ) ) < GETDATE( ) - 30 THEN 1
ELSE NULL END ) SWLastScanDate_ood -- gt 30 days, COUNT( CASE WHEN ISNULL( fh.DPCustomCfg_Date, CAST( '1/1/1830' AS DATETIME ) ) < GETDATE( ) - 21 THEN
1 ELSE NULL END ) DPCustomCfg_ood -- gt 30 days, COUNT( CASE WHEN fh.localsch_version > 98 OR fh.localsch_version < 7 THEN 1 ELSE NULL END )
cnt_locsch_ver_out_of_date, COUNT( CASE WHEN ISNULL( fh.LDSReconnectVer, '0.0.0.0') LIKE '0.[0-9.]%' THEN 1 ELSE NULL END )
LDSReconnectVer_0x, COUNT( CASE WHEN ISNULL( fh.LDSReconnectVer, '0.0.0.0') LIKE '1.[0-9.]%' THEN 1 ELSE NULL END )
LDSReconnectVer_1x, COUNT( CASE WHEN ISNULL( fh.LDSReconnectVer, '0.0.0.0') LIKE '2.[0-9.]%' THEN 1 ELSE NULL END )
LDSReconnectVer_2x, COUNT( CASE WHEN ISNULL( fh.LDSReconnectVer, '0.0.0.0') LIKE '3.[0-9.]%' THEN 1 ELSE NULL END )
LDSReconnectVer_3x, COUNT( CASE WHEN ISNULL( fh.LDSReconnectVer, '0.0.0.0') LIKE '4.[0-9.]%' THEN 1 ELSE NULL END )
LDSReconnectVer_4x, COUNT( CASE WHEN ISNULL( fh.LDSReconnectVer, '0.0.0.0') LIKE '5.[0-9.]%' THEN 1 ELSE NULL END )
LDSReconnectVer_5x, COUNT( CASE WHEN CONVERT( DATETIME, CAST( ISNULL( fh.Policy_Ran, '18300101000000' ) AS CHAR( 8 ) ) ) <
GETDATE( ) - 21 THEN 1 ELSE NULL END ) LANDesk_Policy_Checkin_ood --problem lies here, COUNT( CASE WHEN ISNULL( fh.Sophos_Primary, 'http://www.example.com/') NOT LIKE '%ldssr3[de]%' THEN 1
ELSE NULL END ) Sophos_Primary_NC, COUNT( CASE WHEN ISNULL( fh.Sophos_Secondary, 'http://www.example.com/') NOT LIKE 'http://es-
web.sophos.com/update/' THEN 1 ELSE NULL END ) Sophos_Secondary_NC, COUNT( CASE WHEN ISNULL( fh.Version_Installed, '0.0.0.0' ) <> '9.0.1.0' THEN 1 ELSE NULL END )
Version_Installed_NC, COUNT( DISTINCT nt.Language) OS_Language_CNT -- number of diff langs, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '0.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_0x, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '1.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_1x, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '2.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_2x, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '3.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_3x, COUNT( CASE WHEN ISNULL( pm.AUTmonVer, 0.0) LIKE '4.[0-9.]%' THEN 1 ELSE NULL END ) AUTmonVer_4x, COUNT( CASE WHEN ISNULL( pm.nFSmonVer, 0.0) LIKE '0.[0-9.]%' THEN 1 ELSE NULL END ) nFSmonVer_0x, COUNT( CASE WHEN ISNULL( pm.nFSmonVer, 0.0) LIKE '1.[0-9.]%' THEN 1 ELSE NULL END ) nFSmonVer_1x, COUNT( CASE WHEN ISNULL( pm.nFSmonVer, 0.0) LIKE '2.[0-9.]%' THEN 1 ELSE NULL END ) nFSmonVer_2x, COUNT( CASE WHEN umd.DATASTRING IS NULL THEN 1 ELSE NULL END) IE_NULLS, COUNT( CASE WHEN umd.DATASTRING LIKE '5.[0-9.]%' THEN 1 ELSE NULL END) 'IE_5x', COUNT( CASE WHEN umd.DATASTRING LIKE '6.[0-9.]%' THEN 1 ELSE NULL END) 'IE_6x', COUNT( CASE WHEN umd.DATASTRING LIKE '7.[0-9.]%' THEN 1 ELSE NULL END) 'IE_7x', COUNT( CASE WHEN umd.DATASTRING LIKE '8.[0-9.]%' THEN 1 ELSE NULL END) 'IE_8x', COUNT( CASE WHEN umd.DATASTRING LIKE '9.[0-9.]%' THEN 1 ELSE NULL END) 'IE_9x'
FROM
dbo.Computer cLEFT OUTER JOIN dbo.CDF cd ON c.computer_idn = cd.computer_idnLEFT OUTER JOIN dbo.Family_History fh ON c.computer_idn = fh.computer_idnLEFT OUTER JOIN dbo.LanDesk ld ON c.computer_idn = ld.computer_idnLEFT OUTER JOIN dbo.osnt nt ON c.computer_idn = nt.computer_idnLEFT OUTER JOIN dbo.pem pm ON c.computer_idn = pm.computer_idnLEFT OUTER JOIN dbo.UNMODELEDDATA umd ON c.computer_idn = umd.computer_idn AND
umd.METAOBJATTRRELATIONS_IDN = 1799GROUP BY
ISNULL( CAST(cd.FHCIDNum AS INT), 0)ORDER BY
ISNULL( CAST(cd.FHCIDNum AS INT), 0);
APPENDIX B – SCRIPTS
KANISA=begin
************************************* NOTES SECTION ********************************************
# KSC_authoring-Production-PASK-009-033-00_00_00-2009_05_28.log
# KSC_case_activity-Production-PASK-009-033-00_00_00-2009_05_28.log
# KSC_response_central-Production-PASK-009-033-00_00_00-2009_05_28.log
# KSS_favorites-Production-PASK-009-033-00_00_00-2009_05_28.log
# KSS_forum-Production-PASK-009-033-00_00_00-2009_05_28.log
# KSS_kc_view-Production-PASK-009-033-00_00_00-2009_05_28.log
# KSS_RAR_events-Production-PASK-009-033-00_00_00-2009_05_28.log
# PLATFORM-Kanisa-Build-1242789327-PASK-009-033-2009_05_28.log
DI will load the master files in by updating the tables.
i.e. if the table has data in it, DI will append to the table.
rules:
- delete old master files
- load only with the new lines
get all the file names in the directory
hash the filenames like the following:
@filenames = {
'file_one' => [[one, date_1], [two, date_2], [three, date_3]],
'file_two' => [[one, date_1], [two, date_2], [three, date_3]],
'file_three' => [[one, date_1], [two, date_2], [three, date_3]]
}
delete all entries that are strictly older than current set group of files
loop through all the remaining files appending them together and updating the logs_status
goto the current file of the set
if cur.mtime != f.mtime
open f
load into an array
goto cur.line
add lines to master file
close file
update log_status
end
=end
require 'yaml'
class XmlStuff
###################################################################################################
attr_accessor :keys, :log_config, :category_f_name_arrays, :file_names
LOG_STATS = 'logs_status.yml'
AUTHORING = 0
CASE_ACTIVITY = 1
FAVORITES = 2
FORUM = 3
KC_VIEW = 4
RAR = 5
PLATFORM_KANISA = 6
RESPONSE_CENTRAL = 7
###################################################################################################
###################################################################################################
# Setup all the variables that are need for this transfer
###################################################################################################
def initialize()
@log_config = YAML.load_file( LOG_STATS )
@keys = Array.new( 8, true )
@category_f_name_arrays = nil
end
###################################################################################################
###################################################################################################
# Takes in no parameters, but utilizes a constant to figure out what file you want to use.
# The only dependant variable which is needed is @log_config.
###################################################################################################
def update_logs_status
x = File.open( LOG_STATS, 'w' ) do |out|
YAML.dump( @log_config , out )
end
end
###################################################################################################
###################################################################################################
# loop through all the remaining files appending them together and updating the logs_status
# loop_and_append_to_masters depends on @log_config and what catagory is coming in. Are the
# different file names, ie authoring and case activity. Please see log_status.yml
# to see the different attributes @log_config can have.
###################################################################################################
def loop_and_append_to_masters(options={})
# Each new file will be started with master then the given catagory. i.e. forum, response central
fout = File.open("master_#{ options[ :catagory ] }.log", File::WRONLY|File::TRUNC|File::CREAT )
fout.puts( @log_config[ options[ :catagory ] ][ 'headers' ] )
#Now lets loop through the rest of the files.
@category_f_name_arrays[ options[:catagory] ].each do |l_file|
# sod stands for start of data. I dont want to grab the headers when I append the data.
sod = @log_config[ options[ :catagory ] ][ 'start_of_data' ]
# grab the file by its full name and then split it into an array so, we can skip ahead to parse out the headers.
fin = open_and_load( l_file.first )
# Since the file has data in it, state the new current file and date it was created.
@log_config[ options[ :catagory ] ]['current_file'] = l_file.first
@log_config[ options[ :catagory ] ]['cdate'] = l_file.last
# We don't need to parse through the rest of the file if it the exact length of the start of data.
next if (fin.length - 1) == sod
# sod is based on zero, so since fin.length is the actual length, we need to return one less so i know which index
# to start on.
@log_config[ options[ :catagory ] ]['end_line'] = fin.length - 1
# grab the range of data and work with it.
fin[ sod..fin.length-1 ].each do |item|
# Now lets concatinate the files if the line is not empty.
fout.puts( item.strip ) unless item.empty?
# execute the code unless the data is empty. which is what nil means in this case.
end unless fin[ sod..fin.length-1 ].nil? # end fin[ sod..fin.length-1 ].each |item|
end # end @category_f_name_arrays[:catagory].each do |fname|
# once file processing is done for the range, close and write all the newly aquired data to the output.
fout.close()
end
###################################################################################################
###################################################################################################
# files_with_date takes uses @category_f_name_arrays to catagorize the different files. The
# decision to use @category_f_name_arrays in this manner was due to debuging, file loading,
# and complexity issues. If you are in Linux you can execute ruby make_xml_file.rb `ls *.log`
# Or if you are in windows, just executing the file will look in a default location, which can be changed.
###################################################################################################
def files_with_date
# find out if the platform that the script is running on is windows or linux. If it is linux, just look for the arguments. If it is windows
look for the files in a predefined folder.
filenames = []
# Check for the platform and execute the appropreate commands
# filenames = ( RUBY_PLATFORM =~ /mswin32/ ) ? ( %x{ dir /B files\\Logs }.split( "\n" ) ) : ARGV
# %x{ dir /B files\\Logs }.split( "\n" ).each do |f| filenames << "files\\Logs\\" + f end
# Gather all the files from the correct drives and directories. What we are not seeing
# is the given drives are mapped directly to the files we are looking for.
%x{ dir /B w:\\ }.split( "\n" ).each do |f| filenames << "w:\\" + f end
%x{ dir /B x:\\ }.split( "\n" ).each do |f| filenames << "x:\\" + f end
%x{ dir /B y:\\ }.split( "\n" ).each do |f| filenames << "y:\\" + f end
# remove from the list all the names that dont have a datae attached to the
filenames = filenames.delete_if{ | x |
!( x =~ /\d{4}_\d{2}_\d{2}/ )
}
# we used @category_f_name_arrays as a hash, is becuase we wanted versatility.
@category_f_name_arrays = { }
@category_f_name_arrays[ 'authoring' ] ||= []
@category_f_name_arrays[ 'case_activity' ] ||= []
@category_f_name_arrays[ 'favorites' ] ||= []
@category_f_name_arrays[ 'forum' ] ||= []
@category_f_name_arrays[ 'kc_view' ] ||= []
@category_f_name_arrays[ 'rar' ] ||= []
@category_f_name_arrays[ 'platform-kanisa' ] ||=[]
@category_f_name_arrays[ 'response_central' ] ||=[]
# Loop through all the newly aquired filenames and place them in the appropreate hash to be sorted later.
# :i is used for keys on the debug. :category is the first match in the regular expression. :cdate is
# the date that is grabed by the regular expression. :fn is the full name of the file.
for fn in filenames
fin = fn.downcase
if fin =~ /(authoring).*(\d{4}_\d{2}_\d{2})/ and @keys[ 0 ]
load_files( { :fn => fn, :i => 0, :cdate => $2, :category => $1 } )
elsif fin =~ /(case_activity).*(\d{4}_\d{2}_\d{2})/ and @keys[ 1 ]
load_files( { :fn => fn, :i => 1, :cdate => $2, :category => $1 } )
elsif fin =~ /(favorites).*(\d{4}_\d{2}_\d{2})/ and @keys[ 2 ]
load_files( { :fn => fn, :i => 2, :cdate => $2, :category => $1 } )
elsif fin =~ /(forum).*(\d{4}_\d{2}_\d{2})/ and @keys[ 3 ]
load_files( { :fn => fn, :i => 3, :cdate => $2, :category => $1 } )
elsif fin =~ /(kc_view).*(\d{4}_\d{2}_\d{2})/ and @keys[ 4 ]
load_files( { :fn => fn, :i => 4, :cdate => $2, :category => $1 } )
elsif fin =~ /(rar).*(\d{4}_\d{2}_\d{2})/ and @keys[ 5 ]
load_files( { :fn => fn, :i => 5, :cdate => $2, :category => $1 } )
elsif fin =~ /(platform-kanisa).*(\d{4}_\d{2}_\d{2})/ and @keys[ 6 ]
load_files( { :fn => fn, :i => 6, :cdate => $2, :category => $1, :min_length => 4, :depth => 7 }
)
elsif fin =~ /(response_central).*(\d{4}_\d{2}_\d{2})/ and @keys[ 7 ]
load_files( { :fn => fn, :i => 7, :cdate => $2, :category => $1 } )
end # end if fn =~ /authoring/
end # end for fn in filenames
# after we have loaded all the files into @category_f_name_arrays, we need to sort them by date.
@category_f_name_arrays.each do |item, value|
next if value.empty?
value.sort! do |x,y|
x.last <=> y.last
end
end
end
###################################################################################################
###################################################################################################
# Load the files in by using the cdate (the date pulled out by the regulare expression) in the yaml file,
###################################################################################################
def load_files( options={} )
# d is the date in question from the regulare expression
date_from_file_name = Date.parse(options[:cdate].gsub(/_/, '-'))
#dc is the date from the log_status.yaml file. :category should be the name garnered from
# the regular epression or some simplified name for the log_status file.
stored_date_in_yaml_file = @log_config[ options[ :category ] ][ 'cdate' ]
# if the current file is older then the file in question, put it in the queue
# to be processed.
if stored_date_in_yaml_file < date_from_file_name && date_from_file_name != Date.today() # 1-1-2000 < 1-1-2009
@category_f_name_arrays[ options[ :category ] ] << [ options[:fn], date_from_file_name ]
end
end
###################################################################################################
###################################################################################################
# debug_info is used to display information,
###################################################################################################
def debug_info( options={} )
# @keys[ options[:i] ] = false
# d is the date in question
d = Date.parse(options[:cdate].gsub(/_/, '-'))
#dc == currently logged date
dc = @log_config[ options[:category] ]['cdate']
if dc < d
@category_f_name_arrays[ options[ :category ] ] << [ options[:fn], d ]
#@log_config[ options[ :category ] ]['date'] = d.to_s
end
the_file = open_and_parse( options[:fn] )
if the_file.length > options[:min_length] ||= 4
#puts "The name of the file is: #{ options[ :fn ].downcase }"
#puts "The length of the file is: #{ the_file.length }"
the_file.each_with_index do |item, index|
#break if index > options[ :depth ] ||= 5
item.each do |cell|
puts cell
end if item.length > 52
puts "The length of the line[#{index}] is: #{item.length}" if item.length > 52
end
end
#puts @keys.to_yaml
end
###################################################################################################
###################################################################################################
def setup_the_xml
filenames = ( RUBY_PLATFORM =~ /mswin32/ ) ? ( %x{ dir /B files\\Logs }.split( "\n" ) ) : ARGV
filenames = filenames.delete_if{ | x |
!( x =~ /\d{4}_\d{2}_\d{2}/ )
}
@category_f_name_arrays = { }
@category_f_name_arrays[ 'authoring' ] ||= []
@category_f_name_arrays[ 'case_activity' ] ||= []
@category_f_name_arrays[ 'favorites' ] ||= []
@category_f_name_arrays[ 'forum' ] ||= []
@category_f_name_arrays[ 'kc_view' ] ||= []
@category_f_name_arrays[ 'rar' ] ||= []
@category_f_name_arrays[ 'platform-kanisa' ] ||=[]
@category_f_name_arrays[ 'response_central' ] ||=[]
for fn in filenames
fin = fn.downcase
#headers are at line[3] and the length is 17
if fin =~ /(authoring).*(\d{4}_\d{2}_\d{2})/ and @keys[ 0 ]
debug_info( { :fn => fn, :i => 0, :cdate => $2, :category => $1 } )
elsif fin =~ /(case_activity).*(\d{4}_\d{2}_\d{2})/ and @keys[ 1 ]
debug_info( { :fn => fn, :i => 1, :cdate => $2, :category => $1 } )
elsif fin =~ /(favorites).*(\d{4}_\d{2}_\d{2})/ and @keys[ 2 ]
debug_info( { :fn => fn, :i => 2, :cdate => $2, :category => $1 } )
elsif fin =~ /(forum).*(\d{4}_\d{2}_\d{2})/ and @keys[ 3 ]
debug_info( { :fn => fn, :i => 3, :cdate => $2, :category => $1 } )
elsif fin =~ /(kc_view).*(\d{4}_\d{2}_\d{2})/ and @keys[ 4 ]
debug_info( { :fn => fn, :i => 4, :cdate => $2, :category => $1 } )
elsif fin =~ /(rar).*(\d{4}_\d{2}_\d{2})/ and @keys[ 5 ]
debug_info( { :fn => fn, :i => 5, :cdate => $2, :category => $1 } )
elsif fin =~ /(platform-kanisa).*(\d{4}_\d{2}_\d{2})/ and @keys[ 6 ]
debug_info( { :fn => fn, :i => 6, :cdate => $2, :category => $1, :min_length => 4, :depth => 7
} )
elsif fin =~ /(response_central).*(\d{4}_\d{2}_\d{2})/ and @keys[ 7 ]
debug_info( { :fn => fn, :i => 7, :cdate => $2, :category => $1 } )
end # end if fn =~ /authoring/
end # end for fn in filenames
end # end def set_up_xml
###################################################################################################
###################################################################################################
def open_and_load(fn_in=nil)
fin = File.open(fn_in, "r" )
file_array = []
fin.each_line do |line| file_array.push line end
#close the file
fin.close
return file_array
end
###################################################################################################
###################################################################################################
def open_and_parse( fn_in=nil, fn_out="temp.log", start_line=0 )
fin = File.open(fn_in, "r" )
file_array = []
fin.each_line do |line| file_array.push line.strip.split("\t") end
#close the file
fin.close
return file_array
end # end def open_and_parse fn_in=nil, fn_out="temp.log", start_line=0
###################################################################################################
###################################################################################################
end
# Delete all the log files just to be clean.
puts "Deleting all log files"
%x{ erase *.log }
# delete the old alldcn.zip file
puts "Deleting the old alldcn.zip"
%x{ erase alldcn.zip }
# Delete the log_conversion.zip file
puts "Deleting log_conversion.zip"
%x{ erase log_conversion.zip }
puts "creating tapestry"
x = XmlStuff.new
tapestry = []
x.files_with_date
# Lets start to create the tapestry off all the threads which create the files.
x.category_f_name_arrays.each do |key, value|
tapestry << Thread.new {x.loop_and_append_to_masters( {:catagory => key} )}
puts "#{key} has been threaded."
end
# Before we move on, lets wait for all the threads to finish
puts "waiting to join all the threads"
tapestry.each do |t|
t.join
end
# We need to update my log file so we have the most upto date data. The if statment is used for debuging purposes.
if true
puts "updating logs status file"
x.update_logs_status
end
# Use the 7-zip command line utility (32bit) to compress all the logs after they are made.
# http://sourceforge.net/projects/sevenzip/files/7-Zip/4.65/7za465.zip/download
puts "creating log_conversion.zip"
%x{ 7za a -tzip log_conversion.zip *.log }
puts "creating alldcn.zip"
%x{ 7za a -tzip alldcn.zip "c:/fch/Kanisa/Kanisa Platform/KSM/Archive/dcnFiles/*" }
OMNITURE/* * Simple example makes call to Omniture API to get a companies report suites * * Requires the following libraries * * jakarta commons-lang 2.4 * jakarta commons-beanutils 1.7.0 * jakarta commons-collections 3.2 * jakarta commons-logging 1.1.1 * ezmorph 1.0.6 * json-lib-2.3-jdk13 * * * @author Lamont Crook * @email [email protected] * * @edited Kaleb J. Albee * @email [email protected] * */
//package com.omniture.security;import java.io.*;import java.util.regex.*;import java.net.URL;import java.net.URLConnection;import java.security.MessageDigest;import java.text.SimpleDateFormat;import java.util.Calendar;import java.util.Date;import java.util.HashMap;import java.util.Map;import net.sf.json.JSONArray;
import net.sf.json.JSONObject;import java.text.DateFormat;import java.text.Format;import java.text.ParseException;
public class OMTR_REST {private static String USERNAME = "albeekj:LDS";private static String PASSWORD = "849a07ace5ad6451ac861f158d77dd05";private static String LOGOUTPUT = "F:\\fh_share\\fhd_tts\\wiki\\omniture";private static String DEVELOPMENT_FOLDER = "F:\\fh_share\\fhd_tts\\dev";private static Integer WAIT_TIME = 30;//private static String ENDPOINT = "https://sc.omniture.com/p/am/1.2/rest-api.html"; //san jose endpointprivate static String ENDPOINT = "https://api.omniture.com/admin/1.2/rest/"; //san jose endpointpublic static final String DATE_FORMAT_NOW = "yyyy-MM-dd";
private OMTR_REST() {}
public static String now(int back) { Calendar cal = Calendar.getInstance(); cal.add(Calendar.DATE, - back);///This will put in yesterdays date SimpleDateFormat sdf = new SimpleDateFormat(DATE_FORMAT_NOW); return sdf.format(cal.getTime());
}//#############################################################################################
//#############################################################################################public static void loadCsv(String msg, String rptName){
//Set up the variable to catch the json message.String response = msg;String fout = "";
//assign the dateformat class the formatting of the string coming inDateFormat df = new SimpleDateFormat("EEE d MMM yyyy");
//setup the output formattingFormat f = new SimpleDateFormat("yyyyMMdd");
//I needed to know what date I was going to access and this is its place holder.Date given_date = null;
//reach in and grab the report arrayJSONObject jsonObj = JSONObject.fromObject(response).getJSONObject("report");
//now pass the array data to the json array.JSONArray jsonArry = JSONArray.fromObject(jsonObj.get("data")); JSONArray jtmp = null;
//the headers to the csv being pumped out.fout = "date\tpageViews\tvisits\tunique_visitors\n";
//I need another date holder.String jdate = null;
//we are not going to go through the report => data. for(int i = 0; i < jsonArry.size(); i++) {
//jtmp stands for json temporary
//for the first item in the array I am looking for the element countsjtmp = JSONArray.fromObject(JSONObject.fromObject(jsonArry.get(i)).get("counts"));
//now I need a way to to store the date which comes from the element name.jdate = JSONObject.fromObject(jsonArry.get(i)).get("name").toString();
try {jdate = jdate.replaceAll("(?i)\\.", "");jdate = jdate.replaceAll("(?i)\\s{2,}", " ");
} catch (PatternSyntaxException ex) {// Syntax error in the regular expressionex.printStackTrace( );
} catch (IllegalArgumentException ex) {// Syntax error in the replacement text (unescaped $ signs?)ex.printStackTrace( );
} catch (IndexOutOfBoundsException ex) {// Non-existent backreference used the replacement textex.printStackTrace( );
}
// Now try to parse out jdate and formate it to a date type.try{
given_date = df.parse( jdate );}catch( ParseException e ){
e.printStackTrace( );}// end catch(ParseException e)
//once it is formated we can concatinate it to fout which later// will be written to the log file on the f: drive.fout += f.format( given_date )//get the date from the array
+ "\t"+ jtmp.get( 0 )+ "\t"+ jtmp.get( 1 )+ "\t"+ jtmp.get( 2 )+ "\n";
}// end for(int i = 0; i < jsonArry.size(); i++)
// since fout might have some tailing white space characters, we are// going to trim off those characters just in case.toLogs( fout.trim( ), rptName );
// I am not totally sure if this is useful, but at this point it is just displaying// to the screen/console.jsonArry = JSONArray.fromObject( jsonObj.get( "totals" ) );System.out.println( jsonArry.get(1) );
} //end public static void loadCsv(String msg)//#############################################################################################
//#############################################################################################public static void processReport(Map map, String rptName) throws IOException {
//Now we need to ask the question to the Omniture webservices server.String response = OMTR_REST.callMethod( "Report.QueueOvertime", JSONObject.fromObject( map ).toString(
) );//on 2/18/2011 we had 1942 tokens
//The following line is used to parse out the json formatted response. we could put this in //XML format, but I don't want to do that yet till we get more complicated queries.JSONObject jsonObj = JSONObject.fromObject( response );
//I cast the number to an int so it is easier to ask questions about it later.int rptid = ( Integer )jsonObj.get( "reportID" );
//check to see if we have a report.if( rptid > 0 ){
//We need to wait for 10 seconds so Omniture has time to generate the reportwaiting( WAIT_TIME );
//print to the user the report id we received from the Omniture serversSystem.out.println("the value we got was: " + rptid);
//set map to null so we can clear out the value. I ran into problems with this earlier.map = null;
//Now assign the map variable to a new hash map. Once that is done we can assign the parameters to it.
map = new HashMap();
//Assign the key, reportID the report id. we have to cast it as a string so the mapping will put quotes around it.
map.put("reportID", "" + rptid);
//Tell the user the vale of the hash map.System.out.println( JSONObject.fromObject( map ).toString( ) );
//We need to check the status of the report. if the status is done we can move on.response = OMTR_REST.callMethod( "Report.GetStatus", JSONObject.fromObject(map).toString( ) );
//pass the response to the json parser so we can quickly check what the value wasjsonObj = JSONObject.fromObject( response );
//Finally, lets get the actual report from the Omniture server and then parse it later.response = OMTR_REST.callMethod( "Report.GetReport", JSONObject.fromObject(map).toString() );jsonObj = JSONObject.fromObject(response);int trys = 10;while( jsonObj.get( "status" ).toString( ).compareTo( "not ready" ) == 0 && trys > 0 ){
trys -= 1;System.out.println( "not ready waiting for 10 seconds" );System.out.println( trys + " trys left" );waiting( 10 );response = OMTR_REST.callMethod( "Report.GetReport",
JSONObject.fromObject(map).toString() );jsonObj = JSONObject.fromObject( response );
}// end while(jsonObj.get( "status" ).toString( ).compareTo( "not ready" ) == 0 && trys > 0)
//we need to wait x seconds so it has time to pass in all the data.waiting( WAIT_TIME );
if(trys > 1)//if the trys are greater then one, then it must have succeeded.loadCsv( response, rptName );
}// end if( rptid > 0 )else{
System.out.println("Report id was " + rptid + ". That was not acceptable.");
}// end if( rptid > 0) else}// end public static void processReport()//#############################################################################################
//#############################################################################################public static void toLogs(String msg, String rptName){
BufferedWriter fout = null;try{
//if I want this to append I need to pass true to the FileWriter constructor as a//second parameter.
fout = new BufferedWriter(new FileWriter(LOGOUTPUT + "_" + rptName + ".csv")); fout.write(msg); fout.flush(); fout.close();
} catch(Exception e){
e.printStackTrace();} finally {
if(fout != null) try{
fout.close();} catch(IOException ioe2){
//ignore}
}}//#############################################################################################
//#############################################################################################public static void debug_ouput(String msg){
BufferedWriter fout = null;try{
//if I want this to append I need to pass true to the FileWriter constructor as a//second parameter.
fout = new BufferedWriter(new FileWriter( DEVELOPMENT_FOLDER + "\\output.txt" )); fout.write( msg ); fout.flush(); fout.close();
} catch(Exception e){
e.printStackTrace();} finally {
if(fout != null) try{
fout.close();} catch(IOException ioe2){
//ignore}
}}
//#############################################################################################
//#############################################################################################public static String callMethod(String method, String data) throws IOException {
URL url = new URL(ENDPOINT + "?method=" + method);System.out.println(url);URLConnection connection = url.openConnection();connection.addRequestProperty("X-WSSE", getHeader());
connection.setDoOutput(true);OutputStreamWriter wr = new OutputStreamWriter(connection.getOutputStream());wr.write(data);wr.flush();
InputStream in = connection.getInputStream();BufferedReader res = new BufferedReader(new InputStreamReader(in, "UTF-8"));
StringBuffer sBuffer = new StringBuffer();String inputLine;while ((inputLine = res.readLine()) != null)
sBuffer.append(inputLine);
res.close();
return sBuffer.toString();}//#############################################################################################
//#############################################################################################private static String getHeader() throws UnsupportedEncodingException {
byte[] nonceB = generateNonce();String nonce = base64Encode(nonceB);String created = generateTimestamp();String password64 = getBase64Digest(nonceB, created.getBytes("UTF-8"), PASSWORD.getBytes("UTF-8"));StringBuffer header = new StringBuffer("UsernameToken Username=\"");header.append(USERNAME);header.append("\", ");header.append("PasswordDigest=\"");header.append(password64.trim());header.append("\", ");header.append("Nonce=\"");header.append(nonce.trim());header.append("\", ");header.append("Created=\"");header.append(created);header.append("\"");return header.toString();
}//#############################################################################################
//#############################################################################################private static byte[] generateNonce() { String nonce = Long.toString(new Date().getTime()); return nonce.getBytes();}//#############################################################################################
//#############################################################################################private static String generateTimestamp() {
SimpleDateFormat dateFormatter = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");return dateFormatter.format(new Date());
}//#############################################################################################
//#############################################################################################private static synchronized String getBase64Digest(byte[] nonce, byte[] created, byte[] password) { try { MessageDigest messageDigester = MessageDigest.getInstance("SHA-1"); // SHA-1 ( nonce + created + password ) messageDigester.reset(); messageDigester.update(nonce); messageDigester.update(created); messageDigester.update(password); return base64Encode(messageDigester.digest()); } catch (java.security.NoSuchAlgorithmException e) { throw new RuntimeException(e); }}//#############################################################################################
//#############################################################################################// waiting was taken from a website, which I cant remember the URL to.public static void waiting (int n){
long t0, t1; t0 = System.currentTimeMillis(); do{ t1 = System.currentTimeMillis(); } while ((t1 - t0) < (n * 1000)); }
//#############################################################################################
//#############################################################################################private static String base64Encode( byte[ ] bytes ) { return Base64Coder.encodeLines( bytes );}//#############################################################################################
//#############################################################################################public static void main( String[ ] args ) throws IOException {
// Declare the base map so we can eventually ask the question to the omniture server.Map map = new HashMap( );
// desc is used to pass parameters to the omniture serverMap desc = new HashMap( );
// A is used to state the type of report we are asking for. We will be asking for an overtime report later.Map a = new HashMap( );
// We need to ask for the page views report.
a.put( "id", "pageViews" );
// and we are asking for specifically visitsMap b = new HashMap( );
b.put( "id", "visits" );Map c = new HashMap();c.put("id", "visitorsdaily");
// Now we ask for the starting date. The formatting does matter at this point.// I am going to have DI check for the duplicate dates. If it does exist, then ignore the entry.// the integer parameter is a time in days back. if you really want now, you need to enter 0desc.put( "dateFrom", OMTR_REST.now( 14 ).toString() );
// The formatting of the date matters to omniture.// since we have a 1 that means we are one day back, yesterday.desc.put( "dateTo", OMTR_REST.now( 1 ).toString() );//now is miss leading, it is actually yesterday.
// If we look into the api we have several options, but we have chosen to look at the daydesc.put( "dateGranularity", "day" );
// Now we need to pass the metrics setup earlier, to the metrics portion of the json request.desc.put("metrics", new Map[]{a, b, c});
// we need to loop through all the arguments passed in to the applicationfor(int i = 0; i < args.length; i++){
// We are not going to ask for the specific site we want to look at.desc.put( "reportSuiteID", args[i] );
// We have description assembled we need to pass it to the reportDescription key.map.put( "reportDescription", desc );// the following line is primarily used for debug purposes. It is useful to see the format of the the json
passed// to the omniture web services server
processReport( map, args[i] );}
}}
Family History Center (FHC) ProfileSELECT
ot.fhc_no,
ot.fhc_unit_no,
ot.fhc_name,
DECODE(ot.org_type_id, 1211, 'AHC', NVL(cl.fhc_type, 'FHC') ) AS fhc_type,
ot.fhc_area,
cfdt.cfar_no AS cfar_no,
good.spons_unit_no,
good.sponsoring_unit,
ot.parent_unit_no,
ot.parent_unit,
cfdt.hrs_open AS hrs_open,
ot.center_hours AS fhc_hrs,
cfdt.closed AS closed,
cfdt.area_advisor AS area_advisor,
ot.bill_to_unit_no,
ot.bill_to_unit_name,
cfdt.supp_stks AS supp_stks,
ot.temple_district,
cfdt.visitor_ctr AS visitor_ctr,
cfdt.corr_fac AS corr_fac,
ta.assgn_person_name AS dir_name,
ta.home_phone_number AS dir_phone,
ta.work_phone_number AS dir_work_phone,
ot.fhc_phone,
ta.assgn_email_address AS dir_email,
ot.fhc_email,
l.lang_name AS fhc_lang,
cfdt.NETWK_TYPE AS NETWK_TYPE,
ot.fhc_loc_add1,
ot.fhc_loc_add2,
ot.fhc_loc_add3,
ot.fhc_loc_add4,
ot.fhc_loc_country,
ot.fhc_loc_postal,
ot.fhc_loc_city,
ot.fhc_loc_state,
cfdt.fhc_loc_county AS FHC_LOC_COUNTY,
ot.approval_date,
cfdt.meetinghouse AS MEETINGHOUSE,
cfdt.BLDG_TYPE AS BLDG_TYPE,
cfdt.FM_PROPERTY_NO AS FM_PROPERTY_NO,
cfdt.FM_GROUP_UNIT_NO AS FM_GROUP_UNIT_NO,
cfdt.FM_GROUP AS FM_GROUP,
cfdt.FM_GROUP_PHONE AS FM_GROUP_PHONE,
cfdt.fm_same_bldg AS FM_SAME_BLDG,
cfdt.notes AS NOTES,
cfdt.admin_notes AS ADMIN_NOTES,
cfdt.attention_notes AS ATTENTION_NOTES,
cfdt.INITIAL_FHC_NO AS INITIAL_FHC_NO,
cfdt.CO_FLAG AS CO_FLAG,
cfdt.NET_NO_SHOW AS NET_NO_SHOW,
ot.fhc_name AS fhc_mail_ctr_name,
ta.ASSGN_PERSON_NAME AS fhc_mail_name,
ta.MAILING_STREET_4 AS fhc_mail_add1,
ta.MAILING_STREET_3 as fhc_mail_add2,
ta.MAILING_STREET_2 as fhc_mail_add3,
ta.MAILING_STREET_1 as fhc_mail_add4,
ta.MAILING_POSTAL_CODE as fhc_mail_postal,
ta.MAILING_COUNTRY_COMMON_NAME as fhc_mail_country,
ta.MAILING_STATE_PROV_COMMON_NAME as fhc_mail_state,
cfdt.SUPPORT_OFFICE as SUPPORT_OFFICE,
cfdt.COUNTRY_ADVISOR as COUNTRY_ADVISOR,
ot.mission,
cfdt.no_hrs_open as NO_HRS_OPEN,
cfdt.rept_to_country as REPT_TO_COUNTRY,
ta.cell_phone_number as dir_cell_phone,
cfdt.film_circulation_q as FILM_CIRCULATION_Q,
ot.fhc_loc_add_comp,
ta.MAILING_ADDRESS_COMPOSED as fhc_mail_comp_addr,
ta.MAILING_CITY as fhc_mail_city,
cfdt.FHC_HISTORICAL_NOTES as FHC_HISTORICAL_NOTES,
ta.ASSIGNMENT_ACTIVE_DATE as dir_start_date,
ot.ORG_STATUS_CODE,
ot.fax,
ot.date_loaded updated,
cfdt.XP_LICENSES as XP_LICENSES,
cfdt.NODE as NODE
FROM
orgs_temp ot,
(
--selecting director
select * from tmp_asst sub_ta where sub_ta.POSITION_TYPE_ID = 97
) ta,
(
--########################
--sponsering unit and number
--########################
select
fhc_stuff.fhc_unit_no, fhc_stuff.fhc_sponsoring_unit_type, fhc_stuff.spons_unit_no, fhc_stuff.Sponsoring_Unit
FROM
( SELECT
nvl(CASE fhc.PARENT_ORG_TYPE_ID
WHEN 5 THEN par.org_name
WHEN 6 THEN par.org_name
WHEN 3 THEN par.org_name
WHEN 1 THEN par.org_name
ELSE gpar.org_name
END, fhc.PARENT_ORG_NAME) AS Sponsoring_Unit,
nvl(CASE fhc.PARENT_ORG_TYPE_ID
WHEN 5 THEN par.unit_number
WHEN 6 THEN par.unit_number
WHEN 3 THEN par.unit_number
WHEN 1 THEN par.unit_number
ELSE gpar.unit_number
END, fhc.PARENT_UNIT_NUMBER) AS spons_unit_no,
nvl(CASE fhc.PARENT_ORG_TYPE_ID
WHEN 5 THEN par.org_type
WHEN 6 THEN par.org_type
WHEN 3 THEN par.org_type
WHEN 1 THEN par.org_type
ELSE gpar.org_type
END, fhc.PARENT_ORG_type) AS fhc_sponsoring_unit_type,
fhc.UNIT_NUMBER AS fhc_unit_no
FROM
mdmr.mdm_org_association moa,
mdmr.mdm_org fhc,
mdmr.mdm_org sup_stake,
mdmr.mdm_org par,
mdmr.mdm_org gpar
WHERE
moa.association_type_code(+)=78
and fhc.org_id=moa.CONSUMER_ORG_ID(+)
and fhc.ORG_TYPE_ID in (44, 49)
and sup_stake.org_id(+)=moa.ASSOCIATED_PROVIDER_ORG_ID
and fhc.ORG_STATUS_CODE=1
and par.org_id(+)=fhc.PARENT_ORG_ID
and gpar.org_id(+)=par.parent_org_id) fhc_stuff
group by
fhc_stuff.fhc_unit_no, fhc_stuff.fhc_sponsoring_unit_type, fhc_stuff.spons_unit_no, fhc_stuff.Sponsoring_Unit) good,
(--###########################
--language subquery join
select
ol.org_id,l.lang_name
from
mdmr.mdm_org_language ol,
mdmr.mdm_language l
where
ol.LANGUAGE_CODE = l.LANGUAGE_CODE and
ol.ORG_SPOKEN_LANGUAGE_RANK = 1
) l, --#########################
(
SELECT
case oc.org_subclass_id
when 91 THEN 'RGN'
when 89 THEN 'CO'
when 90 THEN 'FHC'
ELSE
'FHC'
end AS FHC_TYPE,
oc.ORG_ID
FROM
mdmr.mdm_org_classification oc
WHERE
upper(oc.ORG_CLASSIFICATION) like upper('%family%')
) cl,
DEBUG_TMP_CFHCD4 cfdt
WHERE
ot.org_id = l.org_id(+)
AND ot.FHC_UNIT_NO = cfdt.fhc_unit_no(+)
AND ot.org_id = cl.org_id(+)
AND good.fhc_unit_no(+) = ot.fhc_unit_no
AND ta.org_id(+) = ot.ORG_ID
AND ot.org_type_id in (44, 48, 47)
ORDER BY
ot.fhc_unit_no desc
APPENDIX C – REPORTS
Works Citedregular-expressions.info. (2002, December 2). Retrieved 3 2, 2012, from regular-expressions.info:
http://www.regular-expressions.info/oracle.html
DWH Concepts and Fundamentals. (2007). Retrieved June 01, 2012, from dwhinfo.com: http://www.dwhinfo.com/Concepts/DWHConceptsMain.html
Star Schema. (2009). Retrieved May 12, 2012, from Datawarehouse4u.info: http://datawarehouse4u.info/Data-warehouse-schema-architecture-star-schema.html
Star-Schema Design. (2010, January 26). Retrieved July 31, 2012, from Stack Over Flow: http://stackoverflow.com/questions/110032/star-schema-design
Benefits of a Data Warehouse. (2011, Jul 31). Retrieved July 10, 2012, from BI-INSIDER.COM: http://bi-insider.com/portfolio/benefits-of-a-data-warehouse/
Ruby, Concurrency, and You. (2011, October 14). Retrieved March 3, 2012, from engine yard: http://www.engineyard.com/blog/2011/ruby-concurrency-and-you/
Dimensional Model Schemas- Star, Snow-Flake and Constellation. (2012). Retrieved July 31, 2012, from Execution-MiH: http://www.executionmih.com/data-warehouse/star-snowflake-schema.php
Foreign key Constraints. (2012). Retrieved March 2, 2012, from msdn.microsoft.com: http://msdn.microsoft.com/en-us/library/ms175464.aspx
LANDesk. (2012, July 20). Retrieved July 31, 2012, from Wikipedia: http://en.wikipedia.org/wiki/LANDesk
List of Business Intelligence (BI) Tools. (2012). Retrieved July 31, 2012, from Business Intelligence Tool Box: http://www.businessintelligencetoolbox.com/list-of-business-intelligence-bi-tools/
Oracle Datatypes Data types for oracle 8 to Oracle 11g. (2012). Retrieved July 31, 2012, from ss64.com: http://ss64.com/ora/syntax-datatypes.html
Scrum Is an Innovative Approach to Getting Work Done. (2012). Retrieved February 20, 2012, from ScrumAlliance: http://www.scrumalliance.org/learn_about_scrum
Top 10 U.S. Websites to Search for Your Ancestors. (2012). Retrieved May 22, 2012, from EasyFamilyHistory.com: http://www.easyfamilyhistory.com/best-of-internet/top-10-websites
What is MySQL? (2012). Retrieved June 1, 2012, from dev.mysql.com: http://dev.mysql.com/doc/refman/5.0/en/what-is-mysql.html
Goyvaerts, J. (2010, December 2). Oracle Database 10g Regular Expressions. Retrieved July 31, 2012, from Regular-expressions.info: http://www.regular-expressions.info/oracle.html
Higginbotham, S. (2012, May 1). Google opens up its BigQuery data analytics service to all. Retrieved July 31, 2012, from gigaom.com: http://gigaom.com/cloud/google-opens-up-its-biq-query-data-analytics-service-to-all/
Mittag, J. W. (2008, September 11). Does ruby have real multithreading? Retrieved March 3, 2012, from Stakeoverflow.com: http://stackoverflow.com/questions/56087/does-ruby-have-real-multithreading
Oracle.com. (n.d.). Oracle9i Data Warehousing Guide. Retrieved 5 6, 2010, from docs.oracle.com: http://docs.oracle.com/cd/B10501_01/server.920/a96520/concept.htm
ScrumAlliance. (n.d.). ScrumAlliance transforming the world of work. Retrieved 2 20, 2012, from ScrumAlliance: http://www.scrumalliance.org/learn_about_scrum
ss64.com. (2011). ss64. Retrieved march 2, 2012, from ss64: http://ss64.com/ora/syntax-datatypes.html
Taylor, S. (2010, April 29). Mormon church's storied Granite Mountain vault opened for virtual tour. Retrieved May 1, 2012, from Deseret News: http://www.deseretnews.com/article/700028045/Mormon-churchs-storied-Granite-Mountain-vault-opened-for-virtual-tour.html
W3C. (2001, March 15). Web Services Description Language (WSDL) 1.1. Retrieved March 3, 2012, from World Wide Web Consortium: http://www.w3.org/TR/wsdl
wikipedia.org. (2012, February 15). wikipedia.org. Retrieved February 1, 2012, from wikipedia: http://en.wikipedia.org/wiki/Snowflake_schema