75
Data Mining Introduction By Daniel Calbimonte, 2012/11/12 Introduction In Classical Antiquity, an oracle was a person considered to be a source of wise counsel with prophetic predictions or precognition of the future, inspired by the gods. They had the gift to predict the future and advise the people with their wisdom. Today we do not have those oracles to predict the future. It would be nice to have an oracle to predict if our business is going to make profits, how much are we going to earn in the next 2 years and the answers to other questions related to the future. Since we do not have oracles (at least not good ones), data mining was created to help us to analyze our information and predict the future. Data Mining Data Mining is a process to discover patterns for a large data set. It is an expert system that uses its historical experience (stored in relational databases or cubes) to predict the future. Let me explain you what you can do with data mining using an example: Imagine that you own a company named Adventureworks. The company sells and manufactures bikes. You want to predict if a customer will buy or not a bike based in the customer information. How can you accomplish the mission? The answer is Data Mining. This tool will find the patterns and describe the characteristics of the customers with higher probability to buy the bikes or the lower probability. Microsoft comes with a nice tool included in SQL Server Analysis Services. You do not need to create a cube or an analysis services project. You can work with relational databases directly. Example In this sample, we are going to work with the database AdventureworksDW if you do not have it installed you can download it from the http://msftdbprodsamples.codeplex.com/ site. Once the AdventureworksDW is installed, use a select to verify the existent information in the v_targetmail view: SELECT * FROM [AdventureWorksDW.[dbo.[vTargetMail If you review the results you will find a lot of information about the customers like: The customer key The title The age Birthdate

Data Mining 5 Levels

Embed Size (px)

DESCRIPTION

SSRS Data Mining 5 niveles de estructuracion de Data Mining

Citation preview

  • Data Mining Introduction

    By Daniel Calbimonte, 2012/11/12

    Introduction

    In Classical Antiquity, an oracle was a person considered to be a source of wise counsel with prophetic predictions or precognition of the future, inspired by the gods. They had the gift to predict the future and advise the people with their wisdom. Today we do not have those oracles to predict the future. It would be nice to have an oracle to predict if our business is going to make profits, how much are we going to earn in the next 2 years and the answers to other questions related to the future.

    Since we do not have oracles (at least not good ones), data mining was created to help us to analyze our information and predict the future.

    Data Mining

    Data Mining is a process to discover patterns for a large data set. It is an expert system that uses its historical experience (stored in relational databases or cubes) to predict the future. Let me explain you what you can do with data mining using an example:

    Imagine that you own a company named Adventureworks. The company sells and manufactures bikes. You want to predict if a customer will buy or not a bike based in the customer information. How can you accomplish the mission?

    The answer is Data Mining. This tool will find the patterns and describe the characteristics of the customers with higher probability to buy the bikes or the lower probability. Microsoft comes with a nice tool included in SQL Server Analysis Services. You do not need to create a cube or an analysis services project. You can work with relational databases directly.

    Example

    In this sample, we are going to work with the database AdventureworksDW if you do not have it installed you can download it from the http://msftdbprodsamples.codeplex.com/ site.

    Once the AdventureworksDW is installed, use a select to verify the existent information in the v_targetmail view:

    SELECT *

    FROM [AdventureWorksDW.[dbo.[vTargetMail

    If you review the results you will find a lot of information about the customers like:

    The customer key The title The age Birthdate

  • Name Lastname MaritalStatus Suffix

    Gender EmailAddress YearlyIncome TotalChildren NumberChildrenAtHome EnglishEducation SpanishEducation FrenchEducation EnglishOccupation SpanishOccupation FrenchOccupation HouseOwnerFlag NumberCarsOwned AddressLine1 AddressLine2 Phone DateFirstPurchase CommuteDistance Region Age BikeBuyer

    All this information is important, but it is a lot! How can I find patterns? For example, if a person is married, (the maritalstatus column) it may affect in the decision to buy a bike. The age is important as well, depending on the age the people may want to buy a bike or not. How do you know which column is important? Which characteristic has more impact in the decision to buy a bike?

    As you may notice, it is pretty hard to find which attributes affects the decision because there are 32 columns in the table. There are too many combinations, so it is hard to find patterns. If you create a cube with all the information, it will be easier to find patterns, but even with cubes, we may miss some patterns because of the different combinations.

    Thats why we use Data Mining. To organize all the columns, analyze them and prioritize them.

    Notice that there is a special column named bikebuyer (the last column). This column shows the value of 1 if the customer bought bikes and 0 if he didnt. This is the value that we want to predict. We want to know if a customer will buy or not bikes based in our experience (in this case the experience is the vtargetmail view.

    Getting started

    In this example, I will show how to create a Data Mining project using the view vTargetMail.

    There are 3 sections here.

    1. Create a Data Source

    2. Create a Data View

    3. Create a Data Mining Project

    4. Predict information using the Mining Model

    Create a Datasource

    First, we are going to select the SQL Server and the connection properties. This is the Data Source.

    1. To start a Data Mining project we will use the SQL Server Business Intelligence included with the SQL Server Installation.

  • 2. Go to File> New Project and select the Analysis Services Project

    3. In the solution explorer right click the Data Sources and select a New data source.

  • 4. In the Data Source Wizard, press next.

    5. We are going to create a new Data Connection. Press New.

  • 6. In the connection manager specify the SQL Server name and the Database. In this scenario we are going to use the AdverntureworksDW Database.

  • 7. In the Data source wizard, press next

  • 8. Press Next and then Finish.

    You have created a Data Source to the AdventureWorksDW Database

    Create a Data Source View

    Now we are going to add the View vTargetMail in order to add it we are going to use a Data Source View. To resume, a Data Source View let us add the tables and view in the project.

    1. In the Solution Explorer right click in the Data Source View and select New Data Source View.

    2. In the Welcome to the Data Source View Wizard,

  • 3. In the Select a Data Source Window select the Data Source created.

  • 4. In the Select Tables and Views, select the vTargetMail and press the > button.

  • 5. In the Completing the Wizard window, press Finish.

  • We just created a Data View with the view to give experience to our Data Mining Model. The vTargetMail is a view that contains historical data about the customers. Using that experience, our mining model, will predict the future.

    Data Mining Model

    Now we are going to create the Mining Model using the Data Source and Data Source View created before.

    1. In the Solution Explorer, right click in the Mining Structures Folder and select New Mining Structure.

  • 2. In the Welcome to the Data Mining Wizard, press Next.

    3. In the select the Definition Method, select the option From existing relational database or data warehouse and press Next. As you can see, we can use relational databases, data warehouses or cubes.

  • 4. In the Create the Data Mining Structure Window, select Create mining structure with a mining model and select the Microsoft Decision Trees and press next. I am going to explain the details in another article about the mining techniques. By the moment lets say that we are using a Decision Trees algorithm for this example.

  • 5. In the Select Data Source View, select the Data Source View created and press Next.

  • 6. In the Specify Table Types, select the vTargetMail.

  • 7. In the specify Bike Buyer row in the predict column, select the checkbox and press the button Suggest.

    In this option, we are selecting which information we want to predict. In this scenario we want to predict if the person is a bike buyer or not.

  • 8. In the Input column mark with an x all the Column Names with the Score different than 0. What we are doing is to choose which columns are relevant in the decision to buy a bike.

  • 9. In the left column, select the first name, last name and email (this is going to be used do drill throw the information) and press Next.

    10. In the Specify Columns Content and Data Type, press Detect and press Next.

    I am going to explain Content Types and Data Types in future articles. By the moment, lets say that we are detecting the Data Types.

  • 11. In the Create Testing Set, set 100 Maximum number of cases in testing data set and press Next.

    This window is used to test the data. I will explain more details in later articles.

  • 12. In the Completing the Wizard Window, write the Mining Structure name and Mining model name and check the Allow drill through and press Finish.

  • 13. Now click in the Mining Model Viewer and you will receive a Windows message to deploy the project. Press the yes button.

    14. We will have a Message to process the model. Press Yes.

  • 15. In the Process Mining Model, press Process

    16. In tree Process Progress, once it has finished successfully, press close.

  • 17. Click in the Mining Model Viewer. The following Desition Tree should be displayed.

  • Zoom in | Open in new window

    We just created a Data Mining project using Decision Trees. It is ready to test. Our final task is to use it. I will create some queries to predict if a user will buy or not a bike using the Data Mining.

    Predict the future

    Now that we have our Data Mining, lets ask to our Oracle if a customer with specific characteristics will buy or not our bikes. We will create 2 queries.

    The first query will ask our oracle if a 45 years old customer with a Commute Distance of 5-10 miles with High School education will buy a bike. The second query will ask our oracle if a 65 years old customer with a Commute Distance of 1-2 miles with missing education will by a bike.

    1. First of all we need to move to the Mining Model Prediction Tab. Click there.

  • 2. In the Mining Model Window, click in the select Model Button

    3. In the Select Model Window, expand the Data Mining>DTStructure and select the DTModle and click OK.

  • 4. In the Select Input Table, click the Select Case Table

    5. In the Select Table Window, select the vTargetMail and press OK

  • 6. Right click in the Select Input Table and select the Singleton Query

  • 7. In the Singleton Query specify the following information:

    Age 45: Commnute Distance 5-10 Miles, English Education: High School, English Ocupation: Professional, Marital Status: S, Numerber of Cars Ownerd: 5, Number of children at home: 3. In this step we are specifying the customer characteristics.

    8. In the source Combobox, Select the DTModel mining model

  • 9. In the second row of the source column, click in the combobox and select the Prediction Function

    10. In the second row of the field column, select the PredictHistogram

    11. In the Criteria Argument column write: [DTModel].[Bike Buyer]

    What we are doing is to specify the probability of this user to buy a bike using the PredictHistogram.

    12. Now, click on the switch icon and select Result to verify the results of the query.

  • 13. If you watch the results you will find than the probability to buy a bike is 0.6213. It means 62 %. So now we have our oracle ready to predict the future!

    14. Finally, we are going to ask if another user with the following characteristics will buy a bike:

    Age 65: Commnute Distance 1-2 Miles, English Education: Missing, English Ocupation: Clerical, Marital Status: S, Numerber of Cars Ownerd: 1, Number of children at home: 0. In this step we are specifying the customer characteristics.

  • 15. Once this is done, lets select the switch icon and select the Result.

    The probability to buy for a customer with these characteristics is 57 %.

    Summary

    In this article we described how to predict the future using the Data Mining. There are many different scenarios to apply Data Mining. In this example we used a Decision Tree algorithm to predict the future.

    We used a View to feed our Mining Model and then we asked the model if 2 customers will buy or bikes. The first one has a probability of 65 % and the second one 57 %.

    Now that you have your mining model ready. You can ask him the future.

    Good luck.

    References

    http://nocreceenlosarboles.blogspot.com/2011/11/al-oraculo-de-delfos-no-le-dejan-votar.html

    http://msdn.microsoft.com/en-us/library/ms167167(v=sql.105).aspx

  • Data Mining Introduction Part 2

    By Daniel Calbimonte, 2012/12/31

    In my first article about Data Mining we talked about Data Mining with a classical example named AdventureWorks. In this example I am going to complement the first article and talk about the decision trees. Let me resume in few words how the Data Mining model worked.

    The data mining is an expert system. It learns from the experience. The experience can be obtained from a table, a view or a cube. In our example the data mining model learned from the view named dbo.vtargetmail. That view contained the user information about the customer.

    People usually think that they need to use cubes to work with Data Mining. We worked with the Business Intelligence Development Studio or the SQL Server Data Tools (in SQL 2012), but we did not use cubes, dimensions or hierarchies (we could use it, but it is not mandatory). We simply used a view.

    If we run the following query we will notice that we have 18484 rows in the view used.

    Select count(1) from dbo.vtargetmail

    Something important to point about Data Mining is that we need a lot of data to predict the future. If we have few rows in the view, the Mining Model will be inaccurate. The more data you have, the more accurate the model will be.

    Another problem in data mining is the input of data for the data mining. How can we determine which information is important for the Mining Model? We can guess a little bit.

    Lets return to the Adventureworks Company and lets think about the customers that may want to buy bikes. The salary may be important to buy a bike. If you do not have money to buy a bike, you will not buy it. The number of cars is important as well. If you have 5 cars you may not want to have a bike because you prefer to drive your cars.

    There are some data that may be useful as the input to predict if the customer is going to buy a bike or not. How can we determine which columns of data are important or which ones are not? In order to start, we can think about it. Is it important for the model the address or the email of the customers?

    It may not be important, especially the email. Does someone with Hotmail have less chances to buy a bike that a person with Gmail? I guess not. They are some input data that we could remove from the model intuitively. However the Data Mining tool lets you determine which columns affect or not the decision to buy or not a new bike.

    The Dependency Network

    In the Data Mining Model, go to the Mining Model Viewer Tab. In the Model Viewer Tab, go to the Dependency Network Tab. The Bike Buyer Oval is the Analysis that we are doing. We want to analyze if a person X is a possible buyer. The number of Children, Yearly Income, Region and the other variables are the columns of the view. With the Dependency Network, we can analyze which column has influence to buy or not a Bike.

  • If you adjust the link bar, you can define which column has more influence to buy or not a bike.

    In this example the number of cars owned is the most important factor to buy or not a Bike.

  • The second factor to buy a bike is the Yearly Income. This information is very important for Business Analysts and the marketing team.

    In my first article we used Decision Trees. Decision Trees are one of the different algorithms used by Microsoft to predict the future. In this case to predict if a customer x is going to buy a bike or not. In the viewer combo box we can select the option Microsoft Generic Content Tree Viewer. This option let you get some technical details about the algorithm.

    For more information about NODES, cardinality visit this link: http://msdn.microsoft.com/en-us/library/cc645772(v=110).aspx

    About Decision Trees

    Decision trees are the first basic algorithm that we used in this article. This Data Mining Algorithm divides the population used to predict if the customers want to buy or not a bike in different nodes. The nodes have branches and child nodes.

  • The first node contains all the cases. If you click on the node, there is a Mining legend at the right with all the cases used. The value 0 is the number of customers that did not buy the bikes. The value 1 is the group of user that bought bikes. There are colors to graphically see the percentages of users of each category.

    The second node divides the cases in the number of cars owned.

  • You can see that the colors of the node are different. The darker nodes contain more cases, if you click in the Number Cars Owned=2 you will notice that the number of cases is 6457. If you click in the Number Cars Owned=3, the number of cases is 1645.

    The other nodes are related to the Yearly Income and Age. There is a lot of information that can be analyzed here.

    I am going to talk about the Mining accuracy chart in future articles. To end this article, we are going to have a list of prospective buyers and predict if they will buy or not a bike.

    For this example, we are going to use the table dbo.Prospectivebuyers table that is included in the AdventureWorksDW database. Lets move to the Mining Model Prediction Tab.

  • Lets select a Model. In this case, select the Decision Tree.

    In the select Input Table(s) click the Select Case Table.

  • In the select Table Windows, select the ProspectiveBuyer Table. This table contains all the Prospective Buyers. We are going to determine the probability to buy or not bikes.

    In the source select the TM Decision Tree. Also select the following fields from the ProspectiveBuyer:

    ProspectivebuyerKey, firstName, lastname and Email. Finally select a Prediction Function and select the PredictProbability.

    What we are doing is to show the firstname, lastname, email and the probability to buy a bike. The PredictProbability shows a value from 0 to 1. The closer the value goes to 1, the closer the user will buy a bike.

  • To verify the results, select the result option

    Now you have the information of the prospective buyes and the probability to buy bikes. You predict the future again!

    For Example Adam Alexander has a probability to buy of 65 % while Adrienne Alonso has a probability of 50 %. We should focus on the guys with more probabilities and find why do they prefer to buy bikes. The main reason is the number of cars and after that the year income.

    Conclusion

    In this article we talked a little more about Data Mining and then we explained how the decision tree worked. Finally we predict the future with a list of possible customers and found which have more probability to buy bikes.

  • Data Mining Introduction Part 3: The Cluster Algorithm

    By Daniel Calbimonte, 2013/03/12

    This is part 3 of a series on data mining. If you want to find part 1 and 2, you can find them here:

    Data Mining Introduction part 1 Data Mining Introduction part 2

    In the last chapter I talked about the decision tree algorithm. The decision tree is the first algorithm that we used to explain the behavior of the customers using data mining.

    We found and predict some results using that algorithm, but sometimes there are algorithms that are better predictors of the future.

    In this new article I will introduce a new algorithm.

    The Microsoft Cluster Algorithm

    The Microsoft cluster algorithm is a technique to group the object to study according to different patterns. It is different than the decision trees because the decision tree uses branches to classify the information. The Microsoft Cluster is a segmentation technique that divides the customer in different groups. This segments are not intuitive for humans.

    For example, once the data mining algorithm detected that young man usually buy beer and diapers at the super market. It will group the customers according to different characteristics like the age, salary, number of cars, etc.

    The figure displayed above shows a cluster. It is a segment of 7 customers grouped.

    In this tutorial we are going to create a cluster algorithm that creates different groups of people according to their characteristics. The image below is a sample of how it groups:

  • You may ask yourself. When should I use decision tree and when to use cluster algorithm? There is a nice accuracy graph that the SQL Server Analysis Services (SSAS) uses to measure that. I will explain that graph in other article.

    Now, lets start working with the cluster algorithm and verify how it works.

    Requirements

    For this example, I am using the Adventureworks Multidimensional project and the AdventureworksDW Database. You can download the project and the database here:

    http://msftdbprodsamples.codeplex.com/releases/view/55330

    Getting started

    Open the AdventureWorksDW Multidimensional project. If it is not processed, process it.

  • Open the Targeted Mailing dmm

    In this sample we are going to work with the targeted Mailing.dmm structure. Double click on it. Now click the Mining Models tab and you will get the image below.

  • Mining models contains all the Models used to simulate the behavior of the customer. In this example we are using Decision Trees (explained in part 2 of these series). The decision trees and the cluster receive the same inputs of information. This information is a view named dbo. vTargetMail. This view contains customer information like the email, name, age, salary and so on.

    In Data Mining Part 1 in the Data Mining Model Section you will find the steps to create a data mining structure. That structure can be used by other algorithms. In other words, once you have a structure created as an input for the model, you do not need to create it again for other algorithms.

    In this sample, the Cluster algorithm is already created. If it were not created, you only need to click the create a related mining model icon.

    You only need to specify a name and choose the Algorithm name. In this case, choose Microsoft Clustering. Note that you do not need to specify input and prediction values because it was already done when you created your model in part 1 and 2 of the series.

    You will receive a message to reprocess the model, press Yes.

  • In the next Window press Run to process the Model.

  • Once finished, the Mining structure will show the start time and the duration of the process.

  • Go to the mining Model Viewer Tab and select the MyClusterModel just created to visualize the cluster algorithm. As you can see, it is an algorithm that creates different groups for all the customers. The groups are named cluster 1, cluster 2 and so on. The clusters creates groups of people based on their characteristis.

    For example the cluster 1 contains people from Europe with a salary between 10000 and 35000 $us while the cluster 2 contains people from north america with a salary between 40000 and 1700000 $us. In the picture bellow you will find the different clusters created:

    There are also different colors for the nodes. The darker colors are used for higher density clusters. In this case, the colors correspond to the Population. It is the shading variable. You can change the shading variable and the colors will change according to the value selected.

    If you click in the cluster profiles, you will find the different variables and the population for each cluster. The total population is 18484. The cluster 1 is the most populated cluster and cluster 2 is the second 1. In other words, the clusters numbers are grouped according to the population.

  • The variables show the customers characteristics like the age, salary and you can find the population with different colors for each characteristic. You can find interesting information here.

    You can also click in the Cluster Characteristics Tab and Find the characteristics per cluster. In this example we are going to select the cluster 1.

  • You will find here that the main characteristic of the cluster 1 is that the people are from Europe. That means that an important segment of people that buy bikes come are European. The second characteristic is the Yearly Income. We have the salary that is really important as well.

    Note and compare the information from the decision tree (in chapter 2) and the cluster. The information provided is really different. We cannot say that the information from the decision tree is better than the cluster model. We can say that the information is complementary.

    We also have the Cluster Discrimination tab. With this information you can visually find the differences between two clusters. For example, select Cluster 1 and Cluster 2.

    As you can see, the Yearly income is a big difference between these 2 clusters. The cluster 2 earns more money than the cluster 1. The same for the region, the cluster 2 do not necessarily live in Europe like the cluster 1. They are mainly Americans and earn more money.

    As you can see you can work with different promotions for the different clusters with specific strategies.

    Finally lets predict the probability of the customer to buy a bike. The prediction section is the same as the decision trees. We can say that the Data Mining could be used like a black box to predict probabilities. In this example we are going to find the customer probability to buy a bike.

    Click the Mining Model Prediction Tab. In the Mining Model, press the button Select Model.

  • In the select Mining Model select the model created at the beginning of this article (MyClusterModel).

    I am not going to explain in detail the steps to select a Singleton Query because it was already explained in part 1 and go to the "predict the future section".

    In part 1 we used the decision tree algorithm to predict the behavior of 1 customer with specific characteristics to buy a bike.

    In this sample we are going to use repeat the same steps, but using the new cluster model created. In the steps 7 we are going to use different characteristics:

  • What we are doing here is asking to the cluster algorithm the probability of someone with a commute Distance of 5-10 miles with highschool education, Female, a house owner and single with 3 cars, one children professional and from north america to buy a house. We are using the cluster model created named MyClusterModel and we are using the PredictHistogram function a funcions that returns the probability from 0 to 1.

    We will finally watch the results of the query:

    In the results we will see that the probability to buy a bike is 0,544 (54 %) and the probability that the user will not buy is 0,45 (46 %).

  • Conclusion

    In this chapter we used a new algorithm or method named Microsoft Cluster. The way that it organizes the information is different, but the input used is the same than the decision tree.

    The output using the mining model prediction is the same, no matter the algorithm used. The results will be different according to the accuracy of the algorithm. We will talk about accuracy in latter chapters.

    References

    http://msdn.microsoft.com/en-us/library/ms174879.aspx

    Images

    http://userwww.sfsu.edu/art511_h/acmaster/Project1/project1.html

    http://www.iglesiadedios.info/maranatha/2012/julio/eligiendo_c01.html

  • Data mining introduction part 4: the Nave Bayes algorithm

    By Daniel Calbimonte, 2013/04/15

    This is the fourth article about data mining. If you want to find the other parts they are:

    Data Mining Introduction part 1 - Getting started with the basics

    Data Mining Introduction part 2 - Decision Trees Data Mining Introduction part 3 - The Cluster Algorithm

    In the last chapter we already created a Data Mining Model using the cluster algorithm. In this new article I will introduce a new algorithm: the Nave Bayes Algorithm.

    The Microsoft Nave Bayes Algorithm

    The Microsoft Nave Bayes is based in the Bayes theorem. This theorem was formulated by Thomas Bayes an English Presbyterian minister (and a mathematician).

    In fact the Bayes theorem was presented to the Royal Society after Thomas dead (at least he is famous now and we are talking about him in the most visited SQL Server site in the world!).

    Microsoft created an algorithm based on this theorem. We call this algorithm Nave because we do not consider dependencies in this algorithm. I am not going to show you Bayesian formulas because Microsoft has an easy to use interface that does not require knowledge of the mathematical algorithm. That part is transparent to the user.

    In few words what the algorithm does is to show the probability of each attribute to do a certain thing.

    In the Adventureworks example used in the tutorial we have a list of prospective customers to buy a bike. With the algorithm we show the percentage of people who will buy a bike according to isolated characteristics.

  • The algorithm classifies the customers per age and it shows the same probability to buy a bike according to the age range. It will do the same process per each attribute. It is nave because it does not consider the dependencies between attributes.

    As you may notice with the information just provided, it is a simple algorithm (thats why we call it nave) and it requires fewer resources to compute the information. This is a great algorithm to quickly start researching relationships between attributes and the results.

    For example the address attribute can or cannot be an attribute that affects the probability to buy a bike. In fact, there is a direct relationship between the address and the probability to buy a bike because some neighbors usually can use the bike there and some cannot because of the traffic. Unfortunately, it is really hard to group all the addresses so it is a good question if we need to include the attribute in the model.

    If you are not sure which attributes are relevant, you could test the attribute using this algorithm.

    Requirements

    We are still using the Adventureworks databases and projects from the Data Mining Part 3.

    In that chapter we already created a model to predict the probability of a customer to buy bikes using decision trees and clusters. The algorithm already used a view as an input and we created the input to feed the algorithm. We will use the same information to create the Nave Bayes Algorithm.

    Getting started

    1. We are going to open the AdventureWorks Project used in earlier versions and open it with the SQL Server Data Tools (SSDT).

    2. In the solution Explorer, we are going to move to the Mining Structures.

  • 3. In the Mining Structures folder double click in Targeted Mailing. It is the sample to verify which customers are prospective buyers to email them.

    4. Click in the Mining Model Tab and Click the icon to create a mining model.

    5. In the New Mining Model Window select the name of the model. You can specify any name. Also select the Nave Bayes Algorithm.

  • 6. This is very important. The Nave Bayes does not support continuous data. In this sample, it does not support the Yearly Income. That is why you will receive the following Message:

    8. Discrete data means that the number of values is finite (for example the gender can be male and female). Continues is a number of values that is infinite (for example the number of starts, grains, the weight, size). The Nave Bayes supports only discrete data.

    9. It cannot work with data like the salary, taxes that you pay extra incomes and other type of data that is continuous. The algorithm classifies in groups the attributes according to values. If it has infinite number of values it cannot classify the attributes. That is why it excludes this type of attributes.

    In the Mining Structure tab, you can optionally ignore some inputs in the model. By default the Yearly Income is already ignored if you press Yes to the question to ignore the column.

    9. Now click in the Mining Model Viewer Tab and select the Nave Bayes just created.

    10. You will receive a message to process the new model. This new model will be loaded with the data when you process it. Click yes in the message.

  • 11. In the Process Mining Model you need to press the run button. This button will start processing the information from the views and get results using the algorithm.

  • 12. At the end of the process you will receive a message with the start day, duration of the process.

  • 13. The first tab is the Dependency network. This is similar to the decision tree. You will find which the main factors to buy a bike are. At the beginning you will see that all the attributes have an influence in a customer to buy a bike.

  • 14. However, if you move the dependency bar, you will find that the main factor to buy a bike is the Number of cars. That means that depending of the cars you own, the probabilities to buy a bike changes a lot.

    15. The second factor is the age. It means that the age is the second factor to buy a bike or not.

  • 16. The third factor is the number of children. Maybe if you have many children you will want to buy more bikes (or maybe none because it is too dangerous).

    17. The other tab is the Attribute profiles tab. This is nice graphical information that helps you classify the attributes information. For example, most of the people that buy bikes are 45-52 years old. Also the commute distance is from 0 to 1 mile. You can analyze the whole population, the people who buy bikes (1 Size) and the people who do not buy (0 size).

  • 18. There is another tab named attribute characteristics. You can see the probability to buy a bike according to specific attributes. For example the people will buy a bike if they do not have children at home and they are males and single.

    19. The discrimination score tab, show you the attributes and the main reasons related to attributes to buy or not to buy a car.

    For example the people without a car may want to buy a bike, but people with 2 cars will not like to buy a bike. People with 5 children wont buy a bike (because controlling 5 kids in the street will drive them crazy) while people with 1 child will like to buy a bike.

  • 20. Finally we will ask to our model what if the probability to buy a bike of a prospective customer who is 40-45 years old, with a commute distance of 5-10 miles, with high school, female, single, house owner, with 3 cars and 3 children to buy a bike. For this purpose we will go to the mining model prediction tab.

    21. Make sure in the Mining Model that the Nave Bayes model is selected and press the select case table button.

    22. In the select table Window, you select the table that you will use to feed you Nave Bayes algorithm. In this case, the dbo.TargetMail contains a list of prospective customers to send them mails.

  • 23. In the select Input Table, select the option singleton. This option lets you create a single query of a single user with specific characteristics.

    24. Now write the characteristics specified in step 20 with the characteristics of the user (age, number of children, education and so on).

  • 25. Also add the prediction function named PredictHistogram. This function provides you a histogram with the probability to buy a bike. Add the Nave Bayes criteria and the Nave Bayes source.

    26. Finally, go to the results.

  • You will see that the probability to buy a bike for the customer with the information provided is 0.5915 (59 %).

    Conclusion

    In this chapter we used a new algorithm or method named Microsoft Nave Bayes. We learned that it is a simple algorithm that does not accept continues values in the attributes. It only accepts discrete values. This algorithm is used to get fast results and to analyze individual attributes specially. In a next chapter we will explain more Data Mining algorithms.

    The way to predict data is similar no matter the algorithm used.

    References

    http://msdn.microsoft.com/en-us/library/ms174806%28v=sql.110%29.aspx

    http://en.wikipedia.org/wiki/Thomas_Bayes http://en.wikipedia.org/wiki/Bayesian_probability http://msdn.microsoft.com/en-us/library/ms174572.aspx

  • Data Mining Introduction Part 5: the Neural Network Algorithm

    By Daniel Calbimonte, 2013/05/22

    In earlier articles I explained the following Microsoft Data Mining Agorithms:

    Decision trees

    Clusters Nave Bayes

    There is also an introduction to this series if you are interested.

    Using these algorithms, we examined a view in SQL Server, and we predicted the probability for customers to buy a bike from the fictitious company, Adventureworks. In this new chapter we will talk about the Neural Network algorithm. This one is my favorite one.

    As the name says, the Neural Network is a pretty nice algorithm based on the way we think the brain works. Lets start comparing the human being with the Microsoft Neural Network with a simple example: the baby example

    When the babies come to the earth, they experiment with the environment. They eat dirt, flies, and papers. They learn with the experiences.They receive the dirt as input, and if they like it, it will be part of their menu. In their brain, using input, the neural network system creates connections, and babies learn what the best is for them and what food can be rejected.

    The Microsoft Neural network is similar to the babies and the human being:

  • There are three layers. The input, the hidden layer and the output.

    The Input Layer

    If we think about the baby, the input would be the dirt. The baby eats the dirt and tastes it, and decides if he likes it. In Microsoft Data Mining we use a view with the past experience of customers who bought a bike or not. With that input, the Neural Network can take some inferences. They predict with the input. The more data it has, the more precise the prediction is.

    The Hidden Layer

    In the baby example, the brain creates different conections and sends electricity through different paths. When a baby eats dirt, the brain sends a bad electrical sensation and the baby learns that the dirt does not taste good (for some babies).

    In our example, the Microsoft Algorithm tests different combinations of possibilities. It analyzes if people from 30-45 years old have a high possibity of buying a bike. If the results is positive, it keeps the results and continues comparing the different attributes of the user (gender, salary, cars, etc).

    The Output Layer

    The output is the result of the experience: if the baby likes the dirt or not. He will experience with his mouth the taste of the food, and he will determine what is the best for himself.

    Neural Networks can be applied to OCR, speech recognition, image analysis, and other artificial intelligence taks. In this case we are going to use neural networks for our Data Mining example.

    In the Microsoft Neural Networks, the system test the differents combinations of states and find the option that best suites the needs. The output is the result of different tests made by the algorithm.

    Getting started

    In the part 2 and part 3 of these articles I explained how to create the other algorithms based on a simple View with the customer information. Based on that information, we created a Data Mining Model and added the different Algorithms.

    We are going to continue using the model of earlier chapters and add the new Neural Network Algorithm. Follow these steps.

    1. Open the Adventureworks project used in earlier chapters and double click in the targeted Mailing.

  • 2. In that project we already added views, inputs to the Data Mining Project, now we are going to add the Neural Network algorithm. In the Mining Model tab, press the Create a related mining model icon.

    3. Write any name for the Model Name textbox and choose the Microsoft Neural Network as the algorithm name.

    4. If everything is OK, a new algorithm should be created:

  • 5. In the Mining Model tab click the Process the mining structure icon.

    6. In the process mining Model Tab, press the run button.

  • 7. Once the process is done, close the window.

  • 8. In order to see the model, go to the Mining Model Viewer and select My neural network.

    9. You will find that the customers older than 88 years old would not buy a bike (Favors 0). This is because they are too old to ride a bike. The same for people from 74-79 years old or 79-88. On the other hand people from Pacific will likely buy a bike, and they are potential customers (Favors 1). If the customer has 4 cars, he may not buy a bike.

  • If the customers have 3 children they may not want to buy a bike and if the age is between 40 and 45 years old they may want to buy a bike.

    In that chapter we asked the model the probability to buy a bike of a prospective customer who is 40-45 years old, with a commute distance of 5-10 miles, with high school, female, single, house owner, with 3 cars and 3 children to buy a bike.

    10. Finally, in order to test the method we are going to apply the same steps used in earlier chapters. If you did not read earlier chapters refer to the article about Nave Bayes step 20 to 26: http://www.sqlservercentral.com/articles/Data+Mining/97948/

    11. We will select the Neural network model using the select Model button.

  • 12. Choose the My neural network model.

    13. Using the Singleton option specify the customer characteristics (age, gender, marital status, etc) and use the PredictHistogram function to specify the probability to buy a bike.

  • 14. Verify the Results.

    The probability to buy a bike for a female with 40-45 years, single, etc is 40 % (0,4085014051).

    Conclusion

    In this chapter we used a new algorithm or method named Neural Network. The neural network is one of the most exciting algorithms and it can be used to predict complex models.

    Even when the algorithm is complex, using it with Microsoft Data Mining is very simple. In the next chapter we will talk about

    References and images

  • http://msdn.microsoft.com/en-us/library/ms174806%28v=sql.110%29.aspx

    http://en.wikipedia.org/wiki/Neural_network

    http://pijamasurf.com/2010/05/comer-tierra-aumenta-la-inteligencia-te-pone-de-buenas/

    http://msdn.microsoft.com/en-us/library/ms174572.aspx