Monday, March 26, 2012

I can't understand the meaning of a prediction query

Dear friends,
I'm reading Wiley's Data mining with SQL Server 2005... There are MANY things I can't understand about MovieClick example (Chapter 3).
I hope someone is going to help me with this troubles...

WARNING (1): I'm a dummy both with sql server and data mining.
WARNING (2): My English is not good at all.

Just two questions for now:

1) When I create the model to predict the number of bedrooms for homeowners, the book says to check BEDROOMS as Predictable... question: is it also an INPUT for the model, or PREDICTABLE only?

2) I'd like to keep this model (number of bedrooms.......) and make a prediction query.

- Query builder
- select case table -> Homeowners
- Drag the Customer ID column from the Homeowners table and drop it on the grid
- Drag the BEDROOMS column from the mining model and drop it on the grid.
- On the last row: Source=PredictionFunction, Field=PredictProbability
- Drag the BEDROOMS column from the mining model and drop it into Criteria/Argument
- Add (i.e.) 'Two or Three' to the field Criteria/Argument

I execute the query and I obtain many rows in a table with the following colums: CustomerID, BEDROOMS and Expression: WHAT DOES THIS MEAN?
WHICH INFO DO I GET FROM THOSE NUMBERS? WHAT CAN I LEARN FROM THEM?

Thanx a lot in advance, please help me!

1: It won't actually matter if you make it INPUT as well. When you mark something as PREDICTABLE, it generally means that the algorithm will try to learn about the data in that column. When you mark it as INPUT, it generally means that the algorithm will consider this column as information about the predictable columns. For trees, for example, the algorithm will create a tree for each predictable column, and the trees can have splits on the input columns. If you have a column that is marked INPUT and PREDICTABLE, in general, it only matters if you have more than one PREDICTABLE, since the columns don't act as inputs for themselves, for example, the tree for # of bedrooms isn't going to have a split on # of bedrooms, but if you also make Home Ownership predictable, than # of bedrooms could be a condition in the tree for Home Ownership. (the one exception is for clustering, but that's a longer explanation).

2: What do the numbers mean? The first column is just the customer ID from the input - this could also be the customer name, for example. The second column is the # of bedrooms predicted by the algorithm - that is, the count of bedrooms that has the highest probability of all options given the input. The last column is the probability that the # of bedrooms would be 'Two or Three'.

What can you use this for? Take for example you were building a model to predict whether people would buy a product. You would want to predict which people would buy or not buy. In addition, you would probably want to contact those people with the highest probability of buying - therefore you need to use the PredictProbability function with a specified value. As you learn to use the product more, you will see that you can use the Profit Chart to determine the maximum profit for a marketing campaign using a Mining Model. The chart will give you a probability threshold. You want to use this threshold in a DMX query to select all customers who have a probability higher than that threshold in order to achieve maximum profit.

Hope this helps

-Jamie

No comments:

Post a Comment