Solved: Re: Proc compare function?

Mairam2345 · Posted 04-30-2018 11:45 AM

Hello all,

I have a transaction data set with two different columns. One is Customer_ID and the other ist Good_ID. What I would like to do is a comparison of those two attributes group by every single customer.

For example:

Cust_ID = 1 and Good_ID = 1

Cust_ID = 1 and Good_ID = 2

Cust_ID = 2 and Good_ID = 3

Cust_ID = 2 and Good_ID = 4

I want to create a comparison for every single customer where each Good_ID is compare to other Good_ID's of the Customer

While searching for a solution I find the compare function in SAS. Is the compare function in this case applicable?

Thanks and regards,

Mariam

TomKari · Posted 05-03-2018 03:38 PM

Here's what I hope will work for you.

The first step is the query that I described in my first post. getting the count by region and product. Let's say the result dataset is named "Have1", and the count field is named "Count_Field".

Now create a query based on Have1, and pull region into the "Select Data" columns, then pull it in a second time. On the second column, again, set the Summary to "count", and the summary groups box should show that you're summarizing on region. Let's say the result dataset is named "Have2", and the count field is "Count_Field1". Next, on the Filter tab, drag the "computed column", which is your Count_Field1, to the "Filter the summarized data" area, and set the filter to select cases "Greater than" and 1. This will identify the region values that have more than one value of product.

Now a third query, on Have2. Add a table, your orignal dataset. Click the "Join Tables" button, and change the join from an inner join to a left or right join, whichever your original dataset is (so if it's on the right, make it a right join).

Pull all the variables from "Have" onto the "Select Data" tab. Now do "Computed Columns" > "New" > "Advanced Expression", and use a "Case" function to set a variable (I'll call it "Flag") to 0 if the region field in the Have2 dataset is null, or 1 otherwise.

This will identify the records that were joined, which are the ones that had more than one value of product.

Tom

View solution in original post

ballardw · Posted 04-30-2018 11:48 AM

Proc compare is likely not what you want if I understand what you are asking.

Best would be to provide short examples of the two data sets with the variables of concern and a desired result.

Mairam2345 · Posted 04-30-2018 11:57 AM

Hi,

thanks for your quick reply.

The transaction data set looks like this:

Column 1: Customer_ID | Column 2: Good_ID

CustomerID_1 | Good_Banana

CustomerID_2 | Good_Banana

CustomerID_2 | Good_Orange

In this case I have 2 different customers. I have 2 transactions for customer 1 where the customer buys bananas. For customer 2 I have 4 transaction record. This customer 2 usually buys bananas but in one transaction he buys oranges. I want to find those transactions which have different goods than the usual transactions for one customer.

Is that understandable?

Thanks

Mar

ballardw · Posted 04-30-2018 12:11 PM

I might try something like:

proc freq data=have order=freq;
tables customer_id*good_id/list  nopercent ;
run;

Which would create ordered output by decreasing frequency.

I am not going to guess at a rule for "different goods than the usual" because I don't know your customers or what typical orders might look like.

You might send the output to a data set with the OUT= option on the tables statement for your further processing.

Reeza · Posted 04-30-2018 12:22 PM

Defining what's different is actually quite hard at scale but easy for a single example. If I regularly buy 6 bananas and 6 apples but buy 10 is that different? If I buy 3 socks is that different?

At any rate, it sounds like Market Basket Analysis is what you're doing. There's a SAS macro or if you have SAS EM that helps to enable that type of analysis.

Mairam2345 · Posted 04-30-2018 12:29 PM

Hi,

in my transaction data one customer usually buys the same product for all transactions. So the scale is usually based on the one product that the customer has bought in the past. If customer1 buys bananas in 3 different transactions then he will also buys in further transactions. I want to find those outliers where customer1 suddenly buys oranges instead of bananas.

I'm using the SAS enterprise guide.

Reeza · Posted 04-30-2018 12:54 PM

So, if customer primarily purchases a single item, I'd probably subset my data on those that purchased multiple items.

You can use the HAVING clause in a query to select customers or records where they have multiple products.

proc sql;
create table want as
select *
group by customer_ID
having count(distinct productID) > 1;
quit;

There's some other examples here of counting distinct by groups and then you can filter based on those who have more than one group.

https://github.com/statgeek/SAS-Tutorials/blob/master/count_distinct_by_group.sas

Mairam2345 · Posted 05-01-2018 08:49 AM

Hi,

is there no possibility to compare the the Good_IDs of one customer to other Good_Ids of the same customer??

Something like: If Good_ID of transaction1 is not equal to Good_ID of transaction 2 then new column = 1 ??

It's about comparing the results between different rows in one column...

Reeza · Posted 05-01-2018 10:56 AM

@Mairam2345 Sure, there's lots of ways.

But it's still not clear what you want. At this point its probably easiest if you post sample data and the exact output in the format you want.

TomKari · Posted 05-03-2018 03:38 PM

Here's what I hope will work for you.

The first step is the query that I described in my first post. getting the count by region and product. Let's say the result dataset is named "Have1", and the count field is named "Count_Field".

Now create a query based on Have1, and pull region into the "Select Data" columns, then pull it in a second time. On the second column, again, set the Summary to "count", and the summary groups box should show that you're summarizing on region. Let's say the result dataset is named "Have2", and the count field is "Count_Field1". Next, on the Filter tab, drag the "computed column", which is your Count_Field1, to the "Filter the summarized data" area, and set the filter to select cases "Greater than" and 1. This will identify the region values that have more than one value of product.

Now a third query, on Have2. Add a table, your orignal dataset. Click the "Join Tables" button, and change the join from an inner join to a left or right join, whichever your original dataset is (so if it's on the right, make it a right join).

Pull all the variables from "Have" onto the "Select Data" tab. Now do "Computed Columns" > "New" > "Advanced Expression", and use a "Case" function to set a variable (I'll call it "Flag") to 0 if the region field in the Have2 dataset is null, or 1 otherwise.

This will identify the records that were joined, which are the ones that had more than one value of product.

Tom

Reeza · Posted 04-30-2018 11:48 AM

No, compare is not designed for what you're trying to do here. I'm not sure there is a defined procedure here in fact.

Could you provide a data example of what you're trying to do, ie sample data and sample expected output? A small example is all that's needed.It's not exactly clear what output you'd want that would be the following:

I want to create a comparison for every single customer where each Good_ID is compare to other Good_ID's of the Customer

And, are you using the GUI or programming?

@Mairam2345 wrote:

Hello all,

I have a transaction data set with two different columns. One is Customer_ID and the other ist Good_ID. What I would like to do is a comparison of those two attributes group by every single customer.

For example:

Cust_ID = 1 and Good_ID = 1

Cust_ID = 1 and Good_ID = 1

Cust_ID = 1 and Good_ID = 1

Cust_ID = 1 and Good_ID = 2

Cust_ID = 2 and Good_ID = 3

Cust_ID = 2 and Good_ID = 3

Cust_ID = 2 and Good_ID = 4

I want to create a comparison for every single customer where each Good_ID is compare to other Good_ID's of the Customer

While searching for a solution I find the compare function in SAS. Is the compare function in this case applicable?

Thanks and regards,

Mariam

TomKari · Posted 05-01-2018 10:51 AM

I don't find it completely clear what you're after, but here are a couple of queries that might produce some useful data for you.

1. Add the counts to your original data

Set up a new query, with the results of the query I described in the previous post. Add your original table, and do an inner join between Customer_ID and Good_ID. Include all of the columns from the result table, and set the sort order to Customer_ID, the count field descending, and Good_ID. Run the query.

You should see that the results are grouped by customer, with the ID with the highest counts at the start.

2. How many different products per customer?

Set up a new query, with the results of the query I described in the previous post. Pull Customer_ID and Good_ID into the "Select Data" columns, and set the Summary to "count" for Good_ID.

The result will have one record per customer, with the count of how many different fruits they had.

Tom

Registration is open

SAS Training: Just a Click Away