12-30-2015 11:56 AM
Hello:
I have a dataset containing customer #, debit card # and associated transactions. Problem is a customer may have more than one debit card in use and I need to select the debit card with the greatest number of transactions. My example:
Before data:
Cust# Debit card# Transaction Amount
123 456 $7
123 456 6
123 456 1
123 789 2
123 789 4
123 789 5
123 789 3
123 222 1
123 222 9
After data should look like this:
Cust# Debit card# Transaction Amount
123 789 2
123 789 4
123 789 5
123 789 3
Hence the debit card with the greatest number of transactions should be outputted to a new data set. Problem is how???!!!
Many thanks for your assistance!
12-30-2015 02:53 PM
To this point, SQL is still a viable solution, however other approaches, such as data step may be more efficient. Since we are already far into the SQL, here it is:
proc sql;
create table new as
select * from (select * from
(select *, count(*) as ct from old
group by cust, debitcard
)
group by cust
having ct=max(ct))
group by cust
having debitcard=max(debitcard)
;
quit;
Don't worry, it should work even debitcard is char, as char is stored in binary anyway, so they do have a hiearchy order.
12-30-2015 12:29 PM
I see this to be easier if using Proc SQL:
proc sql;
create table new as
select * from
(select *, count(*) as ct from old
group by cust, debitcard
)
group by cust
having ct=max(ct)
;
quit;
12-30-2015 01:31 PM
INGENIOUS! Thank you Haikuo.
12-30-2015 01:59 PM - edited 12-30-2015 02:03 PM
Sorry; but just discovered another variation in my data. In the case where a customer has multiple debit cards but only 1 transaction on each card the above code will select these as well. Hence, all of customer 666 gets included too which I do not want in final output. I would only need any one record from customer 666. Apologies for not clarifying earlier.
before:
Cust# Debit card# Transaction Amount
123 456 $7
123 456 6
123 456 1
123 789 2
123 789 4
123 789 5
123 789 3
123 222 1
123 222 9
666 111 3
666 112 4
666 113 5
After:
Obs | cust | debitcard | transaction | ct |
---|---|---|---|---|
1 | 123 | 789 | 3 | 4 |
2 | 123 | 789 | 5 | 4 |
3 | 123 | 789 | 4 | 4 |
4 | 123 | 789 | 2 | 4 |
5 | 666 | 111 | 3 | 1 |
6 | 666 | 112 | 4 | 1 |
7 | 666 | 113 | 5 | 1 |
12-30-2015 02:27 PM
Well, the plot always thickens. Consider this:
666 111 3 666 112 4 666 113 5
And this:
666 111 3 666 111 3 666 112 4 666 112 4 666 113 5 666 113 5
And this:
666 111 3 666 111 3 666 111 3 666 111 3 666 112 4 666 112 4 666 112 4 666 112 4 666 113 5 666 113 5
What do you want from above scenarios? This will be more a business decision than a coding requirement. A good business decision will be something like:
1. Consistent
2. Make sense
3. Cover as much ground as posible
4. Easy to convert to computer language.
12-30-2015 02:40 PM
Yes, you are correct as it is more of a business decision. In my particular case the final output must contain 1 customer to 1 debit card (but choose the debit card # with the most transactions when multiple transactions occur). In case of a tie choose any one.
In your examples - the first and second blocks would be a matter of selecting any one of the records since the customer has used each card once or twice. In the last example, card 111 and 112 were utilized four times; hence, choose either 111 or 112.
Many thanks again!
12-30-2015 02:53 PM
To this point, SQL is still a viable solution, however other approaches, such as data step may be more efficient. Since we are already far into the SQL, here it is:
proc sql;
create table new as
select * from (select * from
(select *, count(*) as ct from old
group by cust, debitcard
)
group by cust
having ct=max(ct))
group by cust
having debitcard=max(debitcard)
;
quit;
Don't worry, it should work even debitcard is char, as char is stored in binary anyway, so they do have a hiearchy order.
12-30-2015 03:20 PM
Worked like a dream! Thank you Haikuo once again.
01-01-2016 01:33 AM
Need further help from the community? Please ask a new question.