Solved: Re: Delete observations in a by variable group on condition

SASEnthusiast · Posted 11-09-2015 04:23 PM

Hi,

My data looks like this,

Customer_Id	Transaction_No	Transaction_Date	Amount	Sales_Credit
4699	80902288	22-Sep-15	55.3	2
4699	80686478	12-Sep-15	62.3	1
4699	80687203	12-Sep-15	55.3	1
4919	81735223	12-Oct-15	15.99	2
4919	80917293	22-Sep-15	15.99	1
4919	80917294	22-Sep-15	15.99	1
4919	80938352	22-Sep-15	23.99	1
6320	80926085	23-Sep-15	51.75	2
6320	80517713	09-Sep-15	51.75	1
6320	80512633	09-Sep-15	51.75	1

I need to find only the data with actual sales. Like in customer_id "4699" only the second observation remains. However with customer_id= "4919" I need to randomly delete either of the two observations having amount 15.99. Could you help me how to proceed with this? If statements with first.variable concept or proc sql?

Haikuo · Posted 11-09-2015 10:39 PM

Here could be solution using Hash:

data have;
infile cards dlm='09'x truncover;
input (Customer_Id	Transaction_No)	(:$20.) Transaction_Date :anydtdte20.	Amount	Sales_Credit ;
format Transaction_Date date9.;
cards;
4699	80902288	22-Sep-15	55.3	2
4699	80686478	12-Sep-15	62.3	1
4699	80687203	12-Sep-15	55.3	1
4919	81735223	12-Oct-15	15.99	2
4919	80917293	22-Sep-15	15.99	1
4919	80917294	22-Sep-15	15.99	1
4919	80938352	22-Sep-15	23.99	1
6320	80926085	23-Sep-15	51.75	2
6320	80517713	09-Sep-15	51.75	1
6320	80512633	09-Sep-15	51.75	1
;

data want;
if _n_=1 then do;
dcl hash h(dataset:'have(drop=Transaction_No rename=(Transaction_Date=_date Sales_Credit=_credit) where=(_Credit=2))', multidata&colon;'y');
h.definekey('customer_id', 'amount');
h.definedata(all: 'y');
h.definedone();
call missing (_credit, _date);
end;
set have(where=(Sales_Credit=1));
if h.find()=0 and _date >= Transaction_Date then do;
h.removedup();
delete;
end;
drop _:;
run;

View solution in original post

LinusH · Posted 11-09-2015 04:40 PM

Can you describe why you wish to perform those actions?

And define "actual sale", it isn't self explanatory.

Reducing records be randomly deletes, AND keep details (in your case transaction number) - why? It's "dangerous, since the details it's not a fact any more.

I would simply combine a filter with select distinct.

Data never sleeps

SASEnthusiast · Posted 11-09-2015 04:47 PM

I am sorry to miss that information. When sales_credit= 2, it refers to a "Product return". Therefore by removing that observation and the observation with similar amount within the customer_id I am trying to achieve only those observations that show actual sales.

And yes, I agree the transaction_id is not required once I randomly select data.

Reeza · Posted 11-09-2015 05:18 PM

Can you guarantee that there will always be an exact match when it's a 2?

And do you need sale dates?

SASEnthusiast · Posted 11-09-2015 05:24 PM

Yes, I can. Although there can be multiple 2's for one customer_id with its respective purchase transaction.

Like this:

16759	80902332	22-Sep-15	19.99	2
16759	80902327	22-Sep-15	19.99	2
16759	80902328	22-Sep-15	19.99	2
16759	80902326	22-Sep-15	19.99	2
16759	80902335	22-Sep-15	19.99	2
16759	80159162	05-Sep-15	19.99	1
16759	80161144	05-Sep-15	19.99	1
16759	80160655	05-Sep-15	19.99	1
16759	80191135	06-Sep-15	19.99	1
16759	80191307	06-Sep-15	19.99	1

Would it be easier if I skip the the transaction_date?

Reeza · Posted 11-09-2015 05:31 PM

Sort of. I would consider simply marking any purchase that was 2 a negative (mulitply by negative 1) and then sum by ID. THat would give total purchase by user...not sure that's what you're after though.

Haikuo · Posted 11-09-2015 10:39 PM

Here could be solution using Hash:

data have;
infile cards dlm='09'x truncover;
input (Customer_Id	Transaction_No)	(:$20.) Transaction_Date :anydtdte20.	Amount	Sales_Credit ;
format Transaction_Date date9.;
cards;
4699	80902288	22-Sep-15	55.3	2
4699	80686478	12-Sep-15	62.3	1
4699	80687203	12-Sep-15	55.3	1
4919	81735223	12-Oct-15	15.99	2
4919	80917293	22-Sep-15	15.99	1
4919	80917294	22-Sep-15	15.99	1
4919	80938352	22-Sep-15	23.99	1
6320	80926085	23-Sep-15	51.75	2
6320	80517713	09-Sep-15	51.75	1
6320	80512633	09-Sep-15	51.75	1
;

data want;
if _n_=1 then do;
dcl hash h(dataset:'have(drop=Transaction_No rename=(Transaction_Date=_date Sales_Credit=_credit) where=(_Credit=2))', multidata&colon;'y');
h.definekey('customer_id', 'amount');
h.definedata(all: 'y');
h.definedone();
call missing (_credit, _date);
end;
set have(where=(Sales_Credit=1));
if h.find()=0 and _date >= Transaction_Date then do;
h.removedup();
delete;
end;
drop _:;
run;

SASEnthusiast · Posted 11-10-2015 12:45 AM

This works great!! Thank you so much.

Patrick · Posted 11-10-2015 03:21 AM

@Haikuo

The two things I would change in your code:

- A keep statement for the data set loaded into the hash so that you're sure not to load variables you don't want to load

- Loop over the "duplicates" in the hash and chose the oldest date where "_date >= Transaction_Date" is true. You can't be sure that the first match is the correct one, and it could also be possible that the condition is true for more than one return.

Haikuo · Posted 11-10-2015 10:52 AM

@Patrick wrote:
@Haikuo
The two things I would change in your code:
- A keep statement for the data set loaded into the hash so that you're sure not to load variables you don't want to load
- Loop over the "duplicates" in the hash and chose the oldest date where "_date >= Transaction_Date" is true. You can't be sure that the first match is the correct one, and it could also be possible that the condition is true for more than one return.

@Patrick

1. Absolutely Agreed.

2. Not so sure about that. Customer behaves randomly, your suggestion is based on the assumption that 'first in - first out', which is not always necessarily the case. Without more information to identify the paired transactions, this is as far as I would go. But I do believe you have raised a valid point to OP.

Haikuo

Patrick · Posted 11-11-2015 03:22 AM

@Haikuo

I believe if you don't select the earliest return date which satifies condition "_date >= Transaction_Date" then you are at risk to miss some returns. Consider the following:

Rec_No	Customer_Id	Transaction_Date	Amount	Sales_Credit
1	1	1-Nov-15	5	1
2	1	2-Nov-15	5	2
3	1	3-Nov-15	5	1
4	1	4-Nov-15	5	2

If you use record 4 to remove record 1 then the pair record 2/record 3 won't satisfy condition "_date >= Transaction_Date" and you'll miss the removal of record 3.

Haikuo · Posted 11-11-2015 10:35 AM

@Patrick wrote:
@Haikuo
I believe if you don't select the earliest return date which satifies condition "_date >= Transaction_Date" then you are at risk to miss some returns. Consider the following:
Rec_No Customer_Id Transaction_Date Amount Sales_Credit
1 1 1-Nov-15 5 1
2 1 2-Nov-15 5 2
3 1 3-Nov-15 5 1
4 1 4-Nov-15 5 2

If you use record 4 to remove record 1 then the pair record 2/record 3 won't satisfy condition "_date >= Transaction_Date" and you'll miss the removal of record 3.

@Patrick,

Thank you for your following up and your point is well taken. However, if OP unfortunately has data like that, the problem is beyond a simple code fix. Surely you will have your 'balance sheet', but you will never know if paired transations are the actual pairs without more information. If the purpose is just to get the net 'balance', then the condition of '_date >= Transaction_Date' can be completely ignored. Given the data as is, my approach is to get as far as the 'known' can stand, and stop right at the point where 'unknown' starts.

Just my 2cents,

Haikuo

Registration is open

SAS Training: Just a Click Away