DATA Step, Macro, Functions and more

Delete observations in a by variable group on condition

Accepted Solution Solved
Reply
Contributor
Posts: 24
Accepted Solution

Delete observations in a by variable group on condition

Hi, 

My data looks like this,

Customer_IdTransaction_NoTransaction_DateAmountSales_Credit
46998090228822-Sep-1555.32
46998068647812-Sep-1562.31
46998068720312-Sep-1555.31
49198173522312-Oct-1515.992
49198091729322-Sep-1515.991
49198091729422-Sep-1515.991
49198093835222-Sep-1523.991
63208092608523-Sep-1551.752
63208051771309-Sep-1551.751
63208051263309-Sep-1551.751

 

I need to find only the data with actual sales. Like in customer_id "4699" only the second observation remains. However with customer_id= "4919" I need to randomly delete either of the two observations having amount 15.99. Could you help me how to proceed with this? If statements with first.variable concept or proc sql? 


Accepted Solutions
Solution
‎11-10-2015 11:01 AM
Respected Advisor
Posts: 3,124

Re: Delete observations in a by variable group on condition

[ Edited ]

Here could be solution using Hash:

 

data have;
infile cards dlm='09'x truncover;
input (Customer_Id	Transaction_No)	(:$20.) Transaction_Date :anydtdte20.	Amount	Sales_Credit ;
format Transaction_Date date9.;
cards;
4699	80902288	22-Sep-15	55.3	2
4699	80686478	12-Sep-15	62.3	1
4699	80687203	12-Sep-15	55.3	1
4919	81735223	12-Oct-15	15.99	2
4919	80917293	22-Sep-15	15.99	1
4919	80917294	22-Sep-15	15.99	1
4919	80938352	22-Sep-15	23.99	1
6320	80926085	23-Sep-15	51.75	2
6320	80517713	09-Sep-15	51.75	1
6320	80512633	09-Sep-15	51.75	1
;

data want;
if _n_=1 then do;
dcl hash h(dataset:'have(drop=Transaction_No rename=(Transaction_Date=_date Sales_Credit=_credit) where=(_Credit=2))', multidata:'y');
h.definekey('customer_id', 'amount');
h.definedata(all: 'y');
h.definedone();
call missing (_credit, _date);
end;
set have(where=(Sales_Credit=1));
if h.find()=0 and _date >= Transaction_Date then do;
h.removedup();
delete;
end;
drop _:;
run;

 

View solution in original post


All Replies
Super User
Posts: 5,256

Re: Delete observations in a by variable group on condition

Can you describe why you wish to perform those actions?

And define "actual sale", it isn't self explanatory.

Reducing records be randomly deletes, AND keep details (in your case transaction number) - why? It's "dangerous, since the details it's not a fact any more.

 

I would simply combine a filter with select distinct.

Data never sleeps
Contributor
Posts: 24

Re: Delete observations in a by variable group on condition

[ Edited ]

I am sorry to miss that information. When sales_credit= 2, it refers to a "Product return". Therefore by removing that observation and the observation with similar amount within the customer_id I am trying to achieve only those observations that show actual sales.

 

And yes, I agree the transaction_id is not required once I randomly select data.

Super User
Posts: 17,813

Re: Delete observations in a by variable group on condition

[ Edited ]

Can you guarantee that there will always be an exact match when it's a 2?

And do you need sale dates?

Contributor
Posts: 24

Re: Delete observations in a by variable group on condition

[ Edited ]

Yes, I can. Although there can be multiple 2's for one customer_id with its respective purchase transaction.

Like this:

16759 80902332 22-Sep-15 19.99 2
16759 80902327 22-Sep-15 19.99 2
16759 80902328 22-Sep-15 19.99 2
16759 80902326 22-Sep-15 19.99 2
16759 80902335 22-Sep-15 19.99 2
16759 80159162 05-Sep-15 19.99 1
16759 80161144 05-Sep-15 19.99 1
16759 80160655 05-Sep-15 19.99 1
16759 80191135 06-Sep-15 19.99 1
16759 80191307 06-Sep-15 19.99 1

 

Would it be easier if I skip the the transaction_date?

Super User
Posts: 17,813

Re: Delete observations in a by variable group on condition

Sort of. I would consider simply marking any purchase that was 2 a negative (mulitply by negative 1) and then sum by ID. THat would give total purchase by user...not sure that's what you're after though.
Solution
‎11-10-2015 11:01 AM
Respected Advisor
Posts: 3,124

Re: Delete observations in a by variable group on condition

[ Edited ]

Here could be solution using Hash:

 

data have;
infile cards dlm='09'x truncover;
input (Customer_Id	Transaction_No)	(:$20.) Transaction_Date :anydtdte20.	Amount	Sales_Credit ;
format Transaction_Date date9.;
cards;
4699	80902288	22-Sep-15	55.3	2
4699	80686478	12-Sep-15	62.3	1
4699	80687203	12-Sep-15	55.3	1
4919	81735223	12-Oct-15	15.99	2
4919	80917293	22-Sep-15	15.99	1
4919	80917294	22-Sep-15	15.99	1
4919	80938352	22-Sep-15	23.99	1
6320	80926085	23-Sep-15	51.75	2
6320	80517713	09-Sep-15	51.75	1
6320	80512633	09-Sep-15	51.75	1
;

data want;
if _n_=1 then do;
dcl hash h(dataset:'have(drop=Transaction_No rename=(Transaction_Date=_date Sales_Credit=_credit) where=(_Credit=2))', multidata:'y');
h.definekey('customer_id', 'amount');
h.definedata(all: 'y');
h.definedone();
call missing (_credit, _date);
end;
set have(where=(Sales_Credit=1));
if h.find()=0 and _date >= Transaction_Date then do;
h.removedup();
delete;
end;
drop _:;
run;

 

Contributor
Posts: 24

Re: Delete observations in a by variable group on condition

This works great!! Thank you so much.

Respected Advisor
Posts: 3,887

Re: Delete observations in a by variable group on condition

@Haikuo

The two things I would change in your code:

- A keep statement for the data set loaded into the hash so that you're sure not to load variables you don't want to load

- Loop over the "duplicates" in the hash and chose the oldest date where "_date >= Transaction_Date" is true. You can't be sure that the first match is the correct one, and it could also be possible that the condition is true for more than one return.

 

 

Respected Advisor
Posts: 3,124

Re: Delete observations in a by variable group on condition


Patrick wrote:

@Haikuo

The two things I would change in your code:

- A keep statement for the data set loaded into the hash so that you're sure not to load variables you don't want to load

- Loop over the "duplicates" in the hash and chose the oldest date where "_date >= Transaction_Date" is true. You can't be sure that the first match is the correct one, and it could also be possible that the condition is true for more than one return.

 

 


@Patrick

1. Absolutely Agreed.

2. Not so sure about that. Customer behaves randomly, your suggestion is based on the assumption that 'first in - first out', which is not always necessarily the case. Without more information  to identify the paired transactions, this is as far as I would go. But I do believe  you have raised a valid point to OP.

 

Haikuo

Respected Advisor
Posts: 3,887

Re: Delete observations in a by variable group on condition

@Haikuo

I believe if you don't select the earliest return date which satifies condition "_date >= Transaction_Date" then you are at risk to miss some returns. Consider the following:

Rec_No Customer_Id Transaction_Date Amount Sales_Credit
1 1 1-Nov-15 5 1
2 1 2-Nov-15 5 2
3 1 3-Nov-15 5 1
4 1 4-Nov-15 5 2

 

If you use record 4 to remove record 1 then the pair record 2/record 3 won't satisfy condition "_date >= Transaction_Date" and you'll miss the removal of record 3.

Respected Advisor
Posts: 3,124

Re: Delete observations in a by variable group on condition

[ Edited ]

 

Patrick wrote:

@Haikuo

I believe if you don't select the earliest return date which satifies condition "_date >= Transaction_Date" then you are at risk to miss some returns. Consider the following:

Rec_NoCustomer_IdTransaction_DateAmountSales_Credit
111-Nov-1551
212-Nov-1552
313-Nov-1551
414-Nov-1552

 

If you use record 4 to remove record 1 then the pair record 2/record 3 won't satisfy condition "_date >= Transaction_Date" and you'll miss the removal of record 3.


@Patrick,

Thank you for your following up and your point is well taken. However, if OP unfortunately has data like that, the problem is beyond a simple code fix. Surely you will have your 'balance sheet', but you will never know if paired transations are the actual pairs without more information. If the purpose is just to get the net 'balance', then the condition of '_date >= Transaction_Date' can be completely ignored. Given the data as is, my approach is to get as far as the 'known' can stand, and stop right at the point where 'unknown' starts.

 

Just my 2cents,

Haikuo

 

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 369 views
  • 2 likes
  • 5 in conversation