Having more than 1 record for the same staff on the same date with the same expense type is a new use case.
The purpose of i.date ne o.date is to not match a record with itself as else all records will get flagged with 'Y' as you've observed.
If we can't use the date for such an exclusion then we need another column (or columns) which uniquely identify a record. I can's see anything suitable in the data you've posted but: Do you have any other column in source which you could map for this purpose - for example a transaction key or a record id?
If yes then map this column and replace i.date ne o.date with i.<key column> ne o.<key column>
IF you can take above approach then you still can implement the process in DIS without any user written code - which is the best practice as then data lineage and impact analysis is fully maintained.
Hi @Patrick
Fully understood. Here are the 4 columns that we can include together to identify the distinct rows:
LocCurrAmt + Doc + DocumentNumber + Description
I have modifed the SQL code to these (But I need your advice to make it work)
proc sql; create table &_OUTPUT1 as select o.*, ( select distinct 'Y' from &_INPUT1 as i where i.StaffNo=o.Staffno and i.ExpTy=o.ExpTy and i.date ne o.date and i.Date between o.Date+30 and o.Date-30 and i.ExpTy IN (6039,6081,6070) and (i.LocCurrAmt ne o.LocCurrAmt and i.DocumentNumber ne o.DocumentNumber and i.Description ne o.Description and i.Doc ne o.Doc) ) as flag length=1 from &_INPUT1 as o ; quit;
Currently I am still getting the same StaffID + ExpTy + Date record that doesnt flag as Y by using command above.
Seeking for your kind advice again Patrick. I think we are very close to there.
EDIT V2:
I have also tried with below SQL Command:
proc sql; create table &_OUTPUT1 as select o.*, ( select distinct 'Y' from &_INPUT1 as i where i.StaffNo=o.Staffno and i.ExpTy=o.ExpTy and i.Date between o.Date+30 and o.Date-30 and i.ExpTy IN (6039,6081,6070) and (i.LocCurrAmt ne o.LocCurrAmt and i.date ne o.date and i.DocumentNumber ne o.DocumentNumber and i.Description ne o.Description and i.Doc ne o.Doc) ) as flag length=1 from &_INPUT1 as o ; quit;
It gives me this result set 😞
Remove: and i.date ne o.date
Not sure about condition: and i.ExpTy IN (6039,6081,6070)
In case you want the whole logic for the Flag column only to be applied on above condition then eventually you will have to extend the condition to: and i.ExpTy IN (6039,6081,6070) and o.ExpTy IN (6039,6081,6070)
...and as long as you've got only AND in your where clause you don't need to bother about brackets.
Hi Patrick.
Thanks for the excellent advice. I removed the date = date and now the result sets are MORE ACCURATE.
However, i still can see data that are in same Date are still not getting flagged as Y.
You may have a look at result below:
As you can see, row 88 to 91 are accurate. So is 93 & 94. However, for Row 95 to 99, it is not flagged as Y.
Please have a look at my SQL below again:
proc sql; create table &_OUTPUT1 as select o.*, ( select distinct 'Y' from &_INPUT1 as i where i.StaffNo=o.Staffno and i.ExpTy=o.ExpTy and i.Date between o.Date+30 and o.Date-30 and i.ExpTy IN (6039,6081,6070) and o.ExpTy IN (6039,6081,6070) and (i.LocCurrAmt ne o.LocCurrAmt and i.DocumentNumber ne o.DocumentNumber and i.Description ne o.Description and i.Doc ne o.Doc) ) as flag length=1 from &_INPUT1 as o ; quit;
Do you have any clue on why row 95 to 99 are not flagged as Y based on the SQL command above?
And i just want these logic apply to ExpenseTy(ExpTy) to that 3 category as per those in SQL command.
You've defined the record identifier as: LocCurrAmt + Doc + DocumentNumber + Description
That means we only exclude a record from matching if ALL columns are the same.
In your SQL you've got:
and (i.LocCurrAmt ne o.LocCurrAmt OR ... OR ... i.Doc ne o.Doc) and i.DocumentNumber ne o.DocumentNumber and i.Description ne o.Description and i.Doc ne o.Doc)
That means we already exclude a record from matching as soon as at least ONE column has the same value.
In your case DOC has always a value of 1 and that's why your logic believes that all these records are the same record.
For what you want to achieve the logic needs to be:
Option 1: and (i.LocCurrAmt ne o.LocCurrAmt OR ... OR ... i.Doc ne o.Doc) Option 2: and NOT ( i.LocCurrAmt = o.LocCurrAmt and i.DocumentNumber = o.DocumentNumber and i.Description = o.Description and i.Doc = o.Doc )
Peoples brains work differently. For me option 2 was always easier to understand.
And last but not least: Looking into your data I believe columns {DOC, DocumentNumber} would be sufficient as record identifiers. If so the your SQL could look like:
proc sql;
create table &_OUTPUT1 as
select
o.*,
( select distinct 'Y'
from &_INPUT1 as i
where
i.StaffNo=o.Staffno
and i.ExpTy=o.ExpTy
and i.Date between o.Date+30 and o.Date-30
and i.ExpTy IN (6039,6081,6070)
and NOT
(
i.DocumentNumber = o.DocumentNumber
and i.Doc = o.Doc
)
)
as flag length=1
from &_INPUT1 as o
;
quit;
BTW: I've also removed and o.ExpTy IN (6039,6081,6070) because looking into the where clause again I've realized that the combination of below already covers this filter condition.
i.ExpTy=o.ExpTy and i.ExpTy IN (6039,6081,6070)
Awesome!
That works 100% perfectly fine @Patrick.
I really appreciate your effort in helping me all this while. I shall buy you a coffee or something if i ever met you.
Once again thanks a lot Patrick!
And i thank everyone that gave me so much feedback and advice with sample codes to help. You guys are awesome.!
And now you can implement all of this using the SQL Join Transformation so it's fully SAS Metadata driven.
This program outputs both dates for any interval <=7 days for the same staffid/expensetype:
proc sort data=have out=need;
by staffno expensetype date;
run;
data want (drop=nxt_date);
set need (drop=date);
by staffno expensetype;
merge need (keep=date)
need (firstobs=2 keep=date rename=(date=nxt_date));
if (last.expensetype=0 and nxt_date-date <=7) or
(first.expensetype=0 and dif(date)<=7);
run;
The downside is if you have 3 dates each within 7 days of its neighbor, the middle record is output twice. That can be fixed by following the above by a proc sort with a nodupkey option, and it puts the data into chronological order:
proc sort data=want nodupkey;
by staffno date expensetype;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.