BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
imdickson
Quartz | Level 8

Hi Everyone here, Good Day.

 

I am using SAS Data Integration Studio(Can be SAS Code or SQL Code) trying to solve one of the business rules given. The business rule is: To find out the Staff claimed for the same expense type, AND date purchase within 30 days(For example, one person buy toner every week, buying the same)(expense type)

I have data in this format:

StaffNo|Date      |Trip    |Desc  |  InvoiceNumber  | ExpenseType  | Amount----------------------------------------------------------------------------
1      |01JAN2015 |A       |XX    | ZXC             |100           |50
1      |05JAN2015 |A       |XX    | ZXC             |100           |50
1      |02FEB2015 |A       |XX    | ZXC             |100           |50
1      |03APR2015 |B       |YY    | PPP             |90            |1000
1      |02MAY2015 |B       |YY    | PPP             |90            |1000
2      |01OCT2017 |C       |LO    | 123             |55            |777
2      |02JAN2018 |D       |HI    | 456             |66            |888
2      |02MAY2018 |E       |LL    | 789             |44            |11
3      |11MAR2016 |F       |PO    | 99AA            |122           |88.5
3      |13MAR2016 |G       |PO    | 99AA            |122           |88.5
3      |31DEC2016 |H       |PO    | 99AA            |122           |88.5

 

Here is the thing, based on the sample record, i need to find out for Staff Number 1 to 3(In actual data, more than 100k Staff) that purchased the same ExpenseType item within 30 days. If found out, create a flag with value 'Y' with column named called 'RepeatPurchase' for all records that have this pattern.

 

In layman term, i want to find out the same StaffNo that purchases the same ExpenseType within 30 days and there is no rules stating to compare the first record against the 2nd or the last record, but preferably to always compare the next row of record that tie to the same StaffNo and ExpenseType.

 

Do take note that the Date might range from 1 day difference to 10 years difference for next row record, not necessarily to be in above Date pattern as above is just a sample.

 

I couldn't think of the way to do this.I was thinking to loop but couldn't think of the logic of it. Furthermore, if it can be solved without the use of loop or just a simple SQL statement with transpose or etc, it would be great either. However, whichever works is fine.

 

Any advice/guide given is very much appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@imdickson

You've defined the record identifier as: LocCurrAmt + Doc + DocumentNumber + Description

 

That means we only exclude a record from matching if ALL columns are the same.

 

In your SQL you've got:

and (i.LocCurrAmt ne o.LocCurrAmt OR ... OR ... i.Doc ne o.Doc)
and i.DocumentNumber ne o.DocumentNumber
and i.Description ne o.Description
and i.Doc ne o.Doc)

That means we already exclude a record from matching as soon as at least ONE column has the same value.

In your case DOC has always a value of 1 and that's why your logic believes that all these records are the same record.

 

For what you want to achieve the logic needs to be:

Option 1:
and (i.LocCurrAmt ne o.LocCurrAmt OR ... OR ... i.Doc ne o.Doc)

Option 2:
and NOT
  (
    i.LocCurrAmt = o.LocCurrAmt
    and i.DocumentNumber = o.DocumentNumber
    and i.Description = o.Description
    and i.Doc = o.Doc
  )

Peoples brains work differently. For me option 2 was always easier to understand.

 

And last but not least: Looking into your data I believe columns {DOC, DocumentNumber} would be sufficient as record identifiers. If so the your SQL could look like:

proc sql;
  create table &_OUTPUT1 as
  select 
    o.*,
    ( select distinct 'Y' 
      from &_INPUT1 as i 
      where 
        i.StaffNo=o.Staffno
        and i.ExpTy=o.ExpTy
        and i.Date between o.Date+30 and o.Date-30
        and i.ExpTy IN (6039,6081,6070)
        and NOT
        (
i.DocumentNumber = o.DocumentNumber
and i.Doc = o.Doc ) ) as flag length=1 from &_INPUT1 as o ; quit;

BTW: I've also removed and o.ExpTy IN (6039,6081,6070) because looking into the where clause again I've realized that the combination of below already covers this filter condition.

i.ExpTy=o.ExpTy
and i.ExpTy IN (6039,6081,6070)

 

View solution in original post

22 REPLIES 22
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Sort the data, then check the lag e.g:

proc sort data=have;
  by staffno expensetype date;
run;

data want;
  set have;
  by staffno;
  if not first.staffno then do;
    if lag(expensetype)=expensetype and date-lag(date) between 0 and 30 then flag="Y";
  end;
run;

Do note, please post test data in the form of a datastep in future.

PaigeMiller
Diamond | Level 26

@RW9 wrote:

Sort the data, then check the lag e.g:

proc sort data=have;
  by staffno expensetype date;
run;

data want;
  set have;
  by staffno;
  if not first.staffno then do;
    if lag(expensetype)=expensetype and date-lag(date) between 0 and 30 then flag="Y";
  end;
run;

Do note, please post test data in the form of a datastep in future.


I thought of something like this, but as I understand the problem, you need to compare all records within a 30 day period, not just the next record if it is in a 30 day period, which apparently is what this code is doing.

 

My next thought was that you need to transpose the data somehow so that all dates and expensetype for a given staffno are in the same row, and then looping allows you to compare every date to every other date for that employee, then see if the two dates are within 30 days and if the expensetype is the same.

--
Paige Miller
RW9
Diamond | Level 26 RW9
Diamond | Level 26

If that is the case, then no need to transpose, just left join a list of dates within 30 days using SQL, something like (and just pulling this out of the air):

proc sql;
  ...
  from have a
  left join (select distinct staffno from have where expensedata between a.expensedate and a.expensedate-30) n
  on   a.id=b.id;
quit;

 

imdickson
Quartz | Level 8

Thanks RW9. I was trying with your code but unfortunately, i get syntax error highlighting between statement.

I want to try the code that you gave first before going to the 2nd post. May I know what could be wrong for between statement?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Mmm, it should work, anyways you can change to just logic:

proc sort data=have;
  by staffno expensetype date;
run;

data want;
  set have;
  by staffno;
  if not first.staffno then do;
    if lag(expensetype)=expensetype and (0 <= date-lag(date) <= 30) then flag="Y";
  end;
run;
s_lassen
Meteorite | Level 14

@RW9: I think there is an error in your code: If you execute the LAG function contidionally, it will (normally) not work correctly. Also, you can simplify your code a slight bit by using the DIF function, DIF(X) is equivalent to X-LAG(X).

 

I think what you mean to do is something like this:

proc sort data=have;

  by staffno expensetype date;

run;

data want;

  set have;

  by staffno;

  if not first.staffno and lag(expensetype)=expensetype and (0 <= dif(date) <= 30) then

    flag="Y";

run;

With this code, the data from the second record for a given staffno will be compared to the first, and not to the last record from the previous staffno.

imdickson
Quartz | Level 8

Hi,

 

thanks for your assistance. However, what if i want to also show the first record for 2nd records onwards that have repeated purchase?

currently the flag does not apply to first filter but i want the first record mark as Y if 2nd record is marked as Y.

 

I tried second.staffno but SAS doesnt have second. Only first and last.

Patrick
Opal | Level 21

@imdickson wrote:

Hi,

 

thanks for your assistance. However, what if i want to also show the first record for 2nd records onwards that have repeated purchase?

currently the flag does not apply to first filter but i want the first record mark as Y if 2nd record is marked as Y.

 

I tried second.staffno but SAS doesnt have second. Only first and last.


@imdickson

The SQL code I've posted will do what you're asking for AND it can be implemented in DIS without the need of user written code.

 

If using DIS User Written Code should always be your last resort (loss of metadata driven code and column level data lineage) and though any data step solution provided here as smart as it might be is not how you should implement with DIS.

novinosrin
Tourmaline | Level 20

Hi @imdickson  Can you help us with a sample of your expected output for the sample input you have in place. This would help in testing the code and the results before somebody posts. Thank you

imdickson
Quartz | Level 8

Hi @novinosrin, Kindly refer to my reply to @Patrick for my expected result set.

Patrick
Opal | Level 21

@imdickson

With DI Studio I'd go for the SQL transformation as this allows to implement without any user written code.

 

It shouldn't be too hard to implement something which generates a SQL as below (using a subquery in the expression column for the Select clause).

Make sure that you also create the index and that you set-up your where clause as in the code below (so selection order the same than the order in the index). 

 

data have;
  infile datalines dlm='|';
  input StaffNo Date:date9. (Trip Desc InvoiceNumber) ($) ExpenseType Amount;
  format date date9.;
  datalines;
1|01JAN2015|A|XX|ZXC|100|50
1|05JAN2015|A|XX|ZXC|100|50
1|02FEB2015|A|XX|ZXC|100|50
1|03FEB2015|A|XX|ZXC|100|50
1|03APR2015|B|YY|PPP|90|1000
1|02MAY2015|B|YY|PPP|90|1000
2|01OCT2017|C|LO|123|55|777
2|02JAN2018|D|HI|456|66|888
2|02MAY2018|E|LL|789|44|11
3|11MAR2016|F|PO|99AA|122|88.5
3|13MAR2016|G|PO|99AA|122|88.5
3|31DEC2016|H|PO|99AA|122|88.5
;
run;

proc sql;
  create index StaffExp_idx
    on have(StaffNo, ExpenseType);
quit;

proc sql;
  create table want as
  select 
    o.*,
    ( select distinct 'Y' 
      from have as i 
      where 
        i.StaffNo=o.Staffno
        and i.ExpenseType=o.ExpenseType
        and i.Date between o.Date-1 and o.Date-30
    ) 
    as flag length=1
  from have as o
  ;
quit;

 

 

imdickson
Quartz | Level 8

Hi @Patrick

 

I just tried to put in your code to User Written Code transformation for testing purpose before implementing it usual 'Join' or 'Extract' node to do subquery without writing all the code.

 

This is the thing, the first round of execution, the index works but stopped at Proc SQL because of my mistake of entering wrong syntax when replacing the source & target table name. When i re-run, the code shows error at index because the index existed.

 

So what i did was comment/hide the index code and only run the Proc SQL.

 

I noticed that the result is not what i was expecting. Hence, i swap the 3rd condition in WHEREwith the 1st condition from your code.

However, it is still not giving me the result that i want.

 

Let me show you the data that i am getting. Please take note on the highlighted records.

Wrong Result repeated purchase.PNG

 

As you can see, ExpTy and Date(within 30days) and StaffNo are the drive. Whenever Same ExpTy, Date within 30 days and belongs to the same StaffNo, it should be flagged as 'Y'. However, i do not see all highlighted records that i am expecting to see 'Y' under flag. Moreover, for row no.5 with StaffNo 3753110, i tried to further filter the record for this Staff. Below is the result.

Wrong Result repeated purchase 2.PNG

 

I am expecting row no.1 and row no.2 to be flagged as Y for result above.

 

Is there a way to further enhance or slight tweak to your code for the result that i want?

Patrick
Opal | Level 21

@imdickson

Then for every record with the same StaffNo you need to look 30 days back and 30 days forward to search for another record in date range. You need to exclude the same date for this search to not match a record with itself.

proc sql;
  create table want as
  select 
    o.*,
    ( select distinct 'Y' 
      from have as i 
      where 
        i.StaffNo=o.Staffno
        and i.ExpenseType=o.ExpenseType
        and i.date ne o.date
        and i.Date between o.Date+30 and o.Date-30
    ) 
    as flag length=1
  from have as o
  ;
quit;

As for the challenge you had with the index: That won't be an issue anymore once you implement index creation via DIS and not user written.

imdickson
Quartz | Level 8

Hey Patrick!

 

Thanks for your answer. The result set is much much better now. The first record is marked as Y if the second record is a repeated purchase based on the first row of StaffNo and ExpenseType within 30days. Awesome.

 

Everything is good except for 1 thing. If StaffNo and Expense are the same but both records have the same Date.

ExpTy question.PNG

Notice row number 16 and 17 as per highlighted. They have the same StaffNo and ExpTY and SAME DATE.

 

For records no.20 to 26 are working perfectly fine! I really appreciate that Patrick.

 

In your subquery, you filter out date ne date. I tried to remove this out but the result set marked every records as Y.

 

Seeking for your further kind advice.

 

 

 

 

 

 

 

 

 

EDIT NO.2:

 

Hey Patrick, i further filter the list and found below result:

ExpTy question 2.PNG

 

 

Row 3&4 and Row 10&11 wouldnt marked as Y whereas row 12&13 marked as Y eventhough they have the same DATE.

 

I will further update again if i found another set of results for your reference.

 

However, I am still seeking for your kind advice for situation above.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 22 replies
  • 1768 views
  • 0 likes
  • 7 in conversation