Solved: find observations exist based on conditions

sas112 · Posted 02-09-2021 12:39 PM

My data set is

data have;
input EFAMID $ SvcDate :mmddyy10. OpCode :$2. Pay;
format svcdate yymmdd10.;
datalines;
101 01/13/2013 19 30.7
101 01/14/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1
102 02/10/2013 12 9.2
102 02/11/2013 13 32.1
102 02/12/2013 14 17.5
102 02/13/2013 15 10.2
102 02/18/2013 14 12.3
;

At data step, I get its EFAMID and SvcDate values when process OpCode=12. Then I like to check whether the same EFAMID and SvcDate have OpCode = 13 observations. If yes, a new column AssociatedOp = 1. Otherwise, AssociatedOp = 0. For example, for EFAMID=101, its OpCode=12 has SvcDate='01/16/2013'. Using EFAMID=101 and SvcDate='01/16/2013', we find a observation (101 '01/16/2014' 13 10.1) OpCode=13 at the same day.

Thus, AssociatedOp=1. The final data set would be

EFAMID SvcDate OpCode Pay AssociatedOp

101 01/13/2013 19 30.7 0
101 01/14/2013 13 28.0 0

101 01/16/2013 12 60.2 1
101 01/16/2014 13 10.1 0
102 02/10/2013 12 9.2 0
102 02/11/2013 13 32.1 0
102 02/12/2013 14 17.5 0
102 02/13/2013 15 10.2 0
102 02/18/2013 14 12.3 0

How do I do that at data step? Thanks in advance.

Tom · Posted 02-09-2021 05:24 PM

It should not be that bad. Since you are reading the same disk file in the same order the blocks of data should already be cached. 10 Million is not a large number and a data step does not require that the whole dataset be loaded into memory.

View solution in original post

Tom · Posted 02-09-2021 12:57 PM

I don't get the algorithm.

It looks like your result is true only when OPCODE='12' and there was a previous observation for this EFAMID that had OPCODE='13'.

data want ;
  any13=0;
  do until(last.efamid);
    set have;
    by efamid svcdate ;
    if opcode='13' then any13=1;
    AssociatedOp = opcode='12' and any13 ;
    output;
  end;
run;

                                         Op             Associated
Obs    any13    EFAMID       SvcDate    Code     Pay        Op

 1       0       101      2013-01-13     19     30.7         0
 2       1       101      2013-01-14     13     28.0         0
 3       1       101      2013-01-16     12     60.2         1
 4       1       101      2014-01-16     13     10.1         0
 5       0       102      2013-02-10     12      9.2         0
 6       1       102      2013-02-11     13     32.1         0
 7       1       102      2013-02-12     14     17.5         0
 8       1       102      2013-02-13     15     10.2         0
 9       1       102      2013-02-18     14     12.3         0

What about OPCODE='12' and OPCODE='13' on the same date? Should the sort order be:

    by efamid svcdate descending opcode;

What if there were multiple OPCODE='12' observations and only one earlier OPCODE='13' observation? Are they all true?

What about if instead there was an OPCODE='13' observation on a future date? Would that mean the earlier OPCODE='12' record is also flagged with true?

sas112 · Posted 02-09-2021 01:10 PM

I only care about the AssociatedOp column when OpCode=12. For each EFAMID, there is only one observation with OpCode=12. When OpCode=12, I need to use its EFAMID and SvcDate at that observation and then find whether there are OpCode=13. If observations with OpCode=13 has different EFAMID or SvcDate, the AssociatedOp should be zero.

Tom · Posted 02-09-2021 01:42 PM

Then your current example answer is wrong. The one OPCODE='12' you have flagged as true has an OPCODE='13' on a different date.

sas112 · Posted 02-09-2021 02:08 PM

When process EFAMID=101, SvcDate='01/16/2013' and OpCode=12, we see the dataset has
101 01/14/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1
Since we found one OpCode=13 has same EFAMID=101 and SvcDate='01/16/2013'. Thus AssociatedOp=1.
For EFAMID=102, SvcDate='02/10/2013', OpCode=12, we see
102 02/10/2013 12 9.2
102 02/11/2013 13 32.1
Since SvcDate='02/11/2013' and OpCode=13. Thus AssociatedOp=0.
The input dataset (have at the example) is quite large. The solution needs to be efficient.

Tom · Posted 02-09-2021 02:51 PM

So which of the two SVDATES has the typo in yoru example? Is the day of month wrong in the first one or is the year wrong in the last one?

101 01/14/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1

sas112 · Posted 02-09-2021 03:18 PM

Thanks point out. You can change either one to match 01/16/2013. I am sorry I didn't notice that.

sas112 · Posted 02-09-2021 03:01 PM

Since I am quite new to SAS programming, my question could be summarized to

Data Want;
set have;
AssociatedOp=0;
if OpCode=12 then AssociatedOp = hasAssociatedOp(EFAMID, SvcDate);
run;

How do I create a function 'hasAssociatedOp' which accepts two arguments, EFAMID and SvcDate? If there is one observation at dataset 'have' which matches the EFAMID, SvcDate and OpCode=13, it will return 1. Otherwise return 0.

Tom · Posted 02-09-2021 03:53 PM

@sas112 wrote:

Since I am quite new to SAS programming, my question could be summarized to
Data Want;
set have;
AssociatedOp=0;
if OpCode=12 then AssociatedOp = hasAssociatedOp(EFAMID, SvcDate);
run;
How do I create a function 'hasAssociatedOp' which accepts two arguments, EFAMID and SvcDate? If there is one observation at dataset 'have' which matches the EFAMID, SvcDate and OpCode=13, it will return 1. Otherwise return 0.

If you literally wanted to make something like that then look at using HASH object. But that is overly complicated and requires that you load data into the hash object.

You could adapt my original suggested data step to your clarified requirements.

data have;
  input EFAMID $ SvcDate :mmddyy10. OpCode :$2. Pay;
  format svcdate yymmdd10.;
datalines;
101 01/13/2013 19 30.7
101 01/16/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1
102 02/10/2013 12 9.2
102 02/11/2013 13 32.1
102 02/12/2013 14 17.5
102 02/13/2013 15 10.2
102 02/18/2013 14 12.3
;

proc sort data=have;
  by efamid svcdate descending opcode;
run;
data want ;
  any13=0;
  do until(last.svcdate);
    set have;
    by efamid svcdate descending opcode;
    if opcode='13' then any13=1;
    AssociatedOp = opcode='12' and any13 ;
    output;
  end;
  drop any13;
run;

Results:

                                Op             Associated
Obs    EFAMID       SvcDate    Code     Pay        Op

 1      101      2013-01-13     19     30.7         0
 2      101      2013-01-16     13     28.0         0
 3      101      2013-01-16     12     60.2         1
 4      101      2014-01-16     13     10.1         0
 5      102      2013-02-10     12      9.2         0
 6      102      2013-02-11     13     32.1         0
 7      102      2013-02-12     14     17.5         0
 8      102      2013-02-13     15     10.2         0
 9      102      2013-02-18     14     12.3         0

Or you could pull out the OPCODE='13' records and remerge those records with the original data by EFAMID and SVCDATE. Then you could test whether you have both 12 and 13 on the same date.

data want ;
  merge have
     have(keep=efamid svcdate opcode
          rename=(opcode=opcode13) 
          where=(opcode13='13')
         ) 
  ;
  by efamid svcdate;
  AssociatedOp = (opcode='12' and opcode13='13');
  drop opcode13;
run;

sas112 · Posted 02-09-2021 04:58 PM

The logics should work. But dataset 'have' may have around 10 million observations. If I merge, does it perform well?

Tom · Posted 02-09-2021 05:24 PM

It should not be that bad. Since you are reading the same disk file in the same order the blocks of data should already be cached. 10 Million is not a large number and a data step does not require that the whole dataset be loaded into memory.

qoit · Posted 02-09-2021 03:54 PM

Hi, does the below fulfil your purpose, for every obs where opcode =12, if the Svcdate and efamid are the same for opcode=13, the flag should turn 1

data have;
infile datalines;
input EFAMID $ SvcDate :mmddyy10. OpCode :$2. Pay;
format svcdate yymmdd10.;
datalines;
101 01/13/2013 19 30.7
101 01/16/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1
102 02/10/2013 12 9.2
102 02/11/2013 13 32.1
102 02/12/2013 14 17.5
102 02/13/2013 15 10.2
102 02/18/2013 14 12.3
;
run;

proc sort data=have out=have1;
by EFAMID opcode;
run;

data have2 (drop=efamid_12 SvcDate_12);
set have1;
by efamid opcode;
retain efamid_12 SvcDate_12;
format SvcDate_12 yymmdd10.;

if first.opcode then
do;
if opcode = '12' then
do;
efamid_12 = efamid;
SvcDate_12 = SvcDate;
end;
end;

if opcode = 13 then
do;
if efamid_12 = efamid and SvcDate_12 = SvcDate then
AssociatedOp = 1;
else AssociatedOp = 0;
end;
else
do;
AssociatedOp = 0;
end;
run;

find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

Re: find observations exist based on conditions

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away