BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas112
Calcite | Level 5

My data set is 

 

data have;
input EFAMID $ SvcDate :mmddyy10. OpCode :$2. Pay;
format svcdate yymmdd10.;
datalines;
101 01/13/2013 19 30.7
101 01/14/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1
102 02/10/2013 12 9.2
102 02/11/2013 13 32.1
102 02/12/2013 14 17.5
102 02/13/2013 15 10.2
102 02/18/2013 14 12.3
;

 

At data step, I get its EFAMID and SvcDate values when process OpCode=12. Then I like to check whether the same EFAMID and SvcDate have OpCode = 13 observations. If yes, a new column AssociatedOp = 1. Otherwise, AssociatedOp = 0. For example, for EFAMID=101, its OpCode=12 has SvcDate='01/16/2013'. Using EFAMID=101 and SvcDate='01/16/2013', we find a observation (101 '01/16/2014' 13 10.1) OpCode=13 at the same day. 

Thus, AssociatedOp=1. The final data set would be 

EFAMID   SvcDate   OpCode   Pay   AssociatedOp

101        01/13/2013    19         30.7     0
101        01/14/2013    13         28.0     0

101        01/16/2013    12         60.2     1
101        01/16/2014    13         10.1     0
102        02/10/2013    12           9.2     0
102        02/11/2013     13        32.1     0
102        02/12/2013     14        17.5     0
102        02/13/2013     15        10.2     0
102        02/18/2013     14        12.3     0

 

How do I do that at data step? Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

It should not be that bad.  Since you are reading the same disk file in the same order the blocks of data should already be cached. 10 Million is not a large number and a data step does not require that the whole dataset be loaded into memory.

View solution in original post

11 REPLIES 11
Tom
Super User Tom
Super User

I don't get the algorithm.  

It looks like your result is true only when OPCODE='12' and there was a previous observation for this EFAMID that had OPCODE='13'.  

data want ;
  any13=0;
  do until(last.efamid);
    set have;
    by efamid svcdate ;
    if opcode='13' then any13=1;
    AssociatedOp = opcode='12' and any13 ;
    output;
  end;
run;
                                         Op             Associated
Obs    any13    EFAMID       SvcDate    Code     Pay        Op

 1       0       101      2013-01-13     19     30.7         0
 2       1       101      2013-01-14     13     28.0         0
 3       1       101      2013-01-16     12     60.2         1
 4       1       101      2014-01-16     13     10.1         0
 5       0       102      2013-02-10     12      9.2         0
 6       1       102      2013-02-11     13     32.1         0
 7       1       102      2013-02-12     14     17.5         0
 8       1       102      2013-02-13     15     10.2         0
 9       1       102      2013-02-18     14     12.3         0

What about OPCODE='12' and OPCODE='13' on the same date?  Should the sort order be:

    by efamid svcdate descending opcode;

What if there were multiple OPCODE='12' observations and only one earlier OPCODE='13' observation? Are they all true?

What about if instead there was an OPCODE='13' observation on a future date?  Would that mean the earlier OPCODE='12' record is also flagged with true?

sas112
Calcite | Level 5

I only care about the AssociatedOp column when OpCode=12. For each EFAMID, there is only one observation with OpCode=12. When OpCode=12, I need to use its EFAMID and SvcDate at that observation and then find whether there are OpCode=13. If observations with OpCode=13 has different EFAMID or SvcDate, the AssociatedOp should be zero.

Tom
Super User Tom
Super User

Then your current example answer is wrong.  The one OPCODE='12' you have flagged as true has an OPCODE='13' on a different date.

sas112
Calcite | Level 5
When process EFAMID=101, SvcDate='01/16/2013' and OpCode=12, we see the dataset has
101 01/14/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1
Since we found one OpCode=13 has same EFAMID=101 and SvcDate='01/16/2013'. Thus AssociatedOp=1.
For EFAMID=102, SvcDate='02/10/2013', OpCode=12, we see
102 02/10/2013 12 9.2
102 02/11/2013 13 32.1
Since SvcDate='02/11/2013' and OpCode=13. Thus AssociatedOp=0.
The input dataset (have at the example) is quite large. The solution needs to be efficient.
Tom
Super User Tom
Super User

So which of the two SVDATES has the typo in yoru example?  Is the day of month wrong in the first one or is the year wrong in the last one?

101 01/14/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1

sas112
Calcite | Level 5
Thanks point out. You can change either one to match 01/16/2013. I am sorry I didn't notice that.
sas112
Calcite | Level 5

Since I am quite new to SAS programming, my question could be summarized to 

 

Data Want;
set have;
AssociatedOp=0;
if OpCode=12 then AssociatedOp = hasAssociatedOp(EFAMID, SvcDate);
run;

How do I create a function 'hasAssociatedOp' which accepts two arguments, EFAMID and SvcDate? If there is one observation at dataset 'have' which matches the EFAMID, SvcDate and OpCode=13, it will return 1. Otherwise return 0. 

Tom
Super User Tom
Super User

@sas112 wrote:

Since I am quite new to SAS programming, my question could be summarized to 

 

Data Want;
set have;
AssociatedOp=0;
if OpCode=12 then AssociatedOp = hasAssociatedOp(EFAMID, SvcDate);
run;

How do I create a function 'hasAssociatedOp' which accepts two arguments, EFAMID and SvcDate? If there is one observation at dataset 'have' which matches the EFAMID, SvcDate and OpCode=13, it will return 1. Otherwise return 0. 


If you literally wanted to make something like that then look at using HASH object.  But that is overly complicated and requires that you load data into the hash object. 

You could adapt my original suggested data step to your clarified requirements.

data have;
  input EFAMID $ SvcDate :mmddyy10. OpCode :$2. Pay;
  format svcdate yymmdd10.;
datalines;
101 01/13/2013 19 30.7
101 01/16/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1
102 02/10/2013 12 9.2
102 02/11/2013 13 32.1
102 02/12/2013 14 17.5
102 02/13/2013 15 10.2
102 02/18/2013 14 12.3
;

proc sort data=have;
  by efamid svcdate descending opcode;
run;
data want ;
  any13=0;
  do until(last.svcdate);
    set have;
    by efamid svcdate descending opcode;
    if opcode='13' then any13=1;
    AssociatedOp = opcode='12' and any13 ;
    output;
  end;
  drop any13;
run;

Results:

                                Op             Associated
Obs    EFAMID       SvcDate    Code     Pay        Op

 1      101      2013-01-13     19     30.7         0
 2      101      2013-01-16     13     28.0         0
 3      101      2013-01-16     12     60.2         1
 4      101      2014-01-16     13     10.1         0
 5      102      2013-02-10     12      9.2         0
 6      102      2013-02-11     13     32.1         0
 7      102      2013-02-12     14     17.5         0
 8      102      2013-02-13     15     10.2         0
 9      102      2013-02-18     14     12.3         0

Or you could pull out the OPCODE='13' records and remerge those records with the original data by EFAMID and SVCDATE.  Then you could test whether you have both 12 and 13 on the same date.

data want ;
  merge have
     have(keep=efamid svcdate opcode
          rename=(opcode=opcode13) 
          where=(opcode13='13')
         ) 
  ;
  by efamid svcdate;
  AssociatedOp = (opcode='12' and opcode13='13');
  drop opcode13;
run;

 

sas112
Calcite | Level 5
The logics should work. But dataset 'have' may have around 10 million observations. If I merge, does it perform well?
Tom
Super User Tom
Super User

It should not be that bad.  Since you are reading the same disk file in the same order the blocks of data should already be cached. 10 Million is not a large number and a data step does not require that the whole dataset be loaded into memory.

qoit
Pyrite | Level 9

Hi, does the below fulfil your purpose, for every obs where opcode =12, if the Svcdate and efamid are the same for opcode=13, the flag should turn 1

 

data have;
infile datalines;
input EFAMID $ SvcDate :mmddyy10. OpCode :$2. Pay;
format svcdate yymmdd10.;
datalines;
101 01/13/2013 19 30.7
101 01/16/2013 13 28.0
101 01/16/2013 12 60.2
101 01/16/2014 13 10.1
102 02/10/2013 12 9.2
102 02/11/2013 13 32.1
102 02/12/2013 14 17.5
102 02/13/2013 15 10.2
102 02/18/2013 14 12.3
;
run;

proc sort data=have out=have1;
by EFAMID opcode;
run;

data have2 (drop=efamid_12 SvcDate_12);
set have1;
by efamid opcode;
retain efamid_12 SvcDate_12;
format SvcDate_12 yymmdd10.;

if first.opcode then
do;
if opcode = '12' then
do;
efamid_12 = efamid;
SvcDate_12 = SvcDate;
end;
end;

if opcode = 13 then
do;
if efamid_12 = efamid and SvcDate_12 = SvcDate then
AssociatedOp = 1;
else AssociatedOp = 0;
end;
else
do;
AssociatedOp = 0;
end;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 874 views
  • 0 likes
  • 3 in conversation