Solved: Re: Comparing rows within a table based on ID and create a flag

Sangita1 · Posted 12-02-2020 02:50 AM

Hi,

I have following dataset (contains three different Ids) for students. I want to find out if they are retained or not based on if their record exists in the previous term.

DATA work.sample;
format id termid prevtermid;

INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100

201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200

201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300

;
RUN;

proc sort data=work.sample;
by id termid ;
quit;

So the sorted dataset will look like below:

ID Termid prevtermid

U100 201508 201501
U100 201601 201508
U100 201608 201601
U100 201701 201608
U100 201705 201701
U100 201708 201701
U200 201508 201501
U200 201608 201601
U200 201701 201608
U200 201705 201701
U200 201708 201701
U300 201508 201501
U300 201601 201508
U300 201701 201608
U300 201705 201701
U300 201708 201701

If for the same ID, prevTermID exists as a Termid in the previous rows, then I want to create a RetainFlag as Yes otherwise No. In the second iteration, if it is beginning of new Id (which will always have the "retainFlag = "no"), I want to mark this row as "New" to identify it as new student record.

So the output should look like below:

UID Termid prevtermid RetainFlag

U100 201508 201501 New
U100 201601 201508 Yes
U100 201608 201601 Yes
U100 201701 201608   Yes
U100 201705 201701 Yes
U100 201708 201701   Yes
U200 201508 201501 New
U200 201608 201601 No
U200 201701 201608   Yes
U200 201705 201701 Yes
U200 201708 201701 Yes
U300 201508 201501   New
U300 201601 201508 Yes
U300 201701 201608 No
U300 201705 201701 Yes
U300 201708 201701 Yes

I am trying to use Lag and retain but am not being successful how to accomplish it.

I will really appreciate your help. Thank you so much for your time and willingness to share your expertise.

PeterClemmensen · Posted 12-02-2020 03:07 AM

Anyways, this should do it

DATA work.sample;
INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;

proc sort data = sample;
   by id termid;
run;

data want;

   dcl hash h ();
   h.definekey('termid');
   h.definedone();

   do until (last.id);
      set sample;
      by id;
      if first.id then RetainFlag = 'New';
      else if h.check(key : prevtermid) = 0 then RetainFlag = 'Yes';
      else RetainFlag = 'No';
      h.ref();
      output;
   end;

   h.clear();
run;

Result:

termid  prevtermid  id    RetainFlag 
201508  201501      U100  New 
201601  201508      U100  Yes 
201608  201601      U100  Yes 
201701  201608      U100  Yes 
201705  201701      U100  Yes 
201708  201701      U100  Yes 
201508  201501      U200  New 
201608  201601      U200  No 
201701  201608      U200  Yes 
201705  201701      U200  Yes 
201708  201701      U200  Yes 
201508  201501      U300  New 
201601  201508      U300  Yes 
201701  201608      U300  No 
201705  201701      U300  Yes 
201708  201701      U300  Yes

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 12-02-2020 02:57 AM

So in reality, you only have ID and termID, right? The variable prevtermid is a variable you created yourself?

The DATA to DATA Step Macro
Blog: SASnrd

Sangita1 · Posted 12-02-2020 03:55 AM

yes, I am new to programming, so I created the prevtermid column. Then I somehow was able to work it out by

DATA work.identified;
set work.sample;
by ID;
retain lagtermid;
if first.id then new_student= 'New';
y = lagtermid;
if prevtermid = lagtermid then retain_Student = 'Yes'; else retain_Student = 'No';
lagtermid = termid;
If Substr (termid, 5, 2) = '05' then lagtermid = prevtermid;
RUN;

I was going to drop all the extra columns - this is call "running around the block to walk one step".

Thank you so much for your elegant code. I am also going to read up the functions that you have used. Appreciate the super fast response.

PeterClemmensen · Posted 12-02-2020 03:07 AM

Anyways, this should do it

DATA work.sample;
INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;

proc sort data = sample;
   by id termid;
run;

data want;

   dcl hash h ();
   h.definekey('termid');
   h.definedone();

   do until (last.id);
      set sample;
      by id;
      if first.id then RetainFlag = 'New';
      else if h.check(key : prevtermid) = 0 then RetainFlag = 'Yes';
      else RetainFlag = 'No';
      h.ref();
      output;
   end;

   h.clear();
run;

Result:

termid  prevtermid  id    RetainFlag 
201508  201501      U100  New 
201601  201508      U100  Yes 
201608  201601      U100  Yes 
201701  201608      U100  Yes 
201705  201701      U100  Yes 
201708  201701      U100  Yes 
201508  201501      U200  New 
201608  201601      U200  No 
201701  201608      U200  Yes 
201705  201701      U200  Yes 
201708  201701      U200  Yes 
201508  201501      U300  New 
201601  201508      U300  Yes 
201701  201608      U300  No 
201705  201701      U300  Yes 
201708  201701      U300  Yes

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 12-02-2020 03:16 AM

Same technique (hash), slightly different code structure:

data want;
set sample;
by id;
if _N_ = 1
then do;
  declare hash look ();
  look.definekey("termid");
  look.definedone();
end;
if first.id
then do;
  retainflag = "new";
  rc = look.clear();
end;
else do;
  if look.check(key:prevtermid) = 0
  then retainflag = "yes";
  else retainflag = "no";
end;
rc = look.add();
drop rc;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Sangita1 · Posted 12-02-2020 04:05 AM

Thank you.

Your code is working great also, but I think I can only accept one example as solution. I will spend time learning definekey, definedone and how elegantly you have written the solution. I really appreciate your help. Thank you very much for sharing your knowledge with the newbees like me.

Regards.

Kurt_Bremser · Posted 12-02-2020 04:38 AM

The HASH object is a relatively new addition to the DATA step language, but it has proven itself to be a very powerful tool that enables rather simple solutions to seemingly complex issues, and can (and WILL) speed up processes by moving the sorting to a pure in-memory process when doing lookups.

In the meantime, I have shaved hours of runtime from our existing programs by replacing sort/merge lookups with hash lookups wherever feasible (there are datasets that simply won't fit into available memory).

Regarding coding styles:

Everyone develops their own method of writing code, you can see two different styles between @PeterClemmensen and me. Both have their merits, the common ground is that we use consistent indentation for logical blocks, make use of whitespace (blanks) to easier identify words (keywords, variable names etc), and try to not pack too much into a single line, as that makes the code easier to read and easier to edit. Just an example:

set incoming (in=inc where=(xx1="Y") rename=(x1=xx1 x2=xx2) keep=x1 x2 x3);

compare to

set incoming (
  in=inc
  keep=x1 x2 x3
  rename=(
    x1=xx1
    x2=xx2
  )
  where=(xx1="Y")
);

In the second example, it is much easier to move a code element somewhere else by simply grabbing a complete line.

And the order of the options follows the logical sequence in which the data step compiler will process them.

When working in teams, the team members should develop and follow a single coding style.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Sangita1 · Posted 12-02-2020 03:25 AM

When I write the following code:

It does not work properly because for the Fall term, I have the previous termid as the Spring term (which can be two rows above as some students take summer term and some do not take summer term). So I have to do something when it is summer term, as that can set the lag variable incorrectly.

DATA work.identified;
set work.sample;
by ID;
retain lagtermid;
if first.id then X= 'New';
if prevtermid = lagtermid then retain_Student = 'Yes'; else retain_Student = 'No';
lagtermid = termid;
RUN;

PeterClemmensen · Posted 12-02-2020 03:29 AM

Did you take a look at the replies above?

The DATA to DATA Step Macro
Blog: SASnrd

Sangita1 · Posted 12-02-2020 04:09 AM

Yes, I did. Both solutions are great. Now I have the code, and I am going to read the function calls so I know how/where to use them in future. Thankful for the help, as I was struggling for couple hours and reading how to use retain/lag. You made my day (or rather morning as it is very early morning for me).

Sangita1 · Posted 12-02-2020 05:44 AM

Hi, I have one more question please. When I run the code on my big dataset, I am getting the type mismatch error
ERROR: Type mismatch for key variable TERMID at line 68 column 6.
I think it is because TERMID is a char. I am not sure if I should do a length statement or convert it (as some posts suggested). What might be the best way to resolve this? Thank you.

PeterClemmensen · Posted 12-02-2020 05:49 AM

Are both TermID and PrevTermID character?

Also, post your log please.

The DATA to DATA Step Macro
Blog: SASnrd

Sangita1 · Posted 12-02-2020 05:57 AM

Actually, I was able to fix that error. Thanks

Kurt_Bremser · Posted 12-02-2020 06:06 AM

Both variables must be of the same type, that's a given. Also, since these are date-related values, they should be stored as such:

data sample;
format id $8 termid prevtermid yymmn6.;
input (termid prevtermid) (:yymmn6.) id $;
datalines;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away