BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Sangita1
Obsidian | Level 7

Hi,

I have following dataset (contains three different Ids) for students.   I want to find out if they are retained or not based on if their record exists in the previous term.

 

DATA work.sample;
format id termid prevtermid;

INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100

201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200


201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300

;
RUN;

proc sort data=work.sample;
by id termid ;
quit;

 

So the sorted dataset will look like below:

ID      Termid     prevtermid

U100   201508    201501
U100   201601    201508
U100   201608    201601
U100   201701    201608
U100   201705    201701
U100   201708    201701
U200   201508    201501
U200   201608    201601
U200   201701   201608
U200   201705   201701
U200   201708   201701
U300   201508   201501
U300   201601   201508
U300   201701   201608
U300   201705    201701
U300   201708   201701

 

If for the same ID, prevTermID exists as a Termid in the previous rows, then I want to create a RetainFlag as Yes otherwise No. In the second iteration, if it is beginning of new Id (which will always have the "retainFlag = "no"), I want to mark this row as "New" to identify it as new student record.  

So the output should look like below:

 

UID    Termid  prevtermid   RetainFlag

U100 201508 201501        New
U100 201601 201508        Yes
U100 201608 201601        Yes
U100 201701 201608        Yes
U100 201705 201701       Yes
U100 201708 201701        Yes
U200 201508 201501        New
U200 201608 201601       No
U200 201701 201608      Yes
U200 201705 201701     Yes
U200 201708 201701     Yes
U300 201508 201501      New
U300 201601 201508     Yes
U300 201701 201608     No
U300 201705 201701     Yes
U300 201708 201701     Yes

 

I am trying to use Lag and retain but am not being successful how to accomplish it. 

I will really appreciate your help.  Thank you so much for your time and willingness to share your expertise. 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Anyways, this should do it

 

DATA work.sample;
INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;

proc sort data = sample;
   by id termid;
run;

data want;

   dcl hash h ();
   h.definekey('termid');
   h.definedone();

   do until (last.id);
      set sample;
      by id;
      if first.id then RetainFlag = 'New';
      else if h.check(key : prevtermid) = 0 then RetainFlag = 'Yes';
      else RetainFlag = 'No';
      h.ref();
      output;
   end;

   h.clear();
run;

Result:

 

termid  prevtermid  id    RetainFlag 
201508  201501      U100  New 
201601  201508      U100  Yes 
201608  201601      U100  Yes 
201701  201608      U100  Yes 
201705  201701      U100  Yes 
201708  201701      U100  Yes 
201508  201501      U200  New 
201608  201601      U200  No 
201701  201608      U200  Yes 
201705  201701      U200  Yes 
201708  201701      U200  Yes 
201508  201501      U300  New 
201601  201508      U300  Yes 
201701  201608      U300  No 
201705  201701      U300  Yes 
201708  201701      U300  Yes 

View solution in original post

13 REPLIES 13
PeterClemmensen
Tourmaline | Level 20

So in reality, you only have ID and termID, right? The variable prevtermid is a variable you created yourself?

Sangita1
Obsidian | Level 7

yes, I am new to programming, so I created the prevtermid column. Then I somehow was able to work it out by

DATA work.identified;
set work.sample;
by ID;
retain lagtermid;
if first.id then new_student= 'New';
y = lagtermid;
if prevtermid = lagtermid then retain_Student = 'Yes'; else retain_Student = 'No';
lagtermid = termid;
If Substr (termid, 5, 2) = '05' then lagtermid = prevtermid;
RUN;

I was going to drop all the extra columns - this is call "running around the block to walk one step".

 

Thank you so much for your elegant code.  I am also going to read up the functions that you have used. Appreciate the super fast response.

PeterClemmensen
Tourmaline | Level 20

Anyways, this should do it

 

DATA work.sample;
INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;

proc sort data = sample;
   by id termid;
run;

data want;

   dcl hash h ();
   h.definekey('termid');
   h.definedone();

   do until (last.id);
      set sample;
      by id;
      if first.id then RetainFlag = 'New';
      else if h.check(key : prevtermid) = 0 then RetainFlag = 'Yes';
      else RetainFlag = 'No';
      h.ref();
      output;
   end;

   h.clear();
run;

Result:

 

termid  prevtermid  id    RetainFlag 
201508  201501      U100  New 
201601  201508      U100  Yes 
201608  201601      U100  Yes 
201701  201608      U100  Yes 
201705  201701      U100  Yes 
201708  201701      U100  Yes 
201508  201501      U200  New 
201608  201601      U200  No 
201701  201608      U200  Yes 
201705  201701      U200  Yes 
201708  201701      U200  Yes 
201508  201501      U300  New 
201601  201508      U300  Yes 
201701  201608      U300  No 
201705  201701      U300  Yes 
201708  201701      U300  Yes 
Kurt_Bremser
Super User

Same technique (hash), slightly different code structure:

data want;
set sample;
by id;
if _N_ = 1
then do;
  declare hash look ();
  look.definekey("termid");
  look.definedone();
end;
if first.id
then do;
  retainflag = "new";
  rc = look.clear();
end;
else do;
  if look.check(key:prevtermid) = 0
  then retainflag = "yes";
  else retainflag = "no";
end;
rc = look.add();
drop rc;
run; 
Sangita1
Obsidian | Level 7

Thank you.

Your code is working great also, but I think I can only accept one example as solution. I will spend time learning definekey, definedone and how elegantly you have written the solution. I really appreciate your help. Thank you very much for sharing your knowledge with the newbees like me. 

Regards.

Kurt_Bremser
Super User

The HASH object is a relatively new addition to the DATA step language, but it has proven itself to be a very powerful tool that enables rather simple solutions to seemingly complex issues, and can (and WILL) speed up processes by moving the sorting to a pure in-memory process when doing lookups.

In the meantime, I have shaved hours of runtime from our existing programs by replacing sort/merge lookups with hash lookups wherever feasible (there are datasets that simply won't fit into available memory).

 

Regarding coding styles:

Everyone develops their own method of writing code, you can see two different styles between @PeterClemmensen and me. Both have their merits, the common ground is that we use consistent indentation for logical blocks, make use of whitespace (blanks) to easier identify words (keywords, variable names etc), and try to not pack too much into a single line, as that makes the code easier to read and easier to edit. Just an example:

set incoming (in=inc where=(xx1="Y") rename=(x1=xx1 x2=xx2) keep=x1 x2 x3);

compare to

set incoming (
  in=inc
  keep=x1 x2 x3
  rename=(
    x1=xx1
    x2=xx2
  )
  where=(xx1="Y")
);

In the second example, it is much easier to move a code element somewhere else by simply grabbing a complete line.

And the order of the options follows the logical sequence in which the data step compiler will process them.

 

When working in teams, the team members should develop and follow a single coding style.

Sangita1
Obsidian | Level 7

When I write the following code:

It does not work properly because for the Fall term, I have the previous termid as the Spring term (which can be two rows above as some students take summer term and some do not take summer term).  So I have to do something when it is summer term, as that can set the lag variable incorrectly. 


DATA work.identified;
set work.sample;
by ID;
retain lagtermid;
if first.id then X= 'New';
if prevtermid = lagtermid then retain_Student = 'Yes'; else retain_Student = 'No';
lagtermid = termid;
RUN;

PeterClemmensen
Tourmaline | Level 20

Did you take a look at the replies above?

Sangita1
Obsidian | Level 7

Yes, I did. Both solutions are great. Now I have the code, and I am going to read the function calls so I know how/where to use them in future. Thankful for the help, as I was struggling for couple hours and reading how to use retain/lag. You made my day (or rather morning as it is very early morning for me). 

Sangita1
Obsidian | Level 7
Hi, I have one more question please. When I run the code on my big dataset, I am getting the type mismatch error
ERROR: Type mismatch for key variable TERMID at line 68 column 6.
I think it is because TERMID is a char. I am not sure if I should do a length statement or convert it (as some posts suggested). What might be the best way to resolve this? Thank you.
PeterClemmensen
Tourmaline | Level 20

Are both TermID and PrevTermID character?


Also, post your log please.

Sangita1
Obsidian | Level 7

Actually, I was able to fix that error. Thanks

Kurt_Bremser
Super User

Both variables must be of the same type, that's a given. Also, since these are date-related values, they should be stored as such:

data sample;
format id $8 termid prevtermid yymmn6.;
input (termid prevtermid) (:yymmn6.) id $;
datalines;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 1402 views
  • 8 likes
  • 3 in conversation