Hi,
I have following dataset (contains three different Ids) for students. I want to find out if they are retained or not based on if their record exists in the previous term.
DATA work.sample;
format id termid prevtermid;
INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;
RUN;
proc sort data=work.sample;
by id termid ;
quit;
So the sorted dataset will look like below:
ID Termid prevtermid
U100 201508 201501
U100 201601 201508
U100 201608 201601
U100 201701 201608
U100 201705 201701
U100 201708 201701
U200 201508 201501
U200 201608 201601
U200 201701 201608
U200 201705 201701
U200 201708 201701
U300 201508 201501
U300 201601 201508
U300 201701 201608
U300 201705 201701
U300 201708 201701
If for the same ID, prevTermID exists as a Termid in the previous rows, then I want to create a RetainFlag as Yes otherwise No. In the second iteration, if it is beginning of new Id (which will always have the "retainFlag = "no"), I want to mark this row as "New" to identify it as new student record.
So the output should look like below:
UID Termid prevtermid RetainFlag
U100 201508 201501 New
U100 201601 201508 Yes
U100 201608 201601 Yes
U100 201701 201608 Yes
U100 201705 201701 Yes
U100 201708 201701 Yes
U200 201508 201501 New
U200 201608 201601 No
U200 201701 201608 Yes
U200 201705 201701 Yes
U200 201708 201701 Yes
U300 201508 201501 New
U300 201601 201508 Yes
U300 201701 201608 No
U300 201705 201701 Yes
U300 201708 201701 Yes
I am trying to use Lag and retain but am not being successful how to accomplish it.
I will really appreciate your help. Thank you so much for your time and willingness to share your expertise.
Anyways, this should do it
DATA work.sample;
INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;
proc sort data = sample;
by id termid;
run;
data want;
dcl hash h ();
h.definekey('termid');
h.definedone();
do until (last.id);
set sample;
by id;
if first.id then RetainFlag = 'New';
else if h.check(key : prevtermid) = 0 then RetainFlag = 'Yes';
else RetainFlag = 'No';
h.ref();
output;
end;
h.clear();
run;
Result:
termid prevtermid id RetainFlag 201508 201501 U100 New 201601 201508 U100 Yes 201608 201601 U100 Yes 201701 201608 U100 Yes 201705 201701 U100 Yes 201708 201701 U100 Yes 201508 201501 U200 New 201608 201601 U200 No 201701 201608 U200 Yes 201705 201701 U200 Yes 201708 201701 U200 Yes 201508 201501 U300 New 201601 201508 U300 Yes 201701 201608 U300 No 201705 201701 U300 Yes 201708 201701 U300 Yes
So in reality, you only have ID and termID, right? The variable prevtermid is a variable you created yourself?
yes, I am new to programming, so I created the prevtermid column. Then I somehow was able to work it out by
DATA work.identified;
set work.sample;
by ID;
retain lagtermid;
if first.id then new_student= 'New';
y = lagtermid;
if prevtermid = lagtermid then retain_Student = 'Yes'; else retain_Student = 'No';
lagtermid = termid;
If Substr (termid, 5, 2) = '05' then lagtermid = prevtermid;
RUN;
I was going to drop all the extra columns - this is call "running around the block to walk one step".
Thank you so much for your elegant code. I am also going to read up the functions that you have used. Appreciate the super fast response.
Anyways, this should do it
DATA work.sample;
INPUT termid prevtermid id $;
DATALINES;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;
proc sort data = sample;
by id termid;
run;
data want;
dcl hash h ();
h.definekey('termid');
h.definedone();
do until (last.id);
set sample;
by id;
if first.id then RetainFlag = 'New';
else if h.check(key : prevtermid) = 0 then RetainFlag = 'Yes';
else RetainFlag = 'No';
h.ref();
output;
end;
h.clear();
run;
Result:
termid prevtermid id RetainFlag 201508 201501 U100 New 201601 201508 U100 Yes 201608 201601 U100 Yes 201701 201608 U100 Yes 201705 201701 U100 Yes 201708 201701 U100 Yes 201508 201501 U200 New 201608 201601 U200 No 201701 201608 U200 Yes 201705 201701 U200 Yes 201708 201701 U200 Yes 201508 201501 U300 New 201601 201508 U300 Yes 201701 201608 U300 No 201705 201701 U300 Yes 201708 201701 U300 Yes
Same technique (hash), slightly different code structure:
data want;
set sample;
by id;
if _N_ = 1
then do;
declare hash look ();
look.definekey("termid");
look.definedone();
end;
if first.id
then do;
retainflag = "new";
rc = look.clear();
end;
else do;
if look.check(key:prevtermid) = 0
then retainflag = "yes";
else retainflag = "no";
end;
rc = look.add();
drop rc;
run;
Thank you.
Your code is working great also, but I think I can only accept one example as solution. I will spend time learning definekey, definedone and how elegantly you have written the solution. I really appreciate your help. Thank you very much for sharing your knowledge with the newbees like me.
Regards.
The HASH object is a relatively new addition to the DATA step language, but it has proven itself to be a very powerful tool that enables rather simple solutions to seemingly complex issues, and can (and WILL) speed up processes by moving the sorting to a pure in-memory process when doing lookups.
In the meantime, I have shaved hours of runtime from our existing programs by replacing sort/merge lookups with hash lookups wherever feasible (there are datasets that simply won't fit into available memory).
Regarding coding styles:
Everyone develops their own method of writing code, you can see two different styles between @PeterClemmensen and me. Both have their merits, the common ground is that we use consistent indentation for logical blocks, make use of whitespace (blanks) to easier identify words (keywords, variable names etc), and try to not pack too much into a single line, as that makes the code easier to read and easier to edit. Just an example:
set incoming (in=inc where=(xx1="Y") rename=(x1=xx1 x2=xx2) keep=x1 x2 x3);
compare to
set incoming (
in=inc
keep=x1 x2 x3
rename=(
x1=xx1
x2=xx2
)
where=(xx1="Y")
);
In the second example, it is much easier to move a code element somewhere else by simply grabbing a complete line.
And the order of the options follows the logical sequence in which the data step compiler will process them.
When working in teams, the team members should develop and follow a single coding style.
When I write the following code:
It does not work properly because for the Fall term, I have the previous termid as the Spring term (which can be two rows above as some students take summer term and some do not take summer term). So I have to do something when it is summer term, as that can set the lag variable incorrectly.
DATA work.identified;
set work.sample;
by ID;
retain lagtermid;
if first.id then X= 'New';
if prevtermid = lagtermid then retain_Student = 'Yes'; else retain_Student = 'No';
lagtermid = termid;
RUN;
Did you take a look at the replies above?
Yes, I did. Both solutions are great. Now I have the code, and I am going to read the function calls so I know how/where to use them in future. Thankful for the help, as I was struggling for couple hours and reading how to use retain/lag. You made my day (or rather morning as it is very early morning for me).
Are both TermID and PrevTermID character?
Also, post your log please.
Actually, I was able to fix that error. Thanks
Both variables must be of the same type, that's a given. Also, since these are date-related values, they should be stored as such:
data sample;
format id $8 termid prevtermid yymmn6.;
input (termid prevtermid) (:yymmn6.) id $;
datalines;
201508 201501 U100
201601 201508 U100
201608 201601 U100
201701 201608 U100
201705 201701 U100
201708 201701 U100
201508 201501 U200
201608 201601 U200
201701 201608 U200
201705 201701 U200
201708 201701 U200
201508 201501 U300
201601 201508 U300
201701 201608 U300
201705 201701 U300
201708 201701 U300
;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.