I'm having issues with string comparison with when it comes to lag function.
Say I have the following dataset:
DATA test;
input id code $;
cards;
1 abc
1 abc
2 xyz
2 xyy
2 xyy
2 zzz
3 ccc
3 ccc
;
PROC SORT;
by id code;
run;
Then I want to calculate how many distinct code values there are within each ID values, I use this program:
DATA test2;
SET test;
by id code;
RETAIN code_cnt 1;
ff = first.id;
lagged = lag1(code);
if first.id then code_cnt = 1;
else do;
if code ~= lag1(code) then code_cnt = code_cnt + 1;
end;
PROC PRINT;
run;
The resulting dataset looks like this:
Obs id code code_cnt ff lagged
1 | abc | 1 | 1 | |
1 | abc | 2 | 0 | abc |
2 | xyy | 1 | 1 | abc |
2 | xyy | 2 | 0 | xyy |
2 | xyz | 3 | 0 | xyy |
2 | zzz | 4 | 0 | xyz |
3 | ccc | 1 | 1 | zzz |
3 | ccc | 2 | 0 | ccc |
What is wrong?
You can't use lag in IF/THEN conditions, the behaviour isn't what you expect. It's a common issue when first starting out with the lag function. If you google the terms you'll find many paper/posts on the issue.
DATA test2;
SET test;
by id code;
RETAIN code_cnt 1;
ff = first.id;
lagged = lag1(code);
if first.id then code_cnt = 1;
else do;
if code ne lagged then code_cnt = code_cnt + 1;
end;
PROC PRINT;
run;
As mentioned PROC SQL is a better method here:
proc sql;
create table want as
select id, count(distinct code) as code_cnt
from test;
quit;
Or a SAS proc method:
proc sort data=test out=test2 nodupkey;
by id code;
run;
proc freq data=test2;
table ID/out=want_freq;
run;
proc print data=want_freq;
run;
EDIT: One more option - based on using FIRST since you've sorted and are using BY variables:
PROC SORT;
by id code;
run;
data want;
set test;
by id code;
if first.id then code_cnt=0;
if first.code then code_cnt+1;
*if last.id then output;
run;
I agree that proc sql is generally a better solution for most types of aggregation. But, if it must be in a data step for some reason, then it would look something like this:
DATA test;
input id code $;
cards;
1 abc
1 abc
2 xyz
2 xyy
2 xyy
2 zzz
3 ccc
3 ccc
;
PROC SORT;
by id code;
run;
DATA test2;
SET test;
by id code;
retain code_cnt;
if first.id then code_cnt = 0;
code_cnt = code_cnt + 1;
if last.id then output;
run;
PROC PRINT;
run;
This is probably what the proc sql will be doing in the background as well.
You can't use lag in IF/THEN conditions, the behaviour isn't what you expect. It's a common issue when first starting out with the lag function. If you google the terms you'll find many paper/posts on the issue.
DATA test2;
SET test;
by id code;
RETAIN code_cnt 1;
ff = first.id;
lagged = lag1(code);
if first.id then code_cnt = 1;
else do;
if code ne lagged then code_cnt = code_cnt + 1;
end;
PROC PRINT;
run;
As mentioned PROC SQL is a better method here:
proc sql;
create table want as
select id, count(distinct code) as code_cnt
from test;
quit;
Or a SAS proc method:
proc sort data=test out=test2 nodupkey;
by id code;
run;
proc freq data=test2;
table ID/out=want_freq;
run;
proc print data=want_freq;
run;
EDIT: One more option - based on using FIRST since you've sorted and are using BY variables:
PROC SORT;
by id code;
run;
data want;
set test;
by id code;
if first.id then code_cnt=0;
if first.code then code_cnt+1;
*if last.id then output;
run;
Thanks, your solutions (mostly) worked. In the PROC SQL you wanna do GROUP BY id as well.
The lag() function builds a FIFO chain that receives a value everytime the function is called. So calling the function in a conditional branch or a do loop will lead to unexpected behaviour.
@CJ_Jackson wrote:
thanks. This is a weird behavior. So if I first saved the lag1(code) as a new variable then I would be able to use that value in a conditional then, correct?
Yes, exactly as @Reeza did with the variable LAGGED in the first data step of the solution.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.