BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
CJ_Jackson
Fluorite | Level 6

I'm having issues with string comparison with when it comes to lag function.

 

Say I have the following dataset:

 

DATA test;
	input id code $;
	cards;
	1 abc
	1 abc
	2 xyz
	2 xyy
	2 xyy
	2 zzz
	3 ccc
	3 ccc
	;
PROC SORT;
	by id code;
run;

 

Then I want to calculate how many distinct code values there are within each ID values, I use this program:

 

DATA test2;
	SET test;
	by id code;
	RETAIN code_cnt 1;
	ff = first.id;
	lagged = lag1(code);
	if first.id then code_cnt = 1;
	else do;
		if code ~= lag1(code) then code_cnt = code_cnt + 1;
		end;
PROC PRINT;
run;

The resulting dataset looks like this:

 

 

Obs id code code_cnt ff lagged

1abc11 
1abc20abc
2xyy11abc
2xyy20xyy
2xyz30xyy
2zzz40xyz
3ccc11zzz
3ccc20ccc

 

What is wrong?

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

You can't use lag in IF/THEN conditions, the behaviour isn't what you expect. It's a common issue when first starting out with the lag function. If you google the terms you'll find many paper/posts on the issue. 

 

DATA test2;
	SET test;
	by id code;
	RETAIN code_cnt 1;
	ff = first.id;
	lagged = lag1(code);
	if first.id then code_cnt = 1;
	else do;
		if code ne lagged then code_cnt = code_cnt + 1;
		end;
PROC PRINT;
run;

As mentioned PROC SQL is a better method here:

 

proc sql;
create table want as
select id, count(distinct code) as code_cnt
from test;
quit;

Or a SAS proc method:

 

proc sort data=test out=test2 nodupkey;
by id code;
run;

proc freq data=test2;
table ID/out=want_freq;
run;

proc print data=want_freq;
run;

 

EDIT: One more option - based on using FIRST since you've sorted and are using BY variables:

 

PROC SORT;
	by id code;
run;

data want;
set test;
by id code;
if first.id then code_cnt=0;
if first.code then code_cnt+1;
*if last.id then output;
run;

 

View solution in original post

7 REPLIES 7
LinusH
Tourmaline | Level 20
The count (distinct code) construct in SQL seems better suited for this kind of calculation.
Data never sleeps
DanZ
Obsidian | Level 7

I agree that proc sql is generally a better solution for most types of aggregation. But, if it must be in a data step for some reason, then it would look something like this:

DATA test;
	input id code $;
	cards;
	1 abc
	1 abc
	2 xyz
	2 xyy
	2 xyy
	2 zzz
	3 ccc
	3 ccc
	;
PROC SORT;
	by id code;
run;

DATA test2;
	SET test;
	by id code;
	retain code_cnt;
	if first.id then code_cnt = 0;
	code_cnt = code_cnt + 1;
	if last.id then output;
run;	

PROC PRINT;
run;

 

This is probably what the proc sql will be doing in the background as well.

Reeza
Super User

You can't use lag in IF/THEN conditions, the behaviour isn't what you expect. It's a common issue when first starting out with the lag function. If you google the terms you'll find many paper/posts on the issue. 

 

DATA test2;
	SET test;
	by id code;
	RETAIN code_cnt 1;
	ff = first.id;
	lagged = lag1(code);
	if first.id then code_cnt = 1;
	else do;
		if code ne lagged then code_cnt = code_cnt + 1;
		end;
PROC PRINT;
run;

As mentioned PROC SQL is a better method here:

 

proc sql;
create table want as
select id, count(distinct code) as code_cnt
from test;
quit;

Or a SAS proc method:

 

proc sort data=test out=test2 nodupkey;
by id code;
run;

proc freq data=test2;
table ID/out=want_freq;
run;

proc print data=want_freq;
run;

 

EDIT: One more option - based on using FIRST since you've sorted and are using BY variables:

 

PROC SORT;
	by id code;
run;

data want;
set test;
by id code;
if first.id then code_cnt=0;
if first.code then code_cnt+1;
*if last.id then output;
run;

 

CJ_Jackson
Fluorite | Level 6

Thanks, your solutions (mostly) worked. In the PROC SQL you wanna do GROUP BY id as well.

Kurt_Bremser
Super User

The lag() function builds a FIFO chain that receives a value everytime the function is called. So calling the function in a conditional branch or a do loop will lead to unexpected behaviour.

CJ_Jackson
Fluorite | Level 6
thanks. This is a weird behavior. So if I first saved the lag1(code) as a new variable then I would be able to use that value in a conditional then, correct?
ballardw
Super User

@CJ_Jackson wrote:
thanks. This is a weird behavior. So if I first saved the lag1(code) as a new variable then I would be able to use that value in a conditional then, correct?

Yes, exactly as @Reeza did with the variable LAGGED in the first data step of the solution.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 4048 views
  • 1 like
  • 6 in conversation