DATA Step, Macro, Functions and more

String comparison problems with lag functions

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

String comparison problems with lag functions

[ Edited ]

I'm having issues with string comparison with when it comes to lag function.

 

Say I have the following dataset:

 

DATA test;
	input id code $;
	cards;
	1 abc
	1 abc
	2 xyz
	2 xyy
	2 xyy
	2 zzz
	3 ccc
	3 ccc
	;
PROC SORT;
	by id code;
run;

 

Then I want to calculate how many distinct code values there are within each ID values, I use this program:

 

DATA test2;
	SET test;
	by id code;
	RETAIN code_cnt 1;
	ff = first.id;
	lagged = lag1(code);
	if first.id then code_cnt = 1;
	else do;
		if code ~= lag1(code) then code_cnt = code_cnt + 1;
		end;
PROC PRINT;
run;

The resulting dataset looks like this:

 

 

Obs id code code_cnt ff lagged

1abc11 
1abc20abc
2xyy11abc
2xyy20xyy
2xyz30xyy
2zzz40xyz
3ccc11zzz
3ccc20ccc

 

What is wrong?


Accepted Solutions
Solution
‎01-26-2016 01:12 PM
Super User
Posts: 17,745

Re: String comparison problems with lag functions

[ Edited ]

You can't use lag in IF/THEN conditions, the behaviour isn't what you expect. It's a common issue when first starting out with the lag function. If you google the terms you'll find many paper/posts on the issue. 

 

DATA test2;
	SET test;
	by id code;
	RETAIN code_cnt 1;
	ff = first.id;
	lagged = lag1(code);
	if first.id then code_cnt = 1;
	else do;
		if code ne lagged then code_cnt = code_cnt + 1;
		end;
PROC PRINT;
run;

As mentioned PROC SQL is a better method here:

 

proc sql;
create table want as
select id, count(distinct code) as code_cnt
from test;
quit;

Or a SAS proc method:

 

proc sort data=test out=test2 nodupkey;
by id code;
run;

proc freq data=test2;
table ID/out=want_freq;
run;

proc print data=want_freq;
run;

 

EDIT: One more option - based on using FIRST since you've sorted and are using BY variables:

 

PROC SORT;
	by id code;
run;

data want;
set test;
by id code;
if first.id then code_cnt=0;
if first.code then code_cnt+1;
*if last.id then output;
run;

 

View solution in original post


All Replies
Super User
Posts: 5,254

Re: String comparison problems with lag functions

The count (distinct code) construct in SQL seems better suited for this kind of calculation.
Data never sleeps
Contributor
Posts: 38

Re: String comparison problems with lag functions

I agree that proc sql is generally a better solution for most types of aggregation. But, if it must be in a data step for some reason, then it would look something like this:

DATA test;
	input id code $;
	cards;
	1 abc
	1 abc
	2 xyz
	2 xyy
	2 xyy
	2 zzz
	3 ccc
	3 ccc
	;
PROC SORT;
	by id code;
run;

DATA test2;
	SET test;
	by id code;
	retain code_cnt;
	if first.id then code_cnt = 0;
	code_cnt = code_cnt + 1;
	if last.id then output;
run;	

PROC PRINT;
run;

 

This is probably what the proc sql will be doing in the background as well.

Solution
‎01-26-2016 01:12 PM
Super User
Posts: 17,745

Re: String comparison problems with lag functions

[ Edited ]

You can't use lag in IF/THEN conditions, the behaviour isn't what you expect. It's a common issue when first starting out with the lag function. If you google the terms you'll find many paper/posts on the issue. 

 

DATA test2;
	SET test;
	by id code;
	RETAIN code_cnt 1;
	ff = first.id;
	lagged = lag1(code);
	if first.id then code_cnt = 1;
	else do;
		if code ne lagged then code_cnt = code_cnt + 1;
		end;
PROC PRINT;
run;

As mentioned PROC SQL is a better method here:

 

proc sql;
create table want as
select id, count(distinct code) as code_cnt
from test;
quit;

Or a SAS proc method:

 

proc sort data=test out=test2 nodupkey;
by id code;
run;

proc freq data=test2;
table ID/out=want_freq;
run;

proc print data=want_freq;
run;

 

EDIT: One more option - based on using FIRST since you've sorted and are using BY variables:

 

PROC SORT;
	by id code;
run;

data want;
set test;
by id code;
if first.id then code_cnt=0;
if first.code then code_cnt+1;
*if last.id then output;
run;

 

Occasional Contributor
Posts: 7

Re: String comparison problems with lag functions

Thanks, your solutions (mostly) worked. In the PROC SQL you wanna do GROUP BY id as well.

Super User
Posts: 6,927

Re: String comparison problems with lag functions

The lag() function builds a FIFO chain that receives a value everytime the function is called. So calling the function in a conditional branch or a do loop will lead to unexpected behaviour.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Occasional Contributor
Posts: 7

Re: String comparison problems with lag functions

thanks. This is a weird behavior. So if I first saved the lag1(code) as a new variable then I would be able to use that value in a conditional then, correct?
Super User
Posts: 10,462

Re: String comparison problems with lag functions


CJ_Jackson wrote:
thanks. This is a weird behavior. So if I first saved the lag1(code) as a new variable then I would be able to use that value in a conditional then, correct?

Yes, exactly as @Reeza did with the variable LAGGED in the first data step of the solution.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 425 views
  • 0 likes
  • 6 in conversation