Hello! I am using SAS ver9.3 and I am attempting to replace subsequent duplicate observations with blanks. Here is what I am looking for:
ID Var1 Var2 Var3 ===> ID Var1 Var2 Var3
001 AAA 111 XXX 001 AAA 111 XXX
001 BBB 111 YYY 001 BBB . YYY
002 AAA 111 XXX 002 AAA 111 XXX
002 AAA 222 XXX 002 . 222 .
002 BBB 222 XXX 002 BBB . .
003 AAA 111 XXX 003 AAA 111 XXX
003 BBB 222 YYY 003 BBB 222 YYY
003 CCC 222 YYY 003 CCC . .
Var1, Var2, and Var3 are grouped by the ID number and duplicates of that variable is fine. Here is the code I've tried without success:
proc sort data=have out=want nodupkey;
by Var1 Var2 Var3;
run;
proc sort data=want out=want;
by id;
run;
Is there another way to do this? I am a beginner at SAS programming, so I apologize if there is a simple solution I am overlooking. Thank you in advance.
Like this?
data HAVE;
input (ID VAR1 VAR2 VAR3) (: $) ;
cards;
001 AAA 111 XXX
001 BBB 111 YYY
002 AAA 111 XXX
002 AAA 222 XXX
002 BBB 222 XXX
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC 222 YYY
run;
data WANT;
set HAVE;
if lag(ID)=ID and lag(VAR1)=VAR1 then call missing(VAR1);
if lag(ID)=ID and lag(VAR2)=VAR2 then call missing(VAR2);
if lag(ID)=ID and lag(VAR3)=VAR3 then call missing(VAR3);
run;
ID | VAR1 | VAR2 | VAR3 |
---|---|---|---|
001 | AAA | 111 | XXX |
001 | BBB | YYY | |
002 | AAA | 111 | XXX |
002 | 222 | ||
002 | BBB | ||
003 | AAA | 111 | XXX |
003 | BBB | 222 | YYY |
003 | CCC |
What you are asking is no so simple at all. Before we go there, though, let's double-check:
At any rate, here's an attempt:
data want;
if 5=4 then set have;
prior_var1 = var1;
prior_var2 = var2;
prior_var3 = var3;
set have;
by id;
if first.id=0 then do;
if var1 = prior_var1 then var1=' ';
if var2 = prior_var2 then var2=' ';
if var3 = prior_var3 then var3=' ';
end;
run;
Out of all this, the most important piece for a beginner in SAS is to understand the function of a BY statement in a DATA step. It will take a little studying, but it's well worth it.
Like this?
data HAVE;
input (ID VAR1 VAR2 VAR3) (: $) ;
cards;
001 AAA 111 XXX
001 BBB 111 YYY
002 AAA 111 XXX
002 AAA 222 XXX
002 BBB 222 XXX
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC 222 YYY
run;
data WANT;
set HAVE;
if lag(ID)=ID and lag(VAR1)=VAR1 then call missing(VAR1);
if lag(ID)=ID and lag(VAR2)=VAR2 then call missing(VAR2);
if lag(ID)=ID and lag(VAR3)=VAR3 then call missing(VAR3);
run;
ID | VAR1 | VAR2 | VAR3 |
---|---|---|---|
001 | AAA | 111 | XXX |
001 | BBB | YYY | |
002 | AAA | 111 | XXX |
002 | 222 | ||
002 | BBB | ||
003 | AAA | 111 | XXX |
003 | BBB | 222 | YYY |
003 | CCC |
Is a case like below possible and if yes how should the result look like?
003 AAA 111 XXX 003 AAA 111 XXX 003 BBB 222 YYY 003 BBB 222 YYY 003 CCC 222 XXX ---> 003 CCC . ??? 003 CCC 222 YYY ---> 003 . . ???
And should you want a result where - per ID - each value per variable only occurs once then below code should do the job.
data HAVE;
input ID VAR1 $ VAR2 VAR3 $;
cards;
001 AAA 111 XXX
001 BBB 111 YYY
002 AAA 111 XXX
002 AAA 222 XXX
002 BBB 222 XXX
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC 222 XXX
003 CCC 222 YYY
;
run;
proc sql noprint;
select max(length) into :max_length
from dictionary.columns
where libname='WORK' and memname='HAVE' and type='char'
;
quit;
data want(drop=_:);
set have;
by id notsorted;
array ac_val _character_;
array an_val _numeric_;
if _n_=1 then
do;
length _vname $32 _n_val 8 _c_val $&max_length;
dcl hash hc_val(hashexp:3);
hc_val.defineKey('_vname', '_c_val');
hc_val.defineDone();
dcl hash hn_val(hashexp:3);
hn_val.defineKey('_vname', '_n_val');
hn_val.defineDone();
end;
if first.id then
do;
hc_val.clear();
hn_val.clear();
end;
/* character vars */
do _i=1 to dim(ac_val);
_vname=vname(ac_val[_i]);
_c_val=ac_val[_i];
if upcase(_vname) ne 'ID' then
do;
if hc_val.check()=0 then call missing(ac_val[_i]);
else hc_val.add();
end;
end;
/* numeric vars */
do _i=1 to dim(an_val);
_vname=vname(an_val[_i]);
_n_val=an_val[_i];
if upcase(_vname) ne 'ID' then
do;
if hn_val.check()=0 then call missing(an_val[_i]);
else hn_val.add();
end;
end;
run;
Are you doing this for reporting purposes? If so, I believe both PROC TABULATE and REPORT will group variables together if defined appropriately to appear this way. Once you've changed it you're losing the ability to identify unique groups within your data.
not attempting to justify the action
first thought is BY-GROUP processing, but the "want" report indicates the "blanking" of values is independent of other columns
To generalise into one loop is awkward as array elements must be of the same type.
Here is a partly generalised two-loop solution
data want ; set have ; array chrs _character_ ; array nums _numeric_ ; array prevn(2000) _temporary_ ; array prevc(2000) $100 _temporary_ ; do over chrs ; if prevc(_i_)= chrs then call missing( chrs) ; else prevc(_i_)= chrs ; end ; do over nums ; if prevn(_i_)= nums then call missing( nums) ; else prevn(_i_)= nums ; end ; run ;
if this is worth doing (i.e. not just an interview question), you might put some effort (proc sql into macro variables) into replacing the 2000 with the numbers of character and numeric variables, and ensuring the temporary character array elements are no wider than the widest character variable.
I wanted to show a data step which looked as simple as this "customer request"
of course that routine also processes the ID column
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.