Solved: Re: How to Replace Subsequent Duplicate Observations with Blanks

yawenyu · Posted 05-18-2017 07:23 PM

Hello! I am using SAS ver9.3 and I am attempting to replace subsequent duplicate observations with blanks. Here is what I am looking for:

ID Var1 Var2 Var3 ===> ID Var1 Var2 Var3

001 AAA 111 XXX 001 AAA 111 XXX

001 BBB 111 YYY 001 BBB . YYY

002 AAA 111 XXX 002 AAA 111 XXX

002 AAA 222 XXX 002 . 222 .

002 BBB 222 XXX 002 BBB . .

003 AAA 111 XXX 003 AAA 111 XXX

003 BBB 222 YYY 003 BBB 222 YYY

003 CCC 222 YYY 003 CCC . .

Var1, Var2, and Var3 are grouped by the ID number and duplicates of that variable is fine. Here is the code I've tried without success:

proc sort data=have out=want nodupkey;

by Var1 Var2 Var3;

run;

proc sort data=want out=want;

by id;

run;

Is there another way to do this? I am a beginner at SAS programming, so I apologize if there is a simple solution I am overlooking. Thank you in advance.

ChrisNZ · Posted 05-18-2017 11:20 PM

Like this?

data HAVE;
 input (ID VAR1 VAR2 VAR3) (: $) ;
cards; 
001       AAA      111       XXX    
001       BBB      111       YYY    
002       AAA      111       XXX    
002       AAA      222       XXX    
002       BBB      222       XXX    
003       AAA      111       XXX    
003       BBB      222       YYY    
003       CCC     222       YYY     
run;
data WANT;
  set HAVE;
  if lag(ID)=ID and lag(VAR1)=VAR1 then call missing(VAR1);
  if lag(ID)=ID and lag(VAR2)=VAR2 then call missing(VAR2);
  if lag(ID)=ID and lag(VAR3)=VAR3 then call missing(VAR3);
run;

ID	VAR1	VAR2	VAR3
001	AAA	111	XXX
001	BBB		YYY
002	AAA	111	XXX
002		222
002	BBB
003	AAA	111	XXX
003	BBB	222	YYY
003	CCC

High-Performance SAS Coding - Third Edition

View solution in original post

LinusH · Posted 05-18-2017 08:15 PM

Proc sort can remove duplicates, but it can only remove observations, not individual values.
You need to use the data step with RETAIN and BY processing. With this you can compare values between observations and potentially assign missing values.

Data never sleeps

Astounding · Posted 05-18-2017 08:17 PM

What you are asking is no so simple at all. Before we go there, though, let's double-check:

Is this a wise thing to attempt? If you described more of where you ultimately want to head with this, there might be other ways to get there (other than actually changing the data).

At any rate, here's an attempt:

data want;

if 5=4 then set have;

prior_var1 = var1;

prior_var2 = var2;

prior_var3 = var3;

set have;

by id;

if first.id=0 then do;

if var1 = prior_var1 then var1=' ';

if var2 = prior_var2 then var2=' ';

if var3 = prior_var3 then var3=' ';

end;

run;

Out of all this, the most important piece for a beginner in SAS is to understand the function of a BY statement in a DATA step. It will take a little studying, but it's well worth it.

ChrisNZ · Posted 05-18-2017 11:20 PM

Like this?

data HAVE;
 input (ID VAR1 VAR2 VAR3) (: $) ;
cards; 
001       AAA      111       XXX    
001       BBB      111       YYY    
002       AAA      111       XXX    
002       AAA      222       XXX    
002       BBB      222       XXX    
003       AAA      111       XXX    
003       BBB      222       YYY    
003       CCC     222       YYY     
run;
data WANT;
  set HAVE;
  if lag(ID)=ID and lag(VAR1)=VAR1 then call missing(VAR1);
  if lag(ID)=ID and lag(VAR2)=VAR2 then call missing(VAR2);
  if lag(ID)=ID and lag(VAR3)=VAR3 then call missing(VAR3);
run;

ID	VAR1	VAR2	VAR3
001	AAA	111	XXX
001	BBB		YYY
002	AAA	111	XXX
002		222
002	BBB
003	AAA	111	XXX
003	BBB	222	YYY
003	CCC

High-Performance SAS Coding - Third Edition

Patrick · Posted 05-19-2017 12:05 AM

@yawenyu

Is a case like below possible and if yes how should the result look like?

003 AAA 111 XXX        003 AAA 111 XXX
003 BBB 222 YYY        003 BBB 222 YYY
003 CCC 222 XXX  --->  003 CCC .   ???
003 CCC 222 YYY  --->  003 .   .   ???

Patrick · Posted 05-19-2017 06:43 AM

@yawenyu

And should you want a result where - per ID - each value per variable only occurs once then below code should do the job.

data HAVE;
 input ID VAR1 $ VAR2 VAR3 $;
cards; 
001 AAA 111 XXX 
001 BBB 111 YYY 
002 AAA 111 XXX 
002 AAA 222 XXX 
002 BBB 222 XXX 
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC 222 XXX
003 CCC 222 YYY
;
run;

proc sql noprint;
  select max(length) into :max_length
  from dictionary.columns
  where libname='WORK' and memname='HAVE' and type='char'
  ;
quit;

data want(drop=_:);
  set have;
  by id notsorted;

  array ac_val _character_;
  array an_val _numeric_;

  if _n_=1 then
    do;
      length _vname $32 _n_val 8 _c_val $&max_length;
      dcl hash hc_val(hashexp:3);
      hc_val.defineKey('_vname', '_c_val');
      hc_val.defineDone();
      dcl hash hn_val(hashexp:3);
      hn_val.defineKey('_vname', '_n_val');
      hn_val.defineDone();
    end;
  

  if first.id then
    do;
      hc_val.clear();
      hn_val.clear();
    end;

  /* character vars */
  do _i=1 to dim(ac_val);
    _vname=vname(ac_val[_i]);
    _c_val=ac_val[_i];
    if upcase(_vname) ne 'ID' then
      do;
        if hc_val.check()=0 then call missing(ac_val[_i]);
        else hc_val.add();
      end;
  end;

  /* numeric vars */
  do _i=1 to dim(an_val);
    _vname=vname(an_val[_i]);
    _n_val=an_val[_i];
    if upcase(_vname) ne 'ID' then
      do;
        if hn_val.check()=0 then call missing(an_val[_i]);
        else hn_val.add();
      end;
  end;

run;

Reeza · Posted 05-19-2017 03:53 AM

Are you doing this for reporting purposes? If so, I believe both PROC TABULATE and REPORT will group variables together if defined appropriately to appear this way. Once you've changed it you're losing the ability to identify unique groups within your data.

Peter_C · Posted 05-19-2017 05:42 AM

not attempting to justify the action

first thought is BY-GROUP processing, but the "want" report indicates the "blanking" of values is independent of other columns

To generalise into one loop is awkward as array elements must be of the same type.

Here is a partly generalised two-loop solution

data want ;
   set have ;
   array chrs _character_ ;
   array nums _numeric_  ;
   array prevn(2000) _temporary_ ;
   array prevc(2000) $100 _temporary_ ;
   do over chrs ;
      if   prevc(_i_)= chrs  then call missing( chrs) ;
      else prevc(_i_)= chrs ;
   end ;
   do over nums ;
      if   prevn(_i_)= nums  then call missing( nums) ;
      else prevn(_i_)= nums ;
   end ;
run ;

if this is worth doing (i.e. not just an interview question), you might put some effort (proc sql into macro variables) into replacing the 2000 with the numbers of character and numeric variables, and ensuring the temporary character array elements are no wider than the widest character variable.

I wanted to show a data step which looked as simple as this "customer request"

Peter_C · Posted 05-19-2017 06:31 AM

of course that routine also processes the ID column

ID	VAR1	VAR2	VAR3
001	AAA	111	XXX
001	BBB		YYY
002	AAA	111	XXX
002		222
002	BBB
003	AAA	111	XXX
003	BBB	222	YYY
003	CCC

ID	VAR1	VAR2	VAR3
001	AAA	111	XXX
001	BBB		YYY
002	AAA	111	XXX
002		222
002	BBB
003	AAA	111	XXX
003	BBB	222	YYY
003	CCC

ID	VAR1	VAR2	VAR3
001	AAA	111	XXX
001	BBB		YYY
002	AAA	111	XXX
002		222
002	BBB
003	AAA	111	XXX
003	BBB	222	YYY
003	CCC

ID	VAR1	VAR2	VAR3
001	AAA	111	XXX
001	BBB		YYY
002	AAA	111	XXX
002		222
002	BBB
003	AAA	111	XXX
003	BBB	222	YYY
003	CCC

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away

ID	VAR1	VAR2	VAR3
001	AAA	111	XXX
001	BBB		YYY
002	AAA	111	XXX
002		222
002	BBB
003	AAA	111	XXX
003	BBB	222	YYY
003	CCC

ID	VAR1	VAR2	VAR3
001	AAA	111	XXX
001	BBB		YYY
002	AAA	111	XXX
002		222
002	BBB
003	AAA	111	XXX
003	BBB	222	YYY
003	CCC