How to Replace Subsequent Duplicate Observations with Blanks

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 14
Accepted Solution

How to Replace Subsequent Duplicate Observations with Blanks

[ Edited ]

Hello! I am using SAS ver9.3 and I am attempting to replace subsequent duplicate observations with blanks. Here is what I am looking for:

 

 ID        Var1     Var2      Var3           ===>          ID        Var1     Var2      Var3

001       AAA      111       XXX                             001       AAA      111       XXX 

001       BBB      111       YYY                             001       BBB         .         YYY

002       AAA      111       XXX                             002       AAA      111       XXX

002       AAA      222       XXX                             002                 222          .

002       BBB      222       XXX                             002       BBB        .             .

003       AAA      111       XXX                             003       AAA      111       XXX

003       BBB      222       YYY                             003       BBB      222       YYY

003       CCC     222       YYY                             003       CCC        .             .

 

Var1, Var2, and Var3 are grouped by the ID number and duplicates of that variable is fine. Here is the code I've tried without success:

 

 

proc sort data=have out=want nodupkey;

by Var1 Var2 Var3;

run;

 

proc sort data=want out=want;

by id;

run;

 

 

Is there another way to do this? I am a beginner at SAS programming, so I apologize if there is a simple solution I am overlooking. Thank you in advance. 


Accepted Solutions
Solution
‎05-23-2017 07:55 PM
PROC Star
Posts: 1,674

Re: How to Replace Subsequent Duplicate Observations with Blanks

Like this?

data HAVE;
 input (ID VAR1 VAR2 VAR3) (: $) ;
cards; 
001       AAA      111       XXX    
001       BBB      111       YYY    
002       AAA      111       XXX    
002       AAA      222       XXX    
002       BBB      222       XXX    
003       AAA      111       XXX    
003       BBB      222       YYY    
003       CCC     222       YYY     
run;
data WANT;
  set HAVE;
  if lag(ID)=ID and lag(VAR1)=VAR1 then call missing(VAR1);
  if lag(ID)=ID and lag(VAR2)=VAR2 then call missing(VAR2);
  if lag(ID)=ID and lag(VAR3)=VAR3 then call missing(VAR3);
run;
ID VAR1 VAR2 VAR3
001 AAA 111 XXX
001 BBB   YYY
002 AAA 111 XXX
002   222  
002 BBB    
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC    

View solution in original post


All Replies
Super User
Posts: 5,391

Re: How to Replace Subsequent Duplicate Observations with Blanks

Proc sort can remove duplicates, but it can only remove observations, not individual values.
You need to use the data step with RETAIN and BY processing. With this you can compare values between observations and potentially assign missing values.
Data never sleeps
Super User
Posts: 5,371

Re: How to Replace Subsequent Duplicate Observations with Blanks

What you are asking is no so simple at all.  Before we go there, though, let's double-check:

 

  • Is this a wise thing to attempt?  If you described more of where you ultimately want to head with this, there might be other ways to get there (other than actually changing the data).

At any rate, here's an attempt:

 

data want;

if 5=4 then set have;

prior_var1 = var1;

prior_var2 = var2;

prior_var3 = var3;

set have;

by id;

if first.id=0 then do;

   if var1 = prior_var1 then var1=' ';

   if var2 = prior_var2 then var2=' ';

   if var3 = prior_var3 then var3=' ';

end;

run;

 

Out of all this, the most important piece for a beginner in SAS is to understand the function of a BY statement in a DATA step.  It will take a little studying, but it's well worth it.

Solution
‎05-23-2017 07:55 PM
PROC Star
Posts: 1,674

Re: How to Replace Subsequent Duplicate Observations with Blanks

Like this?

data HAVE;
 input (ID VAR1 VAR2 VAR3) (: $) ;
cards; 
001       AAA      111       XXX    
001       BBB      111       YYY    
002       AAA      111       XXX    
002       AAA      222       XXX    
002       BBB      222       XXX    
003       AAA      111       XXX    
003       BBB      222       YYY    
003       CCC     222       YYY     
run;
data WANT;
  set HAVE;
  if lag(ID)=ID and lag(VAR1)=VAR1 then call missing(VAR1);
  if lag(ID)=ID and lag(VAR2)=VAR2 then call missing(VAR2);
  if lag(ID)=ID and lag(VAR3)=VAR3 then call missing(VAR3);
run;
ID VAR1 VAR2 VAR3
001 AAA 111 XXX
001 BBB   YYY
002 AAA 111 XXX
002   222  
002 BBB    
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC    
Respected Advisor
Posts: 4,137

Re: How to Replace Subsequent Duplicate Observations with Blanks

@yawenyu

Is a case like below possible and if yes how should the result look like?

003 AAA 111 XXX        003 AAA 111 XXX
003 BBB 222 YYY        003 BBB 222 YYY
003 CCC 222 XXX  --->  003 CCC .   ???
003 CCC 222 YYY  --->  003 .   .   ???
Respected Advisor
Posts: 4,137

Re: How to Replace Subsequent Duplicate Observations with Blanks

@yawenyu

 

And should you want a result where - per ID - each value per variable only occurs once then below code should do the job.

data HAVE;
 input ID VAR1 $ VAR2 VAR3 $;
cards; 
001 AAA 111 XXX 
001 BBB 111 YYY 
002 AAA 111 XXX 
002 AAA 222 XXX 
002 BBB 222 XXX 
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC 222 XXX
003 CCC 222 YYY
;
run;

proc sql noprint;
  select max(length) into :max_length
  from dictionary.columns
  where libname='WORK' and memname='HAVE' and type='char'
  ;
quit;

data want(drop=_:);
  set have;
  by id notsorted;

  array ac_val _character_;
  array an_val _numeric_;

  if _n_=1 then
    do;
      length _vname $32 _n_val 8 _c_val $&max_length;
      dcl hash hc_val(hashexp:3);
      hc_val.defineKey('_vname', '_c_val');
      hc_val.defineDone();
      dcl hash hn_val(hashexp:3);
      hn_val.defineKey('_vname', '_n_val');
      hn_val.defineDone();
    end;
  

  if first.id then
    do;
      hc_val.clear();
      hn_val.clear();
    end;

  /* character vars */
  do _i=1 to dim(ac_val);
    _vname=vname(ac_val[_i]);
    _c_val=ac_val[_i];
    if upcase(_vname) ne 'ID' then
      do;
        if hc_val.check()=0 then call missing(ac_val[_i]);
        else hc_val.add();
      end;
  end;

  /* numeric vars */
  do _i=1 to dim(an_val);
    _vname=vname(an_val[_i]);
    _n_val=an_val[_i];
    if upcase(_vname) ne 'ID' then
      do;
        if hn_val.check()=0 then call missing(an_val[_i]);
        else hn_val.add();
      end;
  end;

run;

Super User
Posts: 19,181

Re: How to Replace Subsequent Duplicate Observations with Blanks

Are you doing this for reporting purposes? If so, I believe both PROC TABULATE and REPORT will group variables together if defined appropriately to appear this way. Once you've changed it you're losing the ability to identify unique groups within your data. 

Valued Guide
Posts: 2,177

Re: How to Replace Subsequent Duplicate Observations with Blanks

not attempting to justify the action

 

first thought is BY-GROUP processing, but the "want" report indicates the "blanking" of values is independent of other columns

 

To generalise into one loop is awkward as array elements must be of the same type.

Here is a partly generalised two-loop solution

 

data want ;
   set have ;
   array chrs _character_ ;
   array nums _numeric_  ;
   array prevn(2000) _temporary_ ;
   array prevc(2000) $100 _temporary_ ;
   do over chrs ;
      if   prevc(_i_)= chrs  then call missing( chrs) ;
      else prevc(_i_)= chrs ;
   end ;
   do over nums ;
      if   prevn(_i_)= nums  then call missing( nums) ;
      else prevn(_i_)= nums ;
   end ;
run ;

if this is worth doing (i.e. not just an interview question), you might put some effort (proc sql into macro variables) into replacing the 2000 with the numbers of character and numeric variables, and ensuring the temporary character array elements are no wider than the widest character variable. 

 

I wanted to show a data step which looked as simple as this "customer request"

Valued Guide
Posts: 2,177

Re: How to Replace Subsequent Duplicate Observations with Blanks

of course that routine also processes the ID column

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 222 views
  • 1 like
  • 7 in conversation