BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
yawenyu
Obsidian | Level 7

Hello! I am using SAS ver9.3 and I am attempting to replace subsequent duplicate observations with blanks. Here is what I am looking for:

 

 ID        Var1     Var2      Var3           ===>          ID        Var1     Var2      Var3

001       AAA      111       XXX                             001       AAA      111       XXX 

001       BBB      111       YYY                             001       BBB         .         YYY

002       AAA      111       XXX                             002       AAA      111       XXX

002       AAA      222       XXX                             002                 222          .

002       BBB      222       XXX                             002       BBB        .             .

003       AAA      111       XXX                             003       AAA      111       XXX

003       BBB      222       YYY                             003       BBB      222       YYY

003       CCC     222       YYY                             003       CCC        .             .

 

Var1, Var2, and Var3 are grouped by the ID number and duplicates of that variable is fine. Here is the code I've tried without success:

 

 

proc sort data=have out=want nodupkey;

by Var1 Var2 Var3;

run;

 

proc sort data=want out=want;

by id;

run;

 

 

Is there another way to do this? I am a beginner at SAS programming, so I apologize if there is a simple solution I am overlooking. Thank you in advance. 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Like this?

data HAVE;
 input (ID VAR1 VAR2 VAR3) (: $) ;
cards; 
001       AAA      111       XXX    
001       BBB      111       YYY    
002       AAA      111       XXX    
002       AAA      222       XXX    
002       BBB      222       XXX    
003       AAA      111       XXX    
003       BBB      222       YYY    
003       CCC     222       YYY     
run;
data WANT;
  set HAVE;
  if lag(ID)=ID and lag(VAR1)=VAR1 then call missing(VAR1);
  if lag(ID)=ID and lag(VAR2)=VAR2 then call missing(VAR2);
  if lag(ID)=ID and lag(VAR3)=VAR3 then call missing(VAR3);
run;
ID VAR1 VAR2 VAR3
001 AAA 111 XXX
001 BBB   YYY
002 AAA 111 XXX
002   222  
002 BBB    
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC    

View solution in original post

8 REPLIES 8
LinusH
Tourmaline | Level 20
Proc sort can remove duplicates, but it can only remove observations, not individual values.
You need to use the data step with RETAIN and BY processing. With this you can compare values between observations and potentially assign missing values.
Data never sleeps
Astounding
PROC Star

What you are asking is no so simple at all.  Before we go there, though, let's double-check:

 

  • Is this a wise thing to attempt?  If you described more of where you ultimately want to head with this, there might be other ways to get there (other than actually changing the data).

At any rate, here's an attempt:

 

data want;

if 5=4 then set have;

prior_var1 = var1;

prior_var2 = var2;

prior_var3 = var3;

set have;

by id;

if first.id=0 then do;

   if var1 = prior_var1 then var1=' ';

   if var2 = prior_var2 then var2=' ';

   if var3 = prior_var3 then var3=' ';

end;

run;

 

Out of all this, the most important piece for a beginner in SAS is to understand the function of a BY statement in a DATA step.  It will take a little studying, but it's well worth it.

ChrisNZ
Tourmaline | Level 20

Like this?

data HAVE;
 input (ID VAR1 VAR2 VAR3) (: $) ;
cards; 
001       AAA      111       XXX    
001       BBB      111       YYY    
002       AAA      111       XXX    
002       AAA      222       XXX    
002       BBB      222       XXX    
003       AAA      111       XXX    
003       BBB      222       YYY    
003       CCC     222       YYY     
run;
data WANT;
  set HAVE;
  if lag(ID)=ID and lag(VAR1)=VAR1 then call missing(VAR1);
  if lag(ID)=ID and lag(VAR2)=VAR2 then call missing(VAR2);
  if lag(ID)=ID and lag(VAR3)=VAR3 then call missing(VAR3);
run;
ID VAR1 VAR2 VAR3
001 AAA 111 XXX
001 BBB   YYY
002 AAA 111 XXX
002   222  
002 BBB    
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC    
Patrick
Opal | Level 21

@yawenyu

Is a case like below possible and if yes how should the result look like?

003 AAA 111 XXX        003 AAA 111 XXX
003 BBB 222 YYY        003 BBB 222 YYY
003 CCC 222 XXX  --->  003 CCC .   ???
003 CCC 222 YYY  --->  003 .   .   ???
Patrick
Opal | Level 21

@yawenyu

 

And should you want a result where - per ID - each value per variable only occurs once then below code should do the job.

data HAVE;
 input ID VAR1 $ VAR2 VAR3 $;
cards; 
001 AAA 111 XXX 
001 BBB 111 YYY 
002 AAA 111 XXX 
002 AAA 222 XXX 
002 BBB 222 XXX 
003 AAA 111 XXX
003 BBB 222 YYY
003 CCC 222 XXX
003 CCC 222 YYY
;
run;

proc sql noprint;
  select max(length) into :max_length
  from dictionary.columns
  where libname='WORK' and memname='HAVE' and type='char'
  ;
quit;

data want(drop=_:);
  set have;
  by id notsorted;

  array ac_val _character_;
  array an_val _numeric_;

  if _n_=1 then
    do;
      length _vname $32 _n_val 8 _c_val $&max_length;
      dcl hash hc_val(hashexp:3);
      hc_val.defineKey('_vname', '_c_val');
      hc_val.defineDone();
      dcl hash hn_val(hashexp:3);
      hn_val.defineKey('_vname', '_n_val');
      hn_val.defineDone();
    end;
  

  if first.id then
    do;
      hc_val.clear();
      hn_val.clear();
    end;

  /* character vars */
  do _i=1 to dim(ac_val);
    _vname=vname(ac_val[_i]);
    _c_val=ac_val[_i];
    if upcase(_vname) ne 'ID' then
      do;
        if hc_val.check()=0 then call missing(ac_val[_i]);
        else hc_val.add();
      end;
  end;

  /* numeric vars */
  do _i=1 to dim(an_val);
    _vname=vname(an_val[_i]);
    _n_val=an_val[_i];
    if upcase(_vname) ne 'ID' then
      do;
        if hn_val.check()=0 then call missing(an_val[_i]);
        else hn_val.add();
      end;
  end;

run;

Reeza
Super User

Are you doing this for reporting purposes? If so, I believe both PROC TABULATE and REPORT will group variables together if defined appropriately to appear this way. Once you've changed it you're losing the ability to identify unique groups within your data. 

Peter_C
Rhodochrosite | Level 12

not attempting to justify the action

 

first thought is BY-GROUP processing, but the "want" report indicates the "blanking" of values is independent of other columns

 

To generalise into one loop is awkward as array elements must be of the same type.

Here is a partly generalised two-loop solution

 

data want ;
   set have ;
   array chrs _character_ ;
   array nums _numeric_  ;
   array prevn(2000) _temporary_ ;
   array prevc(2000) $100 _temporary_ ;
   do over chrs ;
      if   prevc(_i_)= chrs  then call missing( chrs) ;
      else prevc(_i_)= chrs ;
   end ;
   do over nums ;
      if   prevn(_i_)= nums  then call missing( nums) ;
      else prevn(_i_)= nums ;
   end ;
run ;

if this is worth doing (i.e. not just an interview question), you might put some effort (proc sql into macro variables) into replacing the 2000 with the numbers of character and numeric variables, and ensuring the temporary character array elements are no wider than the widest character variable. 

 

I wanted to show a data step which looked as simple as this "customer request"

Peter_C
Rhodochrosite | Level 12

of course that routine also processes the ID column

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 5031 views
  • 1 like
  • 7 in conversation