BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cardinull
Fluorite | Level 6

Hello!

 

I want to create a list of partial duplicates - observations that have the same values for ID, date, and time, but different values in all other variables. The only condition is that I want to ignore differences due to missing values. For example, let's say I have the dataset:

 

IDVAR1_VAR2_VAR3_VAR4_datetime
11112915
12..2915
12221915
211221020
21.231020
2322.1020
211231020
3.134915
321.4915

 

Edit: I just realized I framed this wrong. I'd ultimately want to keep only one of row 5 or 7 as they are partial duplicates to the other observations of ID=2, date=10, and time=20. However, I'd want to remove both rows with ID=3 as when one is absorbed by the other, it no longer has partial duplicates. Sorry for the confusion! Smiley Embarassed :

IDVAR1_VAR2_VAR3_VAR4_datetime
11112915
12..2915
12221915
211221020
2322.1020
211231020

 

Is this possible in SAS? Is this not practical to code?

 

Below I have the SAS code to make the data:

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3 . 1 3 4 9 15
3 2 1 . 4 9 15
;
run;

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You apparently want to exclude all records that can be construed as a "child" of another record.   And by "child" I mean a record such that

  1. all its non-missing values match the "parent" record
  2. has at least one missing value for which the parent has a non-missing value.

 

In that case, if you have a table of child records, you could then keep all of the original records EXCEPT for those child records.  So the real task is to take each original record, and generate all its child records.  Then do a proc SQL (using the EXCEPT operator) or a data step with merge.

 

Here's the SQL solution:

 

data dt;
  retain ID DATE TIME;
  input ID var1 var2 var3 var4 date time;
datalines;
1 1 1 1 2  9 15
1 2 . . 2  9 15
1 2 2 2 1  9 15
1 2 2 2 .  9 15   Extra record
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
run;


data child (drop=_: v i);
  set dt;
  length _strng $8; 
  _strng=catx('!',of var:);
  array var{4} var1-var4;
  array _v{4};
  do v=1 to 4; 
    _v{v}=var{v};
  end;

  do i=1 to 2**4 -1;
    call missing(of var:);
    if mod(i,2)>=1  then var{1}=_v{1};
	if mod(i,4)>=2  then var{2}=_v{2};
	if mod(i,8)>=4  then var{3}=_v{3};
	if mod(i,16)>=8 then var{4}=_v{4};
	if catx('!',of var:)^=_strng then output;
  end;
run;

proc sort data=child nodupkey;
  by id date time var: ;
run;

proc sql;
  create table want as
  select * from dt
  except select * from child;
quit;

Now when the data sets are large, I suspect the SQL solution can be expensive because each record in the original dataset has to be compare to possible ALL the records in the child dataset, which is much larger.  So the DATA step approach might be better, as in:

 

data dt;
  retain ID DATE TIME;
  input ID var1 var2 var3 var4 date time;
datalines;
1 1 1 1 2  9 15
1 2 . . 2  9 15
1 2 2 2 1  9 15
1 2 2 2 .  9 15   Extra record
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
run;

proc sort data=dt out=have; by id date time var: ; run;

data child (drop=_: v i);
  set dt;
  length _strng $8; 
  _strng=catx('!',of var:);
  array var{4} var1-var4;
  array _v{4};
  do v=1 to 4; 
    _v{v}=var{v};
  end;

  do i=1 to 2**4 -1;
    call missing(of var:);
    if mod(i,2)>=1  then var{1}=_v{1};
	if mod(i,4)>=2  then var{2}=_v{2};
	if mod(i,8)>=4  then var{3}=_v{3};
	if mod(i,16)>=8 then var{4}=_v{4};
	if catx('!',of var:)^=_strng then output;
  end;
run;

proc sort data=child nodupkey;
  by id date time var: ;
run;

data want2;
merge have  child (in=inchild);
  by id date time var: ;
  if inchild=0;
run;

Regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

15 REPLIES 15
novinosrin
Tourmaline | Level 20

Hi @cardinull   I went with some guesses and trying to shame myself by attempting to be creative lol

 


data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
;
run;

data temp;
 update dt(obs=0) dt;
 by id;
 rownum+1;
 output;
run;

proc sql;
create table want as
select b.*
from
(select *
from temp
group by ID, VAR1, VAR2, VAR3, VAR4, date, time
having count(*)=1) a
left join 
(select *,monotonic() as rownum from dt) b
on a.rownum=b.rownum
order by b.rownum;
quit;

 

 

 

 

cardinull
Fluorite | Level 6

Lol! No shame! Being creative is the funnest part of figuring out SAS 🙂 Although with my clarification of mkeintz's quesiton, idk if this would change anything lol!

novinosrin
Tourmaline | Level 20

Hi @cardinull  Saturday lazy me will yet again go for another silly creative(just to show off) one using LOCF guesses. 🙂 Bear with me turning on some humor.

 


data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3	.	1	3	4	9	15
3	2	1	.	4	9	15
;
run;


data temp;
set dt;
array t var1--time;
call sortn(of t(*));
run;

data temp1;
 update temp(obs=0) temp;
 by id;
 rownum+1;
 if cmiss(of _all_)=0 then output;
run;

proc sql;
create table want as
select b.*
from
(select *
from temp1
group by ID, VAR1, VAR2, VAR3, VAR4, date, time
having count(*)=1) a
left join 
(select *,monotonic() as rownum from dt) b
on a.rownum=b.rownum
order by b.rownum;
quit;

I do have a linear clean approach in my mind, but I am sure that the giant seniors are most likely to offer a perfect solution

 

 

 

 

mkeintz
PROC Star

@novinosrin 

 

I don't think you suggestion works for much more than the data presented.   Current, in the data

 

1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20

your program keeps all three records with ID=1.

 

But add a 4th original record, as in

 

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
1 2 2 2 . 9 15     This is the extra record
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
;
run;

the additional record (#4) should be considered absorbed by the 3rd record, so WANT should be unchanged.   But the result is that it actually REDUCES the number of ID=1 records, i.e. neither record 4 nor its "container" record 3 is kept.

 

regards,

Mark

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
cardinull
Fluorite | Level 6

@novinosrin@mkeintz

Woops! Mark is right. I just updated my post/questions to account for that.... Sorry for the confusion... Long day Smiley Embarassed

To add to Mark's case, I would want to keep one record if it still is a partial duplicate to other existing records. However, if only the two records exist for that combination of ID, date, and time, then I wouldn't keep those

 

In the case of the example by Mark, if we have data:

 

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
1 2 2 2 . 9 15     This is the extra record by Mark
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3 . 1 3 4 9 15
3 2 1 . 4 9 15
;
run;

 

 

Then I would want a result of this where one of row 3 or 4 is kept, one of row 5 or 7 is kept, and neither of rows 9 and 10 are kept:

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
;
run;

 

Sorry for that Smiley Embarassed!

 

Jameson

novinosrin
Tourmaline | Level 20

Thank you @mkeintz  Hmm Got it. I sit corrected. Also where have you been in a while? I was looking to PM you anyway to say hello?

 

Hello @cardinull   Enough of fun earlier, this is more a sincere effort though I wish to test more effectively(feeling very lazy) but should get very close. I would appreciate you test and let us know where it is missing. If Mark is fresher this evening, He would of course charm you. But until then, you have something to play with

 


data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3	.	1	3	4	9	15
3	2	1	.	4	9	15
;
run;

data want;
if _n_=1 then do;
  dcl hash H () ;
  h.definekey  ("_cn") ;
  h.definedata ("_cn") ;
  h.definedone () ;
  dcl hiter hi('h');
end;
 do until(last.id);
  set dt;
  by id date time notsorted;
  array t var1-var4;
  _cn=cats(of t(*));
  _rc=h.add();
 end;
 do until(last.id);
  set dt end=z;
   by id date time notsorted;
   array temp(999999)$20 _temporary_;
   __cn=cats(of t(*));
   call missing(_cn);
   if __cn in temp then continue;
   do while(hi.next()=0);
   _c=.;
   if __cn ne _cn then do;
    do _n_=1 to length(__cn);
	 if char(_cn,_n_) ne '.' and  char(__cn,_n_) ne '.' then
	 if char(_cn,_n_) ne  char(__cn,_n_) then _c+1;
	end;
	if _c=. then do;n+1;temp(n)=_cn;temp(n+1)=__cn;end;
   end;
  end;
  if __cn not in temp then output;
  end;
 h.clear();
 call missing(of temp(*),n);
 drop _: n;
run;

  

cardinull
Fluorite | Level 6
@novinosrin

Thanks a bunch. I'll try and play with this and keep you both updated :)!
novinosrin
Tourmaline | Level 20

Yes sorry, didn't even shower/lunch today, all day bed and music/laptop. lol VERY Very Lazy day. I hate it.My apologies for not being sincere earlier.  🙂

 

I just removed the 2nd do until(last.id) to make it faster.

 

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3	.	1	3	4	9	15
3	2	1	.	4	9	15
;
run;

data want;
if _n_=1 then do;
  dcl hash H () ;
  h.definekey  ("_cn") ;
  h.definedata ("_cn") ;
  h.definedone () ;
  dcl hiter hi('h');
end;
 do _iorc_=1 by 1 until(last.id);
  set dt;
  by id date time notsorted;
  array t var1-var4;
  _cn=cats(of t(*));
  _rc=h.add();
 end;
 do _iorc_=1 to _iorc_;
  set dt ;
   array temp(999999)$20 _temporary_;
   __cn=cats(of t(*));
   call missing(_cn);
   if __cn in temp then continue;
   do while(hi.next()=0);
   _c=.;
   if __cn ne _cn then do;
    do _n_=1 to length(__cn);
	 if char(_cn,_n_) ne '.' and  char(__cn,_n_) ne '.' then
	 if char(_cn,_n_) ne  char(__cn,_n_) then _c+1;
	end;
	if _c=. then do;n+1;temp(n)=_cn;temp(n+1)=__cn;end;
   end;
  end;
  if __cn not in temp then output;
  end;
 h.clear();
 call missing(of temp(*),n);
 drop _: n;
run;

EDIT:

Oh well, shouldn't the BY GROUP be ID DATE TIME with time being lowest(child group)

 

In that case you would change the BY id;

to BY ID DATE TIME; 

and so DO UNTIL(LAST.TIME);

 

I am a NUT. 

mkeintz
PROC Star

Yes, it can be done in SAS.  But first, what if you have these two observations?  Their non-missing values match exactly, and each has one missing value.  Do you keep both, neither, one of them?

 

ID VAR1_ VAR2_ VAR3_ VAR4_ date time
3 . 1 3 4 9 15
3 2 1 . 4 9 15
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
cardinull
Fluorite | Level 6

Thanks for a reply! I would keep neither.

mkeintz
PROC Star

You apparently want to exclude all records that can be construed as a "child" of another record.   And by "child" I mean a record such that

  1. all its non-missing values match the "parent" record
  2. has at least one missing value for which the parent has a non-missing value.

 

In that case, if you have a table of child records, you could then keep all of the original records EXCEPT for those child records.  So the real task is to take each original record, and generate all its child records.  Then do a proc SQL (using the EXCEPT operator) or a data step with merge.

 

Here's the SQL solution:

 

data dt;
  retain ID DATE TIME;
  input ID var1 var2 var3 var4 date time;
datalines;
1 1 1 1 2  9 15
1 2 . . 2  9 15
1 2 2 2 1  9 15
1 2 2 2 .  9 15   Extra record
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
run;


data child (drop=_: v i);
  set dt;
  length _strng $8; 
  _strng=catx('!',of var:);
  array var{4} var1-var4;
  array _v{4};
  do v=1 to 4; 
    _v{v}=var{v};
  end;

  do i=1 to 2**4 -1;
    call missing(of var:);
    if mod(i,2)>=1  then var{1}=_v{1};
	if mod(i,4)>=2  then var{2}=_v{2};
	if mod(i,8)>=4  then var{3}=_v{3};
	if mod(i,16)>=8 then var{4}=_v{4};
	if catx('!',of var:)^=_strng then output;
  end;
run;

proc sort data=child nodupkey;
  by id date time var: ;
run;

proc sql;
  create table want as
  select * from dt
  except select * from child;
quit;

Now when the data sets are large, I suspect the SQL solution can be expensive because each record in the original dataset has to be compare to possible ALL the records in the child dataset, which is much larger.  So the DATA step approach might be better, as in:

 

data dt;
  retain ID DATE TIME;
  input ID var1 var2 var3 var4 date time;
datalines;
1 1 1 1 2  9 15
1 2 . . 2  9 15
1 2 2 2 1  9 15
1 2 2 2 .  9 15   Extra record
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
run;

proc sort data=dt out=have; by id date time var: ; run;

data child (drop=_: v i);
  set dt;
  length _strng $8; 
  _strng=catx('!',of var:);
  array var{4} var1-var4;
  array _v{4};
  do v=1 to 4; 
    _v{v}=var{v};
  end;

  do i=1 to 2**4 -1;
    call missing(of var:);
    if mod(i,2)>=1  then var{1}=_v{1};
	if mod(i,4)>=2  then var{2}=_v{2};
	if mod(i,8)>=4  then var{3}=_v{3};
	if mod(i,16)>=8 then var{4}=_v{4};
	if catx('!',of var:)^=_strng then output;
  end;
run;

proc sort data=child nodupkey;
  by id date time var: ;
run;

data want2;
merge have  child (in=inchild);
  by id date time var: ;
  if inchild=0;
run;

Regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
cardinull
Fluorite | Level 6
Thanks so much Mark! I was able to get something to work by tweaking this a bit :)!
hashman
Ammonite | Level 13

@mkeintz:

Mark, I'd suggest that it can be done much simpler (not to mention in a single step and a single pass) by using a hash table as a dynamic medium. Basically, what the program below does is adding an item to the table if the current record isn't a dupe relative to the current state of the table. Of course, below, the input is assumed sorted by [id,date,time]. 

data have ;                                                                                                                             
  input id date time v1-v4 ;                                                                                                            
  list ;                                                                                                                                
  cards ;                                                                                                                               
1   9  15  1  1  1  2                                                                                                                   
1   9  15  2  .  .  2                                                                                                                   
1   9  15  2  2  2  1                                                                                                                   
1   9  15  2  2  2  .                                                                                                                   
2  10  20  1  1  2  2                                                                                                                   
2  10  20  1  .  2  3                                                                                                                   
2  10  20  3  2  2  .                                                                                                                   
2  10  20  1  1  2  3                                                                                                                   
3   9  15  .  1  3  4                                                                                                                   
3   9  15  2  1  .  4                                                                                                                   
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (keep = id date time v:) ;                                                                                                    
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (multidata:"y") ;                                                                                                        
    h.definekey ("_n_") ;                                                                                                               
    h.definedata ("v1", "v2", "v3", "v4") ;                                                                                             
    h.definedone () ;                                                                                                                   
    dcl hiter i ("h") ;                                                                                                                 
  end ;                                                                                                                                 
  do until (last.time) ;                                                                                                                
    set have ;                                                                                                                          
    by id date time ;                                                                                                                   
    array v v1-v4 ;                                                                                                                     
    array t t1-t4 ;                                                                                                                     
    do over v ;                                                                                                                         
      t = v ;                                                                                                                           
    end ;                                                                                                                               
    do while (i.next() = 0) ;                                                                                                           
      do over v ;                                                                                                                       
        if nmiss (v, t) then continue ;                                                                                                 
        if v ne t then leave ;                                                                                                          
      end ;                                                                                                                             
      if _i_ > dim (v) then _dup = 1 ;                                                                                                  
    end ;                                                                                                                               
    if _dup then continue ;                                                                                                             
    do over v ;                                                                                                                         
      v = t ;                                                                                                                           
    end ;                                                                                                                               
    h.add() ;                                                                                                                           
  end ;                                                                                                                                 
  if h.num_items > 1 then do while (i.next() = 0) ;                                                                                     
    output ;                                                                                                                            
  end ;                                                                                                                                 
  h.clear() ;                                                                                                                           
run ;                                               

Note that proper unduplication (as would be done, say, by proc SORT) would keep the key groups with a single record in the output. The only reason the last IF is used above is the weird nature of the OP's request dictating that if the V-variables in a key group are "same" from the standpoint of the specs, the entire group should be removed. 

 

That becomes a problem for a more concise hash solution not relying on the sorted order at all:

data _null_ ;                                                                                                                           
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (dataset:"have(obs=0)", multidata:"y", ordered:"y") ;                                                                    
    h.definekey ("id", "date", "time") ;                                                                                                
    h.definedata (all:"y") ;                                                                                                            
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  if z then h.output (dataset:"want") ;                                                                                                 
  set have end = z ;                                                                                                                    
  array v v1-v4 ;                                                                                                                       
  array t t1-t4 ;                                                                                                                       
  do over v ;                                                                                                                           
    t = v ;                                                                                                                             
  end ;                                                                                                                                 
  do while (h.do_over() = 0) ;                                                                                                          
    do over v ;                                                                                                                         
      if nmiss (v, t) then continue ;                                                                                                   
      if v ne t then leave ;                                                                                                            
    end ;                                                                                                                               
    if _i_ > dim (v) then _dup = 1 ;                                                                                                    
  end ;                                                                                                                                 
  if _dup then delete ;                                                                                                                 
  do over v ;                                                                                                                           
    v = t ;                                                                                                                             
  end ;                                                                                                                                 
  h.add() ;                                                                                                                             
run ;            

Evidently, the group with ID=3 is kept in the output, as "normal" unduplication would prescribe but contrary to the OP's specs. Of course, it can be handled in the step above; but it hardly worth complicating the code. Instead, it can be just post-processed either via SQL or another DATA step (since the hash output is sorted), e.g.:

data want ;                                    
  set want (keep = id date time in = _1) want ;
  by id date time ;                            
  if first.time then _iorc_ = 1 ;              
  else if _1    then _iorc_ + 1 ;              
  else if _iorc_ > 1 then output ;             
run ;                                          

Kind regards

Paul D. 

mkeintz
PROC Star

@hashman 

 

Very nice indeed ... but there's something I am not seeing.

 

What if an early observation in a BY group  (say {var1,var2,var3,var4}={1,.,3,.} gets put into the hash object H, because it is not preceded by a "parent".  And then later a "parent"  (say {1,2,3,.} is encountered.  I see where the new parent gets added to the hash object, but I don't see where the now-erroneous inclusion of {1,.,3,.} is deleted from the object.

 

Edited addition:  And I did think there would be a single-step hash solution, but I didn't think enough to think up one.  I'm glad you did.

 

Regards

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1224 views
  • 5 likes
  • 4 in conversation