topic Re: Keeping Partial Duplicates but Ignoring Missings in SAS Programming

Keeping Partial Duplicates but Ignoring Missings

cardinull — Sat, 28 Sep 2019 22:39:59 GMT

Hello!

I want to create a list of partial duplicates - observations that have the same values for ID, date, and time, but different values in all other variables. The only condition is that I want to ignore differences due to missing values. For example, let's say I have the dataset:

ID	VAR1_	VAR2_	VAR3_	VAR4_	date	time
1	1	1	1	2	9	15
1	2	.	.	2	9	15
1	2	2	2	1	9	15
2	1	1	2	2	10	20
2	1	.	2	3	10	20
2	3	2	2	.	10	20
2	1	1	2	3	10	20
3	.	1	3	4	9	15
3	2	1	.	4	9	15

Edit: I just realized I framed this wrong. I'd ultimately want to keep only one of row 5 or 7 as they are partial duplicates to the other observations of ID=2, date=10, and time=20. However, I'd want to remove both rows with ID=3 as when one is absorbed by the other, it no longer has partial duplicates. Sorry for the confusion! :

ID	VAR1_	VAR2_	VAR3_	VAR4_	date	time
1	1	1	1	2	9	15
1	2	.	.	2	9	15
1	2	2	2	1	9	15
2	1	1	2	2	10	20
2	3	2	2	.	10	20
2	1	1	2	3	10	20

Is this possible in SAS? Is this not practical to code?

Below I have the SAS code to make the data:

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3 . 1 3 4 9 15
3 2 1 . 4 9 15
;
run;

Thanks!

Re: Keeping Partial Duplicates but Ignoring Missings

novinosrin — Sat, 28 Sep 2019 18:06:22 GMT

Hi @cardinull I went with some guesses and trying to shame myself by attempting to be creative lol


data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
;
run;

data temp;
 update dt(obs=0) dt;
 by id;
 rownum+1;
 output;
run;

proc sql;
create table want as
select b.*
from
(select *
from temp
group by ID, VAR1, VAR2, VAR3, VAR4, date, time
having count(*)=1) a
left join 
(select *,monotonic() as rownum from dt) b
on a.rownum=b.rownum
order by b.rownum;
quit;

Re: Keeping Partial Duplicates but Ignoring Missings

mkeintz — Sat, 28 Sep 2019 18:10:34 GMT

Yes, it can be done in SAS. But first, what if you have these two observations? Their non-missing values match exactly, and each has one missing value. Do you keep both, neither, one of them?

ID	VAR1_	VAR2_	VAR3_	VAR4_	date	time
3	.	1	3	4	9	15
3	2	1	.	4	9	15

Re: Keeping Partial Duplicates but Ignoring Missings

cardinull — Sat, 28 Sep 2019 18:11:30 GMT

Thanks for a reply! I would keep neither.

Re: Keeping Partial Duplicates but Ignoring Missings

cardinull — Sat, 28 Sep 2019 18:19:04 GMT

Lol! No shame! Being creative is the funnest part of figuring out SAS 🙂 Although with my clarification of mkeintz's quesiton, idk if this would change anything lol!

Re: Keeping Partial Duplicates but Ignoring Missings

novinosrin — Sat, 28 Sep 2019 19:16:26 GMT

Hi @cardinull Saturday lazy me will yet again go for another silly creative(just to show off) one using LOCF guesses. 🙂 Bear with me turning on some humor.


data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3	.	1	3	4	9	15
3	2	1	.	4	9	15
;
run;


data temp;
set dt;
array t var1--time;
call sortn(of t(*));
run;

data temp1;
 update temp(obs=0) temp;
 by id;
 rownum+1;
 if cmiss(of _all_)=0 then output;
run;

proc sql;
create table want as
select b.*
from
(select *
from temp1
group by ID, VAR1, VAR2, VAR3, VAR4, date, time
having count(*)=1) a
left join 
(select *,monotonic() as rownum from dt) b
on a.rownum=b.rownum
order by b.rownum;
quit;

I do have a linear clean approach in my mind, but I am sure that the giant seniors are most likely to offer a perfect solution

Re: Keeping Partial Duplicates but Ignoring Missings

mkeintz — Sat, 28 Sep 2019 20:32:13 GMT

@novinosrin

I don't think you suggestion works for much more than the data presented. Current, in the data

1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20

your program keeps all three records with ID=1.

But add a 4th original record, as in

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
1 2 2 2 . 9 15     This is the extra record
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
;
run;

the additional record (#4) should be considered absorbed by the 3rd record, so WANT should be unchanged. But the result is that it actually REDUCES the number of ID=1 records, i.e. neither record 4 nor its "container" record 3 is kept.

regards,

Mark

Re: Keeping Partial Duplicates but Ignoring Missings

cardinull — Sat, 28 Sep 2019 22:03:28 GMT

@novinosrin @mkeintz

Woops! Mark is right. I just updated my post/questions to account for that.... Sorry for the confusion... Long day

To add to Mark's case, I would want to keep one record if it still is a partial duplicate to other existing records. However, if only the two records exist for that combination of ID, date, and time, then I wouldn't keep those

In the case of the example by Mark, if we have data:

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
1 2 2 2 . 9 15     This is the extra record by Mark
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3 . 1 3 4 9 15
3 2 1 . 4 9 15
;
run;

Then I would want a result of this where one of row 3 or 4 is kept, one of row 5 or 7 is kept, and neither of rows 9 and 10 are kept:

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
;
run;

Sorry for that !

Jameson

Re: Keeping Partial Duplicates but Ignoring Missings

novinosrin — Sat, 28 Sep 2019 22:03:21 GMT

Thank you @mkeintz Hmm Got it. I sit corrected. Also where have you been in a while? I was looking to PM you anyway to say hello?

Hello @cardinull Enough of fun earlier, this is more a sincere effort though I wish to test more effectively(feeling very lazy) but should get very close. I would appreciate you test and let us know where it is missing. If Mark is fresher this evening, He would of course charm you. But until then, you have something to play with


data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3	.	1	3	4	9	15
3	2	1	.	4	9	15
;
run;

data want;
if _n_=1 then do;
  dcl hash H () ;
  h.definekey  ("_cn") ;
  h.definedata ("_cn") ;
  h.definedone () ;
  dcl hiter hi('h');
end;
 do until(last.id);
  set dt;
  by id date time notsorted;
  array t var1-var4;
  _cn=cats(of t(*));
  _rc=h.add();
 end;
 do until(last.id);
  set dt end=z;
   by id date time notsorted;
   array temp(999999)$20 _temporary_;
   __cn=cats(of t(*));
   call missing(_cn);
   if __cn in temp then continue;
   do while(hi.next()=0);
   _c=.;
   if __cn ne _cn then do;
    do _n_=1 to length(__cn);
	 if char(_cn,_n_) ne '.' and  char(__cn,_n_) ne '.' then
	 if char(_cn,_n_) ne  char(__cn,_n_) then _c+1;
	end;
	if _c=. then do;n+1;temp(n)=_cn;temp(n+1)=__cn;end;
   end;
  end;
  if __cn not in temp then output;
  end;
 h.clear();
 call missing(of temp(*),n);
 drop _: n;
run;

Re: Keeping Partial Duplicates but Ignoring Missings

cardinull — Sat, 28 Sep 2019 22:06:46 GMT

@novinosrin

Thanks a bunch. I'll try and play with this and keep you both updated :)!

Re: Keeping Partial Duplicates but Ignoring Missings

novinosrin — Sat, 28 Sep 2019 22:13:34 GMT

Yes sorry, didn't even shower/lunch today, all day bed and music/laptop. lol VERY Very Lazy day. I hate it.My apologies for not being sincere earlier. 🙂

I just removed the 2nd do until(last.id) to make it faster.

data dt;
input ID VAR1 VAR2 VAR3 VAR4 date time;
datalines;
1 1 1 1 2 9 15
1 2 . . 2 9 15
1 2 2 2 1 9 15
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
3	.	1	3	4	9	15
3	2	1	.	4	9	15
;
run;

data want;
if _n_=1 then do;
  dcl hash H () ;
  h.definekey  ("_cn") ;
  h.definedata ("_cn") ;
  h.definedone () ;
  dcl hiter hi('h');
end;
 do _iorc_=1 by 1 until(last.id);
  set dt;
  by id date time notsorted;
  array t var1-var4;
  _cn=cats(of t(*));
  _rc=h.add();
 end;
 do _iorc_=1 to _iorc_;
  set dt ;
   array temp(999999)$20 _temporary_;
   __cn=cats(of t(*));
   call missing(_cn);
   if __cn in temp then continue;
   do while(hi.next()=0);
   _c=.;
   if __cn ne _cn then do;
    do _n_=1 to length(__cn);
	 if char(_cn,_n_) ne '.' and  char(__cn,_n_) ne '.' then
	 if char(_cn,_n_) ne  char(__cn,_n_) then _c+1;
	end;
	if _c=. then do;n+1;temp(n)=_cn;temp(n+1)=__cn;end;
   end;
  end;
  if __cn not in temp then output;
  end;
 h.clear();
 call missing(of temp(*),n);
 drop _: n;
run;

EDIT:

Oh well, shouldn't the BY GROUP be ID DATE TIME with time being lowest(child group)

In that case you would change the BY id;

to BY ID DATE TIME;

and so DO UNTIL(LAST.TIME);

I am a NUT.

Re: Keeping Partial Duplicates but Ignoring Missings

mkeintz — Sun, 29 Sep 2019 02:52:23 GMT

You apparently want to exclude all records that can be construed as a "child" of another record. And by "child" I mean a record such that

all its non-missing values match the "parent" record
has at least one missing value for which the parent has a non-missing value.

In that case, if you have a table of child records, you could then keep all of the original records EXCEPT for those child records. So the real task is to take each original record, and generate all its child records. Then do a proc SQL (using the EXCEPT operator) or a data step with merge.

Here's the SQL solution:

data dt;
  retain ID DATE TIME;
  input ID var1 var2 var3 var4 date time;
datalines;
1 1 1 1 2  9 15
1 2 . . 2  9 15
1 2 2 2 1  9 15
1 2 2 2 .  9 15   Extra record
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
run;


data child (drop=_: v i);
  set dt;
  length _strng $8; 
  _strng=catx('!',of var:);
  array var{4} var1-var4;
  array _v{4};
  do v=1 to 4; 
    _v{v}=var{v};
  end;

  do i=1 to 2**4 -1;
    call missing(of var:);
    if mod(i,2)>=1  then var{1}=_v{1};
	if mod(i,4)>=2  then var{2}=_v{2};
	if mod(i,8)>=4  then var{3}=_v{3};
	if mod(i,16)>=8 then var{4}=_v{4};
	if catx('!',of var:)^=_strng then output;
  end;
run;

proc sort data=child nodupkey;
  by id date time var: ;
run;

proc sql;
  create table want as
  select * from dt
  except select * from child;
quit;

Now when the data sets are large, I suspect the SQL solution can be expensive because each record in the original dataset has to be compare to possible ALL the records in the child dataset, which is much larger. So the DATA step approach might be better, as in:

data dt;
  retain ID DATE TIME;
  input ID var1 var2 var3 var4 date time;
datalines;
1 1 1 1 2  9 15
1 2 . . 2  9 15
1 2 2 2 1  9 15
1 2 2 2 .  9 15   Extra record
2 1 1 2 2 10 20
2 1 . 2 3 10 20
2 3 2 2 . 10 20
2 1 1 2 3 10 20
run;

proc sort data=dt out=have; by id date time var: ; run;

data child (drop=_: v i);
  set dt;
  length _strng $8; 
  _strng=catx('!',of var:);
  array var{4} var1-var4;
  array _v{4};
  do v=1 to 4; 
    _v{v}=var{v};
  end;

  do i=1 to 2**4 -1;
    call missing(of var:);
    if mod(i,2)>=1  then var{1}=_v{1};
	if mod(i,4)>=2  then var{2}=_v{2};
	if mod(i,8)>=4  then var{3}=_v{3};
	if mod(i,16)>=8 then var{4}=_v{4};
	if catx('!',of var:)^=_strng then output;
  end;
run;

proc sort data=child nodupkey;
  by id date time var: ;
run;

data want2;
merge have  child (in=inchild);
  by id date time var: ;
  if inchild=0;
run;

Regards,

Mark

Re: Keeping Partial Duplicates but Ignoring Missings

cardinull — Tue, 01 Oct 2019 00:51:55 GMT

Thanks so much Mark! I was able to get something to work by tweaking this a bit :)!

Re: Keeping Partial Duplicates but Ignoring Missings

hashman — Thu, 03 Oct 2019 05:06:09 GMT

@mkeintz:

Mark, I'd suggest that it can be done much simpler (not to mention in a single step and a single pass) by using a hash table as a dynamic medium. Basically, what the program below does is adding an item to the table if the current record isn't a dupe relative to the current state of the table. Of course, below, the input is assumed sorted by [id,date,time].

data have ;                                                                                                                             
  input id date time v1-v4 ;                                                                                                            
  list ;                                                                                                                                
  cards ;                                                                                                                               
1   9  15  1  1  1  2                                                                                                                   
1   9  15  2  .  .  2                                                                                                                   
1   9  15  2  2  2  1                                                                                                                   
1   9  15  2  2  2  .                                                                                                                   
2  10  20  1  1  2  2                                                                                                                   
2  10  20  1  .  2  3                                                                                                                   
2  10  20  3  2  2  .                                                                                                                   
2  10  20  1  1  2  3                                                                                                                   
3   9  15  .  1  3  4                                                                                                                   
3   9  15  2  1  .  4                                                                                                                   
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (keep = id date time v:) ;                                                                                                    
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (multidata:"y") ;                                                                                                        
    h.definekey ("_n_") ;                                                                                                               
    h.definedata ("v1", "v2", "v3", "v4") ;                                                                                             
    h.definedone () ;                                                                                                                   
    dcl hiter i ("h") ;                                                                                                                 
  end ;                                                                                                                                 
  do until (last.time) ;                                                                                                                
    set have ;                                                                                                                          
    by id date time ;                                                                                                                   
    array v v1-v4 ;                                                                                                                     
    array t t1-t4 ;                                                                                                                     
    do over v ;                                                                                                                         
      t = v ;                                                                                                                           
    end ;                                                                                                                               
    do while (i.next() = 0) ;                                                                                                           
      do over v ;                                                                                                                       
        if nmiss (v, t) then continue ;                                                                                                 
        if v ne t then leave ;                                                                                                          
      end ;                                                                                                                             
      if _i_ > dim (v) then _dup = 1 ;                                                                                                  
    end ;                                                                                                                               
    if _dup then continue ;                                                                                                             
    do over v ;                                                                                                                         
      v = t ;                                                                                                                           
    end ;                                                                                                                               
    h.add() ;                                                                                                                           
  end ;                                                                                                                                 
  if h.num_items > 1 then do while (i.next() = 0) ;                                                                                     
    output ;                                                                                                                            
  end ;                                                                                                                                 
  h.clear() ;                                                                                                                           
run ;

Note that proper unduplication (as would be done, say, by proc SORT) would keep the key groups with a single record in the output. The only reason the last IF is used above is the weird nature of the OP's request dictating that if the V-variables in a key group are "same" from the standpoint of the specs, the entire group should be removed.

That becomes a problem for a more concise hash solution not relying on the sorted order at all:

data _null_ ;                                                                                                                           
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (dataset:"have(obs=0)", multidata:"y", ordered:"y") ;                                                                    
    h.definekey ("id", "date", "time") ;                                                                                                
    h.definedata (all:"y") ;                                                                                                            
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  if z then h.output (dataset:"want") ;                                                                                                 
  set have end = z ;                                                                                                                    
  array v v1-v4 ;                                                                                                                       
  array t t1-t4 ;                                                                                                                       
  do over v ;                                                                                                                           
    t = v ;                                                                                                                             
  end ;                                                                                                                                 
  do while (h.do_over() = 0) ;                                                                                                          
    do over v ;                                                                                                                         
      if nmiss (v, t) then continue ;                                                                                                   
      if v ne t then leave ;                                                                                                            
    end ;                                                                                                                               
    if _i_ > dim (v) then _dup = 1 ;                                                                                                    
  end ;                                                                                                                                 
  if _dup then delete ;                                                                                                                 
  do over v ;                                                                                                                           
    v = t ;                                                                                                                             
  end ;                                                                                                                                 
  h.add() ;                                                                                                                             
run ;

Evidently, the group with ID=3 is kept in the output, as "normal" unduplication would prescribe but contrary to the OP's specs. Of course, it can be handled in the step above; but it hardly worth complicating the code. Instead, it can be just post-processed either via SQL or another DATA step (since the hash output is sorted), e.g.:

data want ;                                    
  set want (keep = id date time in = _1) want ;
  by id date time ;                            
  if first.time then _iorc_ = 1 ;              
  else if _1    then _iorc_ + 1 ;              
  else if _iorc_ > 1 then output ;             
run ;

Kind regards

Paul D.

Re: Keeping Partial Duplicates but Ignoring Missings

mkeintz — Thu, 03 Oct 2019 19:49:21 GMT

@hashman

Very nice indeed ... but there's something I am not seeing.

What if an early observation in a BY group (say {var1,var2,var3,var4}={1,.,3,.} gets put into the hash object H, because it is not preceded by a "parent". And then later a "parent" (say {1,2,3,.} is encountered. I see where the new parent gets added to the hash object, but I don't see where the now-erroneous inclusion of {1,.,3,.} is deleted from the object.

Edited addition: And I did think there would be a single-step hash solution, but I didn't think enough to think up one. I'm glad you did.

Regards

Re: Keeping Partial Duplicates but Ignoring Missings

hashman — Fri, 04 Oct 2019 00:30:41 GMT

@mkeintz:

Mark, I may be overlooking something, but to my mind the logic is no different in principle from the following (not using FIND/C function deliberately):

data _null_ ;                                           
  array rec $1 r1-r8 ("1","2","1","3","4","3","5","2") ;
  length hash $ 8 ;                                     
  do over rec ;                                         
    dup = . ;                                           
    do i = 1 to length (hash) ;                         
      if char (hash, i) = rec then dup = 1 ;            
    end ;                                               
    if not dup then hash = cats (hash, rec) ;           
  end ;                                                 
  put hash= ;                                           
run ;

Which results in:

hash=12345

The differences are that the real comparison condition is more complex than:

char (hash, i) = rec

and the hash table is used instead of the string HASH. So, basically, we're adding the hash items while eliminating the post-coming dupes in the process, so there's no need to delete anything from the hash when there're no more records.

Kind regards

Paul D.