Re: flagging duplicate data

Rcrowder · Posted 08-28-2019 02:08 PM

I have data where subjects come in for three separate visits during a study to track health issues.

I need to flag data that repeats the same issues for the same visit date of each subject.

EX:

Sub Visit Issue

1 1 A

1 2 B

1 2 C

1 3 A

2 1 A

2 2 C

2 2 B

2 2 C

2 3 D

Subject 2's second visit should be flagged because there are 2 issues C's posted in the data set.

novinosrin · Posted 08-28-2019 02:16 PM

Can you please post the expected output sample for the input sample? It leads me/perhaps some of us to assumptions that we can avoid

ballardw · Posted 08-28-2019 02:36 PM

@Rcrowder wrote:

I have data where subjects come in for three separate visits during a study to track health issues.

I need to flag data that repeats the same issues for the same visit date of each subject.

EX:

Sub Visit Issue

1 1 A

1 2 B

1 2 C

1 3 A

2 1 A

2 2 C

2 2 B

2 2 C

2 3 D

Subject 2's second visit should be flagged because there are 2 issues C's posted in the data set.

Best is to provide example data in the form of a data step so we test code. Example:

data work.have;
input Sub       Visit    Issue  $;
datalines;
1            1         A
1            2         B
1            2         C
1            3         A
2            1         A
2            2         C
2            2         B
2            2         C
2            3         D
;

Paste into a code box opened with the forum's {I} or "running man" icons to preserve formatting. the message windows will reformat text and sometimes the reformatting will change the data enough that the data step will not run.

One easy way to get just the problem records:

proc freq data=work.have noprint;
   tables sub*visit*issue/out=work.issuecount(where=(count>1));
run;

Or to set a flag, 1=True 0=False for duplicates:

proc sort data=work.have;
   by sub visit issue;
run;
data work.want;
   set work.have;
   by sub visit issue;
   flag=  not first.issue ;
run;

A By statement in the data set creates a number of Boolean variables named First. and Last. for each variable on the by statement. If the record is the first at a given level then it is True (or 1). So if something is not the first record with the same Issue value then negating that variable gets the flag.

hashman · Posted 08-28-2019 03:52 PM

@Rcrowder:

Methinks the simplest is to use the SAS SQL's self-merge facility:

data have ;                                               
  input Sub Visit Issue :$1. ;                            
  cards ;                                                 
1  1  A                                                   
1  2  B                                                   
1  2  C                                                   
1  3  A                                                   
2  1  A                                                   
2  2  C                                                   
2  2  B                                                   
2  2  C                                                   
2  3  D                                                   
;                                                         
run ;                                                     
                                                          
proc sql ;                                                
  create table want as                                       
  select *, count(issue) > count(distinct issue) as flag
  from   have                                             
  group  sub, visit                                       
  ;                                                       
quit ;

Note: The original order of the records within (sub,visit) may change, for example:

Sub    Visit    Issue    flag
-----------------------------
 1       1        A        0 
 1       2        C        0 
 1       2        B        0 
 1       3        A        0 
 2       1        A        0 
 2       2        B        1 
 2       2        C        1 
 2       2        C        1 
 2       3        D        0

But since it's SQL, this is an expected behavior.

OTOH, since your data appear to be sorted by (sub,visit), you can get what you want and preserve the original record order using a simple double DoW loop:

data want (drop = _:) ;                               
  do _n_ = 1 by 1 until (last.visit) ;    
    set have ;                            
    by sub visit ;                        
    if _n_ = 1 then _issue = issue ;      
    else if issue = _issue then flag = 1 ;
  end ;                                   
  do _n_ = 1 to _n_ ;                     
    set have ;                            
    flag = ^^ flag ;                      
    output ;                              
  end ;                                   
run ;

What you get as a result in the output is:

Sub    Visit    Issue    flag
-----------------------------
 1       1        A        0 
 1       2        B        0 
 1       2        C        0 
 1       3        A        0 
 2       1        A        0 
 2       2        C        1 
 2       2        B        1 
 2       2        C        1 
 2       3        D        0

However, your specs are rather vague, so there's a chance you want to flag only the records with the duplicate values of ISSUE rather than the entire (sub,visit) group where they are found. If this is the case, the hash object comes in handy, especially since its use doesn't assume that the input file is sorted - and it also preserves the original record order:

data want (drop = _:) ;                     
  if _n_ = 1 then do ;                      
    dcl hash h () ;                         
    h.definekey  ("sub", "visit", "issue") ;
    h.definedata ("_q") ;                   
    h.definedone () ;                       
    do until (z) ;                          
      set have end = z ;                    
      if h.find() ne 0 then _q = 1 ;        
      else _q + 1 ;                         
      h.replace() ;                         
    end ;                                   
  end ;                                     
  set have ;                                
  h.find() ;                                
  flag = _q > 1 ;                           
run ;

As a result, you'll get:

Sub    Visit    Issue    flag
-----------------------------
 1       1        A        0 
 1       2        B        0 
 1       2        C        0 
 1       3        A        0 
 2       1        A        0 
 2       2        C        1 
 2       2        B        0 
 2       2        C        1 
 2       3        D        0

Kind regards

Paul D.

flagging duplicate data