DATA Step, Macro, Functions and more

Flagging Duplicates across multiple Rows

Reply
Contributor
Posts: 22

Flagging Duplicates across multiple Rows

Hi,

I have a dataset as follows:

 

ID  Var1  Var2 Var3 Var4 Var5

1      Y       Y      Y      Y     N

1      Y       Y      Y      Y     N

1      N       N      Y      N      

2      N       N      N      N    Y

2      Y       Y      N      N     Y

 

I want to identify the similarities and differences between each variable (Var1-Var5) by ID.

eg: For ID1, obs 1 and 2 are same for all variables,  but obs3 is different in terms of Var1, Var2, Var4 and Var5.

I am not sure how I will be able to flag each of these differences without making the code too long.

 

Thank you for your help.

Super User
Posts: 11,343

Re: Flagging Duplicates across multiple Rows

Posted in reply to nehuvarma

Please show what you  expect the output for that input to look like. With 5 variables and apparenlty 3 values (including missing) you have 243 possible value combinations to consider.

 

Also if an Id has 10 records are each of the 10 to compared to all of the other records for the id, only the first or some other rule? You potentially are dealing with factorial numbers of comparisons which depending on how many records per ID can lead to extremely large numbers of comparisons.

Trusted Advisor
Posts: 1,022

Re: Flagging Duplicates across multiple Rows

Posted in reply to nehuvarma

 

I assume the dataset is already sorted by ID. 

 

Are duplicates always contiguous?  If so, then this simple program is all you need:

 

data want;
  set have;
  by ID var1-var5 notsorted;
  if first.var5 then first_obs_this_pattern=1;
  else first_obs_this_pattern=0;
run;

 

Notes:

  1. The dataset has to be sorted by ID, and grouped (but not neccessarily in ascending order) by var1-var5 within each ID.  That's why the NOTSORTED parameter is used in the BY statement.
  2. The program creates a dummy var FIRST_OBS_THIS_PATTERN as 1 for the fist instance of a given patten, and 0 otherwise.  The 0's are your duplicates.
  3. I use "first.var5" as the indicator of a new group starting, because VAR5 is the declared as the "least-significant" variable in the BY-group definition.   If any of the more significant by-vars change then first.var5=1 (even if var5 does not change).

 

If the data are not pre-grouped by VAR1-VAR5 and sorting is not expsensive (i.e. data set is not big), then this works:

 

proc sort data=have out=need;
  by ID var1-var5;
run;

data want;
  set need;
  by ID var1-var5 ;
  if first.var5 then first_obs_this_pattern=1;
  else first_obs_this_pattern=0;
run;

 

NOtes:

  1. This approach throws away the original order of records within each ID.
  2. Don't need the NOTSORTED parameter this time.

 

 

If (1) the data set is so big that you want to avoid sorting, and (2) you don't have too many records per ID, then a little more programming as in:

 

data want;

  array ny_list {10} $20 _temporary_;

  set have;
  by id;

  length ny_test $20 ;
  ny_test=cat(of var:);

  if first.id then N_patterns=0;
  retain N_patterns;
  
  do I=1 by 1 while (I<=N_patterns);
    if ny_test=ny_list{I} then leave;
  end;

  if I>N_patterns then do;
    first_obs_this_pattern=1;
    N_patterns+1;
    ny_list{N_patterns}=ny_test;
  end;
  else first_obs_this_pattern=0;

  drop ny_test N_patterns I;
run;

 

NOtes:

  1. The $20 length assigned to each element of the NY_TEST array, and also to the NY_TEST variable, is chosen assuming that 20 bytes is as long as the concatenation of VAR1 through VAR5.  If each of VAR1 through VAR5 is, say, 10 bytes, then use $50.
  2. The NY_LIST array is set to have a maximum of 10 entries, so up to 10 unique patterns per id can be accomodated.  If you have, say 20 obs for some ID's then declare the array to have 20 elements.  But at some point, as @ballardw points out, you'll be doing too many comparions, in which cases I'd suggest the SORT solution above.
  3. NY_LIST array is declared as _TEMPORARY_, meaning (a) it is not output to data want, and (b) its values are retained from record to record - i.e. a history is maintained.  The program doesn't blank out the NY_LIST array at the beginning of each ID group, because setting N=0 is good enough to prevent contamination from preceding ID's.
Ask a Question
Discussion stats
  • 2 replies
  • 83 views
  • 0 likes
  • 3 in conversation