BookmarkSubscribeRSS Feed
analyst_work
Obsidian | Level 7

Hi,

I have a dataset as follows:

 

ID  Var1  Var2 Var3 Var4 Var5

1      Y       Y      Y      Y     N

1      Y       Y      Y      Y     N

1      N       N      Y      N      

2      N       N      N      N    Y

2      Y       Y      N      N     Y

 

I want to identify the similarities and differences between each variable (Var1-Var5) by ID.

eg: For ID1, obs 1 and 2 are same for all variables,  but obs3 is different in terms of Var1, Var2, Var4 and Var5.

I am not sure how I will be able to flag each of these differences without making the code too long.

 

Thank you for your help.

2 REPLIES 2
ballardw
Super User

Please show what you  expect the output for that input to look like. With 5 variables and apparenlty 3 values (including missing) you have 243 possible value combinations to consider.

 

Also if an Id has 10 records are each of the 10 to compared to all of the other records for the id, only the first or some other rule? You potentially are dealing with factorial numbers of comparisons which depending on how many records per ID can lead to extremely large numbers of comparisons.

mkeintz
PROC Star

 

I assume the dataset is already sorted by ID. 

 

Are duplicates always contiguous?  If so, then this simple program is all you need:

 

data want;
  set have;
  by ID var1-var5 notsorted;
  if first.var5 then first_obs_this_pattern=1;
  else first_obs_this_pattern=0;
run;

 

Notes:

  1. The dataset has to be sorted by ID, and grouped (but not neccessarily in ascending order) by var1-var5 within each ID.  That's why the NOTSORTED parameter is used in the BY statement.
  2. The program creates a dummy var FIRST_OBS_THIS_PATTERN as 1 for the fist instance of a given patten, and 0 otherwise.  The 0's are your duplicates.
  3. I use "first.var5" as the indicator of a new group starting, because VAR5 is the declared as the "least-significant" variable in the BY-group definition.   If any of the more significant by-vars change then first.var5=1 (even if var5 does not change).

 

If the data are not pre-grouped by VAR1-VAR5 and sorting is not expsensive (i.e. data set is not big), then this works:

 

proc sort data=have out=need;
  by ID var1-var5;
run;

data want;
  set need;
  by ID var1-var5 ;
  if first.var5 then first_obs_this_pattern=1;
  else first_obs_this_pattern=0;
run;

 

NOtes:

  1. This approach throws away the original order of records within each ID.
  2. Don't need the NOTSORTED parameter this time.

 

 

If (1) the data set is so big that you want to avoid sorting, and (2) you don't have too many records per ID, then a little more programming as in:

 

data want;

  array ny_list {10} $20 _temporary_;

  set have;
  by id;

  length ny_test $20 ;
  ny_test=cat(of var:);

  if first.id then N_patterns=0;
  retain N_patterns;
  
  do I=1 by 1 while (I<=N_patterns);
    if ny_test=ny_list{I} then leave;
  end;

  if I>N_patterns then do;
    first_obs_this_pattern=1;
    N_patterns+1;
    ny_list{N_patterns}=ny_test;
  end;
  else first_obs_this_pattern=0;

  drop ny_test N_patterns I;
run;

 

NOtes:

  1. The $20 length assigned to each element of the NY_TEST array, and also to the NY_TEST variable, is chosen assuming that 20 bytes is as long as the concatenation of VAR1 through VAR5.  If each of VAR1 through VAR5 is, say, 10 bytes, then use $50.
  2. The NY_LIST array is set to have a maximum of 10 entries, so up to 10 unique patterns per id can be accomodated.  If you have, say 20 obs for some ID's then declare the array to have 20 elements.  But at some point, as @ballardw points out, you'll be doing too many comparions, in which cases I'd suggest the SORT solution above.
  3. NY_LIST array is declared as _TEMPORARY_, meaning (a) it is not output to data want, and (b) its values are retained from record to record - i.e. a history is maintained.  The program doesn't blank out the NY_LIST array at the beginning of each ID group, because setting N=0 is good enough to prevent contamination from preceding ID's.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 1054 views
  • 0 likes
  • 3 in conversation