Solved: Compare rows in a dataset

Anita_n · Posted 08-06-2024 09:13 AM

Dear all,

I have a table that looks like this

data have;
input id $3 pnr pid time cn var1 var2 var3 var4 var5;
datalines;
1 12 2 1 1 5 6 7 3 4
1 12 2 1 2 5 7 9 3 5
1 12 2 2 2 3 8 5 1 5
1 12 2 2 2 3 8 7 1 6
;
run;

I want to first compare the values in the rows to see if id, pnr, pid and time are equal. If yes then compare var1 to var5 if the are the same.

Tom · Posted 08-07-2024 12:29 PM

So trying to pull key information from your post it looks like groups in the data is identified by the values of three variables: id pnr pid

Each group could appear once or twice. (Are you sure there is never a third or fourth observation in a group?)

So split the data into three files. The singletons. The first and the second.

data single first second;
  set have;
  by id pnr pid;
  if first.pid and last.pid then output single;
  else if first.pid then output first;
  else output second;
run;

So the dataset SINGLE is the answer to which ones have not had what we used to call second pass data entry.

You can then PROC COMPARE with the other two datasets to see which records had differences.

View solution in original post

ballardw · Posted 08-06-2024 09:31 AM

What would the result of "comparing" multiple observations (data sets have observations, not rows) for 5 variables look like? What information do you expect to see in the result?

And in your "real" data are all of your Var1 through Var5 variables actually numeric? With numeric variables the Range statistic is useful as a result of 0 means "all non-missing values are the same".

For example

proc summary data=have nway;
   class id pnr pid time;
   var var1-var5;
   output out=useful (drop=_type_) range=;
run;

pr id=1 pnr=2 pid=2 and time=1 then the range of Var1=0 and Var4=0 meaning the same for all the observations with those values of your grouping variable.

For time=2 then var1, var2 and var4 have range of 0 meaning the same.

IF you might have missing values for Var1 - Var5 you would want to look at the NMISS statistic as well as missing values are excluded from range calculations.

Anita_n · Posted 08-06-2024 09:57 AM

Sorry, var1 to var5 are not numeric

data have;
 input id $3 pnr pid time cn var1 $ var2 $ var3 $ var4 $ var5 $ ;
datalines;
1   12  2  1 1 5 6 7 3 4
1   12  2  1 2 5 7 9 3 5
1   12  2  2 2 3 8 5 1 5
1   12  2  2 2 3 8 7 1 6
;
run;

and the results should look like this

data want;
 input id $3 pnr pid time cn var1 $ flag1 var2 $ flag2 var3 $ flag3  var4 $ flag4 var5 $ flag5 ;
datalines;
1   12  2  1 1 5 1 6 0 7 0 3 1 4 0
1   12  2  1 2 5 1 7 0 9 0 3 1 5 0
1   12  2  2 2 3 1 8 1 5 0 1 1 5 0
1   12  2  2 2 3 1 8 1 7 0 1 1 6 0
;
run;

It's just that two different people entered this data at different times. And the values they entered should be same. So, the idea behind it is to check if the values entered by the two different people are the same, if not then the data should be corrected. If values have not been entered at time point(time) then this should be done. Time is either 1 or 2 , cn is the control which is also 1 or 2. So at the end I will like to print all id where error occured during entry. And also all ids with missing entries. That is why I wanted to use the flags (1=same, 0=not the same)

ballardw · Posted 08-06-2024 10:28 AM

Since you have a requirement " all ids with missing entries" then your example data should include at least one example of what "missing entry" means to you, point it or them out so we can see them and indicate what the output looks like for a missing entry.

If your example data included a "who entered the data" variable that might help such as identifying which combination of ID variables were only entered by one person...

Tom · Posted 08-06-2024 12:38 PM

The easiest way to compare variable values is between datasets.

If sounded from your description like the data is entered by two different individuals. So if you can split the data into two groups you can use PROC COMPARE to compare them. Does one of those variables represent the person that entered the data? If so then us it to split the data.

Otherwise you could just split between the first and second records for each set of ID variables.

Which of those variables are the ID (or BY) variables that uniquely identify the observations you want to compare? To me it looks like the first 4 variables from ID to TIME.

Anita_n · Posted 08-06-2024 03:15 PM

@Tom I don't want to split the data. The data shouldn't be splitted. Am only suppose to compare what person A and B entered if they are the same or not

Tom · Posted 08-06-2024 03:19 PM

Which variable indicates if it was PERSON A or PERSON B?

Note you don't have to make a second copy of the dataset if you can use a WHERE= dataset option to select the appropriate observations.

proc compare 
  data=have(where=(person='A'))
  compare=have(where=(person='B'))
;
  id id--time;
run;

Tom · Posted 08-06-2024 03:26 PM

If you are POSITIVE there are always two observations per BY group then this code will compare the second to the first for all 5 of your variables and then remerge the results back onto both observations.

data diff;
do until (last.time);
  set have;
  by id--time ;
  array x var1-var5 ;
  array y flag1-flag5 ;
  do index=1 to dim(x);
    y[index] = x[index] = lag(x[index]);
  end;
end;
do until (last.time);
  set have;
  by id--time ;
  output;
end;
  drop index;
run;

Anita_n · Posted 08-06-2024 04:25 PM

@Tom This is the error message I get when using this code on the original data

ERROR: Array subscript out of range at row 411 column 5.

Tom · Posted 08-06-2024 07:09 PM

@Anita_n wrote:

@Tom This is the error message I get when using this code on the original data

ERROR: Array subscript out of range at row 411 column 5.

Learn how to read the SAS log. Look at what statement was shown on LINE number 411 of your SAS log. What part of the statement was in the fifth column on that line of code?

If you ran the code I provided the only way that the INDEX could be invalid is if you did not define the array of FLAG variables with the same number (or more) of members as the array of original variables.

Astounding · Posted 08-06-2024 02:16 PM

First, get rid of the "3" in the INPUT statement:

input id $3

Just a dollar sign would be sufficient. The "3" tells SAS to take the contents of column 3, which it always blank.

Then sort your data:

proc sort data=have;
   by id pnr pid time cn var1 var2 var3 var4 var5;
run;

Then pick out the observations that need to be investigated:

data want;
   set have;
   by id pnr pid time cn var1 var2 var3 var4 var5;
   if first.var5 and last.var5 then delete;
run;

It's untested code, but easy enough to test since you have the data. This just picks out the observations, but doesn't tell you where the differences lie.

Follow-up: right idea but wrong logic will post an update later.

Anita_n · Posted 08-06-2024 03:12 PM

Sorry, this did not work even using the test data

Quentin · Posted 08-06-2024 03:49 PM

While I would go with PROC COMPARE, PROC SQL may be another option. For numeric values, you can use the RANGE() function to make sure the values match.

proc sql ;
  select id, pnr, pid, time, cn
        ,var1
        ,range(input(var1,8.))>0 as flag1
        ,var3
        ,range(input(var3,8.))>0 as flag3
  from have 
  group by id, pnr, pid, time
  ;
quit ;

You'd need to think about how to handle any missing values.

I don't know if there is an easy way to compare character values the same way. Possibly you could try using the RANK() function if your values are only one character. Maybe someone else has a better idea. A hashing function, perhaps?

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Anita_n · Posted 08-06-2024 04:28 PM

@Quentin Is there any short way to list the variables. I have more than 300 variables to compare. To list each single variable will not be funny

Quentin · Posted 08-07-2024 06:56 AM

@Anita_n wrote:

@Quentin Is there any short way to list the variables. I have more than 300 variables to compare. To list each single variable will not be funny

Are you familiar with the macro language? That's the approach I use most often for code generation. So I would write a macro which has a parameter for the list of variables, then the macro would return a block of code like:

 ,range(input(var1,8.))>0 as var1flag
 ,range(input(var2,8.))>0 as var2flag
 ,range(input(var3,8.))>0 as var3flag

If you're familiar with the macro language, please try writing a macro like that, and if you have problems, post your macro and people can help.

Or if you're familiar with other code generation approaches (e.g. CALL EXECUTE, or using DATA step to write a .sas file that you %include), those approaches could work as well. There are lots of different ways to generate code in SAS.

That said, I think probably PROC COMPARE is the better way to go. If I decided to do it with SQL or DATA step, I would probably validate my results with PROC COMPARE, just because I trust PROC COMPARE to do its job correctly.

Did you try PROC COMPARE? What went wrong with that approach?

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

SAS Innovate 2025: Call for Content

Classroom Training Available!