BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Anita_n
Pyrite | Level 9

Dear all, 

I have a table that looks like this 

data have;
input id $3 pnr pid time cn var1 var2 var3 var4 var5;
datalines;
1 12 2 1 1 5 6 7 3 4
1 12 2 1 2 5 7 9 3 5
1 12 2 2 2 3 8 5 1 5
1 12 2 2 2 3 8 7 1 6
;
run;

I want to first compare the values in the rows to see if id, pnr, pid and time are equal. If yes then  compare var1 to var5 if the are the same. 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So trying to pull key information from your post it looks like groups in the data is identified by the values of three variables:  id pnr pid

 

Each group could appear once or twice. (Are you sure there is never a third or fourth observation in a group?)

 

So split the data into three files.  The singletons.  The first and the second.

data single first second;
  set have;
  by id pnr pid;
  if first.pid and last.pid then output single;
  else if first.pid then output first;
  else output second;
run;

So the dataset SINGLE is the answer to which ones have not had what we used to call second pass data entry.

 

You can then PROC COMPARE with the other two datasets to see which records had differences.

 

View solution in original post

20 REPLIES 20
ballardw
Super User

What would the result of "comparing" multiple observations (data sets have observations, not rows) for 5 variables look like? What information do you expect to see in the result?

 

And in your "real" data are all of your Var1 through Var5 variables actually numeric? With numeric variables the Range statistic is useful as a result of 0 means "all non-missing values are the same".

For example

proc summary data=have nway;
   class id pnr pid time;
   var var1-var5;
   output out=useful (drop=_type_) range=;
run;

pr id=1 pnr=2 pid=2 and time=1 then the range of Var1=0 and Var4=0 meaning the same for all the observations with those values of your grouping variable.

For time=2 then var1, var2 and var4 have range of 0 meaning the same.

 

IF you might have missing values for Var1 - Var5 you would want to look at the NMISS statistic as well as missing values are excluded from range calculations.

 

Anita_n
Pyrite | Level 9

Sorry, var1 to var5 are not numeric 

data have;
 input id $3 pnr pid time cn var1 $ var2 $ var3 $ var4 $ var5 $ ;
datalines;
1   12  2  1 1 5 6 7 3 4
1   12  2  1 2 5 7 9 3 5
1   12  2  2 2 3 8 5 1 5
1   12  2  2 2 3 8 7 1 6
;
run;

and the results should look like this

data want;
 input id $3 pnr pid time cn var1 $ flag1 var2 $ flag2 var3 $ flag3  var4 $ flag4 var5 $ flag5 ;
datalines;
1   12  2  1 1 5 1 6 0 7 0 3 1 4 0
1   12  2  1 2 5 1 7 0 9 0 3 1 5 0
1   12  2  2 2 3 1 8 1 5 0 1 1 5 0
1   12  2  2 2 3 1 8 1 7 0 1 1 6 0
;
run;

It's just that two different people entered this data at different times. And the values they entered should be same. So, the idea behind it is to check if the values entered by the two different people are the same, if not then the data should be corrected. If values have not been entered at time point(time) then this should be done. Time is either 1 or 2 , cn is the control which is also 1 or 2. So at the end I will like to print all id where error occured during entry. And also all ids with missing entries. That is why I wanted to use the flags (1=same, 0=not the same)

ballardw
Super User

Since you have a requirement " all ids with missing entries" then your example data should include at least one example of what "missing entry" means to you, point it or them out so we can see them and indicate what the output looks like for a missing entry.

 

If your example data included a "who entered the data" variable that might help such as identifying which combination of ID variables were only entered by one person...

Tom
Super User Tom
Super User

The easiest way to compare variable values is between datasets.  

 

If sounded from your description like the data is entered by two different individuals. So if you can split the data into two groups you can use PROC COMPARE to compare them.  Does one of those variables represent the person that entered the data?  If so then us it to split the data.

 

Otherwise you could just split between the first and second records for each set of ID variables.

 

Which of those variables are the ID (or BY) variables that uniquely identify the observations you want to compare?  To me it looks like the first 4 variables from ID to TIME.

Tom_0-1722962322550.png

 

Anita_n
Pyrite | Level 9

@Tom I don't want to split the data. The data  shouldn't be splitted. Am only suppose to compare what person A and B entered if they are the same or not

Tom
Super User Tom
Super User

Which variable indicates if it was PERSON A or PERSON B?

 

Note you don't have to make a second copy of the dataset if you can use a WHERE= dataset option to select the appropriate observations.

proc compare 
  data=have(where=(person='A'))
  compare=have(where=(person='B'))
;
  id id--time;
run;

 

Tom
Super User Tom
Super User

If you are POSITIVE there are always two observations per BY group then this code will compare the second to the first for all 5 of your variables and then remerge the results back onto both observations.

data diff;
do until (last.time);
  set have;
  by id--time ;
  array x var1-var5 ;
  array y flag1-flag5 ;
  do index=1 to dim(x);
    y[index] = x[index] = lag(x[index]);
  end;
end;
do until (last.time);
  set have;
  by id--time ;
  output;
end;
  drop index;
run;

Tom_0-1722972354762.png

 

Anita_n
Pyrite | Level 9

@Tom This is the error message I get when using this code on the original data

ERROR: Array subscript out of range at row 411 column 5.

Tom
Super User Tom
Super User

@Anita_n wrote:

@Tom This is the error message I get when using this code on the original data

ERROR: Array subscript out of range at row 411 column 5.


Learn how to read the SAS log.  Look at what statement was shown on LINE number 411 of your SAS log.  What part of the statement was in the fifth column on that line of code?

 

If you ran the code I provided the only way that the INDEX could be invalid is if you did not define the array of FLAG variables with the same number (or more) of members as the array of original variables.

Astounding
PROC Star

First, get rid of the "3" in the INPUT statement:

input id $3 

Just a dollar sign would be sufficient.  The "3" tells SAS to take the contents of column 3, which it always blank.

Then sort your data:

proc sort data=have;
   by id pnr pid time cn var1 var2 var3 var4 var5;
run;

Then pick out the observations that need to be investigated:

data want;
   set have;
   by id pnr pid time cn var1 var2 var3 var4 var5;
   if first.var5 and last.var5 then delete;
run;

It's untested code, but easy enough to test since you have the data.  This just picks out the observations, but doesn't tell you where the differences lie.

Follow-up:  right idea but wrong logic  will post an update later.

 

Anita_n
Pyrite | Level 9
Sorry, this did not work even using the test data
Quentin
Super User

While I would go with PROC COMPARE, PROC SQL may be another option.  For numeric values, you can use the RANGE() function to make sure the values match.

 

proc sql ;
  select id, pnr, pid, time, cn
        ,var1
        ,range(input(var1,8.))>0 as flag1
        ,var3
        ,range(input(var3,8.))>0 as flag3
  from have 
  group by id, pnr, pid, time
  ;
quit ;


You'd need to think about how to handle any missing values.

 

I don't know if there is an easy way to compare character values the same way.  Possibly you could try using the RANK() function if your values are only one character.  Maybe someone else has a better idea.   A hashing function, perhaps?


The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
Anita_n
Pyrite | Level 9

@Quentin Is there any short way to list the variables. I have more than 300 variables to compare. To list each single variable will not be funny

Quentin
Super User

@Anita_n wrote:

@Quentin Is there any short way to list the variables. I have more than 300 variables to compare. To list each single variable will not be funny


Are you familiar with the macro language?  That's the approach I use most often for code generation.  So I would write a macro which has a parameter for the list of variables, then the macro would return a block of code like:

 ,range(input(var1,8.))>0 as var1flag
 ,range(input(var2,8.))>0 as var2flag
 ,range(input(var3,8.))>0 as var3flag

If you're familiar with the macro language, please try writing a macro like that, and if you have problems, post your macro and people can help.

Or if you're familiar with other code generation approaches (e.g. CALL EXECUTE, or using DATA step to write a .sas file that you %include), those approaches could work as well.  There are lots of different ways to generate code in SAS.

 

That said, I think probably PROC COMPARE is the better way to go.  If I decided to do it with SQL or DATA step, I would probably validate my results with PROC COMPARE, just because I trust PROC COMPARE to do its job correctly.

 

Did you try PROC COMPARE? What went wrong with that approach?

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 1597 views
  • 10 likes
  • 5 in conversation