Solved: Re: Compare rows in a dataset - Page 2

Anita_n · Posted 08-07-2024 10:35 AM

@Tom @Quentin @Astounding @ballardw Thankyou all for your contribution to find a solution to this problem

After consultation with the customer. Here is a detailed explanation of his intentions with the data. I have adjusted (added the variable that contains the person who entered the data) the data a little.

data have;
input id $ pnr pid time cn editor var1 var2 var3 var4 var5;
datalines;
1 12 2 1 1 1 5 6 7 3 4
1 12 2 1 2 2 5 7 9 3 5
1 12 2 2 2 3 3 8 5 1 5
1 12 2 2 2 4 3 8 7 1 6

15 6 3 1 1 1 3 5 8 3 4
15 6 3 2 2 2 3 6 8 3 5
15 6 3 2 2 3 1 7 5 1 5

25 7 3 1 1 1 3 5 8 3 4
25 7 3 2 2 2 3 6 8 3 5
25 7 3 1 2 3 1 7 5 1 5
25 7 3 1 2 4 3 8 7 1 6

;
run;

a) Identify the matching pairs of the 1st and 2nd input
- there should always be 2 observations that match in the variables with match in the id, pnr and pid
- these observations differ by the number of the control input in the variable 'control' (1 or 2)

b) Check all data records to see whether they already have a control entry or not; these two groups should be outputted separately
- a group with observations that do not yet have a 2nd input
- a group with observations for which there is a 2nd input

c) Compare the observations that exist twice with each other, prerequisite as described above with match in id, pnr, pid
- then compare whether the var1-var5 match (by rows) or not,
- observations that do not match should be output so that we know what needs to be corrected

So these are the notes I made. I hope someone can give me a helping hand. Even if it's just the start because am little bit confused of where to start from. Thanks

ballardw · Posted 08-07-2024 11:01 AM

Editor values of 1,2,3, 4 sure don't look like two people entering data which was part of the prior description.

Please restrict use of "data set" to discussion about SAS data sets. You are apparently talking about rows of values in the text file as a data set and that doesn't make much sense (at least not to me) and confuses where a data set may actually be needed.

You may need to provide a bit more detail about this

b) Check all data records to see whether they already have a control entry or not; these two groups should be outputted                   separately
      - a group with data records that do not yet have a 2nd input
       - a group with data records for which there is a 2nd input

As in which records in your example meet each requirement and it would be nice for consistency sake to provide something like the name of an output data set to hold which "separate" outputs

Anita_n · Posted 08-07-2024 12:07 PM

@ballardw Thanks for your remarks:

Q: Editor values of 1,2,3, 4 sure don't look like two people entering data which was part of the prior description.

A: I did not want to use 1, 2 for that this doesn't confuse with time and control

Q: Please restrict use of "data set" to discussion about SAS data sets. You are apparently talking about rows of values in the text file as a data set and that doesn't make much sense (at least not to me) and confuses where a data set may actually be needed.

A: Sorry I have editted this above

Q:

You may need to provide a bit more detail about this.

b) Check all data records to see whether they already have a control entry or not; these two groups should be outputted                   separately
      - a group with data records that do not yet have a 2nd input
       - a group with data records for which there is a 2nd input

As in which records in your example meet each requirement and it would be nice for consistency sake to provide something like the name of an output data set to hold which "separate" outputs

A: Identify the matching pairs of the 1st and 2nd input
- there should always be 2 observations that match in the variables with match in the id, pnr, pid and time

- The values of id, pnr, pid should always be the same to become a matching pair. In variable "control" there should be a first and second entry to see that two different people entered the data

- Like in observation one and two id=1 , since id, pnr, pid match they are pairs and there exist control 1 and 2 for the values of var1 to var5 could be checked to if if they are the same or not. If they are the same I output them in a dataset "all_values_match". If they do not match I output them in a data set "values_doesnt_match"

-Then I check for the next pair with the same id, pnr, pid but I have only "control"=2 so I know control one is definitely missing. I then check if these are duplicates if not I output them in dataset "missing_entry" etc

I hope I could answer your questions

Tom · Posted 08-07-2024 12:29 PM

So trying to pull key information from your post it looks like groups in the data is identified by the values of three variables: id pnr pid

Each group could appear once or twice. (Are you sure there is never a third or fourth observation in a group?)

So split the data into three files. The singletons. The first and the second.

data single first second;
  set have;
  by id pnr pid;
  if first.pid and last.pid then output single;
  else if first.pid then output first;
  else output second;
run;

So the dataset SINGLE is the answer to which ones have not had what we used to call second pass data entry.

You can then PROC COMPARE with the other two datasets to see which records had differences.

Quentin · Posted 08-07-2024 02:01 PM

Let's make a simpler example dataset. The data has one row per PatientID. The data for each PatienID was entered by two different people (EntryID):

data have;
input ptid entryid var1 var2 var3;
datalines;
1 1 2 4 6
1 2 2 5 6
2 1 10 12 14
2 2 10 12 14
3 1 2 4 6
;
run;

So that is basically your data, but with fewer ID variables. When asking questions, it's helpful to make the HAVE data as simple as possible. Anything that works for one ID variable can work for multiple.

With that, I would use PROC COMPARE:

proc compare base=have(where=(entryid=1)) compare=have(where=(entryid=2)) listobs;
  var var1-var3 ;
  id ptid ;
run ;

That will show you where are are differences in the value of var1-var3. It will also show you when data for a patient was only entered once. If data for a patient was entered 3 times, you'll get a warning in your log about duplicate data. It will work for numeric data and character data. You don't even have to list all of the variables. PROC COMPARE is fabulous!

If you want output data instead of a report, you can explore the options in PROC COMPARE to create output datasets.

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Anita_n · Posted 08-07-2024 05:44 PM

@Tom @Quentin @ballardw @Astounding You guys are really great. Thanks a lot for the great job. It worked, after a little changes

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

Re: Compare rows in a dataset

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!