DATA Step, Macro, Functions and more

Identify and Print/create an output for duplicate records

Reply
Occasional Contributor
Posts: 7

Identify and Print/create an output for duplicate records

Hello All,

 

So I have a program where we randomly sample patients each month. I want to create a code where I can merge sample_patients1 (say this is the sample for month of April) merge it sample_pateients2 (this is for month of May) to ensure not same patients get sampled again in May. And if there are any duplicates then I want to know which patients are those.

 

FYI - I am merging the 2 datasets by using 'PatientFileNo' which is unique for each patient.

 

I am using this now but it is not working -

data work.eligible;

merge work.eligible (in=new) reference1 (in=ref keep=PRAMSFileNo);

by PRAMSFileNo; 

if (new and (not ref));   

run; */  

Super User
Posts: 17,840

Re: Identify and Print/create an output for duplicate records

[ Edited ]

You can use by processing in a few more steps:

 

/*generate sample data*/
data class;
set sashelp.class;
run;
 
data class2;
set sashelp.class(obs=5);
run;
 
data merged;
set class class2;
run;
 
/*Sort*/
proc sort data=merged;
by name;
run;
 
/*ID*/
data duplicate unique;
set merged;
by name;
if first.name and last.name then output unique;
else output duplicate;
run;

 

EDIT: This may work for your situation. I would also recommend NOT overwriting your dataset.

 

data work.eligible2 duplicate;
merge work.eligible (in=new) reference1 (in=ref keep=PRAMSFileNo);
by PRAMSFileNo; 
if first.pramsfileno and last.pramsfileno then output eligible2;
else output duplicate;
run;

 

PROC Star
Posts: 1,232

Re: Identify and Print/create an output for duplicate records

What do you mean when you say it is not working?  Are you getting errors, or not the desired result?  You have an extra */ at the end of your code that woudl cause problems if you submitted it (an unclosed comment).

 

Generally it's a bad idea to overwrite a dataset (you are reading work.eligible and writing work.eligible), but it's allowable.  I don't see anything wrong in the code. Below quick test looks like it's working as expected:

 

226  data eligible;
227    do id=1 to 5;
228      output;
229    end;
230  run;

NOTE: The data set WORK.ELIGIBLE has 5 observations and 1 variables.

231
232  data reference1;
233    do id=1,3,5;
234      output;
235    end;
236  run;

NOTE: The data set WORK.REFERENCE1 has 3 observations and 1 variables.

237
238  data eligible;
239    merge eligible (in=new)
240          reference1 (in=ref)
241    ;
242    by id;
243    if new=1 and ref=0;
244  run;

NOTE: There were 5 observations read from the data set WORK.ELIGIBLE.
NOTE: There were 3 observations read from the data set WORK.REFERENCE1.
NOTE: The data set WORK.ELIGIBLE has 2 observations and 1 variables.

Valued Guide
Posts: 858

Re: Identify and Print/create an output for duplicate records

I would write the same thing like this to prevent error in syntax:

 

data new_eligible;
merge work.eligible (in=new)  
          reference1    (in=ref keep=PRAMSFileNo);
by PRAMSFileNo;
if new and not ref;   
run; 

 

BUT, I think you would be better off giving example data and an example of the desired output.  You can possibly learn a completely different and possibly better way to get what you are looking for.

Frequent Contributor
Posts: 130

Re: Identify and Print/create an output for duplicate records

Here's a way to display only the matches between sample_patients1 (eligible) and sample_patients2 (reference1) using the SQL procedure:

proc sql;
create table eligible_duplicates as
select A.*,
       B.PRAMSFileNo as Dup_PRAMSFileNo
from eligible A inner join reference1 B
on (A.PRAMSFileNo=B.PRAMSFileNo);
quit;
Super User
Posts: 10,500

Re: Identify and Print/create an output for duplicate records

When I have had somewhat similar sample management issues my approach has been to maintain a separate data table of the selected sample with the sample time frame (month and year perhaps). Then when the next round comes up start with the sample frame, exclude any records/subjects from the previously samples (proc sql with EXCEPT works well) and then APPEND the selected sample afterwards.

 

The append with the sample period means that I can easily include previously sampled if sufficient time has expired assuming that is allowed in the protocol for a study. Or if you have separate groups such as by facility or treatment groupings you include that information as well so that a patient that might have been sampled for group x but is also eligible for group y is available for sample in that part of the process.

 

 

Ask a Question
Discussion stats
  • 5 replies
  • 188 views
  • 0 likes
  • 6 in conversation