Dear all,
I have the below dataset and I would like to make sure that the values (viallab1 to viallab10) are unique within my entire dataset
So for example the values in red would be marked as duplicates
Can you please give a suggestion on how to solve that?
Thank you
SUBJID | VISITT | VISITD | VIALLAB1 | VIALLAB2 | VIALLAB3 | VIALLAB4 | VIALLAB5 | VIALLAB6 | VIALLAB7 | VIALLAB8 | VIALLAB9 | VIALLAB10 | |
DUM-002-001 | Cycle 1 | 1 | 117003 | 170068 | 172472 | . | . | . | . | . | . | . | |
DUM-002-001 | Unscheduled visit during combination treatment 1 | 201 | . | . | . | . | . | . | . | . | . | . | |
DUM-002-001 | Unscheduled visit during combination treatment 2 | 202 | 236547 | 236548 | 236549 | . | . | . | . | . | . | . | |
CAM-002-001 | Unscheduled visit during combination treatment 3 | 203 | . | . | . | . | . | . | . | . | . | . | |
CAM-002-001 | Unscheduled visit during combination treatment 4 | 204 | 117003 | . | 170068 | . | . | . | . | . | . |
|
Please clarify ... when you determine if something is a duplicate, are you just looking in one column, or is something a duplicate if it appears once in column VIALLAB1 and it appears again in column VIALLAB2? Okay, I see that the value can be in different columns
When we do find duplicates, then what? Remove them? Or flag them? What action should be taken when duplicates are found?
Can you please provide a portion of your data as SAS data step code (instructions) instead of as a screen capture, which we can't really work with.
One approach, hope the result would be satisfactory:
data vials;
set have;
array viallab {10};
do _n_=1 to 10;
if viallab{_n_} ne . then do;
vial = viallab{_n_};
output;
end;
end;
keep vial subjid;
run;
proc freq data=vials noprint;
tables vial / out = counts;
run;
proc print data=counts;
where count > 1;
var vial count;
run;
The report gives you a list of the vials that were recorded more than once, and the data set VIALS tells you which subjects were matched with which vials.
This prints the duplicate vials but I cannot see on which subjects the duplicate values occur
see below the result
Obs | vial | COUNT |
799 | 108291 | 2 |
1067 | 111011 | 2 |
1264 | 112968 | 2 |
1349 | 113794 | 2 |
1362 | 113909 | 2 |
1604 | 116195 | 2 |
Looking at the updated description for the output, I would keep the duplicates all in one data set:
data vials;
set have;
array viallab {10};
do whichtest=1 to 10;
if viallab{whichtest} ne . then do;
vial = viallab{whichtest};
output;
end;
end;
keep vial subjid visitt visitd whichtest;
run;
proc sort data=vials;
by vial subjid;
run;
data want;
set vials;
by vial;
if first.vial=0 or last.vial=0;
run;
/* optionally */
proc print data=want;
by vial;
id vial;
var subjid visitt visitd whichtest;
run;
It's untested code, but should do the trick.
Also, provide an example of what your output data set should look like. If we a flagging values in multiple variables there has to be something that indicates which what is a duplicate and potentially you have have 10 variables per record that can be duplicates, so are you expecting to have 10 additional flag variables added?
Transpose the data so you have one record per VIALLAB and it is easy to count those to see which are duplicated anywhere in the set. The fun part will be getting whatever the required output is.
This is how my output should look like for the example given below
So I will have 2 separate tables in my result.
SUBJID | VISITT | VISITD | VIALLAB1 | |
DUM-002-001 | Cycle 1 | 1 | 117003 | |
CAM-002-001 | Unscheduled visit during combination treatment 4 | 204 | 117003 | |
SUBJID | VISITT | VISITD | VIALLAB2 | VIALLAB3 |
DUM-002-001 | Cycle 1 | 1 | 170068 | |
CAM-002-001 | Unscheduled visit during combination treatment 4 | 204 | . | 170068 |
proc transpose
data=have
out=long (where=(col1 ne ""))
;
by subjid visitd visitt;
var viallab:;
run;
proc sort
data=long
out=nondup
dupout=dup
nodupkey
;
by _name_;
run;
Untested; for tested code, please provide data in a data step with datalines, posted in a code box ("little running man" button).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.