Hi,
I have a dataset about the blood sample withdrawal with more information than I need. I want to reduce the duplicates and change the way data are presented. Here is an example of the format I have data in (table 1) below and I want to change it as is shown in table 2 (both tables are not real datasets).
Thanks for your guidance!
Table 1: Existing format
| subject_id | label | sample_address | sample_id | date | subj_date | visit | sample type |
| 12 | 2 | 2 | 2 | 7/23/2004 | 2 040723 | 1 | Plasma |
| 12 | 2 | q | 4 | 7/23/2004 | 2 040723 | 1 | Plasma |
| 12 | 2 | e | 6 | 2/10/2005 | 2 050210 | 2 | Plasma |
| 12 | 2 | D,11 | 8 | 2/10/2005 | 2 050210 | 2 | Plasma |
| 12 | 2 | s | 10 | 7/23/2004 | 2 040723 | 1 | Plasma |
| 25 | 6 | c | 20 | 4/3/2007 | 6 070403 | 1 | Plasma |
| 25 | 6 | x | 14 | 4/3/2007 | 6 070403 | 1 | Plasma |
| 25 | 6 | D,11 | 16 | 12/10/2007 | 6 071210 | 2 | Plasma |
| 25 | 6 | y | 18 | 12/10/2007 | 6 071210 | 2 | Plasma |
Table 2: The format I need
| subject_id | label | date of 1st sample | date of 2nd sample | samp_typ |
| 12 | 2 | 7/23/2004 | 2/10/2005 | Plasma |
| 25 | 6 | 4/3/2007 | 12/10/2007 | Plasma |
What are your key variables that define a duplicate? It looks like ID, Label and Type?
So your proc sort would be something like:
proc sort data=have nodupkey out=deduped;
by id label type Date;
run;
*then run a proc transpose;
proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;
Is 2 the maximum number of duplicates? Or will there be an indeterminate number of samples?
You could do a proc sort, with the nodupkey option, and a proc transpose.
No, there are more than 2 duplicates. Here is some problem. The whole row is not duplicate; as you can see the column C and D are different. I do not want them. I want those variables which I showed in table 2.
I tried to use sort procedure with no duplicates but it does not remove the duplicates, probably these are not 'real' duplicates as I mentioned above that one or two columns are different.
What are your key variables that define a duplicate? It looks like ID, Label and Type?
So your proc sort would be something like:
proc sort data=have nodupkey out=deduped;
by id label type Date;
run;
*then run a proc transpose;
proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;
Hi Reeza,
I could reorganized the dat variables. However, when I looked at the data found some IDs were with both dates and some were with one date. How can I remove the IDs with one date?
thanks
Would not something along the lines of (and no test data provided in a useable format to test this - i.e. a datastep so I don't have to type all that in) this work:
data want (keep=subject_id label date1 date2 samp_type);
set have;
by subject_id date;
retain cnt date1 date2;
if first.subject_id then do;
cnt=1;
date1=date;
end;
else if cnt=1 then do;
cnt=2;
date2=date;
end;
if last.subject_id then output;
format date1 date2 date9.;
run;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.