Hi,
I have a dataset about the blood sample withdrawal with more information than I need. I want to reduce the duplicates and change the way data are presented. Here is an example of the format I have data in (table 1) below and I want to change it as is shown in table 2 (both tables are not real datasets).
Thanks for your guidance!
Table 1: Existing format
subject_id | label | sample_address | sample_id | date | subj_date | visit | sample type |
12 | 2 | 2 | 2 | 7/23/2004 | 2 040723 | 1 | Plasma |
12 | 2 | q | 4 | 7/23/2004 | 2 040723 | 1 | Plasma |
12 | 2 | e | 6 | 2/10/2005 | 2 050210 | 2 | Plasma |
12 | 2 | D,11 | 8 | 2/10/2005 | 2 050210 | 2 | Plasma |
12 | 2 | s | 10 | 7/23/2004 | 2 040723 | 1 | Plasma |
25 | 6 | c | 20 | 4/3/2007 | 6 070403 | 1 | Plasma |
25 | 6 | x | 14 | 4/3/2007 | 6 070403 | 1 | Plasma |
25 | 6 | D,11 | 16 | 12/10/2007 | 6 071210 | 2 | Plasma |
25 | 6 | y | 18 | 12/10/2007 | 6 071210 | 2 | Plasma |
Table 2: The format I need
subject_id | label | date of 1st sample | date of 2nd sample | samp_typ |
12 | 2 | 7/23/2004 | 2/10/2005 | Plasma |
25 | 6 | 4/3/2007 | 12/10/2007 | Plasma |
What are your key variables that define a duplicate? It looks like ID, Label and Type?
So your proc sort would be something like:
proc sort data=have nodupkey out=deduped;
by id label type Date;
run;
*then run a proc transpose;
proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;
Is 2 the maximum number of duplicates? Or will there be an indeterminate number of samples?
You could do a proc sort, with the nodupkey option, and a proc transpose.
No, there are more than 2 duplicates. Here is some problem. The whole row is not duplicate; as you can see the column C and D are different. I do not want them. I want those variables which I showed in table 2.
I tried to use sort procedure with no duplicates but it does not remove the duplicates, probably these are not 'real' duplicates as I mentioned above that one or two columns are different.
What are your key variables that define a duplicate? It looks like ID, Label and Type?
So your proc sort would be something like:
proc sort data=have nodupkey out=deduped;
by id label type Date;
run;
*then run a proc transpose;
proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;
Hi Reeza,
I could reorganized the dat variables. However, when I looked at the data found some IDs were with both dates and some were with one date. How can I remove the IDs with one date?
thanks
Would not something along the lines of (and no test data provided in a useable format to test this - i.e. a datastep so I don't have to type all that in) this work:
data want (keep=subject_id label date1 date2 samp_type); set have; by subject_id date; retain cnt date1 date2; if first.subject_id then do; cnt=1; date1=date; end; else if cnt=1 then do; cnt=2; date2=date; end; if last.subject_id then output; format date1 date2 date9.; run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.