Hi,
I have a dataset about the blood sample withdrawal with more information than I need. I want to reduce the duplicates and change the way data are presented. Here is an example of the format I have data in (table 1) below and I want to change it as is shown in table 2 (both tables are not real datasets).
Thanks for your guidance!
Table 1: Existing format
| subject_id | label | sample_address | sample_id | date | subj_date | visit | sample type | 
| 12 | 2 | 2 | 2 | 7/23/2004 | 2 040723 | 1 | Plasma | 
| 12 | 2 | q | 4 | 7/23/2004 | 2 040723 | 1 | Plasma | 
| 12 | 2 | e | 6 | 2/10/2005 | 2 050210 | 2 | Plasma | 
| 12 | 2 | D,11 | 8 | 2/10/2005 | 2 050210 | 2 | Plasma | 
| 12 | 2 | s | 10 | 7/23/2004 | 2 040723 | 1 | Plasma | 
| 25 | 6 | c | 20 | 4/3/2007 | 6 070403 | 1 | Plasma | 
| 25 | 6 | x | 14 | 4/3/2007 | 6 070403 | 1 | Plasma | 
| 25 | 6 | D,11 | 16 | 12/10/2007 | 6 071210 | 2 | Plasma | 
| 25 | 6 | y | 18 | 12/10/2007 | 6 071210 | 2 | Plasma | 
Table 2: The format I need
| subject_id | label | date of 1st sample | date of 2nd sample | samp_typ | 
| 12 | 2 | 7/23/2004 | 2/10/2005 | Plasma | 
| 25 | 6 | 4/3/2007 | 12/10/2007 | Plasma | 
What are your key variables that define a duplicate? It looks like ID, Label and Type?
So your proc sort would be something like:
proc sort data=have nodupkey out=deduped;
by id label type Date;
run;
*then run a proc transpose;
proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;Is 2 the maximum number of duplicates? Or will there be an indeterminate number of samples?
You could do a proc sort, with the nodupkey option, and a proc transpose.
No, there are more than 2 duplicates. Here is some problem. The whole row is not duplicate; as you can see the column C and D are different. I do not want them. I want those variables which I showed in table 2.
I tried to use sort procedure with no duplicates but it does not remove the duplicates, probably these are not 'real' duplicates as I mentioned above that one or two columns are different.
What are your key variables that define a duplicate? It looks like ID, Label and Type?
So your proc sort would be something like:
proc sort data=have nodupkey out=deduped;
by id label type Date;
run;
*then run a proc transpose;
proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;Hi Reeza,
I could reorganized the dat variables. However, when I looked at the data found some IDs were with both dates and some were with one date. How can I remove the IDs with one date?
thanks
Would not something along the lines of (and no test data provided in a useable format to test this - i.e. a datastep so I don't have to type all that in) this work:
data want (keep=subject_id label date1 date2 samp_type);
  set have;
  by subject_id date;
  retain cnt date1 date2;
  if first.subject_id then do;
    cnt=1;
    date1=date;
  end;
  else if cnt=1 then do;
    cnt=2;
    date2=date;
  end;
  if last.subject_id then output;
  format date1 date2 date9.;
run;
					
				
			
			
				
			
			
			
			
			
			
			
		It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
