Solved: Re: Refining dataset

wajmsu · Posted 03-06-2016 06:58 PM

Hi,

I have a dataset about the blood sample withdrawal with more information than I need. I want to reduce the duplicates and change the way data are presented. Here is an example of the format I have data in (table 1) below and I want to change it as is shown in table 2 (both tables are not real datasets).

Thanks for your guidance!

Table 1: Existing format

subject_id	label	sample_address	sample_id	date	subj_date	visit	sample type
12	2	2	2	7/23/2004	2 040723	1	Plasma
12	2	q	4	7/23/2004	2 040723	1	Plasma
12	2	e	6	2/10/2005	2 050210	2	Plasma
12	2	D,11	8	2/10/2005	2 050210	2	Plasma
12	2	s	10	7/23/2004	2 040723	1	Plasma
25	6	c	20	4/3/2007	6 070403	1	Plasma
25	6	x	14	4/3/2007	6 070403	1	Plasma
25	6	D,11	16	12/10/2007	6 071210	2	Plasma
25	6	y	18	12/10/2007	6 071210	2	Plasma

Table 2: The format I need

subject_id	label	date of 1st sample	date of 2nd sample	samp_typ
12	2	7/23/2004	2/10/2005	Plasma
25	6	4/3/2007	12/10/2007	Plasma

Reeza · Posted 03-06-2016 09:33 PM

What are your key variables that define a duplicate? It looks like ID, Label and Type?

So your proc sort would be something like:

proc sort data=have nodupkey out=deduped;
by id label type Date;
run;


*then run a proc transpose;

proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;

View solution in original post

Reeza · Posted 03-06-2016 07:08 PM

Is 2 the maximum number of duplicates? Or will there be an indeterminate number of samples?

You could do a proc sort, with the nodupkey option, and a proc transpose.

wajmsu · Posted 03-06-2016 09:13 PM

No, there are more than 2 duplicates. Here is some problem. The whole row is not duplicate; as you can see the column C and D are different. I do not want them. I want those variables which I showed in table 2.

I tried to use sort procedure with no duplicates but it does not remove the duplicates, probably these are not 'real' duplicates as I mentioned above that one or two columns are different.

Reeza · Posted 03-06-2016 09:33 PM

What are your key variables that define a duplicate? It looks like ID, Label and Type?

So your proc sort would be something like:

proc sort data=have nodupkey out=deduped;
by id label type Date;
run;


*then run a proc transpose;

proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;

wajmsu · Posted 03-07-2016 12:33 PM

Hi Reeza,

I could reorganized the dat variables. However, when I looked at the data found some IDs were with both dates and some were with one date. How can I remove the IDs with one date?

thanks

RW9 · Posted 03-07-2016 04:29 AM

Would not something along the lines of (and no test data provided in a useable format to test this - i.e. a datastep so I don't have to type all that in) this work:

data want (keep=subject_id label date1 date2 samp_type);
  set have;
  by subject_id date;
  retain cnt date1 date2;
  if first.subject_id then do;
    cnt=1;
    date1=date;
  end;
  else if cnt=1 then do;
    cnt=2;
    date2=date;
  end;
  if last.subject_id then output;
  format date1 date2 date9.;
run;

Classroom Training Available!