DATA Step, Macro, Functions and more

Refining dataset

Accepted Solution Solved
Reply
Contributor
Posts: 61
Accepted Solution

Refining dataset

Hi,

I have a dataset about the blood sample withdrawal with more information than I need.  I want to reduce the duplicates and change the way data are presented.  Here is an example of the format I have data in (table 1) below and I want to change it as is shown in table 2 (both tables are not real datasets).

 

Thanks for your guidance!

Table 1: Existing format

subject_idlabelsample_addresssample_iddatesubj_datevisitsample type
122227/23/20042 0407231Plasma
122q47/23/20042 0407231Plasma
122e62/10/20052 0502102Plasma
122D,1182/10/20052 0502102Plasma
122s107/23/20042 0407231Plasma
256c204/3/20076 0704031Plasma
256x144/3/20076 0704031Plasma
256D,111612/10/20076 0712102Plasma
256y1812/10/20076 0712102Plasma

 

Table 2: The format I need

subject_idlabeldate of 1st sampledate of 2nd samplesamp_typ
1227/23/20042/10/2005Plasma
2564/3/200712/10/2007Plasma
     

Accepted Solutions
Solution
‎03-07-2016 12:30 PM
Super User
Posts: 19,789

Re: Refining dataset

What are your key variables that define a duplicate? It looks like ID, Label and Type?

 

So your proc sort would be something like:

 

proc sort data=have nodupkey out=deduped;
by id label type Date;
run;


*then run a proc transpose;

proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;

View solution in original post


All Replies
Super User
Posts: 19,789

Re: Refining dataset

Is 2 the maximum number of duplicates? Or will there be an indeterminate number of samples? 

 

You could do a proc sort, with the nodupkey option, and a proc transpose. 

Contributor
Posts: 61

Re: Refining dataset

No, there are more than 2 duplicates.  Here is some problem.  The whole row is not duplicate; as you can see the column C and D are different.  I do not want them.  I want those variables which I showed in table 2. 

I tried to use sort procedure with no duplicates but it does not remove the duplicates, probably these are not 'real' duplicates as I mentioned above that one or two columns are different.

 

Solution
‎03-07-2016 12:30 PM
Super User
Posts: 19,789

Re: Refining dataset

What are your key variables that define a duplicate? It looks like ID, Label and Type?

 

So your proc sort would be something like:

 

proc sort data=have nodupkey out=deduped;
by id label type Date;
run;


*then run a proc transpose;

proc transpose data=deduped out=want prefix=date;
by id label type;
var date;
run;
Contributor
Posts: 61

Re: Refining dataset

Hi Reeza,

I could reorganized the dat variables.  However, when I looked at the data found some IDs were with both dates and some were with one date.  How can I remove the IDs with one date?

 

thanks

 

Super User
Super User
Posts: 7,950

Re: Refining dataset

Would not something along the lines of (and no test data provided in a useable format to test this - i.e. a datastep so I don't have to type all that in) this work:

data want (keep=subject_id label date1 date2 samp_type);
  set have;
  by subject_id date;
  retain cnt date1 date2;
  if first.subject_id then do;
    cnt=1;
    date1=date;
  end;
  else if cnt=1 then do;
    cnt=2;
    date2=date;
  end;
  if last.subject_id then output;
  format date1 date2 date9.;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 281 views
  • 1 like
  • 3 in conversation