topic Re: Keep all variables from dataset when using proc summary in New SAS User

Keep all variables from dataset when using proc summary

monsterpie — Sun, 16 Jan 2022 23:26:26 GMT

Hello all,

I have a long dataset and am using proc summary and sql to transform it to a wide dataset. I'm having trouble adapting the following code to retain all of my variables in my long dataset in the new wide dataset. Example of my code is below. For my age and count variables, I would like to retain them but if I include them in the IDGROUP statement, then SAS creates multiple age and count columns in my wide dataset (I would just like one, see desired output below).


data have;
input id visit_num diag$ age treatment count;
cards;
01 1 diabetes 42 1 1
02 1 stroke 78 1 2
02 2 cancer . 2 2
02 3 stroke . . 2
03 1 stroke 66 2 2
03 2 copd . 1 2
03 3 . . . 2
;
run;

proc sql noprint;
select max(n) into :n
 from (select count(*) as n from have group by id);
quit;
proc summary data=have;
by id;
output out=want idgroup(out[&n] (visit_num diag treatment)=);
run;

Desired output:(Note, in my actual dataset I have ~40 variables similar to count and age, where I would like just one column for each of these variables in my wide dataset).

ID	Visit_Num1	Visit_Num2	Visit_Num3	Age	Treatment1	Treatment2	Treatment3	Count
01	diabetes			42	1			1
02	stroke	cancer	stroke	78	1	2		2
03	stroke	copd		66	2	1		2

Re: Keep all variables from dataset when using proc summary

r_behata — Mon, 17 Jan 2022 00:29:40 GMT

Without re-writing the existing code ( which works just fine for most part). I would suggest adding a step to re-merging the age and count variables.


proc sql noprint;
	select max(n) into :n from (select count(*) as n from have group by id);
quit;

proc summary data=have;
	by id;
	output out=want idgroup(out[&n] (visit_num diag treatment)=);
run;


data final;
	merge want have(where=(age ne .) keep=id age  count );
	by id;
run;

Re: Keep all variables from dataset when using proc summary

Tom — Mon, 17 Jan 2022 01:39:59 GMT

If AGE and/or COUNT is constant for an ID then you could include it in the BY statement. But that does not seem to be the case. Why?

To drop variables use the DROP= dataset option.

proc summary data=have;
  by id;
  output
    out=want(drop=age2-age&n count2-count&n
             rename=(age1=age count1=count))
    idgroup(out[&n] (visit_num age diag treatment count)=)
  ;
run;

If AGE and COUNT a numeric you could calculate a statistic for them instead. Perhaps the minimum age and the sum of the counts?

proc summary data=have;
  by id;
  var age count;
  output
    out=want
   min(age)= sum(count)=
    idgroup(out[&n] (visit_num diag treatment )=)
  ;
run;

Re: Keep all variables from dataset when using proc summary

Reeza — Mon, 17 Jan 2022 17:23:02 GMT

This gets you pretty close, except it's not coded as VISIT_NUM but DIAG1 - DIAG3

data have;
input id visit_num diag$ age treatment count;
cards;
01 1 diabetes 42 1 1
02 1 stroke 78 1 2
02 2 cancer . 2 2
02 3 stroke . . 2
03 1 stroke 66 2 2
03 2 copd . 1 2
03 3 . . . 2
;
run;

proc sql noprint;
select max(n) into :n
 from (select count(*) as n from have group by id);
quit;
proc summary data=have;
by id;
ID Age COUNT;
output out=want (drop=_:) idgroup(out[&n] (diag treatment)=);
run;

Results:

@monsterpie wrote:

Hello all,

I have a long dataset and am using proc summary and sql to transform it to a wide dataset. I'm having trouble adapting the following code to retain all of my variables in my long dataset in the new wide dataset. Example of my code is below. For my age and count variables, I would like to retain them but if I include them in the IDGROUP statement, then SAS creates multiple age and count columns in my wide dataset (I would just like one, see desired output below).
data have;
input id visit_num diag$ age treatment count;
cards;
01 1 diabetes 42 1 1
02 1 stroke 78 1 2
02 2 cancer . 2 2
02 3 stroke . . 2
03 1 stroke 66 2 2
03 2 copd . 1 2
03 3 . . . 2
;
run;

proc sql noprint;
select max(n) into :n
 from (select count(*) as n from have group by id);
quit;
proc summary data=have;
by id;
output out=want idgroup(out[&n] (visit_num diag treatment)=);
run;
Desired output:(Note, in my actual dataset I have ~40 variables similar to count and age, where I would like just one column for each of these variables in my wide dataset).

ID Visit_Num1 Visit_Num2 Visit_Num3 Age Treatment1 Treatment2 Treatment3 Count

01 diabetes 42 1 1

02 stroke cancer stroke 78 1 2 2

03 stroke copd 66 2 1 2