I have a dataset :
ID var1 var2 ... var1000 Y
1 a b f 1111
2 c m g 222
3 a m g 4444
4 d m h 7777
Then I want an output like:
varName Levels Freq Avg_Y
var1 a 2 xx
var1 c 1 yy
var1 d 1 ww
var2 m 3 zz
var2 b 1 qq
The dataset is huge with lots of levels, so PROC FREQ fails to run. I'm wondering if PROC SQL can help me run the code.
Any help would be greatly appreciated!
Hi @mh2t If you are ever interested in some advanced solutions like using Hash for your stated requirement, I encourage you to read Guru Paul D @hashman / @DonH book that has an example of all these kinds.
By Paul Dorfman and Don Henderson
Anyways, I wanted to have some fun-
data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
data _null_ ;
if _n_=1 then do;
dcl hash H (ordered: "A") ;
h.definekey ("varname","levels") ;
h.definedata ("varname","levels","freq","sum","avg_y") ;
h.definedone () ;
end;
set have end=z;
array t var1-var2;
do over t;
varname=vname(t);
levels=t;
if h.find()=0 then do;
freq=sum(freq,1);
sum=sum(y,sum);
avg_y=sum/freq;
end;
else do;
freq=1;
sum=y;
avg_y=sum;
end;
h.replace();
end;
if z;
h.output(dataset:'want(drop=sum)');
run;
PROC FREQ will do this for the levels. (Although I'm skeptical what value there is in creating this table for 1000 variables...)
PROC MEANS/PROC SUMMARY will compute the mean for Y. (Although I'm skeptical what value there is in creating this table for 1000 variables...)
You can merge the tables together if you really need the output in the exact format shown.
How do you calculate the value for the average of Y?
Anyway, this is the method for calculating the frequencies:
data have;
input id (var1 var2) ($);
datalines;
1 a b
2 c m
3 a m
4 d m
;
proc transpose data=have out=long name=varname;
by id;
var var:;
run;
proc freq data=long noprint;
tables varname*col1 / nopercent out=want (drop=percent);
run;
Thank you @Kurt_Bremser .
Avg_Y is calcualted within levels. So for Var1, level "a", the Avg_Y = (1111 + 4444)/2
Then this would be my suggestion:
data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
proc transpose
data=have
out=long (rename=(col1=levels))
name=varname
;
by id;
var var:;
run;
data long2;
merge
long
have (keep=id y)
;
by id;
run;
proc summary data=long2 nway;
class varname levels;
output
out=want (drop=_type_ rename=(_freq_=freq))
mean(y)=
;
run;
For Huge data sets, I would avoid Proc Transpose. Using Data Step for fixed data structure can be faster and uses less data passes
/* Transpose the data using Data Step */
data want(keep=varname levels y);
length varname $7 levels $1;
array vars (*) $1 var1-var3;
set have;
do i=1 to dim(vars);
varname =vname(vars(i));
levels =vvalue(vars(i));
output;
end;
run;
/* Summarize to produce your desired means */
proc summary data=want nway;
class varname levels;
var y;
output out=want_sums(drop=_type_ Rename=(_freq_=freq)) mean=y;
run;
/* Remove unwanted data */
proc delete data=want; run;
Hope this helps,
Ahmed
Hi @mh2t If you are ever interested in some advanced solutions like using Hash for your stated requirement, I encourage you to read Guru Paul D @hashman / @DonH book that has an example of all these kinds.
By Paul Dorfman and Don Henderson
Anyways, I wanted to have some fun-
data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
data _null_ ;
if _n_=1 then do;
dcl hash H (ordered: "A") ;
h.definekey ("varname","levels") ;
h.definedata ("varname","levels","freq","sum","avg_y") ;
h.definedone () ;
end;
set have end=z;
array t var1-var2;
do over t;
varname=vname(t);
levels=t;
if h.find()=0 then do;
freq=sum(freq,1);
sum=sum(y,sum);
avg_y=sum/freq;
end;
else do;
freq=1;
sum=y;
avg_y=sum;
end;
h.replace();
end;
if z;
h.output(dataset:'want(drop=sum)');
run;
data have; input id (var1 var2) ($) y; datalines; 1 a b 1111 2 c m 2222 3 a m 4444 4 d m 7777 ; proc transpose data=have(obs=0) out=temp; var _all_; run; proc delete data=want;run; data _null_; set temp(where=(lowcase(_name_) like 'var%')); call execute(cat('proc sql;create table x as select "',_name_,'" as vname length=40,', _name_,' as levels,count(',_name_,') as freq,mean(y) as avg_y from have group by ',_name_, ';quit;proc append base=want data=x force;run;')); run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.