- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset :
ID var1 var2 ... var1000 Y
1 a b f 1111
2 c m g 222
3 a m g 4444
4 d m h 7777
Then I want an output like:
varName Levels Freq Avg_Y
var1 a 2 xx
var1 c 1 yy
var1 d 1 ww
var2 m 3 zz
var2 b 1 qq
The dataset is huge with lots of levels, so PROC FREQ fails to run. I'm wondering if PROC SQL can help me run the code.
Any help would be greatly appreciated!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @mh2t If you are ever interested in some advanced solutions like using Hash for your stated requirement, I encourage you to read Guru Paul D @hashman / @DonH book that has an example of all these kinds.
Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study
By Paul Dorfman and Don Henderson
Anyways, I wanted to have some fun-
data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
data _null_ ;
if _n_=1 then do;
dcl hash H (ordered: "A") ;
h.definekey ("varname","levels") ;
h.definedata ("varname","levels","freq","sum","avg_y") ;
h.definedone () ;
end;
set have end=z;
array t var1-var2;
do over t;
varname=vname(t);
levels=t;
if h.find()=0 then do;
freq=sum(freq,1);
sum=sum(y,sum);
avg_y=sum/freq;
end;
else do;
freq=1;
sum=y;
avg_y=sum;
end;
h.replace();
end;
if z;
h.output(dataset:'want(drop=sum)');
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PROC FREQ will do this for the levels. (Although I'm skeptical what value there is in creating this table for 1000 variables...)
PROC MEANS/PROC SUMMARY will compute the mean for Y. (Although I'm skeptical what value there is in creating this table for 1000 variables...)
You can merge the tables together if you really need the output in the exact format shown.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How do you calculate the value for the average of Y?
Anyway, this is the method for calculating the frequencies:
data have;
input id (var1 var2) ($);
datalines;
1 a b
2 c m
3 a m
4 d m
;
proc transpose data=have out=long name=varname;
by id;
var var:;
run;
proc freq data=long noprint;
tables varname*col1 / nopercent out=want (drop=percent);
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @Kurt_Bremser .
Avg_Y is calcualted within levels. So for Var1, level "a", the Avg_Y = (1111 + 4444)/2
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Then this would be my suggestion:
data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
proc transpose
data=have
out=long (rename=(col1=levels))
name=varname
;
by id;
var var:;
run;
data long2;
merge
long
have (keep=id y)
;
by id;
run;
proc summary data=long2 nway;
class varname levels;
output
out=want (drop=_type_ rename=(_freq_=freq))
mean(y)=
;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For Huge data sets, I would avoid Proc Transpose. Using Data Step for fixed data structure can be faster and uses less data passes
/* Transpose the data using Data Step */
data want(keep=varname levels y);
length varname $7 levels $1;
array vars (*) $1 var1-var3;
set have;
do i=1 to dim(vars);
varname =vname(vars(i));
levels =vvalue(vars(i));
output;
end;
run;
/* Summarize to produce your desired means */
proc summary data=want nway;
class varname levels;
var y;
output out=want_sums(drop=_type_ Rename=(_freq_=freq)) mean=y;
run;
/* Remove unwanted data */
proc delete data=want; run;
Hope this helps,
Ahmed
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @mh2t If you are ever interested in some advanced solutions like using Hash for your stated requirement, I encourage you to read Guru Paul D @hashman / @DonH book that has an example of all these kinds.
Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study
By Paul Dorfman and Don Henderson
Anyways, I wanted to have some fun-
data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
data _null_ ;
if _n_=1 then do;
dcl hash H (ordered: "A") ;
h.definekey ("varname","levels") ;
h.definedata ("varname","levels","freq","sum","avg_y") ;
h.definedone () ;
end;
set have end=z;
array t var1-var2;
do over t;
varname=vname(t);
levels=t;
if h.find()=0 then do;
freq=sum(freq,1);
sum=sum(y,sum);
avg_y=sum/freq;
end;
else do;
freq=1;
sum=y;
avg_y=sum;
end;
h.replace();
end;
if z;
h.output(dataset:'want(drop=sum)');
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have; input id (var1 var2) ($) y; datalines; 1 a b 1111 2 c m 2222 3 a m 4444 4 d m 7777 ; proc transpose data=have(obs=0) out=temp; var _all_; run; proc delete data=want;run; data _null_; set temp(where=(lowcase(_name_) like 'var%')); call execute(cat('proc sql;create table x as select "',_name_,'" as vname length=40,', _name_,' as levels,count(',_name_,') as freq,mean(y) as avg_y from have group by ',_name_, ';quit;proc append base=want data=x force;run;')); run;