Solved: levels average and frequency for 1000 variables by proc sql

mh2t · Posted 06-25-2020 08:21 AM

I have a dataset :

ID var1 var2 ... var1000 Y

1 a b f 1111

2 c m g 222

3 a m g 4444

4 d m h 7777

Then I want an output like:

varName Levels Freq Avg_Y

var1 a 2 xx

var1 c 1 yy

var1 d 1 ww

var2 m 3 zz

var2 b 1 qq

The dataset is huge with lots of levels, so PROC FREQ fails to run. I'm wondering if PROC SQL can help me run the code.

Any help would be greatly appreciated!

novinosrin · Posted 06-25-2020 09:37 AM

Hi @mh2t If you are ever interested in some advanced solutions like using Hash for your stated requirement, I encourage you to read Guru Paul D @hashman / @DonH book that has an example of all these kinds.

Data Management Solutions Using SAS^® Hash Table Operations: A Business Intelligence Case Study

By Paul Dorfman and Don Henderson

Anyways, I wanted to have some fun-

data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
data _null_ ;
 if _n_=1 then do;
  dcl hash H (ordered: "A") ;
  h.definekey  ("varname","levels") ;
  h.definedata ("varname","levels","freq","sum","avg_y") ;
  h.definedone () ;
 end;
 set have end=z;
 array t var1-var2;
 do over t;
  varname=vname(t);
  levels=t;
  if h.find()=0 then do;
   freq=sum(freq,1);
   sum=sum(y,sum);
   avg_y=sum/freq;
  end;
  else do;
   freq=1;
   sum=y;
   avg_y=sum;
  end;
  h.replace();
 end;
 if z;
 h.output(dataset:'want(drop=sum)');
run;

View solution in original post

PaigeMiller · Posted 06-25-2020 08:28 AM

PROC FREQ will do this for the levels. (Although I'm skeptical what value there is in creating this table for 1000 variables...)

PROC MEANS/PROC SUMMARY will compute the mean for Y. (Although I'm skeptical what value there is in creating this table for 1000 variables...)

You can merge the tables together if you really need the output in the exact format shown.

--
Paige Miller

Kurt_Bremser · Posted 06-25-2020 08:55 AM

How do you calculate the value for the average of Y?

Anyway, this is the method for calculating the frequencies:

data have;
input id (var1 var2) ($);
datalines;
1        a       b
2        c       m
3        a       m
4        d       m
;

proc transpose data=have out=long name=varname;
by id;
var var:;
run;

proc freq data=long noprint;
tables varname*col1 / nopercent out=want (drop=percent);
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mh2t · Posted 06-25-2020 09:02 AM

Thank you @Kurt_Bremser .

Avg_Y is calcualted within levels. So for Var1, level "a", the Avg_Y = (1111 + 4444)/2

Kurt_Bremser · Posted 06-25-2020 09:14 AM

Then this would be my suggestion:

data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;

proc transpose
  data=have
  out=long (rename=(col1=levels))
  name=varname
;
by id;
var var:;
run;

data long2;
merge
  long
  have (keep=id y)
;
by id;
run;

proc summary data=long2 nway;
class varname levels;
output
  out=want (drop=_type_ rename=(_freq_=freq))
  mean(y)=
;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

AhmedAl_Attar · Posted 06-25-2020 09:36 AM

For Huge data sets, I would avoid Proc Transpose. Using Data Step for fixed data structure can be faster and uses less data passes

/* Transpose the data using Data Step */
data want(keep=varname levels y);
	length varname $7 levels $1;
	array vars (*) $1 var1-var3;
	set have;
	do i=1 to dim(vars);
		varname =vname(vars(i));
		levels =vvalue(vars(i));
		output;
	end;
run;
/* Summarize to produce your desired means */
proc summary data=want nway;
	class varname levels;
	var y;
	output out=want_sums(drop=_type_ Rename=(_freq_=freq)) mean=y;
run;
/* Remove unwanted data */
proc delete data=want; run;

Hope this helps,

Ahmed

novinosrin · Posted 06-25-2020 09:37 AM

Hi @mh2t If you are ever interested in some advanced solutions like using Hash for your stated requirement, I encourage you to read Guru Paul D @hashman / @DonH book that has an example of all these kinds.

Data Management Solutions Using SAS^® Hash Table Operations: A Business Intelligence Case Study

By Paul Dorfman and Don Henderson

Anyways, I wanted to have some fun-

data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
data _null_ ;
 if _n_=1 then do;
  dcl hash H (ordered: "A") ;
  h.definekey  ("varname","levels") ;
  h.definedata ("varname","levels","freq","sum","avg_y") ;
  h.definedone () ;
 end;
 set have end=z;
 array t var1-var2;
 do over t;
  varname=vname(t);
  levels=t;
  if h.find()=0 then do;
   freq=sum(freq,1);
   sum=sum(y,sum);
   avg_y=sum/freq;
  end;
  else do;
   freq=1;
   sum=y;
   avg_y=sum;
  end;
  h.replace();
 end;
 if z;
 h.output(dataset:'want(drop=sum)');
run;

Ksharp · Posted 06-26-2020 09:52 AM

data have;
input id (var1 var2) ($) y;
datalines;
1 a b 1111
2 c m 2222
3 a m 4444
4 d m 7777
;
proc transpose data=have(obs=0) out=temp;
var _all_;
run; 
proc delete data=want;run;
data _null_;
 set temp(where=(lowcase(_name_) like 'var%'));
call execute(cat('proc sql;create table x as select "',_name_,'" as vname length=40,',
_name_,' as levels,count(',_name_,') as freq,mean(y) as avg_y from have group by ',_name_,
';quit;proc append base=want data=x force;run;'));
run;

levels average and frequency for 1000 variables by proc sql

Re: levels average and frequency for 1000 variables by proc sql

Data Management Solutions Using SAS^® Hash Table Operations: A Business Intelligence Case Study

Re: levels average and frequency for 1000 variables by proc sql

Re: levels average and frequency for 1000 variables by proc sql

Re: levels average and frequency for 1000 variables by proc sql

Re: levels average and frequency for 1000 variables by proc sql

Re: levels average and frequency for 1000 variables by proc sql

Re: levels average and frequency for 1000 variables by proc sql

Data Management Solutions Using SAS^® Hash Table Operations: A Business Intelligence Case Study

Re: levels average and frequency for 1000 variables by proc sql

Catch up on SAS Innovate 2026

Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study

Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away

Data Management Solutions Using SAS^® Hash Table Operations: A Business Intelligence Case Study

Data Management Solutions Using SAS^® Hash Table Operations: A Business Intelligence Case Study