Solved: calculate frequencies from multiple tables and merge into one final t...

CHL0320 · Posted 04-08-2025 09:19 AM

Hi,

I have more than 200 tables like the sample table below and need to generate a final table with the frequency in percentage of records (7 in the sample table). The simple sample table is like

data temp;
input activity $ location $ item $;
datalines;
A IC V
A ID R
B IR W
C IC R
A IR W
C IC V
B ID V
;
RUN;

What I want is like

I use proc freq to get the percentage, but I am struggling to put them all into one final big table with the Filename variable as the dataset name.

Thank you in advance for your suggestion.

PaigeMiller · Posted 04-08-2025 09:46 AM

How about this example? I only use two data sets, temp1 and temp2, and in fact the data and percents are identical, but the point is to illustrate the method.

data temp1;
input activity $ location $ item $;
datalines;
A IC V
A ID R
B IR W
C IC R
A IR W
C IC V
B ID V
;
RUN;

data temp2;
input activity $ location $ item $;
datalines;
A IC V
A ID R
B IR W
C IC R
A IR W
C IC V
B ID V
;
RUN;

data all;
    /* Note: if you really have 200 data sets, just use temp1-temp200 below */
    /* or whatever the real names are */
    set temp1-temp2 indsname=indsname;
    file=propcase(scan(indsname,2,'.'));
run;
proc freq data=all;
    by file;
    tables activity/out=act;
    tables location/out=loc;
    tables item/out=item;
run;
data all2;
    length characteristic activity location item $30;
    set act loc item;
    if not missing(activity) then activity=catx('~',"Activity",activity);
    if not missing(location) then location=catx('~','Location',location);
    if not missing(item) then item=catx('~','Item',item);
    characteristic=coalescec(activity,location,item);
run;

proc report data=all2 split='~';
    columns ("File~Name" file) ("Percent" percent),characteristic;
    define file/group ' ';
    define characteristic/across ' ';
    define percent/sum ' ' format=8.2;
run;

Please note that it is always a good idea to make the table easily readable by your intended audience. This includes using capital letters and English words. Using per_ic, which is not an English word and is not capitalized, is not as clear and readable as Percent Location IC which uses English words and has capitalized words. Note also File Name has capitalized file names and English words (filename is not really an English word). Furthermore, if R V W for Items have actual names or values, these should appear in the table as well. Activity should not be A B C but rather the actual English word names that A B C represent. And so on. Make the table readable!!

--
Paige Miller

View solution in original post

PaigeMiller · Posted 04-08-2025 09:46 AM

How about this example? I only use two data sets, temp1 and temp2, and in fact the data and percents are identical, but the point is to illustrate the method.

data temp1;
input activity $ location $ item $;
datalines;
A IC V
A ID R
B IR W
C IC R
A IR W
C IC V
B ID V
;
RUN;

data temp2;
input activity $ location $ item $;
datalines;
A IC V
A ID R
B IR W
C IC R
A IR W
C IC V
B ID V
;
RUN;

data all;
    /* Note: if you really have 200 data sets, just use temp1-temp200 below */
    /* or whatever the real names are */
    set temp1-temp2 indsname=indsname;
    file=propcase(scan(indsname,2,'.'));
run;
proc freq data=all;
    by file;
    tables activity/out=act;
    tables location/out=loc;
    tables item/out=item;
run;
data all2;
    length characteristic activity location item $30;
    set act loc item;
    if not missing(activity) then activity=catx('~',"Activity",activity);
    if not missing(location) then location=catx('~','Location',location);
    if not missing(item) then item=catx('~','Item',item);
    characteristic=coalescec(activity,location,item);
run;

proc report data=all2 split='~';
    columns ("File~Name" file) ("Percent" percent),characteristic;
    define file/group ' ';
    define characteristic/across ' ';
    define percent/sum ' ' format=8.2;
run;

Please note that it is always a good idea to make the table easily readable by your intended audience. This includes using capital letters and English words. Using per_ic, which is not an English word and is not capitalized, is not as clear and readable as Percent Location IC which uses English words and has capitalized words. Note also File Name has capitalized file names and English words (filename is not really an English word). Furthermore, if R V W for Items have actual names or values, these should appear in the table as well. Activity should not be A B C but rather the actual English word names that A B C represent. And so on. Make the table readable!!

--
Paige Miller

CHL0320 · Posted 04-08-2025 06:36 PM

@PaigeMiller Thank you for the solution. It works very neatly. There is an error to request that I need to do proc sort by file before proceeding to the step of proc freq. The order in the final table is like temp1, temp10.........temp2, temp 20....... Any suggestions to make the order like temp 1, temp2, temp3......? Thank you so much.

Ksharp · Posted 04-08-2025 09:22 PM

Try this one :

proc sort data=all out=want sortseq=linguistic(numeric_collation=on);
by file;
run;

PaigeMiller · Posted 04-09-2025 06:02 AM

@CHL0320 wrote:

@PaigeMiller Thank you for the solution. It works very neatly. There is an error to request that I need to do proc sort by file before proceeding to the step of proc freq. The order in the final table is like temp1, temp10.........temp2, temp 20....... Any suggestions to make the order like temp 1, temp2, temp3......? Thank you so much.

This raises another question. Is the intended audience for your table output familiar with what is in data set temp1 and temp2 and ... and temp200? Maybe, but I think this is unlikely. Meaningful descriptions of these data sets ought to be used in the table instead of file1 file2 etc. (The goal isn't to get the programming to work; the goal is to create a table that people will understand, and programming is a part of the steps needed to obtain that table)

--
Paige Miller

quickbluefish · Posted 04-09-2025 07:58 AM

I have somehow never seen the INDSNAME option before - very useful!

quickbluefish · Posted 04-08-2025 10:01 AM

data temp1;
input activity $ location $ item $;
datalines;
A IC V
A ID R
B IR W
C IC V
A IR W
C IC V
B ID V
;
RUN;

data temp2;
input activity $ location $ item $;
datalines;
A IC V
A ID R
B IR W
C IC R
A IR W
A IC W
B ID V
;
RUN;

data temp3;
input activity $ location $ item $;
datalines;
A IC R
B ID R
B IR W
C IC R
A IR W
C IC V
B ID V
;
RUN;

proc sql noprint;
select compbl(memname || compress(' (in=ds_' || memname || ')')) into :tlist separated by ' '
from dictionary.tables where libname='WORK' and upcase(memname) like 'TEMP%';
quit;

%put confirm this looks right: &tlist;

data all (rename=(lastfn=fname));
set &tlist end=last;
length lastfn fname $32;
array T {*} ds_:;
do i=1 to dim(T);
	if T[i] then do;
		fname=substr(vname(T[i]),4);
		leave;
	end;
end;
lastfn=lag(fname);
length per_A per_B per_C per_IC per_ID per_IR per_V per_R per_W 8;
format per_: percent8.2;
array p1 {*} per_A per_B per_C;
array p2 {*} per_IC per_ID per_IR;
array p3 {*} per_V per_R per_W;
if fname^=lastfn then do;
	if _N_>1 then do;
		do i=1 to dim(p1);
			p1[i]=p1[i]/nrecs;
		end;
		do i=1 to dim(p2);
			p2[i]=p2[i]/nrecs;
		end;
		do i=1 to dim(p3);
			p3[i]=p3[i]/nrecs;
		end;
		output;
	end;
	call missing(of p1[*], of p2[*], of p3[*]);
	nrecs=0;
end;
nrecs+1;
do i=1 to dim(p1);
	p1[i]+(activity=scan(vname(p1[i]),2,'_'));
end;
do i=1 to dim(p2);
	p2[i]+(location=scan(vname(p2[i]),2,'_'));
end;
do i=1 to dim(p3);
	p3[i]+(item=scan(vname(p3[i]),2,'_'));
end;
if last then do;
	lastfn=fname;
	do i=1 to dim(p1);
		p1[i]=p1[i]/nrecs;
	end;
	do i=1 to dim(p2);
		p2[i]=p2[i]/nrecs;
	end;
	do i=1 to dim(p3);
		p3[i]=p3[i]/nrecs;
	end;
	output;
end;
keep lastfn per_:;
run;

proc print data=all heading=v width=min; run;
run;

CHL0320 · Posted 04-08-2025 10:57 AM

Thank you. The values in the final table are the percentage of observation. The value is (the count of the original value in the column/total record number)*100.

quickbluefish · Posted 04-08-2025 11:09 AM

ah, misread your original post - I edited the code above.

PaigeMiller · Posted 04-08-2025 11:16 AM

NRECS is a variable that is not defined before you use it in this statement

p1[i]=p1[i]/nrecs;

The original question said there are over 200 data sets, how do you up-scale this code to that case? My code handles this case easily, as the comment in my code indicates.

--
Paige Miller

quickbluefish · Posted 04-08-2025 11:23 AM

good point - it should be in the LENGTH statement - the reason it's working is because of the IF _N_>1 THEN... part.

It's the SQL part that's dealing with any input datasets starting with TEMP. I think it should work for any reasonable number (within the max allowed size of a macro var).

calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Re: calculate frequencies from multiple tables and merge into one final table without key ID

Registration is open

Registration is open

SAS Training: Just a Click Away