Hello SAS community,
First, I want to say how grateful I am to find this discussion board. I think it is so valuable there is a community of people who are willing to help one another with their SAS programming questions and I want to thank anyone in advance who comments and replies. I hope one day I am a good enough SAS programmer to return the favor and help others.
My goal is to take a dataset of multiple variables and find how many times unique combinations of pairs of variables exist in the dataset. First, I created a fake dataset with three variables called y1, y2, y3. Then, I have a macro that uses proc freq on each pair of combinations. The last step is to combine the frequencies (and convert it to a fraction) into one data set with the original variables.
Here is my SAS program:
data test_data;
do i = 1 to 10;
cod = i;
call streaminit(123);
u1 = rand("Uniform");
y1 = ceil(5*u1);
call streaminit(456);
u2 = rand("Uniform");
y2 = ceil(5*u2);
call streaminit(789);
u3 = rand("Uniform");
y3 = ceil(5*u3);
output;
end;
drop u1 u2 u3 i;
run;
/*proc print data = test_data; run;*/
%macro pairedfreqs(dataset, order, k, vars);
proc freq data = &dataset order=data noprint;
tables y1*y2/ out = pfreq1 list nopct nocum;
run;
proc freq data = &dataset order=data noprint;
tables y1*y3/ out = pfreq2 list nopct nocum;
run;
proc freq data= &dataset order=data noprint;
tables y2*y3/ out = pfreq3 list nopct nocum;
run;
%do i=1 %to &k;
proc sort data=&dataset;
by %scan(&vars,&i);
run;
proc sort data=pfreq&i;
by %scan(&vars,&i);
run;
data f&i(drop=count percent);
merge &dataset pfreq&i;
by %scan(&vars,&i);
pfreq&i=percent/100;
run;
proc sort;
by ℴ
run;
%end;
data all_freqs;
merge %do i=1 %to &k; f&i %end;;
by ℴ
run;
%mend;
%pairedfreqs(dataset=test_data,order=cod,k = 3, vars= y1 y2 y3);
The problem arises with my do-loop, specifically where I create dataset "f2" because there is no variable y2 in this dataset. So the final data set of interest, all_freqs, has missing values for pfreq2 even though the actual data set pfreq2 does exist and is correct.
I have searched the internet and these discussion boards, and I have tried many different things but I am stuck. Any advice on fixing what I have so the correct data set results is appreciated.
The second step, or my second question, is then how to edit my macro so that the idea behind this procedure can extend to datasets with any number of variables. Meaning, change the macro contents so that if someone has a dataset with 5 variables (and is using the macro with the argument vars = var1 var2 var3 var4 var5), the macro will find all the combinations of pairs of two variables, find the frequencies (convert to a decimal), and then merge together in a final data set.
If further clarification is needed, let me know. Thanks!!
This is an expanded version of Tom's suggested method. It is probably more convoluted than it needs to be, but appears to be a way of accomplishing what you want and should be easy to generalize to any number of variables:
data test_data;
do i = 1 to 10;
cod = i;
call streaminit(123);
u1 = rand("Uniform");
y1 = ceil(5*u1);
call streaminit(456);
u2 = rand("Uniform");
y2 = ceil(5*u2);
call streaminit(789);
u3 = rand("Uniform");
y3 = ceil(5*u3);
output;
end;
drop u1 u2 u3 i;
run;
PROC MEANS DATA=WORK.TEST_DATA NOPRINT;
WAYS 0,2;
VAR cod;
CLASS y1 y2 y3;
OUTPUT OUT=WORK.TEST_DATA_STATS (drop=_freq_) N()= / AUTONAME;
RUN;
proc sql noprint;
select count(_type_)-1
into :total_types
from (select distinct _type_
from test_data_stats)
;
select cod_n
into :total
from test_data_stats
where _type_ eq 0
;
quit;
proc sort data=test_data_stats;
by descending _type_;
run;
data test_data_stats (drop=_: cod_N);
set test_data_stats (where=(_type_ ne 0));
array pfreq(&total_types);
if _type_ ne lag(_type_) then types+1;
pfreq(types)=cod_N/&total;
run;
data want;
set test_data;
run;
proc sql noprint;
create table test_data_stats_plus
as select *
from test_data_stats,
test_data (keep=cod)
order by cod
;
select name
into :vars separated by " "
from dictionary.columns
where libname="WORK" and
memname="WANT" and
name like 'y%'
;
quit;
%macro finish;
%do i=1 %to &total_types;
data type_&i (drop=types);
set test_data_stats_plus
(where=(types eq &i));
run;
%end;
%let type=0;
%do i=1 %to %eval(&total_types.-1);
%do j=%eval(&i+1) %to &total_types.;
%let type=%eval(&type+1);
proc sort data=type_&type;
by cod %scan(&vars,&i) %scan(&vars,&j);
run;
proc sort data=want;
by cod %scan(&vars,&i) %scan(&vars,&j);
run;
data want;
update want (in=ina) type_&type;
by cod %scan(&vars,&i) %scan(&vars,&j);
if ina;
run;
%end;
%end;
%mend;
%finish
Given your sample dataset, can you post a file that looks like the final dataset you want to achieve?
Hey Arthur,
This is the current output:
Obs cod y1 y2 y3 pfreq1 pfreq2 pfreq3
1 1 3 1 1 0.1 . 0.3
2 2 2 2 2 0.1 . 0.1
3 3 2 1 1 0.1 . 0.3
4 4 1 5 2 0.1 . 0.1
5 5 3 5 4 0.1 . 0.1
6 6 5 2 5 0.1 . 0.1
7 7 1 3 5 0.1 . 0.2
8 8 1 1 1 0.1 . 0.3
9 9 4 4 5 0.1 . 0.1
10 10 4 3 5 0.1 . 0.1
This is the desired output:
Obs cod y1 y2 y3 pfreq1 pfreq2 pfreq3
1 1 3 1 1 0.1 0.1 0.3
2 2 2 2 2 0.1 0.1 0.1
3 3 2 1 1 0.1 0.1 0.3
4 4 1 5 2 0.1 0.1 0.1
5 5 3 5 4 0.1 0.1 0.1
6 6 5 2 5 0.1 0.1 0.1
7 7 1 3 5 0.1 0.1 0.2
8 8 1 1 1 0.1 0.1 0.3
9 9 4 4 5 0.1 0.2 0.1
10 10 4 3 5 0.1 0.2 0.1
What are pfreq1 thru pfreq3 supposed to represent?
Taking your problem out of macro land for the moment, the following produces the counts you want:
proc freq data= test_data order=data noprint;
tables cod*y1*y2*y3/ out = pfreq list nopct nocum;
run;
proc sort data=pfreq;
by cod;
run;
Hey Arthur,
Your suggest code is only giving me one column of counts (which all display the value 1) not the three separate pairs of counts which should have the values I posted above (pfreq1, pfreq2, pfreq3) * 100.
pfreq1 - 3 are the pairs of frequencies (well, converted into decimals).
For example:
y1 y2
3 1
1 2
3 1
The pfreq values would be 2/3, 1/3, and 2/3 because the combination of y1 = 3 and y2 = 1 occurs twice and the combination y1 = 1 and y2 = 2 occurs once.
I hope that makes sense but if further clarification is needed let me know.
I'm not sure about your post-processing, but you might find this a little easier way to get all of your two-way counts than your macro loop:
PROC MEANS DATA=WORK.TEST_DATA NOPRINT;
WAYS 2;
VAR cod;
CLASS y1 y2 y3;
OUTPUT OUT=WORK.TEST_DATA_STATS N()= / AUTONAME;
RUN;
Tom
Hey Tom,
I like this idea because I think I know how to extend it for my second question (how to adjust the program for any sized data set).
Any idea on how to adjust the proc means output so the desired output I posted above is displayed? Instead, your proc means currently outputs 26 rows but I want to collapse the information back to the original 10 observations.
Thanks!
This is an expanded version of Tom's suggested method. It is probably more convoluted than it needs to be, but appears to be a way of accomplishing what you want and should be easy to generalize to any number of variables:
data test_data;
do i = 1 to 10;
cod = i;
call streaminit(123);
u1 = rand("Uniform");
y1 = ceil(5*u1);
call streaminit(456);
u2 = rand("Uniform");
y2 = ceil(5*u2);
call streaminit(789);
u3 = rand("Uniform");
y3 = ceil(5*u3);
output;
end;
drop u1 u2 u3 i;
run;
PROC MEANS DATA=WORK.TEST_DATA NOPRINT;
WAYS 0,2;
VAR cod;
CLASS y1 y2 y3;
OUTPUT OUT=WORK.TEST_DATA_STATS (drop=_freq_) N()= / AUTONAME;
RUN;
proc sql noprint;
select count(_type_)-1
into :total_types
from (select distinct _type_
from test_data_stats)
;
select cod_n
into :total
from test_data_stats
where _type_ eq 0
;
quit;
proc sort data=test_data_stats;
by descending _type_;
run;
data test_data_stats (drop=_: cod_N);
set test_data_stats (where=(_type_ ne 0));
array pfreq(&total_types);
if _type_ ne lag(_type_) then types+1;
pfreq(types)=cod_N/&total;
run;
data want;
set test_data;
run;
proc sql noprint;
create table test_data_stats_plus
as select *
from test_data_stats,
test_data (keep=cod)
order by cod
;
select name
into :vars separated by " "
from dictionary.columns
where libname="WORK" and
memname="WANT" and
name like 'y%'
;
quit;
%macro finish;
%do i=1 %to &total_types;
data type_&i (drop=types);
set test_data_stats_plus
(where=(types eq &i));
run;
%end;
%let type=0;
%do i=1 %to %eval(&total_types.-1);
%do j=%eval(&i+1) %to &total_types.;
%let type=%eval(&type+1);
proc sort data=type_&type;
by cod %scan(&vars,&i) %scan(&vars,&j);
run;
proc sort data=want;
by cod %scan(&vars,&i) %scan(&vars,&j);
run;
data want;
update want (in=ina) type_&type;
by cod %scan(&vars,&i) %scan(&vars,&j);
if ina;
run;
%end;
%end;
%mend;
%finish
Happy, happy holidays to you Arthur!
Everything is fine and dandy with your code and I was able to successfully include it in my macro. I have one more question, if it isn't too much. I don't understand this part of your code:
select name
into :vars separated by " "
from dictionary.columns
where libname="WORK" and
memname="WANT" and
name like 'y%'
;
Specifically, the last like "name like 'y%'. I am trying to allow users to have a different name for their variables other than y#. Is there another line that I could use instead in case the names of their variables are var1, var2, var3, ... etc? Or if the names were height, weight, age, etc.?
Thank you!!
using:
select name
into :vars separated by " "
from dictionary.columns
where libname="WORK" and
memname="WANT" ;
Happy holidays to you and everyone else as well. That part of the code was simply automating the creation of macro variable &vars so, in your case, it would create a variable that was equal to "y1 y2 y3".
You could always just allow the user to enter the variables in either a %let statement, a macro call, or by indicating which variables should be excluded from the list. However, you would have to ensure that a relationship was created between that variable list and the resulting combinations (i.e., pfreq1, pfreq2 and pfreq3). Conversely, what might be more manageable, would be to recode such a variable list into y1, y2, etc.
However, realize that I'm a Psychologist NOT a programmer, thus there could easily be a better way to address what you are trying to do. I, typically, just try to find a good solution, then leave it to others to optimize the concept.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.