Hello,
I am trying to compute a Cochran's Q using proc freq with 24 indexed variables, but SAS freezes as I think the line of code is too long.
Is there a trick to allow SAS to compute the line of code?
My code is:
proc freq data=mydata;
tables var1*var2*var3*var4*var5*var6*var7*var8*var9*var10*var11*var12*var13*var14*var15*var16*var17*var18*var19*var20*var21*var22*var23*var24 /agree;
run;
Many thanks in advance!
Hello @docfak,
This is interesting! Even with
ods select CochransQ;
which is more restrictive than
ods exclude crosstabfreqs;
PROC FREQ would probably take a while to run your code with this 1000-obs. sample dataset HAVE (in place of your MYDATA):
%let m=24;
data have;
call streaminit(27182818);
array var[&m];
do i=1 to 1000;
do _n_=1 to dim(var);
var[_n_]=rand('bern',0.5);
end;
output;
end;
run;
I haven't actually run your PROC FREQ step with m=24, but observed that for m=17, 18, 19, 20 the run times approximately doubled with each additional variable (like 5, 9, 19, 39 seconds).
Then I looked up the formula for Cochran's Q (and the associated p-value) in the most recent SAS 9.4 documentation (https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/statug_freq_details78.htm). After some more research and manual calculations I realized that the formula for Cochran's Q on that page (and in my local SAS 9.4M5 help files) is wrong! I will report this to SAS. Comparing older versions of the documentation, starting at SAS version 8 (http://v8doc.sas.com/sashtml/proc/zeq-comp.htm), it appears that the error was introduced in the documentation of SAS 9.2 (https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_freq_a0...) and then copied to versions 9.3 and 9.4. In version 8 and also as "recently" as v9.1.3 (https://support.sas.com/documentation/onlinedoc/91pdf/index_913.html, see Procedures Guide base_proc_8977.pdf linked there) the formula was correct:
Using the correct formula (more precisely: a formula equivalent to the correct one) I was able to reproduce the results from PROC FREQ for m=17, 18, 19, 20, but with run times <1 second, and the case m=24 went about as fast as the others. Here's the code I used:
ods select none;
ods output onewayfreqs=freqs;
proc freq data=have;
tables var:;
run;
ods select all;
data freqs1(keep=n);
set freqs(rename=(frequency=n));
if sum(of var:);
run;
proc summary data=freqs1;
var n;
output out=stats(drop=_:) css=c sum=t;
run;
data want(keep=s c t Q DF p);
do until(last);
set have end=last;
s+(sum(of var:)**2);
end;
set stats;
Q=(&m-1)*c/(t-s/&m);
DF=&m-1;
p=1-cdf('chisq',Q,DF);
format p pvalue6.4;
label Q="Cochran's Q"
p="p-value of Cochran's Q test";
run;
Log (incl. creation of sample data):
1 %let m=24; 2 3 data have; 4 call streaminit(27182818); 5 array var[&m]; 6 do i=1 to 1000; 7 do _n_=1 to dim(var); 8 var[_n_]=rand('bern',0.5); 9 end; 10 output; 11 end; 12 run; NOTE: The data set WORK.HAVE has 1000 observations and 25 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 13 14 ods select none; 15 ods output onewayfreqs=freqs; 16 proc freq data=have; 17 tables var:; 18 run; NOTE: The data set WORK.FREQS has 48 observations and 53 variables. NOTE: There were 1000 observations read from the data set WORK.HAVE. NOTE: PROCEDURE FREQ used (Total process time): real time 0.09 seconds cpu time 0.09 seconds 19 ods select all; 20 21 data freqs1(keep=n); 22 set freqs(rename=(frequency=n)); 23 if sum(of var:); 24 run; NOTE: There were 48 observations read from the data set WORK.FREQS. NOTE: The data set WORK.FREQS1 has 24 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 25 26 proc summary data=freqs1; 27 var n; 28 output out=stats(drop=_:) css=c sum=t; 29 run; NOTE: Multiple concurrent threads will be used to summarize data. NOTE: There were 24 observations read from the data set WORK.FREQS1. NOTE: The data set WORK.STATS has 1 observations and 2 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 0.03 seconds cpu time 0.07 seconds 30 31 data want(keep=s c t Q DF p); 32 do until(last); 33 set have end=last; 34 s+(sum(of var:)**2); 35 end; 36 set stats; 37 Q=(&m-1)*c/(t-s/&m); 38 DF=&m-1; 39 p=1-cdf('chisq',Q,DF); 40 format p pvalue6.4; 41 label Q="Cochran's Q" 42 p="p-value of Cochran's Q test"; 43 run; NOTE: There were 1000 observations read from the data set WORK.HAVE. NOTE: There were 1 observations read from the data set WORK.STATS. NOTE: The data set WORK.WANT has 1 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
Result (dataset WANT):
s c t Q DF p 152459 4314.96 12097 17.2762 23 0.7955
So, if you only need Cochran's Q and the associated p-value, it seems that the manual calculation beats PROC FREQ.
How many levels are involved with each of those variables? IF you have 2 levels per variable you are looking at 2**22 (4,194,304) 2-by-2 tables.
"Freeze" in this case, I would guess means you don't see anything because SAS is making all those html tables to display all of the frequencies involved.
I might suggest placing this immediately before your Proc Freq code to suppress the humoungous
ods exclude crosstabfreqs;
Are you sure that your data meets the requirements for Agreement calculations as in " AGREE statistics are computed only for tables where the number of rows equals the number of columns." ?
I have two levels per variable.
When I use 12 variables (var1-var12), I can compute Cochran's Q.
Using your line of code, I get this error
ERROR: An exception has been encountered.
Veuillez contacter le support technique et transmettez-lui les informations suivantes :
Le nom de la tâche SAS est [FREQ]
ERROR: Violation de l'accès en lecture FREQ
Exception en (0753146E)
Task Traceback
Address Frame (DBGHELP API Version 4.0 rev 5)
000000000753146E 00000000007F9F70 sasods:tkvercn1+0x10042E
000000000753CD31 00000000007FA2E0 sasods:tkvercn1+0x10BCF1
000000000753C30A 00000000007FA370 sasods:tkvercn1+0x10B2CA
000000000F604DB5 00000000007FA378 sasfreq:tkvercn1+0x73D75
000000000F5B177D 00000000007FCC60 sasfreq:tkvercn1+0x2073D
000000000F59450B 00000000007FFB20 sasfreq:tkvercn1+0x34CB
00000000035BA366 00000000007FFB28 sashost:Main+0x11EA6
00000000035C0574 00000000007FFF20 sashost:Main+0x180B4
00007FFCCAD07034 00000000007FFF28 KERNEL32:BaseThreadInitThunk+0x14
00007FFCCB422651 00000000007FFF58 ntdll:RtlUserThreadStart+0x21
Hello @docfak,
This is interesting! Even with
ods select CochransQ;
which is more restrictive than
ods exclude crosstabfreqs;
PROC FREQ would probably take a while to run your code with this 1000-obs. sample dataset HAVE (in place of your MYDATA):
%let m=24;
data have;
call streaminit(27182818);
array var[&m];
do i=1 to 1000;
do _n_=1 to dim(var);
var[_n_]=rand('bern',0.5);
end;
output;
end;
run;
I haven't actually run your PROC FREQ step with m=24, but observed that for m=17, 18, 19, 20 the run times approximately doubled with each additional variable (like 5, 9, 19, 39 seconds).
Then I looked up the formula for Cochran's Q (and the associated p-value) in the most recent SAS 9.4 documentation (https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/statug_freq_details78.htm). After some more research and manual calculations I realized that the formula for Cochran's Q on that page (and in my local SAS 9.4M5 help files) is wrong! I will report this to SAS. Comparing older versions of the documentation, starting at SAS version 8 (http://v8doc.sas.com/sashtml/proc/zeq-comp.htm), it appears that the error was introduced in the documentation of SAS 9.2 (https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_freq_a0...) and then copied to versions 9.3 and 9.4. In version 8 and also as "recently" as v9.1.3 (https://support.sas.com/documentation/onlinedoc/91pdf/index_913.html, see Procedures Guide base_proc_8977.pdf linked there) the formula was correct:
Using the correct formula (more precisely: a formula equivalent to the correct one) I was able to reproduce the results from PROC FREQ for m=17, 18, 19, 20, but with run times <1 second, and the case m=24 went about as fast as the others. Here's the code I used:
ods select none;
ods output onewayfreqs=freqs;
proc freq data=have;
tables var:;
run;
ods select all;
data freqs1(keep=n);
set freqs(rename=(frequency=n));
if sum(of var:);
run;
proc summary data=freqs1;
var n;
output out=stats(drop=_:) css=c sum=t;
run;
data want(keep=s c t Q DF p);
do until(last);
set have end=last;
s+(sum(of var:)**2);
end;
set stats;
Q=(&m-1)*c/(t-s/&m);
DF=&m-1;
p=1-cdf('chisq',Q,DF);
format p pvalue6.4;
label Q="Cochran's Q"
p="p-value of Cochran's Q test";
run;
Log (incl. creation of sample data):
1 %let m=24; 2 3 data have; 4 call streaminit(27182818); 5 array var[&m]; 6 do i=1 to 1000; 7 do _n_=1 to dim(var); 8 var[_n_]=rand('bern',0.5); 9 end; 10 output; 11 end; 12 run; NOTE: The data set WORK.HAVE has 1000 observations and 25 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 13 14 ods select none; 15 ods output onewayfreqs=freqs; 16 proc freq data=have; 17 tables var:; 18 run; NOTE: The data set WORK.FREQS has 48 observations and 53 variables. NOTE: There were 1000 observations read from the data set WORK.HAVE. NOTE: PROCEDURE FREQ used (Total process time): real time 0.09 seconds cpu time 0.09 seconds 19 ods select all; 20 21 data freqs1(keep=n); 22 set freqs(rename=(frequency=n)); 23 if sum(of var:); 24 run; NOTE: There were 48 observations read from the data set WORK.FREQS. NOTE: The data set WORK.FREQS1 has 24 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 25 26 proc summary data=freqs1; 27 var n; 28 output out=stats(drop=_:) css=c sum=t; 29 run; NOTE: Multiple concurrent threads will be used to summarize data. NOTE: There were 24 observations read from the data set WORK.FREQS1. NOTE: The data set WORK.STATS has 1 observations and 2 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 0.03 seconds cpu time 0.07 seconds 30 31 data want(keep=s c t Q DF p); 32 do until(last); 33 set have end=last; 34 s+(sum(of var:)**2); 35 end; 36 set stats; 37 Q=(&m-1)*c/(t-s/&m); 38 DF=&m-1; 39 p=1-cdf('chisq',Q,DF); 40 format p pvalue6.4; 41 label Q="Cochran's Q" 42 p="p-value of Cochran's Q test"; 43 run; NOTE: There were 1000 observations read from the data set WORK.HAVE. NOTE: There were 1 observations read from the data set WORK.STATS. NOTE: The data set WORK.WANT has 1 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
Result (dataset WANT):
s c t Q DF p 152459 4314.96 12097 17.2762 23 0.7955
So, if you only need Cochran's Q and the associated p-value, it seems that the manual calculation beats PROC FREQ.
This is very interesting, thank you very much!
Why do you think those two different ways of writing the formula are different?
Is of the form A*((B*C)/D)
Has the same terms re-arranged into: B*A*C/D
Thanks, @Tom, for chiming in. I see what you mean, but my understanding is that the term T² is not part of the sum (j=1, ..., m) because it doesn't contain j and it's not enclosed in parentheses either. (Also, I think that my "manual" calculation is consistent with this interpretation and its results matched those from PROC FREQ in the few cases I checked.) So, multiplying m (from the second form of the equation) into the numerator of the fraction would result in mSTj² − mT², but the correct term is mSTj² − T² (as in the first equation).
I have sent a notification to SAS via the "Feedback" link on the documentation page and I will report their response (if any) in this thread.
Addendum:
Another strong argument supporting my point is:
We have T=STj and, by Cauchy-Schwarz inequality, T² ≤ mSTj² (with equality holding in the case T1=...=Tm), so the numerator of the fraction in the old (correct) version of the formula (as I read it) is non-negative. The denominator is non-negative as well because T=SSk and 0 ≤ Sk ≤ m for all k=1, ..., N, which implies mT ≥SSk² (with equality holding in the case S1, ..., SN Î {0, m}). Hence, QC cannot be negative, which is plausible as it is a chi-square statistic. However, the numerator of the new (incorrect) formula is non-positive (since STj² ≤ (STj)² = T²), even regardless of T² being part of the sum or not. Together with the non-negative denominator QC could never be positive, which would be absurd for a chi-square statistic.
Update 2022-10-13
I have not received any response from SAS to my notifications regarding the Cochran's Q formula in question (emails to yourturn@sas.com sent on October 10, 2021 and June 11, 2022), and the documentation has not been revised either. This must be a rare exception because in Re: SAS Certifiication & Documentation Questions (March 2022) it says: "Our team responds to every request within 2 business days."
Hi @free. You're right...unanswered feedback from our team is a rare exception. 🙂 I'm on the team that monitors the yourturn@sas.com inbox and we're committed to addressing each response we receive. I don't see your feedback emails on this topic from the dates you mention (10/10/21 and 6/11/22). I wonder if there was an email delivery snafu.
Regardless, I'd like to be sure we address your feedback. I will inform our team about this Communities thread to investigate the doc. In the meantime, please feel free to send another email to yourturn@sas.com about this same issue so we can be sure your emails are delivering accurately.
Many thanks, @MarkBright, for your prompt response. Indeed, lost emails were my only explanation for this. I'm going to re-send the email from last June.
Update 2022-10-17
The typo in Cochran's Q formula in the current PROC FREQ documentation was confirmed by SAS Technical Support earlier today (and indeed within two business days after I had re-sent the lost email). The correction (replacement of T² by T²/m) will be made "in the next release of the documentation."
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.