BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TrueTears
Obsidian | Level 7

I have a datafile called 'original' and in it contains 4 variables, call them a, b, c, d, with n observations each. I then use proc surveyselect to draw 1000 resamples from the 'original' dataset with sample size n_b = n/4, the code is as follows:

 

 

proc sql noprint;
select ceil(count(*)/4) into :record_count
from original;
quit;


%put &record_count;
%let rep = 1000;
proc surveyselect data= original out=bootsample
     seed = 1234 method = urs
	 sampsize=&record_count outhits rep = &rep;
run;
ods listing close;

 

 

This produces a datafile named 'bootsample' which contains 1000 samples with sample size n_b of each variable (a, b, c, and d) from the 'original' dataset. Each observation's replication ID is given by the variable "Replicate" (ranging from 1 to 1000).

 

What I need to do is this:Take Replicate = 1 (i.e., the first replication sample) and the variable a as an example. I want to calculate the following value of t: (if the picture below doesn't show, please see attachment of the picture titled "formula")

 

expression.png

 

where mean(a) is the sample average of the variable a for replication sample 1, std(a) is the sample standard deviation of the variable a for replication sample 1, a_i represents each individual observation of the variable a for replication sample 1.

 

Then, I want to repeat the above procedure and calculate the value of t for all 1000 replication samples and all four variables: a, b, c, and d. I want to store the final result in a datafile called "result" that has 4 variables called a_t, b_t, c_t, and d_t (i.e., 4 columns) and the 1000 values of t of each variable in each row. So, graphically, a datafile structured like this: (if the picture below doesn't show, please see attachment of the picture titled "result")

 

result.png

 

Can anyone show me a template code that can achieve what I described above? I'm thinking maybe proc sql can do the trick, but I'm quite new to SAS and still don't really know the syntax very well. Thanks.

 


formula.pngresult.png
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Sure, PROC SQL can do the trick, but PROC MEANS (or PROC SUMMARY for that matter) can compute more statistics than PROC SQL, e.g. skewness (thanks, @Reeza, for the hint!). Please note that your S statistic can be derived from the coefficient of variation: S=100/CV. Your g statistic can be calculated directly as skewness with the VARDEF=N option of the PROC MEANS statement, whereas your sample standard deviation would require VARDEF=DF (for the denominator n-1, which I assume is what you want). Instead of merging two datasets with summary statistics (one for each setting of VARDEF), I decided to convert the default "DF skewness" to "N skewness" by multiplying with the appropriate conversion factor := (n-1)(n-2)/n² = 1 - 3/+ 2/n².

 

 

proc summary data=bootsample;
by replicate;
var a b c d;
output out=stats(drop=_:) cv= skew= / autoname;
run;

%let f=%sysevalf(1-3/&record_count+2/&record_count**2);

data want;
set stats;
a_t=100/a_CV+&f*a_Skew;
b_t=100/b_CV+&f*b_Skew;
c_t=100/c_CV+&f*c_Skew;
d_t=100/d_CV+&f*d_Skew;
drop a_CV--d_Skew;
run;

ods html file="C:\Temp\t_stat.html";
ods listing close;
title 'The t Statistic';

proc print data=want label noobs;
run;

ods html close;
ods listing;
title;

 

This could be the PROC SQL code (perhaps for validation purposes or for comparison of run times and numerical accuracy):

proc sql;
create table want as
select replicate,
       m_a/s_a+sum(x_a)/(&record_count*s_a**3) as a_t,
       m_b/s_b+sum(x_b)/(&record_count*s_b**3) as b_t,
       m_c/s_c+sum(x_c)/(&record_count*s_c**3) as c_t,
       m_d/s_d+sum(x_d)/(&record_count*s_d**3) as d_t from
(select replicate, mean(a) as m_a, std(a) as s_a, (a-calculated m_a)**3 as x_a, 
                   mean(b) as m_b, std(b) as s_b, (b-calculated m_b)**3 as x_b,
                   mean(c) as m_c, std(c) as s_c, (c-calculated m_c)**3 as x_c,
                   mean(d) as m_d, std(d) as s_d, (d-calculated m_d)**3 as x_d
 from bootsample
 group by replicate)
group by replicate;
quit;

Minor differences (like 1E-13, but depending on your a, b, c, d values) between the two approaches are likely to occur.

 

View solution in original post

4 REPLIES 4
Reeza
Super User
Look into proc means. Using BY processing in SAS tells proc how to define your groups.
Here's some sample code.


Proc means data=have stackods n mean median std max min;
By replicate;
Var a b c d t;
Ods table summary=Results;
Run;

Proc print data=results;
Run;
Reeza
Super User
Sorry, didn't see your formulas in attachment. It helps to include or mention them in your question. Is your metric related to skewness?
TrueTears
Obsidian | Level 7
I included them as a picture in the post itself, maybe it doesn't show for you for some reason. I have attached both pictures as an attachment.

Basically, the trouble I am having is how to code the formula and also how to output the results.

Thanks for your help.
FreelanceReinh
Jade | Level 19

Sure, PROC SQL can do the trick, but PROC MEANS (or PROC SUMMARY for that matter) can compute more statistics than PROC SQL, e.g. skewness (thanks, @Reeza, for the hint!). Please note that your S statistic can be derived from the coefficient of variation: S=100/CV. Your g statistic can be calculated directly as skewness with the VARDEF=N option of the PROC MEANS statement, whereas your sample standard deviation would require VARDEF=DF (for the denominator n-1, which I assume is what you want). Instead of merging two datasets with summary statistics (one for each setting of VARDEF), I decided to convert the default "DF skewness" to "N skewness" by multiplying with the appropriate conversion factor := (n-1)(n-2)/n² = 1 - 3/+ 2/n².

 

 

proc summary data=bootsample;
by replicate;
var a b c d;
output out=stats(drop=_:) cv= skew= / autoname;
run;

%let f=%sysevalf(1-3/&record_count+2/&record_count**2);

data want;
set stats;
a_t=100/a_CV+&f*a_Skew;
b_t=100/b_CV+&f*b_Skew;
c_t=100/c_CV+&f*c_Skew;
d_t=100/d_CV+&f*d_Skew;
drop a_CV--d_Skew;
run;

ods html file="C:\Temp\t_stat.html";
ods listing close;
title 'The t Statistic';

proc print data=want label noobs;
run;

ods html close;
ods listing;
title;

 

This could be the PROC SQL code (perhaps for validation purposes or for comparison of run times and numerical accuracy):

proc sql;
create table want as
select replicate,
       m_a/s_a+sum(x_a)/(&record_count*s_a**3) as a_t,
       m_b/s_b+sum(x_b)/(&record_count*s_b**3) as b_t,
       m_c/s_c+sum(x_c)/(&record_count*s_c**3) as c_t,
       m_d/s_d+sum(x_d)/(&record_count*s_d**3) as d_t from
(select replicate, mean(a) as m_a, std(a) as s_a, (a-calculated m_a)**3 as x_a, 
                   mean(b) as m_b, std(b) as s_b, (b-calculated m_b)**3 as x_b,
                   mean(c) as m_c, std(c) as s_c, (c-calculated m_c)**3 as x_c,
                   mean(d) as m_d, std(d) as s_d, (d-calculated m_d)**3 as x_d
 from bootsample
 group by replicate)
group by replicate;
quit;

Minor differences (like 1E-13, but depending on your a, b, c, d values) between the two approaches are likely to occur.

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1514 views
  • 4 likes
  • 3 in conversation