BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Season
Lapis Lazuli | Level 10

Hello, everyone. I am currently attempting to compute descriptive statistics, generate statistical plots and outputting them to datasets. My task is to perform these procedures on the very same variable using different grouping methods repetitively. I wish to use PROC UNIVARIATE to do the job, but the CLASS statement in PROC UNIVARITATE only accepts putting one variable in. Therefore, one way to achieve my goal is to write a code, copy that code and paste it somewhere else, change the variable name in the CLASS statement and the names of the output datasets, and do that job over and over again.

I think that method is a feasible yet somehow "dumb" one. Therefore, I tried to compose a SAS macro to help me deal with the problem. Please take a look at my code first:

 

data a;
input var$ @@;
cards;
aa ab ac ad ae af ag ah ai aj
;
run;
data b;
set c a;/*c is the dataset in which all of the data are stored*/
merge c a;
call symput ('i'||left(_n_),strip(var));
run;
%put _user_;
%macro st;
%do j=1 %to 10;
proc univariate data=b outtable=d&j. freq normal;
class i&j.;
var k;/*k is the variable I wish to analyze*/
ppplot;
cdfplot;
qqplot/normal(mu=est sigma=est);
output out=n&j. normaltest=n probn=p;
run;
%end;
%mend;
%st

The following code does not contain dataset printing/outputting arguments (e.g. PROC EXPORT), as error has already been reported during the process of running the aforementioned code.

 

The log said: "Variable Ix (x=1, 2, 3, ..., 7) unfound." So I think that the problem lies in the CLASS statement of the UNIVARIATE procedure of the SAS macro, since the initial names of the variables are aa, ab, ac, ad, ae, af, ag, ah, ai, aj; not i1, i2, ..., i7.

But before I run the macro, I have already used call symput and %put _user_ to "link" each and every one of the variables I input in dataset a with a macro variable whose prefix is i. SAS log also did not report any error of the "linking" process. The log was displayed as follows:

14 %put _user_;
GLOBAL I1 aa
GLOBAL I10  
GLOBAL I100
...  
GLOBAL I2 ab
...
GLOBAL I3 ac
...
/*Some of the logs were replaced with "..." manually for the sake of brevity*/

So, what is wrong with my code? How can I use SAS macro to achieve my goal I mentioned in the very beginning of my post?

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Season
Lapis Lazuli | Level 10

Hello, I attached notes and made some changes to your code around a week ago. I later tried with this code and found out that further improvement could still be made. The code I improved is as follows:

/*Type in all of the grouping variables in dataset a*/
data a;
input var$ @@;
cards;
aa ab ac ad ae af ag ah ai aj
;
run;
libname d "";/*Assign a pseudo- (or false) SAS library to store all of the dataset generated by outtable= option in PROC UNIVARIATE*/
libname n "";/*Assign another pseudo-SAS library*/
/*Run PROC UNIVARIATE repetitively using CALL EXECUTE*/
ods pdf file="";/*Type in the entire name of the pdf file between the quotation marks*/
data _null_;
set a; /*Dataset a serves as a variable list for data step*/
call execute ('proc univariate data=C outtable=');/*The user can use the original dataset (dataset c) here, as c contained all of the information except the variable list, which is not needed here. In addition, I ran the code successfully with dataset c*/
call execute ('d'||var);/*CALL EXECUTE stores all of the datasets generated from outtable=option into the temporary SAS library (i.e. Work), with a prefix of "d" (e.g. daa, dab); yet the CALL EXECUTE argument itself treats what is does as putting the datasets in a SAS library (e.g. d.aa), which is not the case. That's why I call the SAS library ("d") a "pseudo-SAS library" in advance*/
call execute ('freq normal;');
call execute ('class '||var||';');/*Don't forget to type in the space after 'class'*/
call execute ('var k; ppplot; cdfplot; qqplot/normal(mu=est sigma=est); output out=');
call execute ('n'||var);
call execute ('normaltest=n probn=p; run;');
run;
ods pdf close;

I tried with the code I improved a week ago and found out that a redundant dataset named "n" in the temporary dataset (Work) after I ran the code. I opened the dataset and found out that it did contain the statistic of the normality test as well as the P-value of it. Yet it did not contain a grouping variable. I think this was caused by missing values of the variable var in the combined dataset (dataset b), as the number of observations in the original dataset significant outnumber that of grouping variables, leaving the cells that follow the last variable name in var nothing to fill in.

So I tried directly using dataset a in the SET statement of the data step, thus making dataset a in place of b being the dataset the serve as the variable list dataset. It worked perfectly, as it not only produced the results I wished to have, but also avoided the problem of producing the redundant datasets named n and b. The dataset named "normaltest" also disappeared, so I deleted the description about that in my code. In short, it is a more concise code. I hope this code will benefit those who are confronted with the same problem I encountered a week ago, liberating them from repetitive "copy-paste" work.

View solution in original post

17 REPLIES 17
ballardw
Super User

There is a serious logic issue with this bit:

data b;
set c a;
merge c a;
run;

I am not sure what you were attempting to do with SET and MERGE on the same statement and not even needed. There is no need to combine the two set, assuming that C actually has values for variables whose names are represented by Var in set A.

 

If you have a variable with the names such as A you can use a data step and Call Execute to create the calls.

Untested example as we don't have your C data set (and B would likely be a terrible mess to use);

data _null_;
  set a;
  /* note using set C for data*/
  call execute ('proc univariate data=C outtable=');
  call execute ('d '||var);
  call execute ('freq normal;');
  call execcute ('class '||var||';');
  call execute ('var k;
     ppplot;
     cdfplot;
     qqplot/normal(mu=est sigma=est);
     output out=');
  call execute ('n_'||var);
  call execute ('normaltest=n probn=p; run;');
run;

Call execute places code into an execution buffer that starts running after the data set finishes. You can see the "boilerplate" part of your code in the quotes and the bits that use VAR to create to create data set names that start with d_ and n_ and the class variable appended instead of not very clear numbered but that could be done if you use the _n_ instead in the output names. If you want to use _n_ you need to take a bit more control to create a string that is doesn't have spaces in it as "d||_n_" default will have a bunch of spaces between d and the number so would use cats('d',_n_) or cats('n',_n_).

Do pay attention to where the semicolons that end statements in the generated code are as opposed to the ones for the Call Execute statement, they tend to be lost easily.

 

Another option is to use code similar to this to write text lines into a file with the .SAS extension and then after verifying that the lines wrote correctly to use %include to execute the generated code. This has the advantage of documenting what was executed and have it rerun later if needed.

 

Proc print or other procedures could be set up using the same data step that generates the proc univariate calls.

 

When you have an ERROR then you should turn on options MPRINT to get more information in the log about the generated code. If you cannot determine the issue then copy that entire log and paste it into a text box on the forum so we have all the information.

Season
Lapis Lazuli | Level 10

Thank you for your detailed reply and suggestion! I admit that merging datasets b and c could be unnecessary, and that I did not look up SAS Help to figure out the correct usage of MERGE statement before I posted my question. I did this because as I have stated in my original post, my method failed. I didn't know if that was caused by generating macro variables from a dataset in which only variable names and nothing else was included (i.e. dataset a). So, I merged the datasets to exclude the possibility of that circumstance.

The second reason why I did this was that: by merging the two datasets, I get a dataset (b) with the variable names that I wish to use as grouping factors when it comes to analyzing variable k appended in the last column of the combined dataset. In this way, I can use the CALL SYMPUT routine to generate macro variables. I read the part of CALL SYMPUT of SAS Help carefully before I posted my question online and found that this argument is suitable for producing multiple macro variables at one time, with the variables aligned in a column. Now that each and every of the grouping variables themselves were a column in the original dataset (dataset c), I typed the variable names on my own and appended them to the original dataset so that I can make sure that CALL SYMPUT works successfully (please refer to the example of generating macro variables POS1, POS2 and POS3 in CALL SYMPUT routine, if you need).

Currently, I have one question regarding the code you composed: the dataset name of your code in the first sentence is "_null_", I previously learnt from my teacher that this name is not suitable for being a dataset name in SAS, for it is specially designated as being a dataset with nothing inside. Therefore, is there any special consideration regarding the nomenclature of the dataset name in your first sentence of the code?

I will take a closer look at CALL EXECUTE, since I haven't heard about that argument before.

Many thanks for your explicit explanation!

ballardw
Super User

@Season wrote:

Currently, I have one question regarding the code you composed: the dataset name of your code in the first sentence is "_null_", I previously learnt from my teacher that this name is not suitable for being a dataset name in SAS, for it is specially designated as being a dataset with nothing inside. Therefore, is there any special consideration regarding the nomenclature of the dataset name in your first sentence of the code?

I will take a closer look at CALL EXECUTE, since I haven't heard about that argument before.

Many thanks for your explicit explanation!


The purpose of the data step in the Data _null_ is to generate code that is submitted by Call execute. No need to create a data set.

There are many uses of data _null_ to allow use of data step functions without creating data. One of them is creating macro variables where the code using the macro processor gets complex and ugly. Another is to write output text files with FILE and PUT statements.

Season
Lapis Lazuli | Level 10

Thank you for the tip you offer regarding the usage of _null_ in the data step!

Season
Lapis Lazuli | Level 10

Hello, I attached notes and made some changes to your code around a week ago. I later tried with this code and found out that further improvement could still be made. The code I improved is as follows:

/*Type in all of the grouping variables in dataset a*/
data a;
input var$ @@;
cards;
aa ab ac ad ae af ag ah ai aj
;
run;
libname d "";/*Assign a pseudo- (or false) SAS library to store all of the dataset generated by outtable= option in PROC UNIVARIATE*/
libname n "";/*Assign another pseudo-SAS library*/
/*Run PROC UNIVARIATE repetitively using CALL EXECUTE*/
ods pdf file="";/*Type in the entire name of the pdf file between the quotation marks*/
data _null_;
set a; /*Dataset a serves as a variable list for data step*/
call execute ('proc univariate data=C outtable=');/*The user can use the original dataset (dataset c) here, as c contained all of the information except the variable list, which is not needed here. In addition, I ran the code successfully with dataset c*/
call execute ('d'||var);/*CALL EXECUTE stores all of the datasets generated from outtable=option into the temporary SAS library (i.e. Work), with a prefix of "d" (e.g. daa, dab); yet the CALL EXECUTE argument itself treats what is does as putting the datasets in a SAS library (e.g. d.aa), which is not the case. That's why I call the SAS library ("d") a "pseudo-SAS library" in advance*/
call execute ('freq normal;');
call execute ('class '||var||';');/*Don't forget to type in the space after 'class'*/
call execute ('var k; ppplot; cdfplot; qqplot/normal(mu=est sigma=est); output out=');
call execute ('n'||var);
call execute ('normaltest=n probn=p; run;');
run;
ods pdf close;

I tried with the code I improved a week ago and found out that a redundant dataset named "n" in the temporary dataset (Work) after I ran the code. I opened the dataset and found out that it did contain the statistic of the normality test as well as the P-value of it. Yet it did not contain a grouping variable. I think this was caused by missing values of the variable var in the combined dataset (dataset b), as the number of observations in the original dataset significant outnumber that of grouping variables, leaving the cells that follow the last variable name in var nothing to fill in.

So I tried directly using dataset a in the SET statement of the data step, thus making dataset a in place of b being the dataset the serve as the variable list dataset. It worked perfectly, as it not only produced the results I wished to have, but also avoided the problem of producing the redundant datasets named n and b. The dataset named "normaltest" also disappeared, so I deleted the description about that in my code. In short, it is a more concise code. I hope this code will benefit those who are confronted with the same problem I encountered a week ago, liberating them from repetitive "copy-paste" work.

Season
Lapis Lazuli | Level 10

Hello, I've already tried your code featured on CALL EXECUTE routine. It is absolutely fantastic! It is concise (only a few lines longer than you perform one UNIVARIATE procedure), efficient and easier to understand than SAS macro, at least for people like me, who are green hands on SAS macros.

By the way, I tried with my own code once again and used "options mprint;" argument to see what is wrong. SAS log stated in CLASS statement of PROC UNIVARIATE that the variable (e.g. I1, I2) does not exist. So I guess that the values I typed in as observations of "VAR" in dataset a were still not transformed into macro variables, despite the presence of CALL SYMPUT. I still don't know why they are not transformed, but never mind, your code works well.

Still, please allow me to attach notes to your code, so that other reader of our conversation can understand the code thoroughly. I found out simply reading the part of SAS Help regarding CALL EXECUTE is not enough for understanding that code. What is more, I will add arguments on outputting my results to a PDF file. Here is the entire code of how to achieve parts of the goal I mentioned in my original post. I will explain why I add "parts of" in the preceding sentence later.

/*Type in all of the grouping variables in dataset a*/
data a;
input var$ @@;
cards;
aa ab ac ad ae af ag ah ai aj
;
run;
data b;
merge c a;/*c is the dataset in which all of the data are stored*/
run;
libname d "";/*Assign a pseudo- (or false) SAS library to store all of the dataset generated by outtable= option in PROC UNIVARIATE*/
libname n "";/*Assign another pseudo-SAS library*/
/*Run PROC UNIVARIATE repetitively using CALL EXECUTE*/
ods pdf file="";/*Type in the entire name of the pdf file between the quotation marks*/
data _null_;
set b; /*I think that using b (the combined dataset) instead of c (the original dataset) is necessary, as you can see in the following steps, CALL EXECUTE runs PROC UNIVARIATE repetitively, with each observation of VAR being the grouping variable (variable following the CLASS statement) every time. Therefore, you should provide a list of grouping variables for CALL EXECUTE to read from*/
call execute ('proc univariate data=C outtable=');/*The user can use the original dataset (dataset c) here, as c contained all of the information except the variable list, which is not needed here. In addition, I ran the code successfully with dataset c*/
call execute ('d'||var);/*CALL EXECUTE stores all of the datasets generated from outtable=option into the temporary SAS library (i.e. Work), with a prefix of "d" (e.g. daa, dab); yet the CALL EXECUTE argument itself treats what is does as putting the datasets in a SAS library (e.g. d.aa), which is not the case. That's why I call the SAS library ("d") a "pseudo-SAS library" in advance*/
call execute ('freq normal;');
call execute ('class '||var||';');/*Don't forget to type in the space after 'class'*/
call execute ('var k; ppplot; cdfplot; qqplot/normal(mu=est sigma=est); output out=');
call execute ('n'||var);
call execute ('normaltest=n probn=p; run;');/*There is a subtle difference here compared to the situation in the outtable= option above. Despite the pseudo-SAS library nature of SAS library "n", CALL EXECUTE does create a SAS dataset entitiled "normaltest" in the file path designated in "libname n "";" statement after the entire process, yet it is an empty one. All of the noramlity test results are still in Work (temporary SAS dataset), with the prefix being "n" (e.g. naa, nbb)*/
run;
ods pdf close;

So here is my question: I wish to output all of the dataset generated by outtable= option and output out= option into an Excel file, with each dataset being a sheet of the Excel. Any suggestion?

Thank you very much!

PaigeMiller
Diamond | Level 26

Why do you need to know if all these different slices of your original data set produce normally distributed data? This is not making sense to me. If you are doing this to see if some sort of modeling is appropriate ... please understand that most modeling does not require independent variables OR the dependent variable to be normally distributed.

--
Paige Miller
Season
Lapis Lazuli | Level 10

In a nutshell, to conduct t-tests or analysis of variance (ANOVA).

Variable k is a continuous variable, while all of the grouping variables (e.g. aa, ab) are discrete (categorical) variables. I am building a multivariate linear regression model and is selecting potential independent variables. Therefore, I want to know that if there is a statistically significant difference between (among) the different groups. I need t-tests and ANOVA to help me figure out the problems. But normality of k in each and every group of each and every grouping variable is a prerequisite for conducting t-tests and ANOVA. So I am doing the work I described in my original post.

PaigeMiller
Diamond | Level 26

I am building a multivariate linear regression model and is selecting potential independent variables.

 

But normality of k in each and every group of each and every grouping variable is a prerequisite for conducting t-tests and ANOVA.

 

Normality of the ERRORs is a requirement, not normality of the actual data. And if you are building a multiple linear regression model, it is the normality of the ERRORs after you fit the entire model that is the requirement; not normality of the residuals against each predictor one-at-a-time.

 

The usual approach is to fit the model and then test the normality of the residuals (residuals being an estimate of the errors), not to pre-test for normality before fitting the model.

--
Paige Miller
Season
Lapis Lazuli | Level 10

Yes, you are correct when it comes to building multivariate linear regression models. It is the residual (or the independent variable, as some books on statistics have stated; but they are essentially the same, since only the independent variable and the residual are random variables, while all of the independent variables do not have to be random variables).

But I am working on the process of selecting potential variables prior to model building, which is also called as "univariate analysis" in epidemiology. I need to test for statistical significance of the differences of each and every variable one by one.

It is true that whether or not univariate analysis is a must before regression model building (also called as "multivariate analysis") remains controversial. When I was a student, different teachers gave different opinions on that topic: some say that this process is unnecessary, since the variables that have been proven to be statistically significant in univariate analysis may become insignificant in multivariate analysis, and vice versa; while some say that this is a must, without citing any reason.

For me, I choose to perform the so-called "univariate analysis" for the following reasons: (1) the teachers who suggested doing that process outnumber those who were against doing that process; (2) this process is essentially a model selection process beforehand. My data contains some 400 variables. I cannot simply put all of the variables into my regression model and let the P-values of the model generated after the model selection process help me pick out the variables I put in the regression model, since I do not have that many sample required to make each and every P-value "believable" (in a statistical language, I cannot guarantee the power of the test of significance of each independent variable; not to mention that collinearity may also have an impact on P-values). Variable selection based totally on professional knowledge is also not pragmatic, since there may be surprises (I have found several statistically significant variables that I deemed as insignificant based on my professional knowledge before I start the entire data analysis process). Therefore, performing univariate analysis by methods including t-tests and ANOVA is necessary for me.

So, in one word, I need to conduct t-tests and ANOVA, so testing for normality is necessary.

PaigeMiller
Diamond | Level 26

Ok, then I disagree with your reasons. You are selecting variables for inclusion in the model based upon their distribution, while I would select variables for inclusion in the model based upon whether or not they are good predictors. Also: https://blogs.sas.com/content/iml/2018/08/27/on-the-assumptions-and-misconceptions-of-linear-regress...

--
Paige Miller
Season
Lapis Lazuli | Level 10

Thank you for your information you provided. But I am afraid that you still had not get what I mean.

I use t-test or nonparametric tests like Wilcoxon to conduct intergroup comparison of the variable I am interested in, namely variable k. Normality only helps me decide which type of test I should choose, not whether or not I should include the candidate variable as the independent variable in the regression model.

Season
Lapis Lazuli | Level 10

I would also like to make some comments on Rick's blog you have mentioned. Rick mentioned that he didn't know where the misconception that the variables (including the independent variables and the dependent variables) has to be normally distributed came from. Neither do I. Still, the following two statements are essentially the same: (1) the residual follows a normal distribution, (2) the dependent variable follows a normal distribution.

As I had stated earlier, the independent variables do not have to be random variables, as you can assign the values of the independent variables, which means that the values of the independent variables are not always results of random experiments. Therefore, the independent variables can be deemed as constants. Now that the residual follows a normal distribution, the dependent variable follows a normal distribution as well. So, in theory, when expressed as a binary outcome (follow or not follow), the normality testing results of the dependent variable and the residual should be identical.

However, it is plausible that in practice, the results may be different, as errors may have an impact. I use the word "plausible" because I never test the dependent variable in my data analysis, so I do not know that if such a circumstance could really happen.

Whatever my speculation is correct, one thing is certain: the residual follows a normal distribution with a mean of 0 is the hypothesis of multivariate linear regression, whereas the dependent variable follows a normal distribution is a theorem. It should be noted that statistics is a discipline dealing with data, with variation being the everlasting topic. Therefore, definiteness that something holds true only on few occasions in statistics. This is a major difference between statistics and mathematics. That is why I speculate that in practice, the normality results of the dependent variable and the residual could be different. If that is really the case, normality testing on the residual should be trusted, as the residual, not the dependent variable follows a normal distribution with a mean of 0 is the hypothesis of multivariate linear regression is the hypothesis of multivariate linear regression.

Finally, I would like to point out that normality testing results (like Shapiro-Wilk and Kolmogorov-Smirnov tests) may not be reliable in large samples, as these tests tend to produce small P-values. Therefore, judgement of normality should not be solely based on the results of the normality tests. Statistical plots like histogram, P-P plot and Q-Q plot provide valuable information when the sample size is large. The information they provide deserve a larger weight in the judgement of normality when sample size is large.

PaigeMiller
Diamond | Level 26

the following two statements are essentially the same: (1) the residual follows a normal distribution, (2) the dependent variable follows a normal distribution.

 

Not true, and Rick's blog gives examples.

--
Paige Miller

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 2812 views
  • 7 likes
  • 3 in conversation