topic Re: How to run the same code for different data? in Statistical Procedures

How to run the same code for different data?

kellychan84 — Wed, 25 Aug 2021 18:30:01 GMT

Hello, if I have several hundred data need to run the same code, how could I analyze them without repeating writing the same codes? For example, I have family 1 to family 800. Thank you very much in advance for your help!

proc glimmix data= cecal_family_taxonomy; 
  class treatment block;
  model Family1P = treatment; 
  random block; 
  lsmeans treatment/tdiff lines;
  ods output LSMeans = myparmdataset;
  output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
  title "1-Taxonomy Data Cecal Family 1 Percentage ANOVA Results";
run;
proc print data=myparmdataset;
format estimate D8.6
       stderr D8.6;
run;
proc print data=second;
run;
proc sgplot data=second;
  scatter y=smresid x=pred;
  refline 0;
run;
proc sgplot data=second;
  scatter y=smresid x=treatment;
  refline 0;
run;
proc sgplot data=second;
  vbox smresid/group=treatment datalabel;
run;
proc sgscatter data=second;
  plot sresid*(pred treatment block);
run;
proc univariate data=second normal plot;
  var sresid;
  histogram sresid / normal kernel;
run;

Re: How to run the same code for different data?

SteveDenham — Wed, 25 Aug 2021 18:55:33 GMT

Use a BY statement on a sorted dataset. If family takes on values from 1 to 800, then just add

by family;

after you invoke PROC GLIMMIX.

SteveDenham

Re: How to run the same code for different data?

PaigeMiller — Wed, 25 Aug 2021 19:01:44 GMT

Can we get a clarification?

You say: "if I have several hundred data..."

Does this mean several hundred data sets, or several hundred variables in one data set, or one variable but with an ID indicating several hundred families? Or something else?

Regardless of your answer, what are you going to do with 800 model fits, and 800 plots (times several different versions of the plots)? That makes me shiver. OF course you can ask the computer to produce all of that, but the problem is really how can a human look at and absorb all of that information?

Reminds me of a quote from Gomez Addams as he is about to saw his wife Morticia in half. Morticia asks: "Gomez, do you really know how to saw a woman in half?" And Gomez replies, "Of course, its putting her back together that is the problem".

You might want to instead of plotting put all the tests for normality (and any other tests) into one large data set, but even then the question remains ... and then what?

Re: How to run the same code for different data?

Reeza — Wed, 25 Aug 2021 19:16:01 GMT

Few different ways to get at this, usually involving either BY group processing or macros.

@SteveDenham has suggested the BY option, I'll show the macro solutions.

One big difference is how you're interpret or parse the output. For the BY you'll have each step together but not all the results for one model, which may be a better formatted output. In general, though I agree with @PaigeMiller, this isn't particularily a practical approach, especially the multiple prints and plots.

UCLA introductory tutorial on macro variables and macros
https://stats.idre.ucla.edu/sas/seminars/sas-macros-introduction/

Tutorial on converting a working program to a macro <- this will illustrate how to run it many times for different data sets
This method is pretty robust and helps prevent errors and makes it much easier to debug your code. Obviously biased, because I wrote it 🙂 https://github.com/statgeek/SAS-Tutorials/blob/master/Turning%20a%20program%20into%20a%20macro.md

Examples of common macro usage

https://communities.sas.com/t5/SAS-Communities-Library/SAS-9-4-Macro-Language-Reference-Has-a-New-Appendix/ta-p/291716

@kellychan84 wrote:

proc glimmix data= cecal_family_taxonomy; 
  class treatment block;
  model Family1P = treatment; 
  random block; 
  lsmeans treatment/tdiff lines;
  ods output LSMeans = myparmdataset;
  output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
  title "1-Taxonomy Data Cecal Family 1 Percentage ANOVA Results";
run;
proc print data=myparmdataset;
format estimate D8.6
       stderr D8.6;
run;
proc print data=second;
run;
proc sgplot data=second;
  scatter y=smresid x=pred;
  refline 0;
run;
proc sgplot data=second;
  scatter y=smresid x=treatment;
  refline 0;
run;
proc sgplot data=second;
  vbox smresid/group=treatment datalabel;
run;
proc sgscatter data=second;
  plot sresid*(pred treatment block);
run;
proc univariate data=second normal plot;
  var sresid;
  histogram sresid / normal kernel;
run;

Re: How to run the same code for different data?

SteveDenham — Wed, 25 Aug 2021 19:25:58 GMT

It would be pretty easy to loop this up in a macro - you would only need a where=(family=&fam_no) sort of addition to the data=cecal_family_taxonomy statement.

However, I worry about something else entirely. In the title, you refer to Percentage ANOVA results. Is your dependent variable a percentage? If so, looking at all of the residuals, etc. could be dispensed with if you used a generalized mixed model, and specified that your response variable was binomial (assuming it is counts of family X/total count). PROC GLIMMIX is specifically designed for this. The diagnostic plots generated with the plots=all option should enable you to see if there was any significant overdispersion (extra variability) in your results. As we have pointed out before, the assumptions of homogeneous variance, independence and normality of residuals is not critical to mixed models, and even less important to generalized mixed models.

SteveDenham

Re: How to run the same code for different data?

kellychan84 — Wed, 25 Aug 2021 19:45:27 GMT

@PaigeMiller It is several hundred variables in one data set.

Re: How to run the same code for different data?

Reeza — Wed, 25 Aug 2021 20:07:05 GMT

This is likely a good quick read for you as well.
https://blogs.sas.com/content/iml/2017/02/13/run-1000-regressions.html

Re: How to run the same code for different data?

kellychan84 — Wed, 25 Aug 2021 20:29:23 GMT

Hello @SteveDenham You are right, my data are presented as percentages (counts of family X/total count). But I don't have any experience with the BY statement and macro statement. If possible, do you mind giving me an example of how to insert relevant codes into my procedure? My supervisor always wants to see my homogeneity test results before the comparisons of treatments. That's why I keep this several lines of codes here.

Re: How to run the same code for different data?

PaigeMiller — Wed, 25 Aug 2021 21:35:39 GMT

I know you're getting lots of advice, but in this case I would recommend (as @Reeza did) the link to a method of running 1000 regressions.

This also accomplishes a conversion of a wide data set to a long data set, and so this is almost always a good thing to do (see Maxim 19). If you created this data set, next time create a long data set which is superior to a wide data set; and even if you received the data this way (i.e. you didn't create it), converting to wide to long is almost always a very good thing to do.

But you seem to keep ignoring other advice: what are you going to do after you saw the woman in half? You really ought to think about that BEFORE you perform all of these analyses and generate all of the outputs and plots.

Re: How to run the same code for different data?

kellychan84 — Wed, 25 Aug 2021 21:51:04 GMT

Hello @PaigeMiller. Yes, there are many good advice and I really appreciate it! But I still have difficulty in adapting those advice into my own codes. Need some time to digest them. Thank you very much again for your kind help.

Re: How to run the same code for different data?

PaigeMiller — Thu, 26 Aug 2021 11:47:49 GMT

Let me ask a simpler question. Suppose you have only 2 families. You can fit the two regressions and do all the plots, there's no real programming difficulty. And then you have two regressions and two sets of plots (one for each family), then what?

Are you going to do some sort of statistical test? What test?

Re: How to run the same code for different data?

kellychan84 — Thu, 26 Aug 2021 12:42:24 GMT

Hello @PaigeMiller, the codes I attached at the beginning include the tests I need to run. Basically I need to have homogeneity test results, then I do an Lsmean comparison (F-test) between my two treatment groups. But now I have many families that I need to repeat the same. I know there is now BY or macro statement I can choose, but so far, I don't know how to write into my codes to be specific.

proc glimmix data= cecal_family_taxonomy; 
  class treatment block;
  model Family1P = treatment; /*here I have to repeat family2P, 3P...until 800P*/
  random block; 
  random _residual_/group=treatment; 
  lsmeans treatment/tdiff lines;
  covtest homogeneity;
  ods output LSMeans = myparmdataset;
  output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
  title "1-Taxonomy Data Cecal Family 1 Percentage ANOVA Results";
run;
proc print data=myparmdataset;
format estimate D8.6
       stderr D8.6;
run;
proc print data=second;
run;
proc sgplot data=second;
  scatter y=smresid x=pred;
  refline 0;
run;
proc sgplot data=second;
  scatter y=smresid x=treatment;
  refline 0;
run;
proc sgplot data=second;
  vbox smresid/group=treatment datalabel;
run;
proc sgscatter data=second;
  plot sresid*(pred treatment block);
run;
proc univariate data=second normal plot;
  var sresid;
  histogram sresid / normal kernel;
run;

Re: How to run the same code for different data?

PaigeMiller — Thu, 26 Aug 2021 12:45:06 GMT

I guess none of this answers my questions, which are not about code, but about how you are going to use the output.

With regards to the code, the link from @Reeza to an article explaining how to do 1000 regressions gives you a clear example of code on how to do this.

Re: How to run the same code for different data?

kellychan84 — Thu, 26 Aug 2021 12:57:33 GMT

My analyses are about gut microbiome. Even though there are large amount of outputs, I have to look through them one by one to see how my positive treatment has an impact on specific bacteria that is under family or species level compared to control group. Even there are over 800 family, I might only find let is say 20 something that have significant differences. Don't know yet.

Re: How to run the same code for different data?

SteveDenham — Thu, 26 Aug 2021 13:26:22 GMT

I think I may have been writing code when I should have been asking design questions. Let me know if this is how the study was designed:

There are N subjects, grouped into M blocks. Subjects within block receive one of two treatments - treated or control. A cecal sample is taken from each subject, and microbiological survey is done on the sample. There could be up to 800 different species/families counted, although it is likely that not all are present in every sample. The counts per family are converted to percentage of total counts. So for any given species/family, you would have N/2 observations in the treated group and N/2 observations in the control group..

Is that a fair summary of the experimental design? If so, then we need to account for the following in the analysis: Non-independence of the family counts, as they must sum to the total count. Repeated measurement on the cecal samples - counts for each family. A non-normal response variable - either family count/total count (binomial) or family count with an offset of total count (generalized Poisson). I would think that since your denominator is on the order of 10^9 per ml, those two distributions are asymptotically the same. There are up to 800 comparisons between treatment and control, so there is a multiple comparison issue to be handled as well.

Am I missing anything critical in the design?

SteveDenham

Re: How to run the same code for different data?

PaigeMiller — Thu, 26 Aug 2021 13:42:19 GMT

@kellychan84 wrote:

My analyses are about gut microbiome. Even though there are large amount of outputs, I have to look through them one by one to see how my positive treatment has an impact on specific bacteria that is under family or species level compared to control group. Even there are over 800 family, I might only find let is say 20 something that have significant differences. Don't know yet.

If you are comparing 800 different families to a control group, you might want to include family in the model itself instead of running 800 separate models, one family at a time.

Re: How to run the same code for different data?

kellychan84 — Thu, 26 Aug 2021 15:12:57 GMT

Hello @SteveDenham your descriptions of my study design are completely correct. I think my experiment design is very simple. Two treatments (1 control and 1 antibiotics), and animals are randomly assigned into two treatments according to RCBD. Now both cecal and fecal samples are collected and are analyzed for microbiota under kingdom, phylum, class, order, family, genus, and species levels. The family, genus, and species levels have so many bacteria that I have to compare them one by one between two treatments. That is the tricky thing.

The percentage data I have converted through excel and can be uploaded to SAS. I here also include my data input code for your reference.

data cecal_family_taxonomy;
  length treatment $20;
  Infile "/home/u39233094/sasuser.v94/Thesis/CSV file/5-family cecal taxonomy for SAS.csv" dlm="," firstobs=5;
  input diet$ treatment$ block pen pig_number Family1 Family1P Family2 Family2P Family3 Family3P Family4 Family4P....Family121P Family122 Family122P TotalR TotalP;     /*species have over 800 bacteria that have to be compared one by one*/                                                                                                                                                                                                                           
run;

Re: How to run the same code for different data?

MaryA_Marion — Thu, 26 Aug 2021 15:26:51 GMT

prepare a macro
%macro macroname(parameters)
... your code with parameters referred to as &param_1...&param_2 etc
%mend;

Re: How to run the same code for different data?

kellychan84 — Thu, 26 Aug 2021 15:31:37 GMT

I think GLIMMIX can not include different families in the same one model? Am I wrong?

Re: How to run the same code for different data?

SteveDenham — Fri, 27 Aug 2021 17:06:23 GMT

Yes GLIMMIX can handle multiple levels of family. There will be missing values, as not every pig will have every taxon group, but that is the sort of thing mixed models are good at handling. However, your main effect means will not be estimable, so you would look at the simple effect of treatment for each taxon group using the SLICE option of the LSMEANS statement for treatment*family.

SteveDenham