Re: How to run the same code for different data?

kellychan84 · Posted 08-25-2021 02:30 PM

Hello, if I have several hundred data need to run the same code, how could I analyze them without repeating writing the same codes? For example, I have family 1 to family 800. Thank you very much in advance for your help!

proc glimmix data= cecal_family_taxonomy; 
  class treatment block;
  model Family1P = treatment; 
  random block; 
  lsmeans treatment/tdiff lines;
  ods output LSMeans = myparmdataset;
  output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
  title "1-Taxonomy Data Cecal Family 1 Percentage ANOVA Results";
run;
proc print data=myparmdataset;
format estimate D8.6
       stderr D8.6;
run;
proc print data=second;
run;
proc sgplot data=second;
  scatter y=smresid x=pred;
  refline 0;
run;
proc sgplot data=second;
  scatter y=smresid x=treatment;
  refline 0;
run;
proc sgplot data=second;
  vbox smresid/group=treatment datalabel;
run;
proc sgscatter data=second;
  plot sresid*(pred treatment block);
run;
proc univariate data=second normal plot;
  var sresid;
  histogram sresid / normal kernel;
run;

SteveDenham · Posted 08-25-2021 02:55 PM

Use a BY statement on a sorted dataset. If family takes on values from 1 to 800, then just add

by family;

after you invoke PROC GLIMMIX.

SteveDenham

PaigeMiller · Posted 08-25-2021 02:58 PM

Can we get a clarification?

You say: "if I have several hundred data..."

Does this mean several hundred data sets, or several hundred variables in one data set, or one variable but with an ID indicating several hundred families? Or something else?

Regardless of your answer, what are you going to do with 800 model fits, and 800 plots (times several different versions of the plots)? That makes me shiver. OF course you can ask the computer to produce all of that, but the problem is really how can a human look at and absorb all of that information?

Reminds me of a quote from Gomez Addams as he is about to saw his wife Morticia in half. Morticia asks: "Gomez, do you really know how to saw a woman in half?" And Gomez replies, "Of course, its putting her back together that is the problem".

You might want to instead of plotting put all the tests for normality (and any other tests) into one large data set, but even then the question remains ... and then what?

--
Paige Miller

kellychan84 · Posted 08-25-2021 03:45 PM

@PaigeMiller It is several hundred variables in one data set.

Reeza · Posted 08-25-2021 04:07 PM

This is likely a good quick read for you as well.
https://blogs.sas.com/content/iml/2017/02/13/run-1000-regressions.html

Reeza · Posted 08-25-2021 03:16 PM

Few different ways to get at this, usually involving either BY group processing or macros.

@SteveDenham has suggested the BY option, I'll show the macro solutions.

One big difference is how you're interpret or parse the output. For the BY you'll have each step together but not all the results for one model, which may be a better formatted output. In general, though I agree with @PaigeMiller, this isn't particularily a practical approach, especially the multiple prints and plots.

UCLA introductory tutorial on macro variables and macros
https://stats.idre.ucla.edu/sas/seminars/sas-macros-introduction/

Tutorial on converting a working program to a macro <- this will illustrate how to run it many times for different data sets
This method is pretty robust and helps prevent errors and makes it much easier to debug your code. Obviously biased, because I wrote it 🙂 https://github.com/statgeek/SAS-Tutorials/blob/master/Turning%20a%20program%20into%20a%20macro.md

Examples of common macro usage

https://communities.sas.com/t5/SAS-Communities-Library/SAS-9-4-Macro-Language-Reference-Has-a-New-Ap...

@kellychan84 wrote:

Hello, if I have several hundred data need to run the same code, how could I analyze them without repeating writing the same codes? For example, I have family 1 to family 800. Thank you very much in advance for your help!

proc glimmix data= cecal_family_taxonomy; 
  class treatment block;
  model Family1P = treatment; 
  random block; 
  lsmeans treatment/tdiff lines;
  ods output LSMeans = myparmdataset;
  output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
  title "1-Taxonomy Data Cecal Family 1 Percentage ANOVA Results";
run;
proc print data=myparmdataset;
format estimate D8.6
       stderr D8.6;
run;
proc print data=second;
run;
proc sgplot data=second;
  scatter y=smresid x=pred;
  refline 0;
run;
proc sgplot data=second;
  scatter y=smresid x=treatment;
  refline 0;
run;
proc sgplot data=second;
  vbox smresid/group=treatment datalabel;
run;
proc sgscatter data=second;
  plot sresid*(pred treatment block);
run;
proc univariate data=second normal plot;
  var sresid;
  histogram sresid / normal kernel;
run;

SteveDenham · Posted 08-25-2021 03:25 PM

It would be pretty easy to loop this up in a macro - you would only need a where=(family=&fam_no) sort of addition to the data=cecal_family_taxonomy statement.

However, I worry about something else entirely. In the title, you refer to Percentage ANOVA results. Is your dependent variable a percentage? If so, looking at all of the residuals, etc. could be dispensed with if you used a generalized mixed model, and specified that your response variable was binomial (assuming it is counts of family X/total count). PROC GLIMMIX is specifically designed for this. The diagnostic plots generated with the plots=all option should enable you to see if there was any significant overdispersion (extra variability) in your results. As we have pointed out before, the assumptions of homogeneous variance, independence and normality of residuals is not critical to mixed models, and even less important to generalized mixed models.

SteveDenham

kellychan84 · Posted 08-25-2021 04:29 PM

Hello @SteveDenham You are right, my data are presented as percentages (counts of family X/total count). But I don't have any experience with the BY statement and macro statement. If possible, do you mind giving me an example of how to insert relevant codes into my procedure? My supervisor always wants to see my homogeneity test results before the comparisons of treatments. That's why I keep this several lines of codes here.

PaigeMiller · Posted 08-25-2021 05:32 PM

I know you're getting lots of advice, but in this case I would recommend (as @Reeza did) the link to a method of running 1000 regressions.

This also accomplishes a conversion of a wide data set to a long data set, and so this is almost always a good thing to do (see Maxim 19). If you created this data set, next time create a long data set which is superior to a wide data set; and even if you received the data this way (i.e. you didn't create it), converting to wide to long is almost always a very good thing to do.

But you seem to keep ignoring other advice: what are you going to do after you saw the woman in half? You really ought to think about that BEFORE you perform all of these analyses and generate all of the outputs and plots.

--
Paige Miller

kellychan84 · Posted 08-25-2021 05:51 PM

Hello @PaigeMiller. Yes, there are many good advice and I really appreciate it! But I still have difficulty in adapting those advice into my own codes. Need some time to digest them. Thank you very much again for your kind help.

PaigeMiller · Posted 08-26-2021 07:47 AM

Let me ask a simpler question. Suppose you have only 2 families. You can fit the two regressions and do all the plots, there's no real programming difficulty. And then you have two regressions and two sets of plots (one for each family), then what?

Are you going to do some sort of statistical test? What test?

--
Paige Miller

kellychan84 · Posted 08-26-2021 08:42 AM

Hello @PaigeMiller, the codes I attached at the beginning include the tests I need to run. Basically I need to have homogeneity test results, then I do an Lsmean comparison (F-test) between my two treatment groups. But now I have many families that I need to repeat the same. I know there is now BY or macro statement I can choose, but so far, I don't know how to write into my codes to be specific.

proc glimmix data= cecal_family_taxonomy; 
  class treatment block;
  model Family1P = treatment; /*here I have to repeat family2P, 3P...until 800P*/
  random block; 
  random _residual_/group=treatment; 
  lsmeans treatment/tdiff lines;
  covtest homogeneity;
  ods output LSMeans = myparmdataset;
  output out=second predicted=pred residual=resid residual(noblup)=mresid student=sresid student(noblup)=smresid;
  title "1-Taxonomy Data Cecal Family 1 Percentage ANOVA Results";
run;
proc print data=myparmdataset;
format estimate D8.6
       stderr D8.6;
run;
proc print data=second;
run;
proc sgplot data=second;
  scatter y=smresid x=pred;
  refline 0;
run;
proc sgplot data=second;
  scatter y=smresid x=treatment;
  refline 0;
run;
proc sgplot data=second;
  vbox smresid/group=treatment datalabel;
run;
proc sgscatter data=second;
  plot sresid*(pred treatment block);
run;
proc univariate data=second normal plot;
  var sresid;
  histogram sresid / normal kernel;
run;

PaigeMiller · Posted 08-26-2021 08:45 AM

I guess none of this answers my questions, which are not about code, but about how you are going to use the output.

With regards to the code, the link from @Reeza to an article explaining how to do 1000 regressions gives you a clear example of code on how to do this.

--
Paige Miller

kellychan84 · Posted 08-26-2021 08:57 AM

My analyses are about gut microbiome. Even though there are large amount of outputs, I have to look through them one by one to see how my positive treatment has an impact on specific bacteria that is under family or species level compared to control group. Even there are over 800 family, I might only find let is say 20 something that have significant differences. Don't know yet.

SteveDenham · Posted 08-26-2021 09:26 AM

I think I may have been writing code when I should have been asking design questions. Let me know if this is how the study was designed:

There are N subjects, grouped into M blocks. Subjects within block receive one of two treatments - treated or control. A cecal sample is taken from each subject, and microbiological survey is done on the sample. There could be up to 800 different species/families counted, although it is likely that not all are present in every sample. The counts per family are converted to percentage of total counts. So for any given species/family, you would have N/2 observations in the treated group and N/2 observations in the control group..

Is that a fair summary of the experimental design? If so, then we need to account for the following in the analysis: Non-independence of the family counts, as they must sum to the total count. Repeated measurement on the cecal samples - counts for each family. A non-normal response variable - either family count/total count (binomial) or family count with an offset of total count (generalized Poisson). I would think that since your denominator is on the order of 10^9 per ml, those two distributions are asymptotically the same. There are up to 800 comparisons between treatment and control, so there is a multiple comparison issue to be handled as well.

Am I missing anything critical in the design?

SteveDenham

Ready to join fellow brilliant minds for the SAS Hackathon?