DATA Step, Macro, Functions and more

Code optimization

Accepted Solution Solved
Reply
Contributor
Posts: 45
Accepted Solution

Code optimization

[ Edited ]

Well... this may be kind of a bad question, but basically I have some code written which runs a lot of data steps, it creates tables in which I extract certain pieces of it, take it, run proc mixed, merge the results with other tables, then recombine everything into one big table. It runs correctly but the problem with this is in terms of SAS it is very inefficient. I have 600+ different variables to test through and currently I only told it to run 5. Running 5 variables already takes a good ~2 minutes. You can imagine how long it would take to run it through 600+. Is there a way or some link or some tips on how I can try to optimize my code? I've tried searching online but some of the things doesn't seem to apply to my code. http://support.sas.com/resources/papers/proceedings12/257-2012.pdf this for example... I never ran an instance where the data file I "set" is the exact same, I have a lot of data steps but I always "set" different tables. The thing is my code runs through a lot of loops, i, j, and k.

 

Edit: Ok I posted "half" of my code. I wrote another macro before this and that one runs pretty quickly, it's this 2nd one that is running really slow.

 


/* Creating a 2nd Macro which finds the Fvalue of the main effects and also does the comparisons, input name of dataset, the variables that belong in class, the variables in the model with interactions included, and total number of metabolites*/
%MACRO Comparison(data, Class, Model, totalvars);
ods graphics off;

/* looping through each metabolite, running proc mixed on the log transformed data*/
%do i = 1 %to &totalvars;
data &data.metabolite_m&i;
            set &data;
            logm&i=log(m&i);            
            keep sample_name &Class logm&i m&i;
        run;
/* Tests3 shows the F values with main effects, Diffs shows the numbers for comparisons*/
		    proc mixed data=&data.metabolite_m&i;
            class &Class;
            model logm&i=  &Model / residual;   
			lsmeans &Model / diff alpha = 0.05;
            ods output Tests3  = &data.overall&i;
			ods output diffs = &data.comparison&i;
        run;
ods output close;

/* Counting the number of effects there are */
data &data.overall&i;
set &data.overall&i nobs = total;
count+1;
MetaboliteLabel = "m&i";
totalnum = total;
run;
/*counting the number of comparisons there are*/
data &data.comparison&i;
set &data.comparison&i nobs = total;
count+1;
MetaboliteLabel = "m&i";
totalnum = total;
run;

/*creating a macro variable that represents the total number of comparison*/
data _null_;
set &Data.comparison1;
if count ne 1 then delete;
call symput('totalnumber', totalnum);
run;

/*creating a macro variable taht represents the total number of effects*/
data _null_;
set &Data.overall1;
if count ne 1 then delete;
call symput('totaleffect', totalnum);
run;

/*sorting*/
proc sort data = &data.overall&i;
by MetaboliteLabel;
run;

/*sorting*/
proc sort data = &data.comparison&i;
by MetaboliteLabel;
run;

/* looping through the number of comparisons and separating each metabolite based on their comparison*/
%do j=1 %to &totalnumber;
data DiffComparison&j._&i;
set &data.comparison&i;
drop count totalnum;
if count = &j;
run;

/*looping through each separate main effect and also separating each metabolite based on each main effect*/
%do k=1 %to &totaleffect;
data DiffEffect&k._&i;
set &data.overall&i;
keep Effect Fvalue ProbF MetaboliteLabel;
if count = &k;
run;

/* Recombining all the metabolites together with the same comparisons */
data CombinedComparisons&j;
length MetaboliteLabel $100;
set DiffComparison&j._1-DiffComparison&j._&i ;
run;

/* Recombining the metabolites together based on the main effect*/
data CombinedMainE&k;
length MetaboliteLabel $100;
set DiffEffect&k._1-DiffEffect&k._&i;
run;

/*Sorting*/
proc sort data = CombinedComparisons&j;
by MetaboliteLabel;
run;

/*Sorting*/
proc sort data = CombinedMainE&k;
by MetaboliteLabel;
run;

/*Sorting the label file */
proc sort data = &data.label;
by MetaboliteLabel;
run;

/*Merging the Comparison table with labels */
data Step2Table;
merge CombinedComparisons&j &data.label;
by MetaboliteLabel;
run;

/* Merging the Main effect table with labels*/
data Step3Table;
merge CombinedMainE&k &data.label;
by MetaboliteLabel;
run;

/*Sorting*/
proc sort data = Step2Table;
by Biochemical;
run;

/*Sorting*/
proc sort data = Step3Table;
by Biochemical;
run;

/*Sorting the annotation file*/
proc sort data = &data.annotation;
by Biochemical;
run;

/*combining the Comparison table with annotations*/
data CompleteComparisonTable&j;
merge Step2Table &data.annotation;
by Biochemical;
run;

/*combining the Main effect table's Fvalues with annotations*/
data CompleteMainTable&k;
merge Step3Table &data.annotation;
by Biochemical;
run;

/*sorting*/
proc sort data = CompleteMainTable&k;
by Biochemical;
run;

/*Combining the Main effect Fvalues with the Main effect means*/
data FinalMainEffectTable&k;
merge Completemaintable&k Finalmeanstablefactor&k;
by Biochemical;
run;

/* Renaming probF to raw_p for PFDR adjustment*/
data FinalMainEffectTable&k;
set FinalMainEffectTable&k;
rename probF = RAW_P;
run;

/* Renaming probT to raw_p for PFDR adjustment*/
data CompleteComparisonTable&j;
set CompleteComparisonTable&j;
rename probT = RAW_P;
run;

/*Doing the PFDR adjustment on each Comparison Table*/
proc multtest PFDR pdata = CompleteComparisonTable&j out = data.CompleteComparisonTableFDR&j;
run;

/*Doing the PFDR adjustment on each Main effect Table*/
proc multtest PFDR pdata = FinalMainEffectTable&k out = data.FinalMainEffectTableFDR&k;
run;

/*Reordering the list of variables in the comparison table to make it look nicer */
data data.CompleteComparisonTableFDR&j;
retain Biochemical Super_Pathway sub_pathway MetaboliteLabel Estimate StdErr DF tValue Alpha Lower Upper raw_p pfdr_p;
set data.CompleteComparisonTableFDR&j;
run;

/*Reordering the list of variables in the Main effect table to make it look nicer */
data data.Finalmaineffecttablefdr&k;
retain Biochemical Super_pathway sub_pathway Metabolitelabel N LowerCL UpperCL Mean StdDev StdErr Fvalue raw_P pfdr_p;
set data.Finalmaineffecttablefdr&k;
run;

/*Ending the i/j/k loops*/
%end;
%end;
%end;

/*Cleaning up the files, can comment this out if needed, to debug 
proc datasets lib = work nolist kill;
run;
quit;*/

/*Ending the macro creation*/
%mend Comparison;

/*Running through the comparison macro*/
%comparison(test, Fructose MGO, Fructose MGO Fructose*MGO, 5);

Accepted Solutions
Solution
‎12-21-2016 03:05 PM
Super User
Posts: 5,085

Re: Code optimization

Without code, we're limited to speaking in very general terms.  It sounds like there's a possibility of running through the loops many fewer times, if you were to switch to using a BY statement.  If that's feasible, it would be much faster.  A simple example of the long way:

 

data ny;

set usa;

if state='ny';

run;

proc means data=ny;

run;

data me;

set usa;

if state='me';

run;

proc means data=me;

run;

data vt;

set usa;

if state='vt';

run;

proc means data=vt;

run;

 

The shorter way just to illustrate a BY statement:

 

proc sort data=usa (where=(state in ('ny', 'me', 'vt'))) out=temp;

by state;

run;

proc means data=temp;

by state;

run;

 

A single PROC MEANS can calculate statistics for all three states.

 

Since you are working with macro loops, there might be significant rewrite of the code required.

View solution in original post


All Replies
Super User
Posts: 10,516

Re: Code optimization

One question that may give a clue is what is the nature of the 600 variables? If they are things like Score_for_client_x, Score_for_client_y, Score_for_client_z, or price_for_product_a, Price_for_product_b or even Value_at_Jan2015, Value_at_Feb2015, Value_at_Mar2015 then the data structure is part of the problem. Often it works out better to have identification as a different variable and then use the SAS BY processing with that variable get separate analysis for each level of the variable.

Contributor
Posts: 45

Re: Code optimization

I'm dealing with Metabolomics data so the name of the variables are a bunch of metabolites, I have renamed them to m1-m&i in my macro. In order to merge data sets together I have sorted every data set using the BY statement and then merging them with that BY.
Solution
‎12-21-2016 03:05 PM
Super User
Posts: 5,085

Re: Code optimization

Without code, we're limited to speaking in very general terms.  It sounds like there's a possibility of running through the loops many fewer times, if you were to switch to using a BY statement.  If that's feasible, it would be much faster.  A simple example of the long way:

 

data ny;

set usa;

if state='ny';

run;

proc means data=ny;

run;

data me;

set usa;

if state='me';

run;

proc means data=me;

run;

data vt;

set usa;

if state='vt';

run;

proc means data=vt;

run;

 

The shorter way just to illustrate a BY statement:

 

proc sort data=usa (where=(state in ('ny', 'me', 'vt'))) out=temp;

by state;

run;

proc means data=temp;

by state;

run;

 

A single PROC MEANS can calculate statistics for all three states.

 

Since you are working with macro loops, there might be significant rewrite of the code required.

Contributor
Posts: 45

Re: Code optimization

[ Edited ]

Ok so I basically wrote 2 macros, 1 in which the runtime isn't that long, but the 2nd one is taking a long time, I'll post it here I guess to make it easier for you guys. I edited my original post and included the code.

 

 

Edit: OK nvm I figured out yeah... I was doing unnecessary loops. In my merge/set statements I was merging 1 with 2, then 1-3, 1-4, 1-5, etc... until 1-final number. Rather than just ONLY doing merge from 1 to the final number.

Super User
Posts: 5,085

Re: Code optimization

I can't tell if this would be applicable or not, but here are some starting points on eliminating the looping using a BY variable ... food for thought anyway.

 

data &data.metabolites;

   set &data;

   %do i=1 %to &totalvars;

      metabolite_num=&i;

      original_value = m&i;

      log_value = log(original_value);

      output;

   %end;

run;

 

proc sort data=&data.metabolites;

   by metabolite_num;

run;

 

proc mixed data=&data.metabolites;

   by metabolite_num;

   class &class;

   model log_value = 

   ... changes to match the new data structure ...

 

You might need to examine the structure of the output data sets when using a BY statement, but one PROC MIXED with a BY statement will be faster than many PROC MIXEDs.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 258 views
  • 0 likes
  • 3 in conversation