Hello Experts,
I would like to apply a log-ratio transformation to compositional variables before performing PCA analysis. I'm wondering if there is a dedicated proc in sas for this?
Thank you!
Here's an example of the centered log-ratio (CLR) transformation.
Remark the CLR requires positive, non-zero values in the composition.
/* 1. Define sample data (5 parts: aa, bb, cc, dd, ee) */
data raw_data0;
input id aa bb cc dd ee;
datalines;
1 0.1 0.3 0.3 0.1 0.2
2 0.1 0.1 0.4 0.2 0.2
3 0.2 0.3 0.2 0.1 0.2
;
run;
data raw_data1;
set raw_data0;
array parts[5] aa bb cc dd ee; /* List all components */
mygeomean=geomean(of parts[*]);
mysum = sum(of parts[*]);
run;
data raw_data2;
set raw_data1;
array parts[5] aa bb cc dd ee; /* List all components */
array clr_parts[5]
clr_aa clr_bb clr_cc clr_dd clr_ee;
do i=1 to dim(parts);
clr_parts[i] = log(parts[i]/mygeomean);
/* log = ln (natural log with base e) */
end;
drop i;
run;
/* end of program */
BR, Koen
Compositional data are nonnegative multivariate data where the absolute values of the data carry only relative meaning.
Compositional data are often data with a constant-sum constraint: that is, ...
Since compositional data lies in a constrained simplex (sum to 1 or 100), it cannot be analyzed directly with standard linear models. Use the SAS DATA step to create transformed variables, such as:
Those ratios may be the by-product of some procedure (I am not aware of this), but it is actually easy to calculate them in a data step.
If you encounter difficulties in the calculation, please let us know... we will try to help you further with sample code.
BR, Koen
By the way,
if you can reduce the dimensionality to three using PCA,
you can then use a ternary plot (trilinear plots / triplot) to visualize.
BR, Koen
Here's an example of the centered log-ratio (CLR) transformation.
Remark the CLR requires positive, non-zero values in the composition.
/* 1. Define sample data (5 parts: aa, bb, cc, dd, ee) */
data raw_data0;
input id aa bb cc dd ee;
datalines;
1 0.1 0.3 0.3 0.1 0.2
2 0.1 0.1 0.4 0.2 0.2
3 0.2 0.3 0.2 0.1 0.2
;
run;
data raw_data1;
set raw_data0;
array parts[5] aa bb cc dd ee; /* List all components */
mygeomean=geomean(of parts[*]);
mysum = sum(of parts[*]);
run;
data raw_data2;
set raw_data1;
array parts[5] aa bb cc dd ee; /* List all components */
array clr_parts[5]
clr_aa clr_bb clr_cc clr_dd clr_ee;
do i=1 to dim(parts);
clr_parts[i] = log(parts[i]/mygeomean);
/* log = ln (natural log with base e) */
end;
drop i;
run;
/* end of program */
BR, Koen
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.