Dear SAS community members,
I am working in scientific research. I would like to run simultaneously correlations between one variable (eg blood cholesterol) and thousand of others (names of thousand genes).
The name of the variables look like this (and continues up to some thousands):
TC01000005_hg_1, TC01000006_hg_1, TC01000008_hg_1, TC01000009_hg_1,
TC01000010_hg_1, TC01000011_hg_1, TC01000012_hg_1, TC01000013_hg_1,
TC01000014_hg_1, TC01000015_hg_1, TC01000016_hg_1, TC01000017_hg_1,
TC01000018_hg_1, TC01000019_hg_1
My first question is how I should type the command so that I can include all those thousand variables. I have seen a syntax like the following;
proc corr data=myData;
var Var1;
with var2-var99;
run;
But how should I transform it in order to include my type of variables?
My second question is how I should type the "BEST" command so that, after running the thousands correlations, I could have a list of the top 50 results. This should include the 50 variable names with the highest (or lowest) R value and the lowest p value.
My third question, is it possible to run this type of correlations (one variable against 30.000 variables) in SAS university edition. I have assigned 7GB out of 8 total GB RAM for the VM Box.
Thank you so much in advance!!
Regards
Apo
Another form of variable list is to use the starting characters of a group of similarly named variables ending with a :
With TC: ;
would compare your variable on the VAR statement with all variables whose names start with the characters TC.
Another form of variable list is to use the starting characters of a group of similarly named variables ending with a :
With TC: ;
would compare your variable on the VAR statement with all variables whose names start with the characters TC.
It helps to show the entire code when you get an error. Some things that may help: use option NOPRINT on the Proc Corr statement to reduce the printed output which takes memory trying to format things. Direct the desired statistics to data sets.
You may have to break the data into groups. If you use the -- operator , that is two dashes, in a variable list then the variables that are in order are selected;
with TC01000005_hg_1 -- TC01000019_hg_1; would select adjacent columns in the data set with the leftmost the first variable and the last being the right-most column of that group.
I have to say that when you said you had 1000s of variables I was afraid there might be a memory issue.
Answer to first question: run proc contents on the file to identify the first and last variables in the set. Then you can simply specify them in a list like: TC01000005_hg_1--TC010000019_hg_1
Second question:
proc corr data=myData best=50; var Var1; with var2--var99; run;
Third question: I don't know, but I don't see why it wouldn't.
Art, CEO, AnalystFinder.com
Dear art297
Thanks for your reply!
The code functions perfectly! The only problem is that, when I try to start the correlation analysis of my variable with around 10.000 variables, SAS stops and shows the following message;
SAS/IML is the best way to do that. @Rick_SAS might be interesting.
Post your data and output you want see here.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.