DATA Step, Macro, Functions and more

correlation with thousand variables

Accepted Solution Solved
Reply
New Contributor Apo
New Contributor
Posts: 3
Accepted Solution

correlation with thousand variables

Dear SAS community members,

 

I am working in scientific research. I would like to run simultaneously correlations between one variable (eg blood cholesterol) and thousand of others (names of thousand genes).

The name of the variables look like this (and continues up to some thousands):

TC01000005_hg_1, TC01000006_hg_1, TC01000008_hg_1, TC01000009_hg_1, 
TC01000010_hg_1, TC01000011_hg_1, TC01000012_hg_1, TC01000013_hg_1, 
TC01000014_hg_1, TC01000015_hg_1, TC01000016_hg_1, TC01000017_hg_1, 
TC01000018_hg_1, TC01000019_hg_1

My first question is how I should type the command so that I can include all those thousand variables. I have seen a syntax like the following;

proc corr data=myData;

var Var1;

with var2-var99;

run;

 

But how should I transform it in order to include my type of variables?

 

 

My second question is how I should type the "BEST" command so that, after running the thousands correlations, I could have a list of the top 50 results. This should include the 50 variable names with the highest (or lowest) R value and the lowest p value.

 

My third question, is it possible to run this type of correlations (one variable against 30.000 variables) in SAS university edition. I have assigned 7GB out of 8 total GB RAM for the VM Box.

 

Thank you so much in advance!!

 

Regards

Apo


Accepted Solutions
Solution
‎03-29-2017 03:59 AM
Super User
Posts: 11,343

Re: correlation with thousand variables

Another form of variable list is to use the starting characters of a group of similarly named variables ending with a :

 

With TC: ;

would compare your variable on the VAR statement with all variables whose names start with the characters TC.

View solution in original post


All Replies
Solution
‎03-29-2017 03:59 AM
Super User
Posts: 11,343

Re: correlation with thousand variables

Another form of variable list is to use the starting characters of a group of similarly named variables ending with a :

 

With TC: ;

would compare your variable on the VAR statement with all variables whose names start with the characters TC.

New Contributor Apo
New Contributor
Posts: 3

Re: correlation with thousand variables

Thank you so much for this reply!
However, I have problems trying to run correlations of my variable (200 obs) with 10.000 variables (200 obs each), insufficient memory.

ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set WORK.NEW4 may be incomplete. When this step was stopped there were 0 observations and 3 variables.
WARNING: Data set WORK.NEW4 was not replaced because this step was stopped.
NOTE: PROCEDURE CORR used (Total process time):
real time 0.47 seconds
cpu time 0.39 seconds

Super User
Posts: 11,343

Re: correlation with thousand variables

It helps to show the entire code when you get an error. Some things that may help: use option NOPRINT on the Proc Corr statement to reduce the printed output which takes memory trying to format things. Direct the desired statistics to data sets.

 

You may have to break the data into groups. If you use the -- operator , that is two dashes, in a variable list then the variables that are in order are selected;

 

with TC01000005_hg_1 -- TC01000019_hg_1;  would select adjacent columns in the data set with the leftmost the first variable and the last being the right-most column of that group.

 

I have to say that when you said you had 1000s of variables I was afraid there might be a memory issue.

 

 

 

PROC Star
Posts: 7,492

Re: correlation with thousand variables

[ Edited ]

Answer to first question: run proc contents on the file to identify the first and last variables in the set. Then you can simply specify them in a list like: TC01000005_hg_1--TC010000019_hg_1

 

Second question:

proc corr data=myData best=50;
  var Var1;
  with var2--var99;
run;

 

Third question: I don't know, but I don't see why it wouldn't.

 

Art, CEO, AnalystFinder.com

 

New Contributor Apo
New Contributor
Posts: 3

Re: correlation with thousand variables

Dear art297

Thanks for your reply!

 

The code functions perfectly! The only problem is that, when I try to start the correlation analysis of my variable with around 10.000 variables, SAS stops and shows the following message;

ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set WORK.NEW4 may be incomplete. When this step was stopped there were 0 observations and 3 variables.
WARNING: Data set WORK.NEW4 was not replaced because this step was stopped.
NOTE: PROCEDURE CORR used (Total process time):
real time 0.47 seconds
cpu time 0.39 seconds
 
My SAS university edition is running on 2 cores and on 7GB RAM. Do you think I should use a faster computer or a standard non-university SAS edition?
 
Thank you once again
 
Super User
Posts: 10,046

Re: correlation with thousand variables

SAS/IML is the best way to do that. @Rick_SAS might be interesting.

Post your data and output you want see here.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 282 views
  • 4 likes
  • 4 in conversation