Howdy folks,
I am running both Pearson and Spearman correlations for a large dataset (approx. 7,000 sets of data), and am wondering whether there is a way to program the PROC CORR program to only run the analysis up to the first instance of "0" for each participant. All sets of data begin at a non-zero number and theoretically drop to zero. However, the point at which these sets of data reach zero differ between participants, so I can't simply delete or replace all columns following the first-occurring zero. For example, see the following three example sets of data (note that each cell indicates number of item purchased at that price):
Price | |||||||||||
$1 | $2 | $3 | $4 | $5 | $6 | $7 | $8 | $9 | $10 | ||
Participant | 1 | 10 | 8 | 6 | 4 | 2 | 0 | 0 | 0 | 0 | 0 |
2 | 25 | 25 | 25 | 20 | 18 | 15 | 10 | 0 | 0 | 0 | |
3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Notice that as price goes up, the instances of the observed purchase decrease. What I need for PROC CORR to execute is to read each set of observations, and only analyze observations through the first zero (for example, I highlighted these zeroes in the above set), but not consider any other zeroes after the first zero. This task needs to be executed for the basic Pearson PROC CORR and the SPEARMAN-enabled PROC CORR statement.
Theoretically, I could simply work through the code and delete (or replace) all zeroes after the first-occurring zero, but it would be difficult to do so for 7,000 sets of observations, and I believe that SAS is capable of executing this task.
I had previously used an ARRAY function to replace all instances of zero with "." to mark them as missing. However, the analysis requires that the first-occuring zero be considered as a component of the function, and the ARRAY function I was using would delete the first-occuring zero.
Any advice? Many thanks for your time!
Which two variables are you going to use to calculate Correlation coefficience ? two participant ? or two Price ?
And why not just set the zeros which follows the first zero to be missing value ?
data have;
infile cards expandtabs truncover;
input Participant _1-_10;
cards;
1 10 8 6 4 2 0 0 0 0 0
2 25 25 25 20 18 15 10 0 0 0
3 2 0 0 0 0 0 0 0 0 0
;
run;
data want;
set have;
array x{*} _1-_10;
do i=1 to dim(x);
if found then x{i}=.;
if x{i}=0 then found=1;
end;
drop i found;
run;
proc print;run;
What are you correlating. Price with number of purchases or number of purchases between participants?
One way to do this is to stack all variables' names and values then use by processing in proc corr for values greater than 0. Something like this:
data have;
input Price Participant1 Participant2 Participant3;
datalines;
1 10 25 0
2 8 25 0
3 6 25 0
4 4 20 0
5 2 18 0
6 0 15 0
7 0 10 0
8 0 0 0
9 0 0 0
10 0 0 0
;
data want(keep=variable price value);
set have;
array p(*) Participant:;
do i=1 to dim(p);
value=p(i);
variable=vname(p(i));
output;
end;
run;
proc sort data=want;
by variable;
run;
proc corr data=want(where=(value>0));
by variable;
var price;
with value;
run;
Please try the following syntax that will maintain the first instance of zero for proc corr:
data want(keep=variable price value);
set have;
array p(*) Participant:;
do i=1 to dim(p);
value=p(i);
variable=vname(p(i));
output;
end;
run;
proc sort data=want;
by variable;
run;
data corr(drop=flag);
do until(last.variable);
set want;
by variable;
if not flag then output;
if value=0 then flag = 1;
end;
run;
proc corr data=corr;
by variable;
var price;
with value;
run;
Make your data long instead of wide:
data long;
set wide;
array A n1-n10;
do price = 1 to dim(A) until(A{price}=0);
number = A{price};
output;
end;
keep participant price number;
run;
Which two variables are you going to use to calculate Correlation coefficience ? two participant ? or two Price ?
And why not just set the zeros which follows the first zero to be missing value ?
data have;
infile cards expandtabs truncover;
input Participant _1-_10;
cards;
1 10 8 6 4 2 0 0 0 0 0
2 25 25 25 20 18 15 10 0 0 0
3 2 0 0 0 0 0 0 0 0 0
;
run;
data want;
set have;
array x{*} _1-_10;
do i=1 to dim(x);
if found then x{i}=.;
if x{i}=0 then found=1;
end;
drop i found;
run;
proc print;run;
@Ksharp wrote:Which two variables are you going to use to calculate Correlation coefficience ? two participant ? or two Price ?
And why not just set the zeros which follows the first zero to be missing value ?
data have; infile cards expandtabs truncover; input Participant _1-_10; cards; 1 10 8 6 4 2 0 0 0 0 0 2 25 25 25 20 18 15 10 0 0 0 3 2 0 0 0 0 0 0 0 0 0 ; run; data want; set have; array x{*} _1-_10; do i=1 to dim(x); if found then x{i}=.; if x{i}=0 then found=1; end; drop i found; run; proc print;run;
This looks like exactly what I'm looking for! Thanks for sending this. Will this syntax work with data libraries that have already been imported to SAS? I.E., can I just replace the highlighted text below with my libname.refname ?
data want;
set have;
array x{*} _1-_10;
do i=1 to dim(x);
if found then x{i}=.;
if x{i}=0 then found=1;
end;
drop i found;
run;
proc print;run;
Yes. You can use that as long as it turn into a SAS dataset.
data yourlib.want;
set yourlib.have;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.