Hi,
I am working on a data set similar to the one in the table below. I would like to create a a fifth variable, v5, where if 3 or more the other variables (V1, V2, V3, V4) have the same value then V5 would be assigned that value as well.
I know how to do this by brute force:
DATA work.eg;
SET work.eg;
IF v1=1 and v2=1 and v4=1 THEN v5=1;
ELSE IF v1=1 and v3=1 and v4=1 THEN v5 =1;
etc...
However, I have to do this repeatedly for multiple values and variables and I am afraid that I will miss a possible combination. I want like SAS to do this step for me. How would I do this?
Thanks in advance!
|
EDIT: Sorry everyone, I was got very ill and am now checking this. Thanks for all the answers.
If the variables were all categorical, how would I modify the codes to make them do the same thing?
So 3 "or" more same value eh?
data have;
input v1 v2 v3 v4;
cards;
1 1 3 1
1 3 1 1
1 1 2 2
1 3 2 4
2 3 1 4
1 2 1 3
;
data want;
set have;
array v v:;
i_=1;
call sortn(of v1-v4);
do i=2 to dim(v);
if v(i)=v(i-1) then i_=sum(i_,1);
else i_=1;
if i_=3 then do; v5=v(i);leave;end;
end;
drop i:;
run;
/*The call sortn in the previous would have changed the order of v1-v4,
so this modified one with set have point=_n_ will
reset v1-v4 in its original order*/
data want1;
set have;
array v v:;
i_=1;
call sortn(of v1-v4);
do i=2 to dim(v);
if v(i)=v(i-1) then i_=sum(i_,1);
else i_=1;
if i_=3 then do; v5=v(i);leave;end;
end;
set have point=_n_;
drop i:;
run;
Novinosrin:
data need;
set have;
array v v1 - v4;
array k[4] _temporary_;
do i = 1 to dim(v);
k[v[i]] + 1;
end;
j = max(of k[*]);
if j = 3 then v5 = k[j];
call missing(of k[*]);
drop i j;
run;
The program simplifies if we count the times each V1 to V4 appears in a row. This will avoid comparison of Vs, Sorting and double passing of the Data Set. But this begs for the maximum value of V1 to V4. In this example it is 4. This maximum value is required to size a _temporary_ array. One can use big number like 1000 which does not affect the results or its speed but it would be a waste of memory.
Here comes the program:
You can loop from the miniumum value of v1-v4 to the maximum, counting the number of each value until you get 3 or more:
data have;
input v1 v2 v3 v4;
cards;
1 1 3 1
1 3 1 1
1 1 2 2
1 3 2 4
2 3 1 4
1 2 1 3
1 2 2 3
;
data want (drop=i);
set have;
do i=min(of v:) to max(of v:) until (countw(catx(',',of v:),i,'k')>=3);
end;
if i<=max(of v:) then v5=i;
run;
Notes:
Most elegant and great notes for the OP. et al. Priceless!!!
Given that you want t match at least 3 out of 4, you can skip the in between values. Just check the minimum value and the maximum value.
Good point - I hadn't thought of that. Yes, checking min and max is all that's needed, since there's no way for a non-extreme value to have a count of 3 out of 4.
But if the user progresses to 3 out of 5, interior values could not be skipped (i.e. 1,2,2,2,3 ==> newvar=2).
Your solution is most elegant. I was curious as to how good/bad my solution was. I simulated 100,000 rows similar to HAVE and re-ran
my program. It took about:
real time 0.04 seconds
user cpu time 0.04 seconds
system cpu time 0.00 seconds
memory 648.43k
Your solution took about:
real time 0.22 seconds
user cpu time 0.21 seconds
system cpu time 0.00 seconds
memory 680.15k
In terms of run-time, some difference is found but it will not make a big difference. So my solution is not slower compared to yours.
Regards,
DataSP
Hi @ajb,
This works for arbitrary non-missing numeric values v1 - v4:
data want(drop=t);
set have;
t=largest(2,of v:);
if smallest(2,of v:)=max(of v:) | min(of v:)=t then v5=t;
run;
And this allows for missing (incl. special missing) values:
data want(drop=t);
set have;
t=ordinal(3,of v:);
if ordinal(2,of v:)=ordinal(4,of v:) | ordinal(1,of v:)=t then v5=t;
run;
Here is a shorter version (for arbitrary non-missing numeric values v1 - v4):
data want;
set have;
if ~mad(of v:) then v5=median(of v:);
run;
Edit: As an additional advantage (over the suggestion involving LARGEST() etc.) it also works for five values, e.g. v0 - v4. Not for more than five, though.
Thanks for all the answers. If the variables were all categorical, how would I modify the codes to make them do the same thing? I am also realizing that I may have to do this for up to 16 categorical variables. Does this make the coding much more difficult?
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.