BookmarkSubscribeRSS Feed
ajb
Obsidian | Level 7 ajb
Obsidian | Level 7

Hi,

 

I am working on a data set similar to the one in the table below.  I would like to create a a fifth variable, v5, where if 3 or more the other variables (V1, V2, V3, V4) have the same value then V5 would be assigned that value as well.

 

I know how to do this by brute force:

 

DATA work.eg;

     SET work.eg;

     

            IF v1=1 and v2=1 and v4=1 THEN v5=1;

  ELSE IF v1=1 and v3=1 and v4=1 THEN v5 =1;

etc...

 

However, I have to do this repeatedly for multiple values and variables and I am afraid that I will miss a possible combination.  I want like SAS to do this step for me.  How would I do this?

 

Thanks in advance!

 

v1 v2 v3 v4
1 1 3 1
1 3 1 1
1 1 2 2
1 3 2 4
2 3 1 4
2 2 1 3

EDIT: Sorry everyone, I was got very ill and am now checking this. Thanks for all the answers.

If the variables were all categorical, how would I modify the codes to make them do the same thing?

11 REPLIES 11
novinosrin
Tourmaline | Level 20

So 3 "or" more same value eh?

 


data have;
input v1	v2	v3	v4;
cards;
1	1	3	1
1	3	1	1
1	1	2	2
1	3	2	4
2	3	1	4
1	2	1	3
;


data want;
set have;
array v v:;	
i_=1;
call sortn(of v1-v4);
do i=2 to dim(v);
if v(i)=v(i-1) then i_=sum(i_,1);
else i_=1;
if i_=3 then do; v5=v(i);leave;end;
end;
drop i:;
run;

 

 

novinosrin
Tourmaline | Level 20
/*The call sortn in the previous would have changed the order of v1-v4,
so this modified one with set have point=_n_ will
reset v1-v4 in its original order*/
data want1;
set have;
array v v:;	
i_=1;
call sortn(of v1-v4);
do i=2 to dim(v);
if v(i)=v(i-1) then i_=sum(i_,1);
else i_=1;
if i_=3 then do; v5=v(i);leave;end;
end;
set have point=_n_;
drop i:;
run;
KachiM
Rhodochrosite | Level 12

Novinosrin:

data need;
   set have;
   array v v1 - v4;
   array k[4] _temporary_;
   do i = 1 to dim(v);
      k[v[i]] + 1;
   end;
   j = max(of k[*]);
   if j = 3 then v5 = k[j];
   call missing(of k[*]);
drop i j;
run;

The program simplifies if we count  the times each V1 to V4 appears in a row. This will avoid comparison of Vs, Sorting and double passing of the Data Set. But this begs for the maximum value of  V1 to V4. In this example it is 4. This maximum value is required to size a _temporary_ array. One can use big number like 1000 which does not affect the results or its speed but it would be a waste of memory.

Here comes the program:

mkeintz
PROC Star

You can loop from the miniumum value of v1-v4 to the maximum, counting the number of each value until you get 3 or more:

 

data have;
  input v1 v2 v3 v4;
cards;
1	1	3	1
1	3	1	1
1	1	2	2
1	3	2	4
2	3	1	4
1	2	1	3
1	2	2	3
;
data want (drop=i);
  set have;
  do i=min(of v:) to max(of v:) until (countw(catx(',',of v:),i,'k')>=3);
  end;
  if i<=max(of v:) then v5=i;
run;

Notes:

 

  1. The catx(',', of v:) function tells sas to make a comma-separated list of values from all the "v" variables.  It is embedded in a countw function.

  2. The 'k' option in the countw function says to consider every character except the specified value (i) as a word separator.  So for a string="1,2,2,2,1,3,4,4,2", the first argument to countw(string,2,'k') would effectively look like "2 2 2 2" and return a value of 4.

  3. The loop says to iterate through all the integers from the minimum observed value to the maximum, but to stop if any count is >=3.  If none of the values meet this stopping criterion, then the loop index i will be 1+max(of v:), and you will no to not assign a value to V5  (or perhaps to assign a value of V5=0 if appropriate).
  4. Also don't use the varname V5 before the loop.  Otherwise the expression (of v:) would include it.
  5. And of course the loop specification (i=min(of v:) to max(of v:) would have to be much more involved if the values were not integers.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
novinosrin
Tourmaline | Level 20

Most elegant and great notes for the OP. et al.  Priceless!!!

Astounding
PROC Star

Given that you want t match at least 3 out of 4, you can skip the in between values.  Just check the minimum value and the maximum value.

mkeintz
PROC Star

@Astounding

 

Good point - I hadn't thought of that.  Yes, checking min and max is all that's needed, since there's no way for a non-extreme value to have a count of 3 out of 4.

 

But if the user progresses to 3 out of 5,  interior values could not be skipped (i.e. 1,2,2,2,3  ==> newvar=2).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
KachiM
Rhodochrosite | Level 12

Your solution is most elegant. I was curious as to how good/bad my solution was. I simulated 100,000 rows similar to HAVE and re-ran

my program. It took about:

 


real time 0.04 seconds
user cpu time 0.04 seconds
system cpu time 0.00 seconds
memory 648.43k

 

Your solution took about:

real time 0.22 seconds
user cpu time 0.21 seconds
system cpu time 0.00 seconds
memory 680.15k

 

In terms of run-time, some difference is found but it will not make a big difference. So my solution is not slower compared to yours.

 

Regards,

DataSP

FreelanceReinh
Jade | Level 19

Hi @ajb,

 

This works for arbitrary non-missing numeric values v1 - v4:

data want(drop=t);
set have;
t=largest(2,of v:);
if smallest(2,of v:)=max(of v:) | min(of v:)=t then v5=t;
run;

And this allows for missing (incl. special missing) values:

data want(drop=t);
set have;
t=ordinal(3,of v:);
if ordinal(2,of v:)=ordinal(4,of v:) | ordinal(1,of v:)=t then v5=t;
run;
FreelanceReinh
Jade | Level 19

Here is a shorter version (for arbitrary non-missing numeric values v1 - v4):

data want;
set have;
if ~mad(of v:) then v5=median(of v:);
run;

 

Edit: As an additional advantage (over the suggestion involving LARGEST() etc.) it also works for five values, e.g. v0 - v4. Not for more than five, though.

ajb
Obsidian | Level 7 ajb
Obsidian | Level 7

Thanks for all the answers. If the variables were all categorical, how would I modify the codes to make them do the same thing? I am also realizing that I may have to do this for up to 16 categorical variables.  Does this make the coding much more difficult?

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 4078 views
  • 3 likes
  • 6 in conversation