Solved: Best method for comparing multiple variables

supp · Posted 09-11-2019 09:08 AM

Hello SAS community. I am hopeful someone might have a better method for comparing three character variables within the same observations such that it identifies when two or more of the variables are different. Also, a null value in one of the variables should not be seen as different.

hypothetical data:

data have;
infile datalines truncover;
input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;

Data want would return ID's 3, 4 and 5 (there is at least one difference between the non missing values). It would not return ID's 1,2 or 6 (all non missing values are the same).

I tried a series of nested if-then do statements. For example:

data want;
set have;
if var1 ne '.' then do;
  if var2 ne '.' then do;
    if var1 ne var2;
  end;
end;
run;

I might have gotten a variation (not exactly like the example above) to maybe work. Given the volume of data it is hard to be confident in my solution. I believe there is a much more elegant solution to this problem, any suggestions would be appreciated.

hashman · Posted 09-12-2019 10:44 PM

@supp:

1. After the DO loop, the flag is either 1 or missing. The double NOT (i.e. ^^) turns 1 into 1 but it turns the missing into 0, quod erat faciendum.

2. COALCESCEC to find the first nonmissing value is a great idea efficiency-wise, as it runs much faster than examining one item at a time in a loop till a nonmissing value is found - especially for a large array where the first nonmissing is closer to the end. However, to take advantage of this efficiency, the ensuing DO loop should begin with the index up by 1 from where the first nonmissing is found, while DO OVER starts at 1. To change that, (a) the implicit array has to be recoded as explicit and (b) the WHICHC function should be used to find the index of the first nonmissing item. Also, COALCESCEC returns a $200 result if not force-sized, so assuming that all V-variables have the same length, it can be preset to the length of the first one.

With the above in mind, the program takes on the form:

data have ;                                                                                                                                                                                                                                                     
  input id (V1-V3) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
0 .  .  AB                                                                                                                                                                                                                                                      
1 .  AB AB                                                                                                                                                                                                                                                      
2 AA AA AA                                                                                                                                                                                                                                                      
3 AA .  BB                                                                                                                                                                                                                                                      
4 .  AA BB                                                                                                                                                                                                                                                      
5 AA BB CC                                                                                                                                                                                                                                                      
6 AA AA .                                                                                                                                                                                                                                                       
7 .  AA .                                                                                                                                                                                                                                                       
8 .  .  .                                                                                                                                                                                                                                                       
9 AB A  B                                                                                                                                                                                                                                                       
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data want (drop = _:) ;                                                                                                                                                                                                                                         
  set have ;                                                                                                                                                                                                                                                    
  array vv[*] v: ;                                                                                                                                                                                                                                              
  _k = v1 ; * set length ;                                                                                                                                                                                                                                      
  _k = coalescec (of vv[*]) ;                                                                                                                                                                                                                                   
  _x = whichc (_k, of vv[*]) + 1 ;                                                                                                                                                                                                                              
  unequal_flag = 0 ;                                                                                                                                                                                                                                            
  do _i = _x to dim (vv) ;                                                                                                                                                                                                                                      
    if cmiss (vv[_i]) or vv[_i] = _k then continue ;                                                                                                                                                                                                            
    unequal_flag = 1 ;                                                                                                                                                                                                                                          
    leave ;                                                                                                                                                                                                                                                     
  end ;                                                                                                                                                                                                                                                         
run ;

Kind regards

Paul D.

View solution in original post

novinosrin · Posted 09-11-2019 09:16 AM


data have;
infile datalines truncover;
input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;
data want;
set have;
array v var:;
k=cats(of v(*));
if count(k,strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
else flag=0;
drop k;
run;

PaigeMiller · Posted 09-11-2019 09:27 AM

This is interesting logic, using COALESCEC in a way I would not have though of.

If the real problem allows variables VAR1-VAR3 to take on arbitrary length character strings instead of character strings that have only a single character, I don't think this works any more. Of course, this is me reading between the lines; the original question didn't say that character strings VAR1-VAR3 could be more than one character.

In the more general case, I think you would need some sort of ARRAY to loop over all VAR1-VARn variables, and if a mismatch is found the flag is zero; if a mismatch is not found then the flag is one.

--
Paige Miller

gamotte · Posted 09-11-2019 10:26 AM

Hello,

For arbitrary length strings, i think a slightly modified version of @novinosrin's program

should work :

if countw(catx(' ',of v(*)),strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;

supp · Posted 09-11-2019 12:54 PM

Can you elaborate on why the solution wouldn't work for character strings more than one character?

novinosrin · Posted 09-11-2019 01:19 PM

Hi @supp Requesting you to please test with a small sample. I am intrigued too coz that solution seem not to have anything to do with length rather deals with value. However, I didn't test extensively, and I might later when I have some time.

The logic is simple:

1. concatenate the contents of the array. Cats would have removed leading and trailing blanks

2. Identify the 1st non-missing value with coalescec within the contents of the array

3. dim(v)-cmiss(of v(*)) gives us the number of non-missing elements in the array

4. Logic: Check if the count of the 1st non-missing value is equal to the number of non-missing elements in the array with count function. If yes , the elements are the same, and if false obviously otherwise

5. Hence, i am not sure what has this got to do with length of any sort. Sometimes, i code too fast and I tend to miss out some intricacies, so test and let me know plz.

supp · Posted 09-11-2019 01:23 PM

Thanks for all the responses. Based on what people posted I read this paper to get a basic understanding of arrays:

https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/242-30.pdf

I applied a couple of the solutions presented here:

Method #1 - find first non missing variable then loop through all (non-missing) variables to check if they have a difference:

data want ;
  set have ;
  array var var1-var3 ;
  first_value=coalescec(of var[*]);
  flag=0;
  do _n_=1 to dim(var) while (not flag);
    if not missing(var[_n_]) then flag = first_value ne var[_n_];
  end;
run;

Method # 2 - Create a new variable with all variables concatenated together. Then count the number of times the first non missing value occurs in the concatenated variable. Compare this count to the total number of non missing variables. If the counts are different then there is a difference among the variables:

data want;
set have;
array v var:;
k=cats(of v(*));
if count(k,strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
else flag=0;
drop k;
run;

I applied both methods to my data, about 35,000 rows, and got the same observations returned. Method #2 seem to have slightly better performance. Less real time, user cpu time, system cpu time and memory used. For this reason method # 2 is my preferred solution.

hashman · Posted 09-11-2019 11:41 PM

@novinosrin:

Very clever indeed.

However, I can't help but note that CATS in situations like this or similar is dangerous because of its scrambling effect. For example, your solution will readily break on something as simple as:

data have ;                 
  input id (var1-var3) ($) ;
  cards ;                   
1 AB A B                    
;                           
run ;

The obvious reason is that (AB A B) is scrambled to (ABAB), and any algorithm looking for something different from AB will say that all VAR values are equal. Of course, you can modify your code using CATX instead. But there's another caveat: The buffer of any feline function (and also others like REPEAT) is limited to 32767; and so if the sum of the non-blank lengths of the VAR: should exceed 32767, it can break the code because of the truncation.

Methinks in this situation (and situations similar to this), a general solution devoid of the above pitfalls is possible only via array processing following the logic:

(1) Find the first (or last) non-missing value.

(2) See if any subsequent (or preceding) non-missing value is different; and if so, break the loop.

For example:

data have ;                      
  input id (V1-V3) ($) ;      
  cards ;                        
1 .  AB AB                        
2 AA AA AA                        
3 AA .  BB                        
4 .  AA BB                        
5 AA BB CC                        
6 AA AA .                        
7 .  AA .                        
8 .  .  .                        
9 AB A  B
;                                
run ;                            
                                 
data want (drop = _:) ;          
  set have ;                      
  array vv v: ;                  
  do over vv ;                    
    if cmiss (vv) then continue ;
    _n + 1 ;                      
    if _n = 1 then _v = vv ;      
    else if vv ne _v then do ;    
      unequal_flag = 1 ;          
      leave ;                    
    end ;                        
  end ;                          
  unequal_flag = ^^ unequal_flag ;
run ;

It's by far not as elegant as your offering, but it's not going to break on scrambling or insufficient buffer length.

Kind regards

Paul D.

novinosrin · Posted 09-12-2019 07:50 AM

Guru @hashman Very interesting insight and rationale. Honestly the buffer thing never clicked to me. I guess the tendency of me treating each one like a fun video game rather a real technical solution shows that I lack maturity. 🙂 Nonetheless, that specialist professional demeanor is something I am gradually learning from you.

PS Though both of us live in the same time zone, I request you to kindly bear with delays in me acknowledging your post as it seems I am missing out quite a bit on some threads while I am sleeping and you are wide awake as I noticed this while brushing my teeth this morning. Also you are aware I try to follow your posts in the ones I didn't participate, and indeed the value is priceless.

novinosrin · Posted 09-12-2019 09:19 AM

Guru @hashman Requesting your time and help if and when you can. Now i am a little confused as the solution still seems to work on the sample you provided barring the buffer limitations which I understood squeaky clean.

data have ;                 
  input id (v1-v3) ($) ;
  cards ;                   
1 AB A B                    
;                           
run ; 
data want;
set have;
array v v1-v3;
k=cats(of v(*));
if count(k,strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
else flag=0;
*drop k;
run;

proc print noobs;run;

id	v1	v2	v3	k	flag
1	AB	A	B	ABAB	0


data have ; 
input id (V1-V3) ($) ; 
cards ; 
1 . AB AB 
2 AA AA AA 
3 AA . BB 
4 . AA BB 
5 AA BB CC 
6 AA AA . 
7 . AA . 
8 . . . 
9 AB A B
; 
run ;

data want;
set have;
array v v1-v3;
k=cats(of v(*));
if count(k,strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
else flag=0;
*drop k;
run;

proc print noobs;run;

id	V1	V2	V3	k	flag
1		AB	AB	ABAB	1
2	AA	AA	AA	AAAAAA	1
3	AA		BB	AABB	0
4		AA	BB	AABB	0
5	AA	BB	CC	AABBCC	0
6	AA	AA		AAAA	1
7		AA		AA	1
8					1
9	AB	A	B	ABAB	0

Now, i am seriously confused how can that work right. It's not about right or wrong, I would like to know stuff works-mechanic here. Jeez! help me out

Addendum : EDITED: at 9:38 EDT Bridgeport time, I think i got it, it's the count of non missing's( dim-cmiss)has made that work. Oh well:) does the scrambling effect gets addressed by the fact when two or more variables contribute to making the 1st non missing value, there is bound to be a mismatch with the count of non missing values in the array

hashman · Posted 09-13-2019 12:30 PM

@novinosrin:

You've nailed it!

In sum, I think that the simplest and most reliable scheme for this task is:

Find the first non-missing value as fast as possible - perhaps COALESCEC is best for it.
Find its index using WHICHC.
Starting from the next higher index, examine the rest of non-missing values and stop if a mismatch is found.

This way, it is a O(N) problem. What I wish we had is the a modifier argument for WHICHC which could be set to "K" to indicate we are looking for the index of the first value that is not in the list (in this case, the list would contain just " ").

Kind regards

Paul D.

supp · Posted 09-12-2019 09:49 AM

Paul, your bring up excellent points, thank you for the thoughtful response! First, thanks for pointing out the scrambling effect. I have already updated my solution to use catx in replace of cats. I don't think it would have been an issue for my application, but there really is no reason not to do it.

Can you explain this line of code?

unequal_flag = ^^ unequal_flag ;

I like your solution. I also like the coalescec() approach to setting a variable equal to the first non-missing value of the array. I modified your solution as such:

data have ;                      
  input id (V1-V3) ($) ;      
  cards ;                        
1 .  AB AB                        
2 AA AA AA                        
3 AA .  BB                        
4 .  AA BB                        
5 AA BB CC                        
6 AA AA .                        
7 .  AA .                        
8 .  .  .                        
9 AB A  B
;                                
run ;                            
                                 
data want (drop = _:) ;          
  set have ;                      
  array vv v: ;
  _k = coalescec(of vv(*)) ;
  do over vv ;                    
    if cmiss (vv) then continue ;
    if vv ne _k then do ;    
      unequal_flag = 1 ;          
      leave ;                    
    end ;                        
  end ;                          
  unequal_flag = ^^ unequal_flag ;
run ;

I think it achieves the same thing with less code?!?

hashman · Posted 09-12-2019 10:44 PM

@supp:

1. After the DO loop, the flag is either 1 or missing. The double NOT (i.e. ^^) turns 1 into 1 but it turns the missing into 0, quod erat faciendum.

2. COALCESCEC to find the first nonmissing value is a great idea efficiency-wise, as it runs much faster than examining one item at a time in a loop till a nonmissing value is found - especially for a large array where the first nonmissing is closer to the end. However, to take advantage of this efficiency, the ensuing DO loop should begin with the index up by 1 from where the first nonmissing is found, while DO OVER starts at 1. To change that, (a) the implicit array has to be recoded as explicit and (b) the WHICHC function should be used to find the index of the first nonmissing item. Also, COALCESCEC returns a $200 result if not force-sized, so assuming that all V-variables have the same length, it can be preset to the length of the first one.

With the above in mind, the program takes on the form:

data have ;                                                                                                                                                                                                                                                     
  input id (V1-V3) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
0 .  .  AB                                                                                                                                                                                                                                                      
1 .  AB AB                                                                                                                                                                                                                                                      
2 AA AA AA                                                                                                                                                                                                                                                      
3 AA .  BB                                                                                                                                                                                                                                                      
4 .  AA BB                                                                                                                                                                                                                                                      
5 AA BB CC                                                                                                                                                                                                                                                      
6 AA AA .                                                                                                                                                                                                                                                       
7 .  AA .                                                                                                                                                                                                                                                       
8 .  .  .                                                                                                                                                                                                                                                       
9 AB A  B                                                                                                                                                                                                                                                       
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data want (drop = _:) ;                                                                                                                                                                                                                                         
  set have ;                                                                                                                                                                                                                                                    
  array vv[*] v: ;                                                                                                                                                                                                                                              
  _k = v1 ; * set length ;                                                                                                                                                                                                                                      
  _k = coalescec (of vv[*]) ;                                                                                                                                                                                                                                   
  _x = whichc (_k, of vv[*]) + 1 ;                                                                                                                                                                                                                              
  unequal_flag = 0 ;                                                                                                                                                                                                                                            
  do _i = _x to dim (vv) ;                                                                                                                                                                                                                                      
    if cmiss (vv[_i]) or vv[_i] = _k then continue ;                                                                                                                                                                                                            
    unequal_flag = 1 ;                                                                                                                                                                                                                                          
    leave ;                                                                                                                                                                                                                                                     
  end ;                                                                                                                                                                                                                                                         
run ;

Kind regards

Paul D.

supp · Posted 09-13-2019 08:15 AM

@hashman,Thanks for explaining the double not and the revised code.

novinosrin · Posted 09-11-2019 09:44 AM

data have;
infile datalines truncover;
input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;

data want;
set have;
array v var:;
array temp(99999)$ _temporary_;
do _n_=1 to dim(v);
 if dim(temp)-cmiss(of temp(*))>0 then if  v(_n_) not in temp then do;
 flag=0;
 return;
 end;
 if v(_n_)>' ' then temp(_n_)=v(_n_);
end;
Flag=1;
run;

Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Re: Best method for comparing multiple variables

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away