BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
supp
Pyrite | Level 9

Hello SAS community. I am hopeful someone might have a better method for comparing three character variables within the same observations such that it identifies when two or more of the variables are different. Also, a null value in one of the variables should not be seen as different.

 

hypothetical data:

 

data have;
infile datalines truncover;
input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;

 

 

Data want would return ID's 3, 4 and 5 (there is at least one difference between the non missing values). It would not return ID's 1,2 or 6 (all non missing values are the same).

 

I tried a series of nested if-then do statements. For example:

 

data want;
set have;
if var1 ne '.' then do;
  if var2 ne '.' then do;
    if var1 ne var2;
  end;
end;
run;

I might have gotten a variation (not exactly like the example above) to maybe work. Given the volume of data it is hard to be confident in my solution. I believe there is a much more elegant solution to this problem, any suggestions would be appreciated.

 

1 ACCEPTED SOLUTION

Accepted Solutions
hashman
Ammonite | Level 13

@supp:

1. After the DO loop, the flag is either 1 or missing. The double NOT (i.e. ^^) turns 1 into 1 but it turns the missing into 0, quod erat faciendum.  

 

2. COALCESCEC to find the first nonmissing value is a great idea efficiency-wise, as it runs much faster than examining one item at a time in a loop till a nonmissing value is found - especially for a large array where the first nonmissing is closer to the end. However, to take advantage of this efficiency, the ensuing DO loop should begin with the index up by 1 from where the first nonmissing is found, while DO OVER starts at 1. To change that, (a) the implicit array has to be recoded as explicit and (b) the WHICHC function should be used to find the index of the first nonmissing item. Also, COALCESCEC returns a $200 result if not force-sized, so assuming that all V-variables have the same length, it can be preset to the length of the first one. 

 

With the above in mind, the program takes on the form:

data have ;                                                                                                                                                                                                                                                     
  input id (V1-V3) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
0 .  .  AB                                                                                                                                                                                                                                                      
1 .  AB AB                                                                                                                                                                                                                                                      
2 AA AA AA                                                                                                                                                                                                                                                      
3 AA .  BB                                                                                                                                                                                                                                                      
4 .  AA BB                                                                                                                                                                                                                                                      
5 AA BB CC                                                                                                                                                                                                                                                      
6 AA AA .                                                                                                                                                                                                                                                       
7 .  AA .                                                                                                                                                                                                                                                       
8 .  .  .                                                                                                                                                                                                                                                       
9 AB A  B                                                                                                                                                                                                                                                       
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data want (drop = _:) ;                                                                                                                                                                                                                                         
  set have ;                                                                                                                                                                                                                                                    
  array vv[*] v: ;                                                                                                                                                                                                                                              
  _k = v1 ; * set length ;                                                                                                                                                                                                                                      
  _k = coalescec (of vv[*]) ;                                                                                                                                                                                                                                   
  _x = whichc (_k, of vv[*]) + 1 ;                                                                                                                                                                                                                              
  unequal_flag = 0 ;                                                                                                                                                                                                                                            
  do _i = _x to dim (vv) ;                                                                                                                                                                                                                                      
    if cmiss (vv[_i]) or vv[_i] = _k then continue ;                                                                                                                                                                                                            
    unequal_flag = 1 ;                                                                                                                                                                                                                                          
    leave ;                                                                                                                                                                                                                                                     
  end ;                                                                                                                                                                                                                                                         
run ;                                                   

Kind regards

Paul D.

View solution in original post

24 REPLIES 24
novinosrin
Tourmaline | Level 20

data have;
infile datalines truncover;
input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;
data want;
set have;
array v var:;
k=cats(of v(*));
if count(k,strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
else flag=0;
drop k;
run;
PaigeMiller
Diamond | Level 26

This is interesting logic, using COALESCEC in a way I would not have though of.

 

If the real problem allows variables VAR1-VAR3 to take on arbitrary length character strings instead of character strings that have only a single character, I don't think this works any more. Of course, this is me reading between the lines; the original question didn't say that character strings VAR1-VAR3 could be more than one character.

 

In the more general case, I think you would need some sort of ARRAY to loop over all VAR1-VARn variables, and if a mismatch is found the flag is zero; if a mismatch is not found then the flag is one.

--
Paige Miller
gamotte
Rhodochrosite | Level 12

Hello,

 

For arbitrary length strings, i think a slightly modified version of @novinosrin's program

should work :

 

if countw(catx(' ',of v(*)),strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
supp
Pyrite | Level 9
Can you elaborate on why the solution wouldn't work for character strings more than one character?
novinosrin
Tourmaline | Level 20

Hi @supp  Requesting you to please test with a small sample. I am intrigued too coz that solution seem not to have anything to do with length rather deals with value. However, I didn't test extensively, and I might later when I have some time. 

 

The logic is simple:

1. concatenate the contents of the array. Cats would have removed leading and trailing blanks

2. Identify the 1st non-missing value with coalescec within the contents of the array

3. dim(v)-cmiss(of v(*)) gives us the number of non-missing elements in the array

4. Logic: Check if the count of the 1st non-missing value is equal to the number of non-missing elements in the array with count function. If yes , the elements are the same, and if false obviously otherwise

5. Hence, i am not sure what has this got to do with length of any sort. Sometimes, i code too fast and I tend to miss out some intricacies, so test and let me know plz. 

supp
Pyrite | Level 9

Thanks for all the responses. Based on what people posted I read this paper to get a basic understanding of arrays:

https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/242-30.pdf

 

I applied a couple of the solutions presented here:

 

Method #1 - find first non missing variable then loop through all (non-missing) variables to check if they have a difference:

data want ;
  set have ;
  array var var1-var3 ;
  first_value=coalescec(of var[*]);
  flag=0;
  do _n_=1 to dim(var) while (not flag);
    if not missing(var[_n_]) then flag = first_value ne var[_n_];
  end;
run;

 

Method # 2 - Create a new variable with all variables concatenated together. Then count the number of times the first non missing value occurs in the concatenated variable. Compare this count to the total number of non missing variables. If the counts are different then there is a difference among the variables:

data want;
set have;
array v var:;
k=cats(of v(*));
if count(k,strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
else flag=0;
drop k;
run;

I applied both methods to my data, about 35,000 rows, and got the same observations returned. Method #2 seem to have slightly better performance. Less real time, user cpu time, system cpu time and memory used. For this reason method # 2 is my preferred solution.

hashman
Ammonite | Level 13

@novinosrin:

Very clever indeed.

However, I can't help but note that CATS in situations like this or similar is dangerous because of its scrambling effect. For example, your solution will readily break on something as simple as:

data have ;                 
  input id (var1-var3) ($) ;
  cards ;                   
1 AB A B                    
;                           
run ;                       

The obvious reason is that (AB A B) is scrambled to (ABAB), and any algorithm looking for something different from AB will say that all VAR values are equal. Of course, you can modify your code using CATX instead. But there's another caveat: The buffer of any feline function (and also others like REPEAT) is limited to 32767; and so if the sum of the non-blank lengths of the VAR: should exceed 32767, it can break the code because of the truncation. 

 

Methinks in this situation (and situations similar to this), a general solution devoid of the above pitfalls  is possible only via array processing following the logic:

(1) Find the first (or last) non-missing value.
(2) See if any subsequent (or preceding) non-missing value is different; and if so, break the loop.
 
For example:
data have ;                      
  input id (V1-V3) ($) ;      
  cards ;                        
1 .  AB AB                        
2 AA AA AA                        
3 AA .  BB                        
4 .  AA BB                        
5 AA BB CC                        
6 AA AA .                        
7 .  AA .                        
8 .  .  .                        
9 AB A  B
; run ; data want (drop = _:) ; set have ; array vv v: ; do over vv ; if cmiss (vv) then continue ; _n + 1 ; if _n = 1 then _v = vv ; else if vv ne _v then do ; unequal_flag = 1 ; leave ; end ; end ; unequal_flag = ^^ unequal_flag ; run ;
It's by far not as elegant as your offering, but it's not going to break on scrambling or insufficient buffer length.
 
Kind regards
Paul D.
novinosrin
Tourmaline | Level 20

Guru @hashman  Very interesting insight and rationale. Honestly the buffer thing never clicked to me. I guess the tendency of me treating each one like a fun video game rather a  real technical solution shows that I lack maturity. 🙂  Nonetheless, that specialist professional demeanor  is something I am gradually learning from you. 

 

PS  Though both of us live in the same time zone, I request you to kindly bear with  delays in me acknowledging your post as it seems I am missing out quite a bit on some threads while I am sleeping and you are wide awake as I noticed this while brushing my teeth this morning. Also you are aware I try to follow your posts in the ones I didn't participate, and indeed the value is priceless.  

novinosrin
Tourmaline | Level 20

Guru @hashman  Requesting your time and help if and when you can. Now i am a little confused as  the solution still seems to work on the sample you provided barring the buffer limitations which I understood squeaky clean. 

data have ;                 
  input id (v1-v3) ($) ;
  cards ;                   
1 AB A B                    
;                           
run ; 
data want;
set have;
array v v1-v3;
k=cats(of v(*));
if count(k,strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
else flag=0;
*drop k;
run;

proc print noobs;run;
id v1 v2 v3 k flag
1 AB A B ABAB 0

 

 


data have ; 
input id (V1-V3) ($) ; 
cards ; 
1 . AB AB 
2 AA AA AA 
3 AA . BB 
4 . AA BB 
5 AA BB CC 
6 AA AA . 
7 . AA . 
8 . . . 
9 AB A B
; 
run ;

data want;
set have;
array v v1-v3;
k=cats(of v(*));
if count(k,strip(coalescec(of v(*))))=dim(v)-cmiss(of v(*)) then flag=1;
else flag=0;
*drop k;
run;

proc print noobs;run;

 

id V1 V2 V3 k flag
1   AB AB ABAB 1
2 AA AA AA AAAAAA 1
3 AA   BB AABB 0
4   AA BB AABB 0
5 AA BB CC AABBCC 0
6 AA AA   AAAA 1
7   AA   AA 1
8         1
9 AB A B ABAB 0

 

Now, i am seriously confused how can that work right. It's not about right or wrong, I would like to know stuff works-mechanic here. Jeez! help me out

 

Addendum : EDITED: at 9:38 EDT Bridgeport time, I think i got it, it's the count of non missing's( dim-cmiss)has made that work. Oh well:) does the scrambling effect gets addressed by the fact when two or more variables contribute to making the 1st non missing value, there is bound to be a mismatch with the count of non missing values in the array

hashman
Ammonite | Level 13

@novinosrin

You've nailed it!

In sum, I think that the simplest and most reliable scheme for this task is:

  1. Find the first non-missing value as fast as possible - perhaps COALESCEC is best for it.
  2. Find its index using WHICHC.
  3. Starting from the next higher index, examine the rest of non-missing values and stop if a mismatch is found.

This way, it is a O(N) problem. What I wish we had is the a modifier argument for WHICHC which could be set to "K" to indicate we are looking for the index of the first value that is not in the list (in this case, the list would contain just " ").

 

Kind regards 

Paul D.

 

 

    

supp
Pyrite | Level 9

Paul, your bring up excellent points, thank you for the thoughtful response! First, thanks for pointing out the scrambling effect. I have already updated my solution to use catx in replace of cats. I don't think it would have been an issue for my application, but there really is no reason not to do it.

 

Can you explain this line of code?

  • unequal_flag = ^^ unequal_flag ;

 

I like your solution. I also like the coalescec() approach to setting a variable equal to the first non-missing value of the array. I modified your solution as such:

data have ;                      
  input id (V1-V3) ($) ;      
  cards ;                        
1 .  AB AB                        
2 AA AA AA                        
3 AA .  BB                        
4 .  AA BB                        
5 AA BB CC                        
6 AA AA .                        
7 .  AA .                        
8 .  .  .                        
9 AB A  B
;                                
run ;                            
                                 
data want (drop = _:) ;          
  set have ;                      
  array vv v: ;
  _k = coalescec(of vv(*)) ;
  do over vv ;                    
    if cmiss (vv) then continue ;
    if vv ne _k then do ;    
      unequal_flag = 1 ;          
      leave ;                    
    end ;                        
  end ;                          
  unequal_flag = ^^ unequal_flag ;
run ; 

I think it achieves the same thing with less code?!?

hashman
Ammonite | Level 13

@supp:

1. After the DO loop, the flag is either 1 or missing. The double NOT (i.e. ^^) turns 1 into 1 but it turns the missing into 0, quod erat faciendum.  

 

2. COALCESCEC to find the first nonmissing value is a great idea efficiency-wise, as it runs much faster than examining one item at a time in a loop till a nonmissing value is found - especially for a large array where the first nonmissing is closer to the end. However, to take advantage of this efficiency, the ensuing DO loop should begin with the index up by 1 from where the first nonmissing is found, while DO OVER starts at 1. To change that, (a) the implicit array has to be recoded as explicit and (b) the WHICHC function should be used to find the index of the first nonmissing item. Also, COALCESCEC returns a $200 result if not force-sized, so assuming that all V-variables have the same length, it can be preset to the length of the first one. 

 

With the above in mind, the program takes on the form:

data have ;                                                                                                                                                                                                                                                     
  input id (V1-V3) ($) ;                                                                                                                                                                                                                                        
  cards ;                                                                                                                                                                                                                                                       
0 .  .  AB                                                                                                                                                                                                                                                      
1 .  AB AB                                                                                                                                                                                                                                                      
2 AA AA AA                                                                                                                                                                                                                                                      
3 AA .  BB                                                                                                                                                                                                                                                      
4 .  AA BB                                                                                                                                                                                                                                                      
5 AA BB CC                                                                                                                                                                                                                                                      
6 AA AA .                                                                                                                                                                                                                                                       
7 .  AA .                                                                                                                                                                                                                                                       
8 .  .  .                                                                                                                                                                                                                                                       
9 AB A  B                                                                                                                                                                                                                                                       
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data want (drop = _:) ;                                                                                                                                                                                                                                         
  set have ;                                                                                                                                                                                                                                                    
  array vv[*] v: ;                                                                                                                                                                                                                                              
  _k = v1 ; * set length ;                                                                                                                                                                                                                                      
  _k = coalescec (of vv[*]) ;                                                                                                                                                                                                                                   
  _x = whichc (_k, of vv[*]) + 1 ;                                                                                                                                                                                                                              
  unequal_flag = 0 ;                                                                                                                                                                                                                                            
  do _i = _x to dim (vv) ;                                                                                                                                                                                                                                      
    if cmiss (vv[_i]) or vv[_i] = _k then continue ;                                                                                                                                                                                                            
    unequal_flag = 1 ;                                                                                                                                                                                                                                          
    leave ;                                                                                                                                                                                                                                                     
  end ;                                                                                                                                                                                                                                                         
run ;                                                   

Kind regards

Paul D.

supp
Pyrite | Level 9
@hashman,Thanks for explaining the double not and the revised code.
novinosrin
Tourmaline | Level 20
data have;
infile datalines truncover;
input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;

data want;
set have;
array v var:;
array temp(99999)$ _temporary_;
do _n_=1 to dim(v);
 if dim(temp)-cmiss(of temp(*))>0 then if  v(_n_) not in temp then do;
 flag=0;
 return;
 end;
 if v(_n_)>' ' then temp(_n_)=v(_n_);
end;
Flag=1;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 24 replies
  • 12944 views
  • 19 likes
  • 10 in conversation