Solved: Re: Best method for comparing multiple variables - Page 2

novinosrin · Posted 09-11-2019 10:05 AM

/*Or COUNTW is Simpler*/

data want;
set have;
array v var:;
_k=coalescec(of v(*));
_n=dim(v)-cmiss(of v(*));
_k1=catx(' ',of v(*));
Flag=_n=countw(_k1,strip(_k));
drop _:;
run;

PaigeMiller · Posted 09-11-2019 09:47 AM

data WANT;
set have;
array v var:;
flag=1;
prev_non_missing=v(1);
do i=2 to dim(v);
	if not missing(v(i)) and not missing(prev_non_Missing) and v(i)^=v(i-1) then do;
		flag=0;
		leave;
	end;
	else if not missing(prev_non_missing) and missing(v(i)) then v(i)=prev_non_missing;
	else if missing(prev_non_missing) and not missing(v(i)) then prev_non_missing=v(i);
end;
drop i;
run;

--
Paige Miller

Tom · Posted 09-11-2019 09:48 AM

You can just take the first non-missing value and then test to see if any other non-missing values exist.

data have;
  infile datalines truncover;
  input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
7 . . .
;

data want ;
  set have ;
  array var var1-var3 ;
  first_value=coalescec(of var[*]);
  flag=0;
  do _n_=1 to dim(var) while (not flag);
    if not missing(var[_n_]) then flag = first_value ne var[_n_];
  end;
run;

proc print;
run;

                                     first_
Obs    id    var1    var2    var3    value     flag

 1     1              A       A        A         0
 2     2      A       A       A        A         0
 3     3      A               B        A         1
 4     4              A       B        A         1
 5     5      A       B       C        A         1
 6     6      A       A                A         0
 7     7                                         0

PeterClemmensen · Posted 09-11-2019 09:50 AM

data want;
    set have;
    call sortc(of var:);
    flag=ifn(length(prxchange('s/(.)\1+/$1/', -1, cats(of var:))) > 1, 1, 0);
run;

The DATA to DATA Step Macro
Blog: SASnrd

Tom · Posted 09-11-2019 09:53 AM

Note that using CALL SORTC() will change the order of the original values.

Also I don't think that method works when the strings are longer than a single character.

Try setting the values to the same multiple character values and see what happens.

Also what value is the use of the IFN() doing in that code?

It looks like you are using it to convert the results of the test for greater than one. So you are using it to convert 1 to 1 and 0 to 0.

PeterClemmensen · Posted 09-11-2019 10:13 AM

Ah yes. Once again thank your for the sharp eye @Tom.

The code should rather be

data want;
    set have;
    call sortc(of var:);
    flag=length(prxchange('s/(.)\1+/$1/', -1, cats(of var:))) > 1;
    set have;
run;

This way the order of var1-var3 will be intact in want and the flag is set correctly. Don't need the ifn function.

Of course, this approach will not work if the variables are not single letters no 🙂

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 09-11-2019 11:00 AM

It is easily extendable to handle more than one-character strings though

data have;
  infile datalines truncover;
  input id $ var1 $ var2 $ var3 $;
datalines;
1 .  AB AB
2 AB AB AB
3 AB . BA 
4 .  AB BA
5 AB BA CD
6 AB AB . 
7 .  .  . 
;

data want;
    set have;
    call sortc(of var:);
    flag=countw(prxchange('s/\b(\w+)(?:\W+\1\b)+/$1/', -1, catx(' ', of var:)), ' ') > 1;
    set have;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Meghna14 · Posted 09-11-2019 10:15 AM

data have;
infile datalines truncover;
input id $ var1 $ var2 $ var3 $;

datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;
run;

data want;
set have;
array x {3} var1 var2 var3 ;
do i= 1 to 1;
if i ne . then do;
if x{i} ne x{i+1} ne x{i+2} then output;
end;end;
run;

you might modify it as per the amount of variables you have to check .

ballardw · Posted 09-11-2019 10:38 AM

@supp wrote:

Hello SAS community. I am hopeful someone might have a better method for comparing three character variables within the same observations such that it identifies when two or more of the variables are different. Also, a null value in one of the variables should not be seen as different.

hypothetical data:
data have;
infile datalines truncover;
input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;
Data want would return ID's 3, 4 and 5 (there is at least one difference between the non missing values). It would not return ID's 1,2 or 6 (all non missing values are the same).

I tried a series of nested if-then do statements. For example:
data want;
set have;
if var1 ne '.' then do;
  if var2 ne '.' then do;
    if var1 ne var2;
  end;
end;
run;
I might have gotten a variation (not exactly like the example above) to maybe work. Given the volume of data it is hard to be confident in my solution. I believe there is a much more elegant solution to this problem, any suggestions would be appreciated.

You are missing a couple pieces of potentially extremely critical information. Such as, are your character values single characters or longer? And are they something from a fixed list of values (you know what they are expected to be) or just sort of randomly entered text?

Also is this case sensitive? Is "A" equal to "a" for this purpose?

If you have values that are fixed and case sensitive then this approach of creating a custom informat to assign a numeric value will work as what you are in effect requesting is "is the range of values = 0 or not". The following code could be adjusted to case insensitive by adding (UPCASE) after the name of the informat in the Invalue statement of the format. The variable dif will be 1 when there is a difference and 0 otherwise. The numeric values assigned to the values aren't critical as long as each is different though if you have a largish number of possible values a sequential assignment might be easier with a CNTLIN data set for Proc Format.

proc format library=work;
invalue mychars
'A' = 1
'B' = 2
'C' = 3
;
run;

data have;
   infile datalines truncover;
   input id $ var1 $ var2 $ var3 $;
   array vc var1 - var3;
   array vn{3};
   do i=1 to dim(vc);
      vn[i]= input(vc[i],mychars.);
   end;
   drop I vn: ;
   dif = (range(of vn(*)) > 0);
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
;

Note that if you have larger number of variables with the same value list this is easily extendable to more variables. Just increase the array definitions.

Ksharp · Posted 09-12-2019 08:54 AM

data have;
  infile datalines truncover;
  input id $ var1 $ var2 $ var3 $;
datalines;
1 . A A
2 A A A
3 A . B
4 . A B
5 A B C
6 A A .
7 . . .
;
data want;
 if _n_=1 then do;
   length k $ 80;
   declare hash h();
   h.definekey('k');
   h.definedone();
 end;
set have;
array x{*} $ var: ;
do i=1 to dim(x);
  if not missing(x{i}) then do;k=x{i};h.ref(); end;
end;
if h.num_items>1 ;
h.clear();
drop i k;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!