Help using Base SAS procedures

Comparing two string variables and outputting the difference

Reply
Contributor
Posts: 49

Comparing two string variables and outputting the difference

Hello there. Some help if I may.

I have two variables with addresses in (no spaces) some contain counties and some don't and i'm trying to extract the differences with no solution.

I have even tried using proc compare by putting the variables I want to compare into two datasets but to no avail

example

Variable1                                     Variable2

1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER

Im trying to output GORTON to another variable but cant seem to get it.

Any help greatly appreciated.

Super User
Super User
Posts: 7,401

Re: Comparing two string variables and outputting the difference

TBH I don't see how you are going to achieve that.  There is no discernible delimiter, length, or algorithm which can identify street names from places from Cities.  Perhaps if you had the tech behind Google maps you might be able to do some good guesses, but even then you would have similar place names/street names in many places.

Go back to where you got the data from and see if there is a delimited version.

Contributor
Posts: 49

Re: Comparing two string variables and outputting the difference

I'm not too interested in the content just the comparison of the strings

so variable 1 could be

"thebigbad"

and variable 2

"thebigbadwolf"

I would want the output to be   "wolf"

Is that possible

Super User
Super User
Posts: 7,401

Re: Comparing two string variables and outputting the difference

Oh, I see.  Then tranwrd would probably be helpful here:

data want;

     set have;

     length result $2000;

     if length(strip(var1)) > length(strip(var2)) then result=tranwrd(var1,var2,"");

     else tranwrd(var2,var1,"");

run;

So first check which string is longer, then from the longest string remove any occurrences of the shorter string.

Respected Advisor
Posts: 3,887

Re: Comparing two string variables and outputting the difference

Below code won't give you 100% what you're after. You need to test it with your real data and then decide whether you only will remove sub-strings where at least the first n-characters match. For this the generated regex would then need to look something like 1RED?S?....

The code will be quite resource hungry as for every single iteration of the data step two regular expressions need to be compiled.

I hope someone else comes up with another more straightforward approach. I don't believe that a simple tranwrd() will cut it though.

data have /*(drop=_Smiley Happy*/;

  length variable1 variable2 result _short _long $ 100 _prx_string $ 200;

  Variable1='1REDSTREETMANCHESTER';

  Variable2='1REDSTREETGORTONMANCHESTER';

  if lengthn(variable1) < lengthn(variable2) then

    do;

      _short=variable1;

      _long =variable2;

    end;

  else

    do;

      _short=variable2;

      _long =variable1;

    end;

  /* remove string from beginning */

  _prx_string=prxchange('s/(.)/\1?/o',-1,strip(_short));

  _prx_string=cats("s/",_prx_string,"//oi");

  _prx_id=prxparse(strip(_prx_string));

  result=prxchange(_prx_id,1,strip(_long));

  /* free up memory */

  call prxfree(_prx_id);

  /* remove string from end */

  _prx_string=prxchange('s/(.)/\1?/o',-1,reverse(strip(_short)));

  _prx_string=cats("s/",_prx_string,"//oi");

  _prx_id=prxparse(strip(_prx_string));

  result=prxchange(_prx_id,1,reverse(strip(result)));

  /* free up memory */

  call prxfree(_prx_id);

  /* get result string into right order */

  result=reverse(strip(result));

  output;

  stop;

run;

SAS Super FREQ
Posts: 683

Re: Comparing two string variables and outputting the difference

The sample code below assumes that the beginning of the two strings is the same.

Maybe this will help or at least provide a start.

data want;
  infile cards dlm=",";
 
input
    var1 :
$64.
    var2 :
$64.
  ;
  length result $ 64;
  loopLen = max( length(var1), length(var2) );

 
*
  * compare varying length if there is a nonmatch
  * end loop
  *;

 
do i = 1 to loopLen;
    if  substr(var1, 1, i) ne  substr(var2, 1, i) then do;
     
leave;
   
end;
 
end;

 
*
  * fill remaining vars
  *;

  remain1 = substr(var1, i);
  remain2 = substr(var2, i);
  lr1 = lengthn(remain1);
  lr2 = lengthn(remain2);

 
if  lr1 > 1 and lr1 < lr2 then do;
    result = tranwrd(remain2, strip(remain1),
"");
  end;
 
else do;
    result = remain2;
 
end;
  cards;
thebigbad,thebigbadwolf
1REDSTREETMANCHESTER,1REDSTREETGORTONMANCHESTER
ab1234,abcöäü1234
abc,abc
;
Super User
Posts: 9,676

Re: Comparing two string variables and outputting the difference

I want know if there is

Variable1                                    Variable2

1REDSTREETMANCHESTER    1REDGORTSTREETMANCHESTERON

If not.

Xia Keshan


data have;
input var1 : $40. var2 : $40.;
cards;
1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER
1REDSTREETMANCHESTER     GORTON1REDSTREETMANCHESTER
1REDSTREETMANCHESTER     1REDSTREETMANCHESTERGORTON

;
run;
data want;
 set have;
 length var3 $ 40;
 position=find(var2,strip(var1));
 if position then do;
   if position eq 1 then var3=substr(var2,length(var1)+1);
    else  var3=substr(var2,1,position-1) ;
 end;
 if not position then do;
   do i=2 to length(var1);
    one=find(var2,strip(substr(var1,1,i-1))) ;
    two=find(var2,strip(substr(var1,i))) ; 
    if one and two then var3=substr(var2,i,two-i);
   end;
 end;
run;
Respected Advisor
Posts: 4,644

Re: Comparing two string variables and outputting the difference

Similar to Bruno's proposal but works also when difference is at beginning:

data have;

length str1 str2 $64;

input str1 str2;

datalines;

thebigbad thebigbadwolf

1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER

ab1234 abcöäü1234

abc abc

abc wolfabc

;

data want;

length short long diff $64;

set have;

if length(str1)<length(str2) then do;

    short=str1; long=str2; end;

else do;

    short=str2; long=str1; end;

do i = 1 to length(short);

    if char(short, i) ne char(long, i) then leave; end;

do j = 1 to length(short);

    if char(short, length(short)-j+1) ne char(long, length(long)-j+1) then leave; end;

if i <= length(long) - j + 1 then diff = substr(long, i, length(long) - j + 1 - i + 1);

drop short long i j;

run;

proc print data=want noobs; var str1 str2 diff; run;

            str1                    str2                           diff

            thebigbad               thebigbadwolf                 wolf

            1REDSTREETMANCHESTER    1REDSTREETGORTONMANCHESTER    GORTON

            ab1234                  abcöäü1234                    cöäü

            abc                     abc

            abc                     wolfabc                       wolf

PG

PG
Contributor
Posts: 22

Re: Comparing two string variables and outputting the difference

Hi ,

How to remove the difference from the variable2?

Ex:

var1="world cup"

var2="Woorlld ccup"

i want to make var2 as var1

thanks....

Regards

Karthik.

Super User
Super User
Posts: 7,401

Re: Comparing two string variables and outputting the difference

How do you know "Woorld ccup" = "world cup"?  Could it not be "world coup"?  This is called data cleaning.  Produce a list of distinct values, then clean those which do not match.  E.g.:

data coded;

     category="world cup"; values="Woorlld cup"; output;

                                         values="world cup"; output;

                                         values="Cup of Worlds"; output;

run;

proc sql;

     select     A.VAR1,

                   A.VAR2,

                    B.CATEGORY

     from       HAVE A

     left join CODED B

     on            A.VAR2=VALUES;

quit;

Respected Advisor
Posts: 4,644

Re: Comparing two string variables and outputting the difference

For non trivial spelling problems like that, look at the spelling distance functions in SAS : COMPGED, COMPLEV, SPEDIS, and also SOUNDEX.

PG

PG
Respected Advisor
Posts: 3,887

Re: Comparing two string variables and outputting the difference

And here another coding option using the compare() function.

data have;

  input var1 : $40. var2 : $40.;

  length result $40;

  if lengthn(var1) < lengthn(var2) then

    do;

      _short=var1;

      _long =var2;

    end;

  else

    do;

      _short=var2;

      _long =var1;

    end;

  _pos=compare(_short, _long,'i:');

  result=substrn(_long,abs(_pos));

 

  result=reverse(strip(result));

  _short=reverse(strip(_short));

 

  _pos=compare(_short,result,'i:');

  result=substrn(result,abs(_pos));

 

  result=reverse(strip(result));

 

  cards;

1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER

1REDSTREETGORTONMANCHESTER     1REDSTREETMANCHESTER

1REDSTREETMANCHESTER     GORTON1REDSTREETMANCHESTER

1REDSTREETMANCHESTER     1REDSTREETMANCHESTERGORTON

;

run;

PROC Star
Posts: 7,362

Re: Comparing two string variables and outputting the difference

: First, thanks for teaching me about a function I wasn't aware of. I never originally responded to this question, as I had misunderstood the requirements and was looking for a solution that identified everything that was different between two strings.

However, that was too complicated without the compare function, and I quickly realized that I had misunderstood the requirements.

Regardless, the following accomplishes both the requirements and my original interpretation of the problem. It only looks for characters in the 2nd string that aren't present in the 1st string, as that was the OP's need. However, like you did, it wouldn't be hard at all to change the code to be able to go both ways:

data have;

  input var1 : $40. var2 : $40.;

  cards;

1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER

1REDSTREETMANCHESTER     1REDSTREETxxMANCHESTER

1REDSTREETMANCHESTER     GORTON1REDSTREETMANCHESTER

1REDSTREETMANCHESTER     1REDSTREETMANCHESTERGORTON

1REDSTREETMANCHESTER     1REDGSTROEETMANRCHETOSTERN

1REDSTREETMANCHESTER     1REDSTREETMANCHESTER

;

data want (drop=_Smiley Happy;

  set have;

  length extra $40;

  _right=var2;

  do while (compare(var1, _right,'il') ne 0);

    if abs(compare(var1, _right,'il')) eq 1 then do;

      extra=catt(extra,substr(_right,1,1));

      _right=substr(_right,2);

    end;

    else do;

      extra=catt(extra,substr(_right,abs(compare(var1, _right,'il')),1));

      _right=catt(substr(_right,1,abs(compare(var1, _right,'il'))-1),

                  substr(_right,abs(compare(var1, _right,'il'))+1));

    end;

  end;

run;

Respected Advisor
Posts: 3,887

Re: Comparing two string variables and outputting the difference

I'm so spoiled by the richness of the SAS language that I just couldn't believe that there is no function which does the job for me. So I "scanned" through the docu, got reminded of a lot of good stuff and then also found compare().

Now I hope that future SAS releases will spoil me even further and add "startpos" as a parameter to allow for a comparison in reverse order if set to "-1".

Ask a Question
Discussion stats
  • 13 replies
  • 585 views
  • 2 likes
  • 8 in conversation