Re: Comparing two string variables and outputting the difference

Stretlow · Posted 11-27-2014 10:26 AM

Hello there. Some help if I may.

I have two variables with addresses in (no spaces) some contain counties and some don't and i'm trying to extract the differences with no solution.

I have even tried using proc compare by putting the variables I want to compare into two datasets but to no avail

example

Variable1 Variable2

1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER

Im trying to output GORTON to another variable but cant seem to get it.

Any help greatly appreciated.

RW9 · Posted 11-27-2014 10:30 AM

TBH I don't see how you are going to achieve that. There is no discernible delimiter, length, or algorithm which can identify street names from places from Cities. Perhaps if you had the tech behind Google maps you might be able to do some good guesses, but even then you would have similar place names/street names in many places.

Go back to where you got the data from and see if there is a delimited version.

Stretlow · Posted 11-27-2014 10:36 AM

I'm not too interested in the content just the comparison of the strings

so variable 1 could be

"thebigbad"

and variable 2

"thebigbadwolf"

I would want the output to be "wolf"

Is that possible

RW9 · Posted 11-27-2014 01:30 PM

Oh, I see. Then tranwrd would probably be helpful here:

data want;

set have;

length result $2000;

if length(strip(var1)) > length(strip(var2)) then result=tranwrd(var1,var2,"");

else tranwrd(var2,var1,"");

run;

So first check which string is longer, then from the longest string remove any occurrences of the shorter string.

Patrick · Posted 11-27-2014 04:30 PM

Below code won't give you 100% what you're after. You need to test it with your real data and then decide whether you only will remove sub-strings where at least the first n-characters match. For this the generated regex would then need to look something like 1RED?S?....

The code will be quite resource hungry as for every single iteration of the data step two regular expressions need to be compiled.

I hope someone else comes up with another more straightforward approach. I don't believe that a simple tranwrd() will cut it though.

data have /*(drop=_:)*/;

length variable1 variable2 result _short _long $ 100 _prx_string $ 200;

Variable1='1REDSTREETMANCHESTER';

Variable2='1REDSTREETGORTONMANCHESTER';

if lengthn(variable1) < lengthn(variable2) then

do;

_short=variable1;

_long =variable2;

end;

else

do;

_short=variable2;

_long =variable1;

end;

/* remove string from beginning */

_prx_string=prxchange('s/(.)/\1?/o',-1,strip(_short));

_prx_string=cats("s/",_prx_string,"//oi");

_prx_id=prxparse(strip(_prx_string));

result=prxchange(_prx_id,1,strip(_long));

/* free up memory */

call prxfree(_prx_id);

/* remove string from end */

_prx_string=prxchange('s/(.)/\1?/o',-1,reverse(strip(_short)));

_prx_string=cats("s/",_prx_string,"//oi");

_prx_id=prxparse(strip(_prx_string));

result=prxchange(_prx_id,1,reverse(strip(result)));

/* free up memory */

call prxfree(_prx_id);

/* get result string into right order */

result=reverse(strip(result));

output;

stop;

run;

BrunoMueller · Posted 11-27-2014 04:36 PM

The sample code below assumes that the beginning of the two strings is the same.

Maybe this will help or at least provide a start.

data want;
  infile cards dlm=","; 
  input
    var1 : $64.
    var2 : $64.
  ;
  length result $ 64; 
  loopLen = max( length(var1), length(var2) );

  *
  * compare varying length if there is a nonmatch
  * end loop
  *; 
  do i = 1 to loopLen;
    if  substr(var1, 1, i) ne  substr(var2, 1, i) then do; 
      leave; 
    end; 
  end; 

  *
  * fill remaining vars
  *;
  remain1 = substr(var1, i);
  remain2 = substr(var2, i);
  lr1 = lengthn(remain1);
  lr2 = lengthn(remain2);

  if  lr1 > 1 and lr1 < lr2 then do; 
    result = tranwrd(remain2, strip(remain1), "");
  end; 
  else do; 
    result = remain2;
  end; 
  cards; 
thebigbad,thebigbadwolf
1REDSTREETMANCHESTER,1REDSTREETGORTONMANCHESTER
ab1234,abcöäü1234
abc,abc
;

Ksharp · Posted 11-28-2014 10:55 AM

I want know if there is

Variable1 Variable2

1REDSTREETMANCHESTER 1REDGORTSTREETMANCHESTERON

If not.

Xia Keshan


data have;
input var1 : $40. var2 : $40.;
cards;
1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER
1REDSTREETMANCHESTER     GORTON1REDSTREETMANCHESTER
1REDSTREETMANCHESTER     1REDSTREETMANCHESTERGORTON

;
run;
data want;
 set have;
 length var3 $ 40;
 position=find(var2,strip(var1));
 if position then do;
   if position eq 1 then var3=substr(var2,length(var1)+1);
    else  var3=substr(var2,1,position-1) ;
 end;
 if not position then do;
   do i=2 to length(var1);
    one=find(var2,strip(substr(var1,1,i-1))) ;
    two=find(var2,strip(substr(var1,i))) ; 
    if one and two then var3=substr(var2,i,two-i);
   end;
 end;
run;

PGStats · Posted 11-28-2014 12:10 PM

Similar to Bruno's proposal but works also when difference is at beginning:

data have;

length str1 str2 $64;

input str1 str2;

datalines;

thebigbad thebigbadwolf

1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER

ab1234 abcöäü1234

abc abc

abc wolfabc

;

data want;

length short long diff $64;

set have;

if length(str1)<length(str2) then do;

short=str1; long=str2; end;

else do;

short=str2; long=str1; end;

do i = 1 to length(short);

if char(short, i) ne char(long, i) then leave; end;

do j = 1 to length(short);

if char(short, length(short)-j+1) ne char(long, length(long)-j+1) then leave; end;

if i <= length(long) - j + 1 then diff = substr(long, i, length(long) - j + 1 - i + 1);

drop short long i j;

run;

proc print data=want noobs; var str1 str2 diff; run;

str1 str2 diff

thebigbad thebigbadwolf wolf

1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER GORTON

ab1234 abcöäü1234 cöäü

abc abc

abc wolfabc wolf

PG

KarthikSrivasthav · Posted 12-05-2014 09:38 AM

Hi ,

How to remove the difference from the variable2?

Ex:

var1="world cup"

var2="Woorlld ccup"

i want to make var2 as var1

thanks....

Regards

Karthik.

RW9 · Posted 12-05-2014 10:05 AM

How do you know "Woorld ccup" = "world cup"? Could it not be "world coup"? This is called data cleaning. Produce a list of distinct values, then clean those which do not match. E.g.:

data coded;

category="world cup"; values="Woorlld cup"; output;

values="world cup"; output;

values="Cup of Worlds"; output;

run;

proc sql;

select A.VAR1,

A.VAR2,

B.CATEGORY

from HAVE A

left join CODED B

on A.VAR2=VALUES;

quit;

PGStats · Posted 12-05-2014 10:55 AM

For non trivial spelling problems like that, look at the spelling distance functions in SAS : COMPGED, COMPLEV, SPEDIS, and also SOUNDEX.

PG

Patrick · Posted 11-28-2014 11:05 PM

And here another coding option using the compare() function.

data have;

input var1 : $40. var2 : $40.;

length result $40;

if lengthn(var1) < lengthn(var2) then

do;

_short=var1;

_long =var2;

end;

else

do;

_short=var2;

_long =var1;

end;

_pos=compare(_short, _long,'i:');

result=substrn(_long,abs(_pos));

result=reverse(strip(result));

_short=reverse(strip(_short));

_pos=compare(_short,result,'i:');

result=substrn(result,abs(_pos));

result=reverse(strip(result));

cards;

1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER

1REDSTREETGORTONMANCHESTER 1REDSTREETMANCHESTER

1REDSTREETMANCHESTER GORTON1REDSTREETMANCHESTER

1REDSTREETMANCHESTER 1REDSTREETMANCHESTERGORTON

;

run;

art297 · Posted 12-05-2014 02:16 PM

: First, thanks for teaching me about a function I wasn't aware of. I never originally responded to this question, as I had misunderstood the requirements and was looking for a solution that identified everything that was different between two strings.

However, that was too complicated without the compare function, and I quickly realized that I had misunderstood the requirements.

Regardless, the following accomplishes both the requirements and my original interpretation of the problem. It only looks for characters in the 2nd string that aren't present in the 1st string, as that was the OP's need. However, like you did, it wouldn't be hard at all to change the code to be able to go both ways:

data have;

input var1 : $40. var2 : $40.;

cards;

1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER

1REDSTREETMANCHESTER 1REDSTREETxxMANCHESTER

1REDSTREETMANCHESTER GORTON1REDSTREETMANCHESTER

1REDSTREETMANCHESTER 1REDSTREETMANCHESTERGORTON

1REDSTREETMANCHESTER 1REDGSTROEETMANRCHETOSTERN

1REDSTREETMANCHESTER 1REDSTREETMANCHESTER

;

data want (drop=_:);

set have;

length extra $40;

_right=var2;

do while (compare(var1, _right,'il') ne 0);

if abs(compare(var1, _right,'il')) eq 1 then do;

extra=catt(extra,substr(_right,1,1));

_right=substr(_right,2);

end;

else do;

extra=catt(extra,substr(_right,abs(compare(var1, _right,'il')),1));

_right=catt(substr(_right,1,abs(compare(var1, _right,'il'))-1),

substr(_right,abs(compare(var1, _right,'il'))+1));

end;

run;

Patrick · Posted 12-05-2014 02:37 PM

I'm so spoiled by the richness of the SAS language that I just couldn't believe that there is no function which does the job for me. So I "scanned" through the docu, got reminded of a lot of good stuff and then also found compare().

Now I hope that future SAS releases will spoil me even further and add "startpos" as a parameter to allow for a comparison in reverse order if set to "-1".

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!