BookmarkSubscribeRSS Feed
Stretlow
Obsidian | Level 7

Hello there. Some help if I may.

I have two variables with addresses in (no spaces) some contain counties and some don't and i'm trying to extract the differences with no solution.

I have even tried using proc compare by putting the variables I want to compare into two datasets but to no avail

example

Variable1                                     Variable2

1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER

Im trying to output GORTON to another variable but cant seem to get it.

Any help greatly appreciated.

13 REPLIES 13
RW9
Diamond | Level 26 RW9
Diamond | Level 26

TBH I don't see how you are going to achieve that.  There is no discernible delimiter, length, or algorithm which can identify street names from places from Cities.  Perhaps if you had the tech behind Google maps you might be able to do some good guesses, but even then you would have similar place names/street names in many places.

Go back to where you got the data from and see if there is a delimited version.

Stretlow
Obsidian | Level 7

I'm not too interested in the content just the comparison of the strings

so variable 1 could be

"thebigbad"

and variable 2

"thebigbadwolf"

I would want the output to be   "wolf"

Is that possible

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Oh, I see.  Then tranwrd would probably be helpful here:

data want;

     set have;

     length result $2000;

     if length(strip(var1)) > length(strip(var2)) then result=tranwrd(var1,var2,"");

     else tranwrd(var2,var1,"");

run;

So first check which string is longer, then from the longest string remove any occurrences of the shorter string.

Patrick
Opal | Level 21

Below code won't give you 100% what you're after. You need to test it with your real data and then decide whether you only will remove sub-strings where at least the first n-characters match. For this the generated regex would then need to look something like 1RED?S?....

The code will be quite resource hungry as for every single iteration of the data step two regular expressions need to be compiled.

I hope someone else comes up with another more straightforward approach. I don't believe that a simple tranwrd() will cut it though.

data have /*(drop=_:)*/;

  length variable1 variable2 result _short _long $ 100 _prx_string $ 200;

  Variable1='1REDSTREETMANCHESTER';

  Variable2='1REDSTREETGORTONMANCHESTER';

  if lengthn(variable1) < lengthn(variable2) then

    do;

      _short=variable1;

      _long =variable2;

    end;

  else

    do;

      _short=variable2;

      _long =variable1;

    end;

  /* remove string from beginning */

  _prx_string=prxchange('s/(.)/\1?/o',-1,strip(_short));

  _prx_string=cats("s/",_prx_string,"//oi");

  _prx_id=prxparse(strip(_prx_string));

  result=prxchange(_prx_id,1,strip(_long));

  /* free up memory */

  call prxfree(_prx_id);

  /* remove string from end */

  _prx_string=prxchange('s/(.)/\1?/o',-1,reverse(strip(_short)));

  _prx_string=cats("s/",_prx_string,"//oi");

  _prx_id=prxparse(strip(_prx_string));

  result=prxchange(_prx_id,1,reverse(strip(result)));

  /* free up memory */

  call prxfree(_prx_id);

  /* get result string into right order */

  result=reverse(strip(result));

  output;

  stop;

run;

BrunoMueller
SAS Super FREQ

The sample code below assumes that the beginning of the two strings is the same.

Maybe this will help or at least provide a start.

data want;
  infile cards dlm=",";
 
input
    var1 :
$64.
    var2 :
$64.
  ;
  length result $ 64;
  loopLen = max( length(var1), length(var2) );

 
*
  * compare varying length if there is a nonmatch
  * end loop
  *;

 
do i = 1 to loopLen;
    if  substr(var1, 1, i) ne  substr(var2, 1, i) then do;
     
leave;
   
end;
 
end;

 
*
  * fill remaining vars
  *;

  remain1 = substr(var1, i);
  remain2 = substr(var2, i);
  lr1 = lengthn(remain1);
  lr2 = lengthn(remain2);

 
if  lr1 > 1 and lr1 < lr2 then do;
    result = tranwrd(remain2, strip(remain1),
"");
  end;
 
else do;
    result = remain2;
 
end;
  cards;
thebigbad,thebigbadwolf
1REDSTREETMANCHESTER,1REDSTREETGORTONMANCHESTER
ab1234,abcöäü1234
abc,abc
;
Ksharp
Super User

I want know if there is

Variable1                                    Variable2

1REDSTREETMANCHESTER    1REDGORTSTREETMANCHESTERON

If not.

Xia Keshan


data have;
input var1 : $40. var2 : $40.;
cards;
1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER
1REDSTREETMANCHESTER     GORTON1REDSTREETMANCHESTER
1REDSTREETMANCHESTER     1REDSTREETMANCHESTERGORTON

;
run;
data want;
 set have;
 length var3 $ 40;
 position=find(var2,strip(var1));
 if position then do;
   if position eq 1 then var3=substr(var2,length(var1)+1);
    else  var3=substr(var2,1,position-1) ;
 end;
 if not position then do;
   do i=2 to length(var1);
    one=find(var2,strip(substr(var1,1,i-1))) ;
    two=find(var2,strip(substr(var1,i))) ; 
    if one and two then var3=substr(var2,i,two-i);
   end;
 end;
run;
PGStats
Opal | Level 21

Similar to Bruno's proposal but works also when difference is at beginning:

data have;

length str1 str2 $64;

input str1 str2;

datalines;

thebigbad thebigbadwolf

1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER

ab1234 abcöäü1234

abc abc

abc wolfabc

;

data want;

length short long diff $64;

set have;

if length(str1)<length(str2) then do;

    short=str1; long=str2; end;

else do;

    short=str2; long=str1; end;

do i = 1 to length(short);

    if char(short, i) ne char(long, i) then leave; end;

do j = 1 to length(short);

    if char(short, length(short)-j+1) ne char(long, length(long)-j+1) then leave; end;

if i <= length(long) - j + 1 then diff = substr(long, i, length(long) - j + 1 - i + 1);

drop short long i j;

run;

proc print data=want noobs; var str1 str2 diff; run;

            str1                    str2                           diff

            thebigbad               thebigbadwolf                 wolf

            1REDSTREETMANCHESTER    1REDSTREETGORTONMANCHESTER    GORTON

            ab1234                  abcöäü1234                    cöäü

            abc                     abc

            abc                     wolfabc                       wolf

PG

PG
KarthikSrivasthav
Calcite | Level 5

Hi ,

How to remove the difference from the variable2?

Ex:

var1="world cup"

var2="Woorlld ccup"

i want to make var2 as var1

thanks....

Regards

Karthik.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

How do you know "Woorld ccup" = "world cup"?  Could it not be "world coup"?  This is called data cleaning.  Produce a list of distinct values, then clean those which do not match.  E.g.:

data coded;

     category="world cup"; values="Woorlld cup"; output;

                                         values="world cup"; output;

                                         values="Cup of Worlds"; output;

run;

proc sql;

     select     A.VAR1,

                   A.VAR2,

                    B.CATEGORY

     from       HAVE A

     left join CODED B

     on            A.VAR2=VALUES;

quit;

PGStats
Opal | Level 21

For non trivial spelling problems like that, look at the spelling distance functions in SAS : COMPGED, COMPLEV, SPEDIS, and also SOUNDEX.

PG

PG
Patrick
Opal | Level 21

And here another coding option using the compare() function.

data have;

  input var1 : $40. var2 : $40.;

  length result $40;

  if lengthn(var1) < lengthn(var2) then

    do;

      _short=var1;

      _long =var2;

    end;

  else

    do;

      _short=var2;

      _long =var1;

    end;

  _pos=compare(_short, _long,'i:');

  result=substrn(_long,abs(_pos));

 

  result=reverse(strip(result));

  _short=reverse(strip(_short));

 

  _pos=compare(_short,result,'i:');

  result=substrn(result,abs(_pos));

 

  result=reverse(strip(result));

 

  cards;

1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER

1REDSTREETGORTONMANCHESTER     1REDSTREETMANCHESTER

1REDSTREETMANCHESTER     GORTON1REDSTREETMANCHESTER

1REDSTREETMANCHESTER     1REDSTREETMANCHESTERGORTON

;

run;

art297
Opal | Level 21

: First, thanks for teaching me about a function I wasn't aware of. I never originally responded to this question, as I had misunderstood the requirements and was looking for a solution that identified everything that was different between two strings.

However, that was too complicated without the compare function, and I quickly realized that I had misunderstood the requirements.

Regardless, the following accomplishes both the requirements and my original interpretation of the problem. It only looks for characters in the 2nd string that aren't present in the 1st string, as that was the OP's need. However, like you did, it wouldn't be hard at all to change the code to be able to go both ways:

data have;

  input var1 : $40. var2 : $40.;

  cards;

1REDSTREETMANCHESTER     1REDSTREETGORTONMANCHESTER

1REDSTREETMANCHESTER     1REDSTREETxxMANCHESTER

1REDSTREETMANCHESTER     GORTON1REDSTREETMANCHESTER

1REDSTREETMANCHESTER     1REDSTREETMANCHESTERGORTON

1REDSTREETMANCHESTER     1REDGSTROEETMANRCHETOSTERN

1REDSTREETMANCHESTER     1REDSTREETMANCHESTER

;

data want (drop=_:);

  set have;

  length extra $40;

  _right=var2;

  do while (compare(var1, _right,'il') ne 0);

    if abs(compare(var1, _right,'il')) eq 1 then do;

      extra=catt(extra,substr(_right,1,1));

      _right=substr(_right,2);

    end;

    else do;

      extra=catt(extra,substr(_right,abs(compare(var1, _right,'il')),1));

      _right=catt(substr(_right,1,abs(compare(var1, _right,'il'))-1),

                  substr(_right,abs(compare(var1, _right,'il'))+1));

    end;

  end;

run;

Patrick
Opal | Level 21

I'm so spoiled by the richness of the SAS language that I just couldn't believe that there is no function which does the job for me. So I "scanned" through the docu, got reminded of a lot of good stuff and then also found compare().

Now I hope that future SAS releases will spoil me even further and add "startpos" as a parameter to allow for a comparison in reverse order if set to "-1".

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 11631 views
  • 2 likes
  • 8 in conversation