Hello there. Some help if I may.
I have two variables with addresses in (no spaces) some contain counties and some don't and i'm trying to extract the differences with no solution.
I have even tried using proc compare by putting the variables I want to compare into two datasets but to no avail
example
Variable1 Variable2
1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER
Im trying to output GORTON to another variable but cant seem to get it.
Any help greatly appreciated.
TBH I don't see how you are going to achieve that. There is no discernible delimiter, length, or algorithm which can identify street names from places from Cities. Perhaps if you had the tech behind Google maps you might be able to do some good guesses, but even then you would have similar place names/street names in many places.
Go back to where you got the data from and see if there is a delimited version.
I'm not too interested in the content just the comparison of the strings
so variable 1 could be
"thebigbad"
and variable 2
"thebigbadwolf"
I would want the output to be "wolf"
Is that possible
Oh, I see. Then tranwrd would probably be helpful here:
data want;
set have;
length result $2000;
if length(strip(var1)) > length(strip(var2)) then result=tranwrd(var1,var2,"");
else tranwrd(var2,var1,"");
run;
So first check which string is longer, then from the longest string remove any occurrences of the shorter string.
Below code won't give you 100% what you're after. You need to test it with your real data and then decide whether you only will remove sub-strings where at least the first n-characters match. For this the generated regex would then need to look something like 1RED?S?....
The code will be quite resource hungry as for every single iteration of the data step two regular expressions need to be compiled.
I hope someone else comes up with another more straightforward approach. I don't believe that a simple tranwrd() will cut it though.
data have /*(drop=_:)*/;
length variable1 variable2 result _short _long $ 100 _prx_string $ 200;
Variable1='1REDSTREETMANCHESTER';
Variable2='1REDSTREETGORTONMANCHESTER';
if lengthn(variable1) < lengthn(variable2) then
do;
_short=variable1;
_long =variable2;
end;
else
do;
_short=variable2;
_long =variable1;
end;
/* remove string from beginning */
_prx_string=prxchange('s/(.)/\1?/o',-1,strip(_short));
_prx_string=cats("s/",_prx_string,"//oi");
_prx_id=prxparse(strip(_prx_string));
result=prxchange(_prx_id,1,strip(_long));
/* free up memory */
call prxfree(_prx_id);
/* remove string from end */
_prx_string=prxchange('s/(.)/\1?/o',-1,reverse(strip(_short)));
_prx_string=cats("s/",_prx_string,"//oi");
_prx_id=prxparse(strip(_prx_string));
result=prxchange(_prx_id,1,reverse(strip(result)));
/* free up memory */
call prxfree(_prx_id);
/* get result string into right order */
result=reverse(strip(result));
output;
stop;
run;
The sample code below assumes that the beginning of the two strings is the same.
Maybe this will help or at least provide a start.
I want know if there is
Variable1 Variable2
1REDSTREETMANCHESTER 1REDGORTSTREETMANCHESTERON
If not.
Xia Keshan
data have; input var1 : $40. var2 : $40.; cards; 1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER 1REDSTREETMANCHESTER GORTON1REDSTREETMANCHESTER 1REDSTREETMANCHESTER 1REDSTREETMANCHESTERGORTON ; run; data want; set have; length var3 $ 40; position=find(var2,strip(var1)); if position then do; if position eq 1 then var3=substr(var2,length(var1)+1); else var3=substr(var2,1,position-1) ; end; if not position then do; do i=2 to length(var1); one=find(var2,strip(substr(var1,1,i-1))) ; two=find(var2,strip(substr(var1,i))) ; if one and two then var3=substr(var2,i,two-i); end; end; run;
Similar to Bruno's proposal but works also when difference is at beginning:
data have;
length str1 str2 $64;
input str1 str2;
datalines;
thebigbad thebigbadwolf
1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER
ab1234 abcöäü1234
abc abc
abc wolfabc
;
data want;
length short long diff $64;
set have;
if length(str1)<length(str2) then do;
short=str1; long=str2; end;
else do;
short=str2; long=str1; end;
do i = 1 to length(short);
if char(short, i) ne char(long, i) then leave; end;
do j = 1 to length(short);
if char(short, length(short)-j+1) ne char(long, length(long)-j+1) then leave; end;
if i <= length(long) - j + 1 then diff = substr(long, i, length(long) - j + 1 - i + 1);
drop short long i j;
run;
proc print data=want noobs; var str1 str2 diff; run;
str1 str2 diff
thebigbad thebigbadwolf wolf
1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER GORTON
ab1234 abcöäü1234 cöäü
abc abc
abc wolfabc wolf
PG
Hi ,
How to remove the difference from the variable2?
Ex:
var1="world cup"
var2="Woorlld ccup"
i want to make var2 as var1
thanks....
Regards
Karthik.
How do you know "Woorld ccup" = "world cup"? Could it not be "world coup"? This is called data cleaning. Produce a list of distinct values, then clean those which do not match. E.g.:
data coded;
category="world cup"; values="Woorlld cup"; output;
values="world cup"; output;
values="Cup of Worlds"; output;
run;
proc sql;
select A.VAR1,
A.VAR2,
B.CATEGORY
from HAVE A
left join CODED B
on A.VAR2=VALUES;
quit;
For non trivial spelling problems like that, look at the spelling distance functions in SAS : COMPGED, COMPLEV, SPEDIS, and also SOUNDEX.
PG
And here another coding option using the compare() function.
data have;
input var1 : $40. var2 : $40.;
length result $40;
if lengthn(var1) < lengthn(var2) then
do;
_short=var1;
_long =var2;
end;
else
do;
_short=var2;
_long =var1;
end;
_pos=compare(_short, _long,'i:');
result=substrn(_long,abs(_pos));
result=reverse(strip(result));
_short=reverse(strip(_short));
_pos=compare(_short,result,'i:');
result=substrn(result,abs(_pos));
result=reverse(strip(result));
cards;
1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER
1REDSTREETGORTONMANCHESTER 1REDSTREETMANCHESTER
1REDSTREETMANCHESTER GORTON1REDSTREETMANCHESTER
1REDSTREETMANCHESTER 1REDSTREETMANCHESTERGORTON
;
run;
: First, thanks for teaching me about a function I wasn't aware of. I never originally responded to this question, as I had misunderstood the requirements and was looking for a solution that identified everything that was different between two strings.
However, that was too complicated without the compare function, and I quickly realized that I had misunderstood the requirements.
Regardless, the following accomplishes both the requirements and my original interpretation of the problem. It only looks for characters in the 2nd string that aren't present in the 1st string, as that was the OP's need. However, like you did, it wouldn't be hard at all to change the code to be able to go both ways:
data have;
input var1 : $40. var2 : $40.;
cards;
1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER
1REDSTREETMANCHESTER 1REDSTREETxxMANCHESTER
1REDSTREETMANCHESTER GORTON1REDSTREETMANCHESTER
1REDSTREETMANCHESTER 1REDSTREETMANCHESTERGORTON
1REDSTREETMANCHESTER 1REDGSTROEETMANRCHETOSTERN
1REDSTREETMANCHESTER 1REDSTREETMANCHESTER
;
data want (drop=_:);
set have;
length extra $40;
_right=var2;
do while (compare(var1, _right,'il') ne 0);
if abs(compare(var1, _right,'il')) eq 1 then do;
extra=catt(extra,substr(_right,1,1));
_right=substr(_right,2);
end;
else do;
extra=catt(extra,substr(_right,abs(compare(var1, _right,'il')),1));
_right=catt(substr(_right,1,abs(compare(var1, _right,'il'))-1),
substr(_right,abs(compare(var1, _right,'il'))+1));
end;
end;
run;
I'm so spoiled by the richness of the SAS language that I just couldn't believe that there is no function which does the job for me. So I "scanned" through the docu, got reminded of a lot of good stuff and then also found compare().
Now I hope that future SAS releases will spoil me even further and add "startpos" as a parameter to allow for a comparison in reverse order if set to "-1".
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.