11-27-2014 10:26 AM
Hello there. Some help if I may.
I have two variables with addresses in (no spaces) some contain counties and some don't and i'm trying to extract the differences with no solution.
I have even tried using proc compare by putting the variables I want to compare into two datasets but to no avail
Im trying to output GORTON to another variable but cant seem to get it.
Any help greatly appreciated.
11-27-2014 10:30 AM
TBH I don't see how you are going to achieve that. There is no discernible delimiter, length, or algorithm which can identify street names from places from Cities. Perhaps if you had the tech behind Google maps you might be able to do some good guesses, but even then you would have similar place names/street names in many places.
Go back to where you got the data from and see if there is a delimited version.
11-27-2014 10:36 AM
I'm not too interested in the content just the comparison of the strings
so variable 1 could be
and variable 2
I would want the output to be "wolf"
Is that possible
11-27-2014 01:30 PM
Oh, I see. Then tranwrd would probably be helpful here:
length result $2000;
if length(strip(var1)) > length(strip(var2)) then result=tranwrd(var1,var2,"");
So first check which string is longer, then from the longest string remove any occurrences of the shorter string.
11-27-2014 04:30 PM
Below code won't give you 100% what you're after. You need to test it with your real data and then decide whether you only will remove sub-strings where at least the first n-characters match. For this the generated regex would then need to look something like 1RED?S?....
The code will be quite resource hungry as for every single iteration of the data step two regular expressions need to be compiled.
I hope someone else comes up with another more straightforward approach. I don't believe that a simple tranwrd() will cut it though.
data have /*(drop=_*/;
length variable1 variable2 result _short _long $ 100 _prx_string $ 200;
if lengthn(variable1) < lengthn(variable2) then
/* remove string from beginning */
/* free up memory */
/* remove string from end */
/* free up memory */
/* get result string into right order */
11-27-2014 04:36 PM
The sample code below assumes that the beginning of the two strings is the same.
Maybe this will help or at least provide a start.
11-28-2014 10:55 AM
I want know if there is
data have; input var1 : $40. var2 : $40.; cards; 1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER 1REDSTREETMANCHESTER GORTON1REDSTREETMANCHESTER 1REDSTREETMANCHESTER 1REDSTREETMANCHESTERGORTON ; run; data want; set have; length var3 $ 40; position=find(var2,strip(var1)); if position then do; if position eq 1 then var3=substr(var2,length(var1)+1); else var3=substr(var2,1,position-1) ; end; if not position then do; do i=2 to length(var1); one=find(var2,strip(substr(var1,1,i-1))) ; two=find(var2,strip(substr(var1,i))) ; if one and two then var3=substr(var2,i,two-i); end; end; run;
11-28-2014 12:10 PM
Similar to Bruno's proposal but works also when difference is at beginning:
length str1 str2 $64;
input str1 str2;
length short long diff $64;
if length(str1)<length(str2) then do;
short=str1; long=str2; end;
short=str2; long=str1; end;
do i = 1 to length(short);
if char(short, i) ne char(long, i) then leave; end;
do j = 1 to length(short);
if char(short, length(short)-j+1) ne char(long, length(long)-j+1) then leave; end;
if i <= length(long) - j + 1 then diff = substr(long, i, length(long) - j + 1 - i + 1);
drop short long i j;
proc print data=want noobs; var str1 str2 diff; run;
str1 str2 diff
thebigbad thebigbadwolf wolf
1REDSTREETMANCHESTER 1REDSTREETGORTONMANCHESTER GORTON
ab1234 abcöäü1234 cöäü
abc wolfabc wolf
12-05-2014 09:38 AM
How to remove the difference from the variable2?
i want to make var2 as var1
12-05-2014 10:05 AM
How do you know "Woorld ccup" = "world cup"? Could it not be "world coup"? This is called data cleaning. Produce a list of distinct values, then clean those which do not match. E.g.:
category="world cup"; values="Woorlld cup"; output;
values="world cup"; output;
values="Cup of Worlds"; output;
from HAVE A
left join CODED B
12-05-2014 10:55 AM
For non trivial spelling problems like that, look at the spelling distance functions in SAS : COMPGED, COMPLEV, SPEDIS, and also SOUNDEX.
11-28-2014 11:05 PM
And here another coding option using the compare() function.
input var1 : $40. var2 : $40.;
length result $40;
if lengthn(var1) < lengthn(var2) then
12-05-2014 02:16 PM
: First, thanks for teaching me about a function I wasn't aware of. I never originally responded to this question, as I had misunderstood the requirements and was looking for a solution that identified everything that was different between two strings.
However, that was too complicated without the compare function, and I quickly realized that I had misunderstood the requirements.
Regardless, the following accomplishes both the requirements and my original interpretation of the problem. It only looks for characters in the 2nd string that aren't present in the 1st string, as that was the OP's need. However, like you did, it wouldn't be hard at all to change the code to be able to go both ways:
input var1 : $40. var2 : $40.;
data want (drop=_;
length extra $40;
do while (compare(var1, _right,'il') ne 0);
if abs(compare(var1, _right,'il')) eq 1 then do;
12-05-2014 02:37 PM
I'm so spoiled by the richness of the SAS language that I just couldn't believe that there is no function which does the job for me. So I "scanned" through the docu, got reminded of a lot of good stuff and then also found compare().
Now I hope that future SAS releases will spoil me even further and add "startpos" as a parameter to allow for a comparison in reverse order if set to "-1".