good day,
I am using verify function to check similarity of my data. but the result looks odd to me.
Obs | name | lag_name | checking |
---|---|---|---|
1 | Harry | 1 | |
2 | Harry | Harry | 6 |
3 | Henry | Harry | 2 |
4 | Ben | Henry | 1 |
seems blank also count one character
is the sequence matter here because
obs 3 should got 3 match word
obs 4 should got 2 match word
can anyone help me on that?
below is my program
data testing;
input name $40.;
infile datalines dlm=',';
datalines;
Harry
Harry
Henry
Ben
;
run;
data testing2;
set testing;
lag_name=compress(lag(name));
name2=compress(name);
run;
data checking;
set testing2;
checking=VERIFY(name2,lag_name);
run;
thanks in advance
harry
Hi @harrylui
Here is the output I get when I run the code using the VERIFY() function:
This function returns the position of the first character in the string that is not in any of several other strings, so the results seems good:
- H compared to a blank in first position
- Second record -> strings are strictly identical so no position is returned (0)
- e compared to a in second position
- B compared to a blank in first position
Depending on what you're trying to achieve, maybe you could explore the COMPGED() function. This function returns the generalized edit distance between two strings, i.e. the largest the distance is, the less similar the strings are. It is up to you to determine the threshold to determine "acceptable similarity". Zero indicates that strings are strictly identical.
data checking;
set testing2;
checking=compged(name2,lag_name);
run;
Hope this helps,
Best,
Hi @harrylui
Here is the output I get when I run the code using the VERIFY() function:
This function returns the position of the first character in the string that is not in any of several other strings, so the results seems good:
- H compared to a blank in first position
- Second record -> strings are strictly identical so no position is returned (0)
- e compared to a in second position
- B compared to a blank in first position
Depending on what you're trying to achieve, maybe you could explore the COMPGED() function. This function returns the generalized edit distance between two strings, i.e. the largest the distance is, the less similar the strings are. It is up to you to determine the threshold to determine "acceptable similarity". Zero indicates that strings are strictly identical.
data checking;
set testing2;
checking=compged(name2,lag_name);
run;
Hope this helps,
Best,
The verify function will check the location where your string1 mismatches string2.
So in your example
Henry and Harry the letter H matches with H, but the 2nd letter e does not match with a. So the verify function returns 2.
For Ben and Henry, the 1st letter itself does not match, so the function returns 1.
thanks all for the explanations
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.