This code is giving me false positive.
I have character data that has ID Numbers. It is 9 character long string and has numbers as char data from 0 to 9. I am trying to identify if 5 or more consecutive characters are the same. If yes, then I will create a flag.
I have this code below. It works most of the time but also gives me false positive. For example, it will pick up something like 121341111 – where the ‘1’ is within the string 5 or more times.
I want to identify only if a character is present consecutively 5 or more times. 121341111 should not be flagged as 1 repeated consecutively only 4 times.
Any idea?
data want(drop = i) ;
set have ;
length ssn_char ssn_rept_chars $9;
ssn_char = ssn;
do i=1 to 6 until (flag=1);
if substr(ssn_char, i, 1) = substr(ssn_char, i+1, 1) = substr(ssn_char, i+2, 1) = substr(ssn_char, i+3, 1)
then flag=1;
if flag = 1 then ssn_rept_chars = ssn_char;
end;
run;
This works. I believe there is a slicker, elegant way to value checkit.
data have;
ssn = 123456789;
output;
ssn = 111116789;
output;
ssn = 123455555;
output;
ssn = 123333339;
output;
run;
data want(drop = i) ;
set have ;
length ssn_char ssn_rept_chars $9;
ssn_char = ssn;
do i=1 to 5 until (flag=1);
checkit = substr(ssn_char, i, 1)||substr(ssn_char, i, 1)||substr(ssn_char, i, 1)||
substr(ssn_char, i, 1)||substr(ssn_char, i, 1) ;
if checkit = substr(ssn_char, i, 1)||substr(ssn_char, i+1, 1)||
substr(ssn_char, i+2, 1)||substr(ssn_char, i+3, 1)||substr(ssn_char, i+4, 1)
then do;
flag=1;
put i= checkit= flag=;
ssn_rept_chars = ssn_char;
end;
end;
run;
data want;
set have;
array a{10}$5 _temporary_ ('00000' '11111' '22222' '33333' '44444' '55555' '66666' '77777' '88888' '99999');
_i_=1;
do until (flag=1 or _i_=11);
flag= (index(ssn,a[_i_])>0);
_i_+1;
end;
run;
Perhaps.
Would have to get slick if looking for any character repeated though
It's giving you the false positives because you are only comparing 4 characters, not comparing 5 characters. To compare 5 characters, two changes would be needed. First, i should go from 1 to 5, not 1 to 6:
do i=1 to 5 until (flag=1);
Second, add another character to the list of comparisons:
... = substr(ssn_char, i+4, 1) then flag=1;
Good luck.
data want;
set have;
flag=prxmatch('/.*(\d)\1{4,4}.*/',ssn);
run;
Perl Regular Expressions are not SAS specific so I'm sure there is a lot of stuff around. I don't know something specific I could recommend.
Within SAS:
SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition
...and once you understand which SAS functions allow you to use Perl Regular Expressions (functions starting with "prx..") then the most important page is: SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition
Because Perl Regular Expressions are not SAS specific there are a lot of expressions published and searching the Internet will very often allow to find something which comes close what you need.
Oh, and the Tip Sheet can also be useful in the beginning: https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf
Thanks very much, so does that mean it is generally aimed for people who are already proficient in the use of Perl scripting language?Hmm if yes, I wonder how many languages a person like me with average to below average intelligence can learn:smileyconfused:. I appreciate your very quick response. Cheers
I could learn it with "Googling" and "try and error" - so you can too!
You don't need to learn Perl for RegEx - Perl just implemented a syntax for Regular Expression which became a quasi standard.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.