Dear SAS experts,
I have a set of variables and among them name and surname: NACHNAME and VORNAME. These are character variables length $35.
Some of their values are not correct: for instance, you can find :
Is there a method to clean them (delete the lines, where NACHNAME or VORNAME have this kind of irrelevant value)?
Thanx,
regards
PY
What are the rules that determine if a value is not acceptable?
What are the rules that determine if a value is not acceptable?
We are in Germany, so (same rules NACHNAME and VORNAME):
- all names must be written with Latin letters, of any european language (not possible to select this I guess)
- in a name, you can have as many words as you need: 'Du Taxi du Pouet de la Valse Folle' is correct
- the sign - between 2 words is accepted, as well as a space, or both together (people also type errors), and also '
- accents on voyels are OK : ` ´ ^ ~ ö ä ü
- the _ will be tolerated - the following signs determine in all cases a false name: ? , = ; ( ) / & % $ § ! " * + # @ < > > | €
- such test cases names as : cccc, ccc, Test, test... will be eliminated (I noticed some)
My purpose here is to be helped in the method to achieve this goal, I would like to progress and understand how I can do.
false_name=findc(name,'?,=;()/&%$§!"*+#@<>>|€')>0;
As far as the other types of false names, you would need to create general rules that could be programmed, such as if the same letter appears 4 times consecutively, that is a false name. Naturally, there are many such general rules that would have to be defined and then programmed.
Adding a rule rejecting repeated characters:
data want;
set have;
invalid =
findc(name,'?,=;()/&%$§!"*+#@<>>|€') > 0
or
prxmatch("/(\S)\1{2,}/", name);
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.