Dear SAS experts,
I have a set of variables and among them name and surname: NACHNAME and VORNAME. These are character variables length $35.
Some of their values are not correct: for instance, you can find :
Is there a method to clean them (delete the lines, where NACHNAME or VORNAME have this kind of irrelevant value)?
Thanx,
regards
PY
What are the rules that determine if a value is not acceptable?
What are the rules that determine if a value is not acceptable?
We are in Germany, so (same rules NACHNAME and VORNAME):
- all names must be written with Latin letters, of any european language (not possible to select this I guess)
- in a name, you can have as many words as you need: 'Du Taxi du Pouet de la Valse Folle' is correct
- the sign - between 2 words is accepted, as well as a space, or both together (people also type errors), and also '
- accents on voyels are OK : ` ´ ^ ~ ö ä ü
- the _ will be tolerated - the following signs determine in all cases a false name: ? , = ; ( ) / & % $ § ! " * + # @ < > > | €
- such test cases names as : cccc, ccc, Test, test... will be eliminated (I noticed some)
My purpose here is to be helped in the method to achieve this goal, I would like to progress and understand how I can do.
false_name=findc(name,'?,=;()/&%$§!"*+#@<>>|€')>0;
As far as the other types of false names, you would need to create general rules that could be programmed, such as if the same letter appears 4 times consecutively, that is a false name. Naturally, there are many such general rules that would have to be defined and then programmed.
Adding a rule rejecting repeated characters:
data want;
set have;
invalid =
findc(name,'?,=;()/&%$§!"*+#@<>>|€') > 0
or
prxmatch("/(\S)\1{2,}/", name);
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.