- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Dear SAS experts,
I have a set of variables and among them name and surname: NACHNAME and VORNAME. These are character variables length $35.
Some of their values are not correct: for instance, you can find :
- cccc
- ccc
- ?0???
- !-ยง4$
- Test
Is there a method to clean them (delete the lines, where NACHNAME or VORNAME have this kind of irrelevant value)?
Thanx,
regards
PY
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What are the rules that determine if a value is not acceptable?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What are the rules that determine if a value is not acceptable?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
We are in Germany, so (same rules NACHNAME and VORNAME):
- all names must be written with Latin letters, of any european language (not possible to select this I guess)
- in a name, you can have as many words as you need: 'Du Taxi du Pouet de la Valse Folle' is correct
- the sign - between 2 words is accepted, as well as a space, or both together (people also type errors), and also '
- accents on voyels are OK : ` ยด ^ ~ รถ รค รผ
- the _ will be tolerated - the following signs determine in all cases a false name: ? , = ; ( ) / & % $ ยง ! " * + # @ < > > | โฌ
- such test cases names as : cccc, ccc, Test, test... will be eliminated (I noticed some)
My purpose here is to be helped in the method to achieve this goal, I would like to progress and understand how I can do.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
false_name=findc(name,'?,=;()/&%$ยง!"*+#@<>>|โฌ')>0;
As far as the other types of false names, you would need to create general rules that could be programmed, such as if the same letter appears 4 times consecutively, that is a false name. Naturally, there are many such general rules that would have to be defined and then programmed.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Adding a rule rejecting repeated characters:
data want;
set have;
invalid =
findc(name,'?,=;()/&%$ยง!"*+#@<>>|โฌ') > 0
or
prxmatch("/(\S)\1{2,}/", name);
run;