BookmarkSubscribeRSS Feed
AmitParmar
Obsidian | Level 7

data a;
input id name&$40. age salary;
cards;
1 Andy Murray 28 1500000
2 Stewart Christen 31 2500000
3 Adam Levine 35 800000
4 Bill White 40 1500000
5 Army Grey 20 300000
6 Dawson Robert 30 500000
;
run;

 

Let's say this is the data and there are 10k such entries. Number 2 and 6 have actual names as Christen Stewart and Robert Dawson. 

 

How can we identify what all names are swapped and how can we correct them?

 

9 REPLIES 9
andreas_lds
Jade | Level 19

This task is unsolvable without having a list of "allowed" first names and last names.

Assume you would have "George Michael" in your data, both words in his name could be first name and last name.

AmitParmar
Obsidian | Level 7

Actually this was one of the interviewer's questions from BARCLAYS but they insisted that they do it and it's very basic.

Kurt_Bremser
Super User

This is impossible to do. Think of a guy named Paul Carl (just type that into Google, and immediately there's a LinkedIn profile for someone of that name).

There are gazillions of funny first names (e.g. Moon Zappa), and lots of surnames that are also used as first names.

AmitParmar
Obsidian | Level 7

Actually this was one of the interviewer's questions from BARCLAYS but they insisted that they do it and it's very basic.

AmitParmar
Obsidian | Level 7

Ya right I wish they would have provided me the code.

ballardw
Super User

@AmitParmar wrote:

Actually this was one of the interviewer's questions from BARCLAYS but they insisted that they do it and it's very basic.


One strongly suspects they would be scrubbing against a current client list, i.e. the "allowed list" that @andreas_lds mentions.

Then it could be doable.

Until they have two clients, one with the name "John Smith" and the other "Smith John" or any similar pairing.

Patrick
Opal | Level 21

The SAS Data Quality Server / DataFlux provides OOTB functionality for splitting up names into its components like first name, middle name and last name.

The result of such a process will be better than what you can reasonably code for but it will never be perfect (i.e. George Michael and Michael Jordan).

DataFlux uses a QKB (Quality Knowledge Base) provided as part of the product.

 

Using  DQ functions like DQPARSE is not that hard BUT one also needs to regularly verify the quality of the results and have some data Stewart role in place for maintaining and updating the QKB (....which gets often missed or done badly).

 

You can see from the answers given by others that DataFlux and the DQ functions are not that widely used. I guess it could become a bit more in Viya.

Looks like your interviewers didn't understand that DataFlux is a beast on its own and not just part of foundation SAS.

AmitParmar
Obsidian | Level 7

Thanks

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2328 views
  • 9 likes
  • 5 in conversation