data a;
input id name&$40. age salary;
cards;
1 Andy Murray 28 1500000
2 Stewart Christen 31 2500000
3 Adam Levine 35 800000
4 Bill White 40 1500000
5 Army Grey 20 300000
6 Dawson Robert 30 500000
;
run;
Let's say this is the data and there are 10k such entries. Number 2 and 6 have actual names as Christen Stewart and Robert Dawson.
How can we identify what all names are swapped and how can we correct them?
This task is unsolvable without having a list of "allowed" first names and last names.
Assume you would have "George Michael" in your data, both words in his name could be first name and last name.
Actually this was one of the interviewer's questions from BARCLAYS but they insisted that they do it and it's very basic.
This is impossible to do. Think of a guy named Paul Carl (just type that into Google, and immediately there's a LinkedIn profile for someone of that name).
There are gazillions of funny first names (e.g. Moon Zappa), and lots of surnames that are also used as first names.
Actually this was one of the interviewer's questions from BARCLAYS but they insisted that they do it and it's very basic.
Let them show you their code, then feed it names that will make it fail, then charge them for revealing the problem.
Ya right I wish they would have provided me the code.
@AmitParmar wrote:
Actually this was one of the interviewer's questions from BARCLAYS but they insisted that they do it and it's very basic.
One strongly suspects they would be scrubbing against a current client list, i.e. the "allowed list" that @andreas_lds mentions.
Then it could be doable.
Until they have two clients, one with the name "John Smith" and the other "Smith John" or any similar pairing.
The SAS Data Quality Server / DataFlux provides OOTB functionality for splitting up names into its components like first name, middle name and last name.
The result of such a process will be better than what you can reasonably code for but it will never be perfect (i.e. George Michael and Michael Jordan).
DataFlux uses a QKB (Quality Knowledge Base) provided as part of the product.
Using DQ functions like DQPARSE is not that hard BUT one also needs to regularly verify the quality of the results and have some data Stewart role in place for maintaining and updating the QKB (....which gets often missed or done badly).
You can see from the answers given by others that DataFlux and the DQ functions are not that widely used. I guess it could become a bit more in Viya.
Looks like your interviewers didn't understand that DataFlux is a beast on its own and not just part of foundation SAS.
Thanks
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.