Ok. This should do it
The overall logic is this:
By appropriately close, I used the Complev Function and simply chose a value that works for this example. Ie here, the distance should be less than ten to be considered 'similar'. You may have to adjust that part.
@fcf Let me know if it works for you 🙂
Code:
data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos 01/01/2000 1
3333 uvwx M Rita Costa Santos 01/01/2000 1
5678 efgh Maria Rita C Santos 01/01/2000
9101 ijkl Rita Costa Santos 01/01/2000 1
1111 mnop Maria Leonor Santos Silva 02/03/2001
2222 qrst Leonor Santos Silva 02/03/2001 2
4444 yzab Leonor Santos Silva 30/08/1999
;
data want(keep = num_A num_B name birth_date id);
format num_A num_B name birth_date id;
if _N_ = 1 then do;
dcl hash h1 ();
h1.definekey("name", "birth_date");
h1.definedata("i");
h1.definedone();
dcl hash h2 (multidata : "Y");
h2.definekey("birth_date");
h2.definedata("n", "i");
h2.definedone();
do until (z);
set have(rename=(id=i name=n) where = (i)) end = z;
h1.ref();
h2.ref();
maxid = max(maxid, i);
end;
end;
set have;
if id = . then do;
if h1.find() ne 0 then do;
do while (h2.do_over() = 0);
if complev(name, n) < 10 then do;
id = i;
h1.ref(key : n, key : birth_date, data : id);
end;
end;
end;
end;
if id = . then do;
maxid + 1;
id = maxid;
h1.ref(key : name, key : birth_date, data : id);
end;
run;
Result
num_A num_B name birth_date id 1234 abcd M Rita Costa Santos 01/01/2000 1 3333 uvwx M Rita Costa Santos 01/01/2000 1 5678 efgh Maria Rita C Santos 01/01/2000 1 9101 ijkl Rita Costa Santos 01/01/2000 1 1111 mnop Maria Leonor Santos Silva 02/03/2001 2 2222 qrst Leonor Santos Silva 02/03/2001 2 4444 yzab Leonor Santos Silva 30/08/1999 3
@fcf you have to be more specific than "It did not work". What did not work? How did the result differ from what you expect?
My code works regardless of the order on your data. That is not why.
Post a larger sample of your actual data and let me take a look.
I cannot post the actual data. It is private. It did not work - it always returned a different id.
Did you try tweaking the threshold (10) ?
Also, you can simply post a larger example of what your data may look like. Not the actual data.
That way, we can find a solution that works.
Sorry the delay again and thank you for the consideration and help.
So, imagine I have this:
num_A | num_B | name | birth_date | id |
5785 | fbff | João Simões Marques | 12/05/2000 | 7 |
1234 | abcd | M Rita Costa Santos | 01/01/2020 | 1 |
3333 | uvwx | M Rita Costa Santos | 01/01/2020 | 1 |
5678 | efgh | Maria Rita C Santos | 01/01/2020 | |
9101 | ijkl | Rita Costa Santos | 01/01/2020 | 1 |
1111 | mnop | Maria Leonor Santos Silva | 02/03/2001 | 2 |
2222 | qrst | Leonor Santos Silva | 02/03/2001 | |
4444 | yzab | Leonor Santos Silva | 30/08/1999 | |
6565 | afgg | Donald J Trump | 01/01/1960 | |
2423 | sgty | Donald J Trump | 01/01/1960 | |
9876 | hgvb | Pedro Costa Santos | 05/09/1990 | 9 |
7865 | jnbv | Luís Miguel Silva | 05/09/1990 |
And let's try another approach, I want this:
num_A | num_B | name | birth_date | id |
5785 | fbff | João Simões Marques | 12/05/2000 | 7 |
1234 | abcd | M Rita Costa Santos | 01/01/2020 | 1 |
3333 | uvwx | M Rita Costa Santos | 01/01/2020 | 1 |
5678 | efgh | Maria Rita C Santos | 01/01/2020 | 1 |
9101 | ijkl | Rita Costa Santos | 01/01/2020 | 1 |
1111 | mnop | Maria Leonor Santos Silva | 02/03/2001 | 2 |
2222 | qrst | Leonor Santos Silva | 02/03/2001 | 2 |
4444 | yzab | Leonor Santos Silva | 30/08/1999 | 10 |
6565 | afgg | Donald J Trump | 01/01/1960 | 11 |
2423 | sgty | Donald J Trump | 01/01/1960 | 11 |
9876 | hgvb | Pedro Costa Santos | 05/09/1990 | 9 |
7865 | jnbv | Luís Miguel Silva | 05/09/1990 | 12 |
Working on it.
@fcf A small correction to the code gives you whay you want.
data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
5785 fbff João Simões Marques 12/05/2000 7
1234 abcd M Rita Costa Santos 01/01/2020 1
3333 uvwx M Rita Costa Santos 01/01/2020 1
5678 efgh Maria Rita C Santos 01/01/2020
9101 ijkl Rita Costa Santos 01/01/2020 1
1111 mnop Maria Leonor Santos Silva 02/03/2001 2
2222 qrst Leonor Santos Silva 02/03/2001
4444 yzab Leonor Santos Silva 30/08/1999
6565 afgg Donald J Trump 01/01/1960
2423 sgty Donald J Trump 01/01/1960
9876 hgvb Pedro Costa Santos 05/09/1990 9
7865 jnbv Luís Miguel Silva 05/09/1990
;
data want(keep = num_A num_B name birth_date id);
format num_A num_B name birth_date id;
if _N_ = 1 then do;
dcl hash h1 ();
h1.definekey("name", "birth_date");
h1.definedata("i");
h1.definedone();
dcl hash h2 (multidata : "Y");
h2.definekey("birth_date");
h2.definedata("n", "i");
h2.definedone();
do until (z);
set have(rename=(id=i name=n) where = (i)) end = z;
h1.ref();
h2.ref();
maxid = max(maxid, i);
end;
end;
set have;
if id = . then do;
if h1.find() ne 0 then do;
do while (h2.do_over() = 0);
if complev(name, n) < 10 then do;
id = i;
h1.ref(key : n, key : birth_date, data : id);
end;
end;
end;
else id = i;
end;
if id = . then do;
maxid + 1;
id = maxid;
h1.ref(key : name, key : birth_date, data : id);
end;
run;
Result:
num_A num_B name birth_date id 5785 fbff João Simões Marques 12/05/2000 7 1234 abcd M Rita Costa Santos 01/01/2020 1 3333 uvwx M Rita Costa Santos 01/01/2020 1 5678 efgh Maria Rita C Santos 01/01/2020 1 9101 ijkl Rita Costa Santos 01/01/2020 1 1111 mnop Maria Leonor Santos Silva 02/03/2001 2 2222 qrst Leonor Santos Silva 02/03/2001 2 4444 yzab Leonor Santos Silva 30/08/1999 10 6565 afgg Donald J Trump 01/01/1960 11 2423 sgty Donald J Trump 01/01/1960 11 9876 hgvb Pedro Costa Santos 05/09/1990 9 7865 jnbv Luís Miguel Silva 05/09/1990 12
Thank you so much for the help.
Just one more question, if possible, what could I do to also consider, for example, a wrong written name? For example, the same id for 'M Rita Costa Santos' and 'Rita Costa Sants'?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.