num_A | num_B | name | birth_date | id |
1234 | abcd | M Rita Costa Santos | 01/01/2000 | 1 |
3333 | uvwx | M Rita Costa Santos | 01/01/2000 | 1 |
5678 | efgh | Maria Rita C Santos | 01/01/2000 | |
9101 | ijkl | Rita Costa Santos | 01/01/2000 | 1 |
1111 | mnop | Maria Leonor Santos Silva | 02/03/2001 | |
2222 | qrst | Leonor Santos Silva | 02/03/2001 | 2 |
4444 | yzab | Leonor Santos Silva | 30/08/1999 |
Imagine I have this table, but in a large scale. I want to find similar and equal values in the name column and if they are similar/equal, go to the column birth_date and see if they are equal. If yes, create the same id for this cases. So i want the final output to be:
num_A | num_B | name | birth_date | id |
1234 | abcd | M Rita Costa Santos | 01/01/2000 | 1 |
3333 | uvwx | M Rita Costa Santos | 01/01/2000 | 1 |
5678 | efgh | Maria Rita C Santos | 01/01/2000 | 1 |
9101 | ijkl | Rita Costa Santos | 01/01/2000 | 1 |
1111 | mnop | Maria Leonor Santos Silva | 02/03/2001 | 2 |
2222 | qrst | Leonor Santos Silva | 02/03/2001 | 2 |
4444 | yzab | Leonor Santos Silva | 30/08/1999 | 3 |
Thank you in advance.
@fcf A small correction to the code gives you whay you want.
data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
5785 fbff João Simões Marques 12/05/2000 7
1234 abcd M Rita Costa Santos 01/01/2020 1
3333 uvwx M Rita Costa Santos 01/01/2020 1
5678 efgh Maria Rita C Santos 01/01/2020
9101 ijkl Rita Costa Santos 01/01/2020 1
1111 mnop Maria Leonor Santos Silva 02/03/2001 2
2222 qrst Leonor Santos Silva 02/03/2001
4444 yzab Leonor Santos Silva 30/08/1999
6565 afgg Donald J Trump 01/01/1960
2423 sgty Donald J Trump 01/01/1960
9876 hgvb Pedro Costa Santos 05/09/1990 9
7865 jnbv Luís Miguel Silva 05/09/1990
;
data want(keep = num_A num_B name birth_date id);
format num_A num_B name birth_date id;
if _N_ = 1 then do;
dcl hash h1 ();
h1.definekey("name", "birth_date");
h1.definedata("i");
h1.definedone();
dcl hash h2 (multidata : "Y");
h2.definekey("birth_date");
h2.definedata("n", "i");
h2.definedone();
do until (z);
set have(rename=(id=i name=n) where = (i)) end = z;
h1.ref();
h2.ref();
maxid = max(maxid, i);
end;
end;
set have;
if id = . then do;
if h1.find() ne 0 then do;
do while (h2.do_over() = 0);
if complev(name, n) < 10 then do;
id = i;
h1.ref(key : n, key : birth_date, data : id);
end;
end;
end;
else id = i;
end;
if id = . then do;
maxid + 1;
id = maxid;
h1.ref(key : name, key : birth_date, data : id);
end;
run;
Result:
num_A num_B name birth_date id 5785 fbff João Simões Marques 12/05/2000 7 1234 abcd M Rita Costa Santos 01/01/2020 1 3333 uvwx M Rita Costa Santos 01/01/2020 1 5678 efgh Maria Rita C Santos 01/01/2020 1 9101 ijkl Rita Costa Santos 01/01/2020 1 1111 mnop Maria Leonor Santos Silva 02/03/2001 2 2222 qrst Leonor Santos Silva 02/03/2001 2 4444 yzab Leonor Santos Silva 30/08/1999 10 6565 afgg Donald J Trump 01/01/1960 11 2423 sgty Donald J Trump 01/01/1960 11 9876 hgvb Pedro Costa Santos 05/09/1990 9 7865 jnbv Luís Miguel Silva 05/09/1990 12
Can you describe further what you mean by "similar"?
What does 'similar' mean here? That eg the spelling distance is not far from each other?
Also, is the data sorted by these 'likewise' names?
And you already have the ID for some of the obs? But not all, correct?
Ok. Is your data sorted like this? Or could M Rita Costa Santos have an observation at the bottom?
It is not sorted. I just showed this so it would be easier to understand. I have like 12 000 records of names.
Can a person have no ID with a pre-assigned value?
The posted data you have is this:
data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos 01/01/2000 1
3333 uvwx M Rita Costa Santos 01/01/2000 1
5678 efgh Maria Rita C Santos 01/01/2000
9101 ijkl Rita Costa Santos 01/01/2000 1
1111 mnop Maria Leonor Santos Silva 02/03/2001
2222 qrst Leonor Santos Silva 02/03/2001 2
4444 yzab Leonor Santos Silva 30/08/1999
;
Can a situation arise, where a person in your data have no ID attached to any of its observations? See the two last obs in the data below (no ID)
data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos 01/01/2000 1
3333 uvwx M Rita Costa Santos 01/01/2000 1
5678 efgh Maria Rita C Santos 01/01/2000
9101 ijkl Rita Costa Santos 01/01/2000 1
1111 mnop Maria Leonor Santos Silva 02/03/2001
2222 qrst Leonor Santos Silva 02/03/2001 2
4444 yzab Leonor Santos Silva 30/08/1999
xxxx xxxx Some Name Here 01/01/1960
xxxx xxxx Some Name Here 01/01/1960
;
I will try to make myself more clear. Consider the data below. This is the same data that you posted plus an additional person.
While the two people in your initial data both have a pre-assigned ID (the one we want to hit for the remaining obs for those people), "Donald J Trump" does not. He is obviously not equal to either id=1 or id=2.
Can this situation happen in your data? And then what?
data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos 01/01/2000 1
3333 uvwx M Rita Costa Santos 01/01/2000 1
5678 efgh Maria Rita C Santos 01/01/2000
9101 ijkl Rita Costa Santos 01/01/2000 1
1111 mnop Maria Leonor Santos Silva 02/03/2001
2222 qrst Leonor Santos Silva 02/03/2001 2
4444 yzab Leonor Santos Silva 30/08/1999
xxxx xxxx Donald J Trump 01/01/1960
xxxx xxxx Donald J Trump 01/01/1960
;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.