BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
num_Anum_Bnamebirth_dateid
1234abcdM Rita Costa Santos01/01/20001
3333uvwxM Rita Costa Santos01/01/20001
5678efghMaria Rita C Santos01/01/2000 
9101ijklRita Costa Santos01/01/20001
1111mnopMaria Leonor Santos Silva02/03/2001 
2222qrstLeonor Santos Silva02/03/20012
4444yzabLeonor Santos Silva30/08/1999 

 

Imagine I have this table, but in a large scale. I want to find similar and equal values in the name column and if they are similar/equal, go to the column birth_date and see if they are equal. If yes, create the same id for this cases. So i want the final output to be:

 

num_Anum_Bnamebirth_dateid
1234abcdM Rita Costa Santos01/01/20001
3333uvwxM Rita Costa Santos01/01/20001
5678efghMaria Rita C Santos01/01/20001
9101ijklRita Costa Santos01/01/20001
1111mnopMaria Leonor Santos Silva02/03/20012
2222qrstLeonor Santos Silva02/03/20012
4444yzabLeonor Santos Silva30/08/19993

 

Thank you in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

@fcf A small correction to the code gives you whay you want.

 

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
5785 fbff João Simões Marques        12/05/2000 7
1234 abcd M Rita Costa Santos        01/01/2020 1
3333 uvwx M Rita Costa Santos        01/01/2020 1
5678 efgh Maria Rita C Santos        01/01/2020  
9101 ijkl Rita Costa Santos          01/01/2020 1
1111 mnop Maria Leonor Santos Silva  02/03/2001 2
2222 qrst Leonor Santos Silva        02/03/2001  
4444 yzab Leonor Santos Silva        30/08/1999  
6565 afgg Donald J Trump             01/01/1960  
2423 sgty Donald J Trump             01/01/1960  
9876 hgvb Pedro Costa Santos         05/09/1990 9
7865 jnbv Luís Miguel Silva          05/09/1990  
;


data want(keep = num_A num_B name birth_date id);
   format num_A num_B name birth_date id;
   if _N_ = 1 then do;
      dcl hash h1 ();
      h1.definekey("name", "birth_date");
      h1.definedata("i");
      h1.definedone();

      dcl hash h2 (multidata : "Y");
      h2.definekey("birth_date");
      h2.definedata("n", "i");
      h2.definedone();

      do until (z);
         set have(rename=(id=i name=n) where = (i)) end = z;
         h1.ref();
         h2.ref();
         maxid = max(maxid, i);
      end;
   end;

   set have;

   if id = . then do;
      if h1.find() ne 0 then do;
         do while (h2.do_over() = 0);
            if complev(name, n) < 10 then do;
               id = i;
               h1.ref(key : n, key : birth_date, data : id);
            end;
         end;
      end;
      else id = i;
   end;

   if id = . then do;
      maxid + 1;
      id = maxid;
      h1.ref(key : name, key : birth_date, data : id);
   end;

run;

 

Result:

 

num_A  num_B  name                       birth_date  id 
5785   fbff   João Simões Marques        12/05/2000  7 
1234   abcd   M Rita Costa Santos        01/01/2020  1 
3333   uvwx   M Rita Costa Santos        01/01/2020  1 
5678   efgh   Maria Rita C Santos        01/01/2020  1 
9101   ijkl   Rita Costa Santos          01/01/2020  1 
1111   mnop   Maria Leonor Santos Silva  02/03/2001  2 
2222   qrst   Leonor Santos Silva        02/03/2001  2 
4444   yzab   Leonor Santos Silva        30/08/1999  10 
6565   afgg   Donald J Trump             01/01/1960  11 
2423   sgty   Donald J Trump             01/01/1960  11 
9876   hgvb   Pedro Costa Santos         05/09/1990  9 
7865   jnbv   Luís Miguel Silva          05/09/1990  12 

 

 

View solution in original post

24 REPLIES 24
PaigeMiller
Diamond | Level 26

Can you describe further what you mean by "similar"?

--
Paige Miller
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Please read what I answered to draycut, it was the same question. Thank you!
PeterClemmensen
Tourmaline | Level 20

What does 'similar' mean here? That eg the spelling distance is not far from each other?

 

Also, is the data sorted by these 'likewise' names?

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Similar, like in the examples. For example, the birth_date is the same for these:
- Maria Rita Costa Santos

- M Rita Costa Santos, the M is an abbreviation for Maria

- Rita Costa Santos - because Maria is such a popular name, the person chooses to only say the second name, which is less common.

However, these are all the same person, but just wrote the name in different ways. However, in the last two examples, people can have the same name and be different people (check the birth date).
PeterClemmensen
Tourmaline | Level 20

And you already have the ID for some of the obs? But not all, correct?

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
yes, not all 🙂 the point is to have the ids according to the same people, so I can group the data correctly
PeterClemmensen
Tourmaline | Level 20

Ok. Is your data sorted like this? Or could M Rita Costa Santos have an observation at the bottom?

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6

It is not sorted. I just showed this so it would be easier to understand. I have like 12 000 records of names.

PeterClemmensen
Tourmaline | Level 20

Can a person have no ID with a pre-assigned value?

PeterClemmensen
Tourmaline | Level 20

The posted data you have is this:

 

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos        01/01/2000 1 
3333 uvwx M Rita Costa Santos        01/01/2000 1 
5678 efgh Maria Rita C Santos        01/01/2000   
9101 ijkl Rita Costa Santos          01/01/2000 1 
1111 mnop Maria Leonor Santos Silva  02/03/2001   
2222 qrst Leonor Santos Silva        02/03/2001 2 
4444 yzab Leonor Santos Silva        30/08/1999   
;

Can a situation arise, where a person in your data have no ID attached to any of its observations? See the two last obs in the data below (no ID)

 

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos        01/01/2000 1 
3333 uvwx M Rita Costa Santos        01/01/2000 1 
5678 efgh Maria Rita C Santos        01/01/2000   
9101 ijkl Rita Costa Santos          01/01/2000 1 
1111 mnop Maria Leonor Santos Silva  02/03/2001   
2222 qrst Leonor Santos Silva        02/03/2001 2 
4444 yzab Leonor Santos Silva        30/08/1999   
xxxx xxxx Some Name Here             01/01/1960   
xxxx xxxx Some Name Here             01/01/1960   
;
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
The only blank field possible would be the ID
PeterClemmensen
Tourmaline | Level 20

I will try to make myself more clear. Consider the data below. This is the same data that you posted plus an additional person.

 

While the two people in your initial data both have a pre-assigned ID (the one we want to hit for the remaining obs for those people), "Donald J Trump" does not. He is obviously not equal to either id=1 or id=2. 

 

Can this situation happen in your data? And then what?

 

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos        01/01/2000 1 
3333 uvwx M Rita Costa Santos        01/01/2000 1 
5678 efgh Maria Rita C Santos        01/01/2000   
9101 ijkl Rita Costa Santos          01/01/2000 1 
1111 mnop Maria Leonor Santos Silva  02/03/2001   
2222 qrst Leonor Santos Silva        02/03/2001 2 
4444 yzab Leonor Santos Silva        30/08/1999   
xxxx xxxx Donald J Trump             01/01/1960   
xxxx xxxx Donald J Trump             01/01/1960   
;
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Ah yes, it can happen. That's why I gave the last example of a Leonor born in another date, which will have an id 3. The Donald Trump should be the id 4 for example.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 24 replies
  • 7509 views
  • 0 likes
  • 3 in conversation