BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Another way is to create another column with all the IDs, instead of putting the id in the same column
PeterClemmensen
Tourmaline | Level 20

Ok. This should do it

 

The overall logic is this:

 

  • Load the observations with id's into two lookup tables. One keyed by name and birth_date. The other keyed by just birth_date.
  • Read all data sequentially.
  • If ID is missing, look in the first table h1 for an exact match on name and birth_date. If not found, look in the second table h2 on all entries with the desired birth_date. If one of the names in h2 is appropriately close to the original name, set the ID to the lookup value of ID. 
  • Add the 'new' name to h2.
  • If Id is still missing and not filled by either lookup above, set ID to (maximum value of ID + 1).

 

By appropriately close, I used the Complev Function and simply chose a value that works for this example. Ie here, the distance should be less than ten to be considered 'similar'. You may have to adjust that part.

 

@fcf Let me know if it works for you 🙂

 

Code:

 

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos        01/01/2000 1 
3333 uvwx M Rita Costa Santos        01/01/2000 1 
5678 efgh Maria Rita C Santos        01/01/2000   
9101 ijkl Rita Costa Santos          01/01/2000 1 
1111 mnop Maria Leonor Santos Silva  02/03/2001   
2222 qrst Leonor Santos Silva        02/03/2001 2 
4444 yzab Leonor Santos Silva        30/08/1999   
;

data want(keep = num_A num_B name birth_date id);
   format num_A num_B name birth_date id;
   if _N_ = 1 then do;
      dcl hash h1 ();
      h1.definekey("name", "birth_date");
      h1.definedata("i");
      h1.definedone();

      dcl hash h2 (multidata : "Y");
      h2.definekey("birth_date");
      h2.definedata("n", "i");
      h2.definedone();

      do until (z);
         set have(rename=(id=i name=n) where = (i)) end = z;
         h1.ref();
         h2.ref();
         maxid = max(maxid, i);
      end;
   end;

   set have;

   if id = . then do;
      if h1.find() ne 0 then do;
         do while (h2.do_over() = 0);
            if complev(name, n) < 10 then do;
               id = i;
               h1.ref(key : n, key : birth_date, data : id);
            end;
         end;
      end;
   end;

   if id = . then do;
      maxid + 1;
      id = maxid;
      h1.ref(key : name, key : birth_date, data : id);
   end;

run;

 

Result

num_A num_B name                       birth_date  id 
1234  abcd  M Rita Costa Santos        01/01/2000  1 
3333  uvwx  M Rita Costa Santos        01/01/2000  1 
5678  efgh  Maria Rita C Santos        01/01/2000  1 
9101  ijkl  Rita Costa Santos          01/01/2000  1 
1111  mnop  Maria Leonor Santos Silva  02/03/2001  2 
2222  qrst  Leonor Santos Silva        02/03/2001  2 
4444  yzab  Leonor Santos Silva        30/08/1999  3 

 

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6
Hi draycut! I'm sorry for taking so long to reply. It did not worked. Maybe because I may not have the same people in the following records(?) So the names might be scattered all over the table. I'm guessing that's why it is not working... I don't know...
PeterClemmensen
Tourmaline | Level 20

@fcf you have to be more specific than "It did not work". What did not work? How did the result differ from what you expect?

 

My code works regardless of the order on your data. That is not why.

 

Post a larger sample of your actual data and let me take a look.

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6

I cannot post the actual data. It is private. It did not work - it always returned a different id.

PeterClemmensen
Tourmaline | Level 20

Did you try tweaking the threshold (10) ?

 

Also, you can simply post a larger example of what your data may look like. Not the actual data.

 

That way, we can find a solution that works.

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6

Sorry the delay again and thank you for the consideration and help.

 

So, imagine I have this:

num_Anum_Bnamebirth_dateid
5785fbffJoão Simões Marques12/05/20007
1234abcdM Rita Costa Santos01/01/20201
3333uvwxM Rita Costa Santos01/01/20201
5678efghMaria Rita C Santos01/01/2020 
9101ijklRita Costa Santos01/01/20201
1111mnopMaria Leonor Santos Silva02/03/20012
2222qrstLeonor Santos Silva02/03/2001 
4444yzabLeonor Santos Silva30/08/1999 
6565afggDonald J Trump01/01/1960 
2423sgtyDonald J Trump01/01/1960 
9876hgvbPedro Costa Santos05/09/19909
7865jnbvLuís Miguel Silva05/09/1990 

 

And let's try another approach, I want this:

 

num_Anum_Bnamebirth_dateid
5785fbffJoão Simões Marques12/05/20007
1234abcdM Rita Costa Santos01/01/20201
3333uvwxM Rita Costa Santos01/01/20201
5678efghMaria Rita C Santos01/01/20201
9101ijklRita Costa Santos01/01/20201
1111mnopMaria Leonor Santos Silva02/03/20012
2222qrstLeonor Santos Silva02/03/20012
4444yzabLeonor Santos Silva30/08/199910
6565afggDonald J Trump01/01/196011
2423sgtyDonald J Trump01/01/196011
9876hgvbPedro Costa Santos05/09/19909
7865jnbvLuís Miguel Silva05/09/199012
PeterClemmensen
Tourmaline | Level 20

@fcf A small correction to the code gives you whay you want.

 

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
5785 fbff João Simões Marques        12/05/2000 7
1234 abcd M Rita Costa Santos        01/01/2020 1
3333 uvwx M Rita Costa Santos        01/01/2020 1
5678 efgh Maria Rita C Santos        01/01/2020  
9101 ijkl Rita Costa Santos          01/01/2020 1
1111 mnop Maria Leonor Santos Silva  02/03/2001 2
2222 qrst Leonor Santos Silva        02/03/2001  
4444 yzab Leonor Santos Silva        30/08/1999  
6565 afgg Donald J Trump             01/01/1960  
2423 sgty Donald J Trump             01/01/1960  
9876 hgvb Pedro Costa Santos         05/09/1990 9
7865 jnbv Luís Miguel Silva          05/09/1990  
;


data want(keep = num_A num_B name birth_date id);
   format num_A num_B name birth_date id;
   if _N_ = 1 then do;
      dcl hash h1 ();
      h1.definekey("name", "birth_date");
      h1.definedata("i");
      h1.definedone();

      dcl hash h2 (multidata : "Y");
      h2.definekey("birth_date");
      h2.definedata("n", "i");
      h2.definedone();

      do until (z);
         set have(rename=(id=i name=n) where = (i)) end = z;
         h1.ref();
         h2.ref();
         maxid = max(maxid, i);
      end;
   end;

   set have;

   if id = . then do;
      if h1.find() ne 0 then do;
         do while (h2.do_over() = 0);
            if complev(name, n) < 10 then do;
               id = i;
               h1.ref(key : n, key : birth_date, data : id);
            end;
         end;
      end;
      else id = i;
   end;

   if id = . then do;
      maxid + 1;
      id = maxid;
      h1.ref(key : name, key : birth_date, data : id);
   end;

run;

 

Result:

 

num_A  num_B  name                       birth_date  id 
5785   fbff   João Simões Marques        12/05/2000  7 
1234   abcd   M Rita Costa Santos        01/01/2020  1 
3333   uvwx   M Rita Costa Santos        01/01/2020  1 
5678   efgh   Maria Rita C Santos        01/01/2020  1 
9101   ijkl   Rita Costa Santos          01/01/2020  1 
1111   mnop   Maria Leonor Santos Silva  02/03/2001  2 
2222   qrst   Leonor Santos Silva        02/03/2001  2 
4444   yzab   Leonor Santos Silva        30/08/1999  10 
6565   afgg   Donald J Trump             01/01/1960  11 
2423   sgty   Donald J Trump             01/01/1960  11 
9876   hgvb   Pedro Costa Santos         05/09/1990  9 
7865   jnbv   Luís Miguel Silva          05/09/1990  12 

 

 

fcf
Fluorite | Level 6 fcf
Fluorite | Level 6

Thank you so much for the help.

 

Just one more question, if possible, what could I do to also consider, for example, a wrong written name? For example, the same id for 'M Rita Costa Santos' and 'Rita Costa Sants'?

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 24 replies
  • 4632 views
  • 0 likes
  • 3 in conversation