Solved: Re: How to find similar and equal values in one column and then anothe...

fcf · Posted 11-04-2020 07:15 AM

num_A	num_B	name	birth_date	id
1234	abcd	M Rita Costa Santos	01/01/2000	1
3333	uvwx	M Rita Costa Santos	01/01/2000	1
5678	efgh	Maria Rita C Santos	01/01/2000
9101	ijkl	Rita Costa Santos	01/01/2000	1
1111	mnop	Maria Leonor Santos Silva	02/03/2001
2222	qrst	Leonor Santos Silva	02/03/2001	2
4444	yzab	Leonor Santos Silva	30/08/1999

Imagine I have this table, but in a large scale. I want to find similar and equal values in the name column and if they are similar/equal, go to the column birth_date and see if they are equal. If yes, create the same id for this cases. So i want the final output to be:

num_A	num_B	name	birth_date	id
1234	abcd	M Rita Costa Santos	01/01/2000	1
3333	uvwx	M Rita Costa Santos	01/01/2000	1
5678	efgh	Maria Rita C Santos	01/01/2000	1
9101	ijkl	Rita Costa Santos	01/01/2000	1
1111	mnop	Maria Leonor Santos Silva	02/03/2001	2
2222	qrst	Leonor Santos Silva	02/03/2001	2
4444	yzab	Leonor Santos Silva	30/08/1999	3

Thank you in advance.

PeterClemmensen · Posted 12-08-2020 07:15 AM

@fcf A small correction to the code gives you whay you want.

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
5785 fbff João Simões Marques        12/05/2000 7
1234 abcd M Rita Costa Santos        01/01/2020 1
3333 uvwx M Rita Costa Santos        01/01/2020 1
5678 efgh Maria Rita C Santos        01/01/2020  
9101 ijkl Rita Costa Santos          01/01/2020 1
1111 mnop Maria Leonor Santos Silva  02/03/2001 2
2222 qrst Leonor Santos Silva        02/03/2001  
4444 yzab Leonor Santos Silva        30/08/1999  
6565 afgg Donald J Trump             01/01/1960  
2423 sgty Donald J Trump             01/01/1960  
9876 hgvb Pedro Costa Santos         05/09/1990 9
7865 jnbv Luís Miguel Silva          05/09/1990  
;


data want(keep = num_A num_B name birth_date id);
   format num_A num_B name birth_date id;
   if _N_ = 1 then do;
      dcl hash h1 ();
      h1.definekey("name", "birth_date");
      h1.definedata("i");
      h1.definedone();

      dcl hash h2 (multidata : "Y");
      h2.definekey("birth_date");
      h2.definedata("n", "i");
      h2.definedone();

      do until (z);
         set have(rename=(id=i name=n) where = (i)) end = z;
         h1.ref();
         h2.ref();
         maxid = max(maxid, i);
      end;
   end;

   set have;

   if id = . then do;
      if h1.find() ne 0 then do;
         do while (h2.do_over() = 0);
            if complev(name, n) < 10 then do;
               id = i;
               h1.ref(key : n, key : birth_date, data : id);
            end;
         end;
      end;
      else id = i;
   end;

   if id = . then do;
      maxid + 1;
      id = maxid;
      h1.ref(key : name, key : birth_date, data : id);
   end;

run;

Result:

num_A  num_B  name                       birth_date  id 
5785   fbff   João Simões Marques        12/05/2000  7 
1234   abcd   M Rita Costa Santos        01/01/2020  1 
3333   uvwx   M Rita Costa Santos        01/01/2020  1 
5678   efgh   Maria Rita C Santos        01/01/2020  1 
9101   ijkl   Rita Costa Santos          01/01/2020  1 
1111   mnop   Maria Leonor Santos Silva  02/03/2001  2 
2222   qrst   Leonor Santos Silva        02/03/2001  2 
4444   yzab   Leonor Santos Silva        30/08/1999  10 
6565   afgg   Donald J Trump             01/01/1960  11 
2423   sgty   Donald J Trump             01/01/1960  11 
9876   hgvb   Pedro Costa Santos         05/09/1990  9 
7865   jnbv   Luís Miguel Silva          05/09/1990  12

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PaigeMiller · Posted 11-04-2020 07:27 AM

Can you describe further what you mean by "similar"?

--
Paige Miller

fcf · Posted 11-04-2020 07:45 AM

Please read what I answered to draycut, it was the same question. Thank you!

PeterClemmensen · Posted 11-04-2020 07:30 AM

What does 'similar' mean here? That eg the spelling distance is not far from each other?

Also, is the data sorted by these 'likewise' names?

The DATA to DATA Step Macro
Blog: SASnrd

fcf · Posted 11-04-2020 07:42 AM

Similar, like in the examples. For example, the birth_date is the same for these:
- Maria Rita Costa Santos

- M Rita Costa Santos, the M is an abbreviation for Maria

- Rita Costa Santos - because Maria is such a popular name, the person chooses to only say the second name, which is less common.

However, these are all the same person, but just wrote the name in different ways. However, in the last two examples, people can have the same name and be different people (check the birth date).

PeterClemmensen · Posted 11-04-2020 07:47 AM

And you already have the ID for some of the obs? But not all, correct?

The DATA to DATA Step Macro
Blog: SASnrd

fcf · Posted 11-04-2020 07:50 AM

yes, not all 🙂 the point is to have the ids according to the same people, so I can group the data correctly

PeterClemmensen · Posted 11-04-2020 08:08 AM

Ok. Is your data sorted like this? Or could M Rita Costa Santos have an observation at the bottom?

The DATA to DATA Step Macro
Blog: SASnrd

fcf · Posted 11-04-2020 08:15 AM

It is not sorted. I just showed this so it would be easier to understand. I have like 12 000 records of names.

PeterClemmensen · Posted 11-04-2020 08:27 AM

Can a person have no ID with a pre-assigned value?

The DATA to DATA Step Macro
Blog: SASnrd

fcf · Posted 11-04-2020 08:32 AM

Can you elaborate please?

PeterClemmensen · Posted 11-04-2020 08:39 AM

The posted data you have is this:

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos        01/01/2000 1 
3333 uvwx M Rita Costa Santos        01/01/2000 1 
5678 efgh Maria Rita C Santos        01/01/2000   
9101 ijkl Rita Costa Santos          01/01/2000 1 
1111 mnop Maria Leonor Santos Silva  02/03/2001   
2222 qrst Leonor Santos Silva        02/03/2001 2 
4444 yzab Leonor Santos Silva        30/08/1999   
;

Can a situation arise, where a person in your data have no ID attached to any of its observations? See the two last obs in the data below (no ID)

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos        01/01/2000 1 
3333 uvwx M Rita Costa Santos        01/01/2000 1 
5678 efgh Maria Rita C Santos        01/01/2000   
9101 ijkl Rita Costa Santos          01/01/2000 1 
1111 mnop Maria Leonor Santos Silva  02/03/2001   
2222 qrst Leonor Santos Silva        02/03/2001 2 
4444 yzab Leonor Santos Silva        30/08/1999   
xxxx xxxx Some Name Here             01/01/1960   
xxxx xxxx Some Name Here             01/01/1960   
;

The DATA to DATA Step Macro
Blog: SASnrd

fcf · Posted 11-04-2020 08:47 AM

The only blank field possible would be the ID

PeterClemmensen · Posted 11-04-2020 08:59 AM

I will try to make myself more clear. Consider the data below. This is the same data that you posted plus an additional person.

While the two people in your initial data both have a pre-assigned ID (the one we want to hit for the remaining obs for those people), "Donald J Trump" does not. He is obviously not equal to either id=1 or id=2.

Can this situation happen in your data? And then what?

data have;
infile datalines missover;
input num_A num_B $ name $ 11-36 birth_date :ddmmyy10. id;
format birth_date ddmmyy10.;
datalines;
1234 abcd M Rita Costa Santos        01/01/2000 1 
3333 uvwx M Rita Costa Santos        01/01/2000 1 
5678 efgh Maria Rita C Santos        01/01/2000   
9101 ijkl Rita Costa Santos          01/01/2000 1 
1111 mnop Maria Leonor Santos Silva  02/03/2001   
2222 qrst Leonor Santos Silva        02/03/2001 2 
4444 yzab Leonor Santos Silva        30/08/1999   
xxxx xxxx Donald J Trump             01/01/1960   
xxxx xxxx Donald J Trump             01/01/1960   
;

The DATA to DATA Step Macro
Blog: SASnrd

fcf · Posted 11-04-2020 09:01 AM

Ah yes, it can happen. That's why I gave the last example of a Leonor born in another date, which will have an id 3. The Donald Trump should be the id 4 for example.

How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

Re: How to find similar and equal values in one column and then another column

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away