linking records based on shared information and created a common ID

GingerJJ · Posted 04-10-2022 03:36 PM

data test1;
infile datalines delimiter = "," truncover;
input 
LN $ 
FN $
ID
M_LN $
M_FN $
M_DOB date9.
Adress1 $
Adress2 $ 
city $
;
datalines;
Doe, Emily, 1234, Lan, Jane, 01JAN1988, 1234BlueLan, Apt67,  LA
Doe, John, 1235, Lan Jane, 01JAN1988, 1234BlueLan, Apt67,  LA
John, Mike, 2345, John, Karen, 19FEB1985, 234ORANGEPL,   , SF     
Garcia, Jose, 3456, Rosa, Maria, 23APR2001, 89APPLELN, APT34,   Boston
Garcia, Victor, 3457, Rosa, Maria, 23APR2001, 89APPLELN, APT34, Boston
;

I have a very large dataset (about 1 million). The above is just a simple dummy example.
Basically I have babies born to the same mother. The baby has their own distinct ID but they don't have a shared family ID. 
I want to find the babies with the same mom and create a family ID based on shared mom's information.
The shared data fields can be as many as 10 but not all of them are populated or all the time useful. 
What would be an effective way to do that? I'm also considering macro if possible but I want to write the repetitive code first so that I can fully understand the macro.
Thank you for any help!

ballardw · Posted 04-10-2022 10:17 PM

I am going to recommend a FREE program you can download from the CDC website named LinkPlus: Information is here:

https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

You can download the setup file there.

I am recommending this especially because of your statement: "The shared data fields can be as many as 10 but not all of them are populated or all the time useful."

Linkplus does a probability match, where you can set a threshold level for what you want to consider a match.

If your mothers data does not provide the "id" value you want to set as common then you need to create one.

Note that any data that involves matching on Names, which was going to be a big part is full of headaches: spelling differences, name changes, missing/extra information like middle name, middle initial, use of actual middle name for a first name, information like MS/MRS that changes, other title information like DR and that is just the start.

The LinkPlus program is one that is intended to help with such data. It does want TEXT files for input. It does have an advantage that the variable names/column headers need not match. So if you have something in one file calle "mothers date of birth" you can specify that is the matching field for "date of birth" in a file of the mothers.

Otherwise you end up basically writing such a program yourself and that is going to take a very long time.

@GingerJJ wrote:

data test1;
infile datalines delimiter = "," truncover;
input 
LN $ 
FN $
ID
M_LN $
M_FN $
M_DOB date9.
Adress1 $
Adress2 $ 
city $
;
datalines;
Doe, Emily, 1234, Lan, Jane, 01JAN1988, 1234BlueLan, Apt67,  LA
Doe, John, 1235, Lan Jane, 01JAN1988, 1234BlueLan, Apt67,  LA
John, Mike, 2345, John, Karen, 19FEB1985, 234ORANGEPL,   , SF     
Garcia, Jose, 3456, Rosa, Maria, 23APR2001, 89APPLELN, APT34,   Boston
Garcia, Victor, 3457, Rosa, Maria, 23APR2001, 89APPLELN, APT34, Boston
;

I have a very large dataset (about 1 million). The above is just a simple dummy example.
Basically I have babies born to the same mother. The baby has their own distinct ID but they don't have a shared family ID. 
I want to find the babies with the same mom and create a family ID based on shared mom's information.
The shared data fields can be as many as 10 but not all of them are populated or all the time useful. 
What would be an effective way to do that? I'm also considering macro if possible but I want to write the repetitive code first so that I can fully understand the macro.
Thank you for any help!

Ksharp · Posted 04-11-2022 08:02 AM

Here is an example :


data have;
infile cards ;
input from $  to $ ;
cards;
1     2
1     3
4     5
5     2
9     4
6     7
8     7
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

Ksharp · Posted 04-11-2022 08:03 AM

If you have SAS/OR, could try this one :

/* Same code as SAS/OR */
proc optnet data_links=have out_nodes=want GRAPH_DIRECTION=UNDIRECTED;
data_links_var from=from to=to;
concomp;
run;

linking records based on shared information and created a common ID

Re: linking records based on shared information and created a common ID

Re: linking records based on shared information and created a common ID

Re: linking records based on shared information and created a common ID

linking records based on shared information and created a common ID

Re: linking records based on shared information and created a common ID

Re: linking records based on shared information and created a common ID

Re: linking records based on shared information and created a common ID

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!