BookmarkSubscribeRSS Feed
GingerJJ
Obsidian | Level 7

 

 

data test1;
infile datalines delimiter = "," truncover;
input 
LN $ 
FN $
ID
M_LN $
M_FN $
M_DOB date9.
Adress1 $
Adress2 $ 
city $
;
datalines;
Doe, Emily, 1234, Lan, Jane, 01JAN1988, 1234BlueLan, Apt67,  LA
Doe, John, 1235, Lan Jane, 01JAN1988, 1234BlueLan, Apt67,  LA
John, Mike, 2345, John, Karen, 19FEB1985, 234ORANGEPL,   , SF     
Garcia, Jose, 3456, Rosa, Maria, 23APR2001, 89APPLELN, APT34,   Boston
Garcia, Victor, 3457, Rosa, Maria, 23APR2001, 89APPLELN, APT34, Boston
;

I have a very large dataset (about 1 million). The above is just a simple dummy example.
Basically I have babies born to the same mother. The baby has their own distinct ID but they don't have a shared family ID. 
I want to find the babies with the same mom and create a family ID based on shared mom's information.
The shared data fields can be as many as 10 but not all of them are populated or all the time useful. 
What would be an effective way to do that? I'm also considering macro if possible but I want to write the repetitive code first so that I can fully understand the macro.
Thank you for any help!

 

 

3 REPLIES 3
ballardw
Super User

I am going to recommend a FREE program you can download from the CDC website named LinkPlus:  Information is here:

https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

 

You can download the setup file there.

 

I am recommending this especially because of your statement: "The shared data fields can be as many as 10 but not all of them are populated or all the time useful."

Linkplus does a probability match, where you can set a threshold level for what you want to consider a match.

If your mothers data does not provide the "id" value you want to set as common then you need to create one.

 

Note that any data that involves matching on Names, which was going to be a big part is full of headaches: spelling differences, name changes, missing/extra information like middle name, middle initial, use of actual middle name for a first name, information like MS/MRS that changes, other title information like DR and that is just the start.

 

The LinkPlus program is one that is intended to help with such data. It does want TEXT files for input. It does have an advantage that the variable names/column headers need not match. So if you have something in one file calle "mothers date of birth" you can specify that is the matching field for "date of birth" in a file of the mothers.

 

Otherwise you end up basically writing such a program yourself and that is going to take a very long time.

 


@GingerJJ wrote:

 

 

data test1;
infile datalines delimiter = "," truncover;
input 
LN $ 
FN $
ID
M_LN $
M_FN $
M_DOB date9.
Adress1 $
Adress2 $ 
city $
;
datalines;
Doe, Emily, 1234, Lan, Jane, 01JAN1988, 1234BlueLan, Apt67,  LA
Doe, John, 1235, Lan Jane, 01JAN1988, 1234BlueLan, Apt67,  LA
John, Mike, 2345, John, Karen, 19FEB1985, 234ORANGEPL,   , SF     
Garcia, Jose, 3456, Rosa, Maria, 23APR2001, 89APPLELN, APT34,   Boston
Garcia, Victor, 3457, Rosa, Maria, 23APR2001, 89APPLELN, APT34, Boston
;

I have a very large dataset (about 1 million). The above is just a simple dummy example.
Basically I have babies born to the same mother. The baby has their own distinct ID but they don't have a shared family ID. 
I want to find the babies with the same mom and create a family ID based on shared mom's information.
The shared data fields can be as many as 10 but not all of them are populated or all the time useful. 
What would be an effective way to do that? I'm also considering macro if possible but I want to write the repetitive code first so that I can fully understand the macro.
Thank you for any help!

 

 


 

Ksharp
Super User

Here is an example :

 

 


data have;
infile cards ;
input from $  to $ ;
cards;
1     2
1     3
4     5
5     2
9     4
6     7
8     7
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Ksharp
Super User

If you have SAS/OR, could try this one :

 

/* Same code as SAS/OR */
proc optnet data_links=have out_nodes=want GRAPH_DIRECTION=UNDIRECTED;
data_links_var from=from to=to;
concomp;
run;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 582 views
  • 2 likes
  • 3 in conversation