Re: Assigning individuals to households - efficient method needed

Sean_OConnor · Posted 10-05-2018 08:57 AM

Folks,

If anyone could provide some insight into this issue I would be very appreciative.

I have two datasets.

Dataset one is a list of individuals (circa 5 million) One variable called ID which ranges from 1 to N.

Dataset two has a list of relationships (circa 11 million). This is how the dataset looks.

ID	ID_1	Relationsip of ID to ID_1
1	104428	Married 
2	36186	Parent
3	507094	Parent
4	147446	Parent
5	453559	Parent
6	395150	Sibling
7	495697	Married 
8	430549	Parent
9	430304	Married 
10	110819	Married 
11	218256	Married 
12	199523	Married 
13	72263	Parent
14	73282	Married 
15	87799	Parent
16	376709	Parent
17	9042	Married 
18	421753	Parent
19	463196	Sibling
20	97903	Married 
87799	15	Child
376709	16	Child
9042	17	Married 
421753	18	Child
463196	19	Sibling
97903	20	Married

So in theory dataset two provides information as to all the different links people have who are found in dataset one.

What I need to do is look through dataset two and if I find a relationship between two people I need to assign a household_id to individuals.

The issue is that that the same individuals can be found in both id and id1 and there can be many different relationships.

So I need some code that will look through column 1 and then look through column 2 and if finding someone assigns them to a household, along with the person in the relationship.

I don't want to end up with a case where id 15 and 8799 are assigned different households as they are found in the same column.

So a dataset such as this would be the final outcome.

ID	Household_id
1	x_100
2	x_101
3	x_102
4	x_103
5	x_104
6	x_105
7	x_106
8	x_107
9	x_108
10	x_109
11	x_110
12	x_111
13	x_112
14	x_113
15	x_114
16	x_115
17	x_116
18	x_117
19	x_118
20	x_119
104428	x_100
36186	x_101
507094	x_102
147446	x_103
453559	x_104
395150	x_105
495697	x_106
430549	x_107
430304	x_108
110819	x_109
218256	x_110
199523	x_111
72263	x_112
73282	x_113
87799	x_114
376709	x_115
9042	x_116
421753	x_117
463196	x_118
97903	x_119

Sean_OConnor · Posted 10-05-2018 08:55 AM

Folks,

If anyone could provide some insight into this issue I would be very appreciative.

I have two datasets.

Dataset one is a list of individuals (circa 5 million) One variable called ID which ranges from 1 to N.

Dataset two has a list of relationships (circa 11 million). This is how the dataset looks.

ID	ID_1	Relationsip of ID to ID_1
1	104428	Married 
2	36186	Parent
3	507094	Parent
4	147446	Parent
5	453559	Parent
6	395150	Sibling
7	495697	Married 
8	430549	Parent
9	430304	Married 
10	110819	Married 
11	218256	Married 
12	199523	Married 
13	72263	Parent
14	73282	Married 
15	87799	Parent
16	376709	Parent
17	9042	Married 
18	421753	Parent
19	463196	Sibling
20	97903	Married 
87799	15	Child
376709	16	Child
9042	17	Married 
421753	18	Child
463196	19	Sibling
97903	20	Married

So in theory dataset two provides information as to all the different links people have who are found in dataset one.

What I need to do is look through dataset two and if I find a relationship between two people I need to assign a household_id to individuals.

The issue is that that the same individuals can be found in both id and id1 and there can be many different relationships.

So I need some code that will look through column 1 and then look through column 2 and if finding someone assigns them to a household, along with the person in the relationship.

I don't want to end up with a case where id 15 and 8799 are assigned different households as they are found in the same column.

So a dataset such as this would be the final outcome.

ID	Household_id
1	x_100
2	x_101
3	x_102
4	x_103
5	x_104
6	x_105
7	x_106
8	x_107
9	x_108
10	x_109
11	x_110
12	x_111
13	x_112
14	x_113
15	x_114
16	x_115
17	x_116
18	x_117
19	x_118
20	x_119
104428	x_100
36186	x_101
507094	x_102
147446	x_103
453559	x_104
395150	x_105
495697	x_106
430549	x_107
430304	x_108
110819	x_109
218256	x_110
199523	x_111
72263	x_112
73282	x_113
87799	x_114
376709	x_115
9042	x_116
421753	x_117
463196	x_118
97903	x_119

TomKari · Posted 10-05-2018 04:15 PM

SAS has great network detection and optimization tools:

proc optnet data_links=Have out_nodes=Network;
	data_links_var from=ID to=ID_1;
	concomp;
run;

Tom

mkeintz · Posted 10-05-2018 06:42 PM

I have frequently used proc optnet for this sort of problem. It's perfect for this task. And the CONCOMP statement effectively provides the household id's in the output data set.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Sean_OConnor · Posted 10-08-2018 05:03 AM

Hi Tom,

I'm using SAS enterprise guide 7.1 and it doesn't appear that this procedure is available?

Ksharp · Posted 10-06-2018 07:14 AM


data have;
infile cards expandtabs;
input from $  to $ ;
cards;
1	104428	Married 
2	36186	Parent
3	507094	Parent
4	147446	Parent
5	453559	Parent
6	395150	Sibling
7	495697	Married 
8	430549	Parent
9	430304	Married 
10	110819	Married 
11	218256	Married 
12	199523	Married 
13	72263	Parent
14	73282	Married 
15	87799	Parent
16	376709	Parent
17	9042	Married 
18	421753	Parent
19	463196	Sibling
20	97903	Married 
87799	15	Child
376709	16	Child
9042	17	Married 
421753	18	Child
463196	19	Sibling
97903	20	Married
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

Sean_OConnor · Posted 10-08-2018 09:56 AM

I just have a follow up query here.

While this code works perfectly I'm running into memory issues. My relationship dataset has circa 12,000,000 observations.

When I run the code on my dataset the program eventually bombs out due to memory issues.

Is there anyway to resolve this or provide a solution?

Ksharp · Posted 10-09-2018 08:22 AM

WOW. Can't believe you have so big table.

Assign more memory to sas by option -memsize 0

Rick click the SAS icon and select attribution .

fill the following line into TARGET :

"C:\Program Files\SAS\SASFoundation\9.2(32-bit)\sas.exe" -CONFIG "C:\Program Files\SAS\SASFoundation\9.2(32-bit)\nls\en\SASV9.CFG" -memsize 0

Ksharp · Posted 10-09-2018 08:29 AM

Here is a blog about make your memory bigger.

https://blogs.sas.com/content/iml/2015/07/31/large-matrices.html

PS. also try add an option

options compress=yes;

at top of my code.

FreelanceReinh · Posted 10-09-2018 09:10 AM

Hi @Sean_OConnor,

I guess it's the last datastep that requires too much RAM because it uses four hash objects. In the past, similar problems have been addressed with PGStats's SubGraphs macro, which uses only one hash object. Here's how it works on your sample data (using dataset HAVE as KSharp created it):

%SubGraphs(have,out=households);

proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;

The result matches exactly your "final outcome".

Chances are that you could even get away without using hash objects:

https://communities.sas.com/t5/SAS-Programming/Combining-rows-of-data/m-p/259977#M50371 (DS2 approach, haven't tested it)
https://communities.sas.com/t5/SAS-Programming/Combining-rows-of-data/m-p/259062#M50047 (same thread, iterative macro using PROC SORT, DATA steps and PROC SQL)
https://communities.sas.com/t5/SAS-Programming/Recursive-lookup-for-ID-s/m-p/259763#M50297 (adapted version of the previous solution, addressed at a problem more similar to yours)

Ksharp · Posted 10-09-2018 09:17 AM

I would doubt that PG's code could work for such big table.So is for SAS/OR .

Sean_OConnor · Posted 10-09-2018 10:18 AM

Running the above piece of code I'm getting an error saying the dataset households does not exist;

data have;
infile cards expandtabs;
input from $  to $ ;
cards;
1	104428	Married 
2	36186	Parent
3	507094	Parent
4	147446	Parent
5	453559	Parent
6	395150	Sibling
7	495697	Married 
8	430549	Parent
9	430304	Married 
10	110819	Married 
11	218256	Married 
12	199523	Married 
13	72263	Parent
14	73282	Married 
15	87799	Parent
16	376709	Parent
17	9042	Married 
18	421753	Parent
19	463196	Sibling
20	97903	Married 
87799	15	Child
376709	16	Child
9042	17	Married 
421753	18	Child
463196	19	Sibling
97903	20	Married
;
run;




%SubGraphs(have,out=households);

proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;

FreelanceReinh · Posted 10-09-2018 10:29 AM

You compiled macro SubGraphs before calling it, didn't you? That is, you submitted the macro code from %macro SubGraphs (...); to %mend SubGraphs;?

Sean_OConnor · Posted 10-09-2018 10:49 AM

As in this?

%macro SubGraphs(have,out=households);

proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;
%mend;
options mlogic mprint;
%SubGraphs;
options nomlogic nomprint;

FreelanceReinh · Posted 10-09-2018 10:53 AM

No, you need to download the macro code from the page https://communities.sas.com/t5/tkb/articleprintpage/tkb-id/library/article-id/1045

Assigning individuals to a common household

Assigning individuals to households - efficient method needed