Folks,
If anyone could provide some insight into this issue I would be very appreciative.
I have two datasets.
Dataset one is a list of individuals (circa 5 million) One variable called ID which ranges from 1 to N.
Dataset two has a list of relationships (circa 11 million). This is how the dataset looks.
ID ID_1 Relationsip of ID to ID_1
1 104428 Married
2 36186 Parent
3 507094 Parent
4 147446 Parent
5 453559 Parent
6 395150 Sibling
7 495697 Married
8 430549 Parent
9 430304 Married
10 110819 Married
11 218256 Married
12 199523 Married
13 72263 Parent
14 73282 Married
15 87799 Parent
16 376709 Parent
17 9042 Married
18 421753 Parent
19 463196 Sibling
20 97903 Married
87799 15 Child
376709 16 Child
9042 17 Married
421753 18 Child
463196 19 Sibling
97903 20 Married
So in theory dataset two provides information as to all the different links people have who are found in dataset one.
What I need to do is look through dataset two and if I find a relationship between two people I need to assign a household_id to individuals.
The issue is that that the same individuals can be found in both id and id1 and there can be many different relationships.
So I need some code that will look through column 1 and then look through column 2 and if finding someone assigns them to a household, along with the person in the relationship.
I don't want to end up with a case where id 15 and 8799 are assigned different households as they are found in the same column.
So a dataset such as this would be the final outcome.
ID Household_id
1 x_100
2 x_101
3 x_102
4 x_103
5 x_104
6 x_105
7 x_106
8 x_107
9 x_108
10 x_109
11 x_110
12 x_111
13 x_112
14 x_113
15 x_114
16 x_115
17 x_116
18 x_117
19 x_118
20 x_119
104428 x_100
36186 x_101
507094 x_102
147446 x_103
453559 x_104
395150 x_105
495697 x_106
430549 x_107
430304 x_108
110819 x_109
218256 x_110
199523 x_111
72263 x_112
73282 x_113
87799 x_114
376709 x_115
9042 x_116
421753 x_117
463196 x_118
97903 x_119
Folks,
If anyone could provide some insight into this issue I would be very appreciative.
I have two datasets.
Dataset one is a list of individuals (circa 5 million) One variable called ID which ranges from 1 to N.
Dataset two has a list of relationships (circa 11 million). This is how the dataset looks.
ID ID_1 Relationsip of ID to ID_1
1 104428 Married
2 36186 Parent
3 507094 Parent
4 147446 Parent
5 453559 Parent
6 395150 Sibling
7 495697 Married
8 430549 Parent
9 430304 Married
10 110819 Married
11 218256 Married
12 199523 Married
13 72263 Parent
14 73282 Married
15 87799 Parent
16 376709 Parent
17 9042 Married
18 421753 Parent
19 463196 Sibling
20 97903 Married
87799 15 Child
376709 16 Child
9042 17 Married
421753 18 Child
463196 19 Sibling
97903 20 Married
So in theory dataset two provides information as to all the different links people have who are found in dataset one.
What I need to do is look through dataset two and if I find a relationship between two people I need to assign a household_id to individuals.
The issue is that that the same individuals can be found in both id and id1 and there can be many different relationships.
So I need some code that will look through column 1 and then look through column 2 and if finding someone assigns them to a household, along with the person in the relationship.
I don't want to end up with a case where id 15 and 8799 are assigned different households as they are found in the same column.
So a dataset such as this would be the final outcome.
ID Household_id
1 x_100
2 x_101
3 x_102
4 x_103
5 x_104
6 x_105
7 x_106
8 x_107
9 x_108
10 x_109
11 x_110
12 x_111
13 x_112
14 x_113
15 x_114
16 x_115
17 x_116
18 x_117
19 x_118
20 x_119
104428 x_100
36186 x_101
507094 x_102
147446 x_103
453559 x_104
395150 x_105
495697 x_106
430549 x_107
430304 x_108
110819 x_109
218256 x_110
199523 x_111
72263 x_112
73282 x_113
87799 x_114
376709 x_115
9042 x_116
421753 x_117
463196 x_118
97903 x_119
SAS has great network detection and optimization tools:
proc optnet data_links=Have out_nodes=Network;
data_links_var from=ID to=ID_1;
concomp;
run;
Tom
I have frequently used proc optnet for this sort of problem. It's perfect for this task. And the CONCOMP statement effectively provides the household id's in the output data set.
Hi Tom,
I'm using SAS enterprise guide 7.1 and it doesn't appear that this procedure is available?
data have;
infile cards expandtabs;
input from $ to $ ;
cards;
1 104428 Married
2 36186 Parent
3 507094 Parent
4 147446 Parent
5 453559 Parent
6 395150 Sibling
7 495697 Married
8 430549 Parent
9 430304 Married
10 110819 Married
11 218256 Married
12 199523 Married
13 72263 Parent
14 73282 Married
15 87799 Parent
16 376709 Parent
17 9042 Married
18 421753 Parent
19 463196 Sibling
20 97903 Married
87799 15 Child
376709 16 Child
9042 17 Married
421753 18 Child
463196 19 Sibling
97903 20 Married
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
I just have a follow up query here.
While this code works perfectly I'm running into memory issues. My relationship dataset has circa 12,000,000 observations.
When I run the code on my dataset the program eventually bombs out due to memory issues.
Is there anyway to resolve this or provide a solution?
WOW. Can't believe you have so big table.
Assign more memory to sas by option -memsize 0
Rick click the SAS icon and select attribution .
fill the following line into TARGET :
"C:\Program Files\SAS\SASFoundation\9.2(32-bit)\sas.exe" -CONFIG "C:\Program Files\SAS\SASFoundation\9.2(32-bit)\nls\en\SASV9.CFG" -memsize 0
Here is a blog about make your memory bigger.
https://blogs.sas.com/content/iml/2015/07/31/large-matrices.html
PS. also try add an option
options compress=yes;
at top of my code.
Hi @Sean_OConnor,
I guess it's the last datastep that requires too much RAM because it uses four hash objects. In the past, similar problems have been addressed with PGStats's SubGraphs macro, which uses only one hash object. Here's how it works on your sample data (using dataset HAVE as KSharp created it):
%SubGraphs(have,out=households);
proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;
The result matches exactly your "final outcome".
Chances are that you could even get away without using hash objects:
I would doubt that PG's code could work for such big table.So is for SAS/OR .
Running the above piece of code I'm getting an error saying the dataset households does not exist;
data have;
infile cards expandtabs;
input from $ to $ ;
cards;
1 104428 Married
2 36186 Parent
3 507094 Parent
4 147446 Parent
5 453559 Parent
6 395150 Sibling
7 495697 Married
8 430549 Parent
9 430304 Married
10 110819 Married
11 218256 Married
12 199523 Married
13 72263 Parent
14 73282 Married
15 87799 Parent
16 376709 Parent
17 9042 Married
18 421753 Parent
19 463196 Sibling
20 97903 Married
87799 15 Child
376709 16 Child
9042 17 Married
421753 18 Child
463196 19 Sibling
97903 20 Married
;
run;
%SubGraphs(have,out=households);
proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;
You compiled macro SubGraphs before calling it, didn't you? That is, you submitted the macro code from %macro SubGraphs (...); to %mend SubGraphs;?
As in this?
%macro SubGraphs(have,out=households);
proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;
%mend;
options mlogic mprint;
%SubGraphs;
options nomlogic nomprint;
No, you need to download the macro code from the page https://communities.sas.com/t5/tkb/articleprintpage/tkb-id/library/article-id/1045
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.