BookmarkSubscribeRSS Feed
Sean_OConnor
Fluorite | Level 6

 

Folks,

 

If anyone could provide some insight into this issue I would be very appreciative. 

 

I have two datasets.

 

Dataset one is a list of individuals (circa 5 million) One variable called ID which ranges from 1 to N.

 

Dataset two has a list of relationships (circa 11 million). This is how the dataset looks.

 

ID	ID_1	Relationsip of ID to ID_1
1	104428	Married 
2	36186	Parent
3	507094	Parent
4	147446	Parent
5	453559	Parent
6	395150	Sibling
7	495697	Married 
8	430549	Parent
9	430304	Married 
10	110819	Married 
11	218256	Married 
12	199523	Married 
13	72263	Parent
14	73282	Married 
15	87799	Parent
16	376709	Parent
17	9042	Married 
18	421753	Parent
19	463196	Sibling
20	97903	Married 
87799	15	Child
376709	16	Child
9042	17	Married 
421753	18	Child
463196	19	Sibling
97903	20	Married 

 

So in theory dataset two provides information as to all the different links people have who are found in dataset one.

 

What I need to do is look through dataset two and if I find a relationship between two people I need to assign a household_id to individuals.

 

The issue is that that the same individuals can be found in both id and id1 and there can be many different relationships.

 

So I need some code that will look through column 1 and then look through column 2 and if finding someone assigns them to a household, along with the person in the relationship.

 

I don't want to end up with a case where id 15 and 8799 are assigned different households as they are found in the same column.

 

So a dataset such as this would be the final outcome.

 

 

ID	Household_id
1	x_100
2	x_101
3	x_102
4	x_103
5	x_104
6	x_105
7	x_106
8	x_107
9	x_108
10	x_109
11	x_110
12	x_111
13	x_112
14	x_113
15	x_114
16	x_115
17	x_116
18	x_117
19	x_118
20	x_119
104428	x_100
36186	x_101
507094	x_102
147446	x_103
453559	x_104
395150	x_105
495697	x_106
430549	x_107
430304	x_108
110819	x_109
218256	x_110
199523	x_111
72263	x_112
73282	x_113
87799	x_114
376709	x_115
9042	x_116
421753	x_117
463196	x_118
97903	x_119
15 REPLIES 15
Sean_OConnor
Fluorite | Level 6

Folks,

 

If anyone could provide some insight into this issue I would be very appreciative. 

 

I have two datasets.

 

Dataset one is a list of individuals (circa 5 million) One variable called ID which ranges from 1 to N.

 

Dataset two has a list of relationships (circa 11 million). This is how the dataset looks.

 

ID	ID_1	Relationsip of ID to ID_1
1	104428	Married 
2	36186	Parent
3	507094	Parent
4	147446	Parent
5	453559	Parent
6	395150	Sibling
7	495697	Married 
8	430549	Parent
9	430304	Married 
10	110819	Married 
11	218256	Married 
12	199523	Married 
13	72263	Parent
14	73282	Married 
15	87799	Parent
16	376709	Parent
17	9042	Married 
18	421753	Parent
19	463196	Sibling
20	97903	Married 
87799	15	Child
376709	16	Child
9042	17	Married 
421753	18	Child
463196	19	Sibling
97903	20	Married 

 

So in theory dataset two provides information as to all the different links people have who are found in dataset one.

 

What I need to do is look through dataset two and if I find a relationship between two people I need to assign a household_id to individuals.

 

The issue is that that the same individuals can be found in both id and id1 and there can be many different relationships.

 

So I need some code that will look through column 1 and then look through column 2 and if finding someone assigns them to a household, along with the person in the relationship.

 

I don't want to end up with a case where id 15 and 8799 are assigned different households as they are found in the same column.

 

So a dataset such as this would be the final outcome.

 

ID	Household_id
1	x_100
2	x_101
3	x_102
4	x_103
5	x_104
6	x_105
7	x_106
8	x_107
9	x_108
10	x_109
11	x_110
12	x_111
13	x_112
14	x_113
15	x_114
16	x_115
17	x_116
18	x_117
19	x_118
20	x_119
104428	x_100
36186	x_101
507094	x_102
147446	x_103
453559	x_104
395150	x_105
495697	x_106
430549	x_107
430304	x_108
110819	x_109
218256	x_110
199523	x_111
72263	x_112
73282	x_113
87799	x_114
376709	x_115
9042	x_116
421753	x_117
463196	x_118
97903	x_119

 

 

 

 

 

 

 

TomKari
Onyx | Level 15

SAS has great network detection and optimization tools:

 

proc optnet data_links=Have out_nodes=Network;
	data_links_var from=ID to=ID_1;
	concomp;
run;

 

Tom

 

mkeintz
PROC Star

I have frequently used proc optnet for this sort of problem. It's perfect for this task.  And the CONCOMP statement effectively provides the household id's in the output data set.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Sean_OConnor
Fluorite | Level 6

Hi Tom,

 

I'm using SAS enterprise guide 7.1 and it doesn't appear that this procedure is available?

Ksharp
Super User

data have;
infile cards expandtabs;
input from $  to $ ;
cards;
1	104428	Married 
2	36186	Parent
3	507094	Parent
4	147446	Parent
5	453559	Parent
6	395150	Sibling
7	495697	Married 
8	430549	Parent
9	430304	Married 
10	110819	Married 
11	218256	Married 
12	199523	Married 
13	72263	Parent
14	73282	Married 
15	87799	Parent
16	376709	Parent
17	9042	Married 
18	421753	Parent
19	463196	Sibling
20	97903	Married 
87799	15	Child
376709	16	Child
9042	17	Married 
421753	18	Child
463196	19	Sibling
97903	20	Married
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Sean_OConnor
Fluorite | Level 6

I just have a follow up query here.

 

While this code works perfectly I'm running into memory issues. My relationship dataset has circa 12,000,000 observations.

 

When I run the code on my dataset the program eventually bombs out due to memory issues. 

 

Is there anyway to resolve this or provide a solution?

Ksharp
Super User

WOW. Can't believe you have so big table.

Assign more memory to  sas by option -memsize 0

 

Rick click the SAS icon and select attribution .

fill the following line into TARGET :

 

 

"C:\Program Files\SAS\SASFoundation\9.2(32-bit)\sas.exe" -CONFIG "C:\Program Files\SAS\SASFoundation\9.2(32-bit)\nls\en\SASV9.CFG"  -memsize 0 

Ksharp
Super User

Here is a blog about make your memory bigger.

 

https://blogs.sas.com/content/iml/2015/07/31/large-matrices.html

 

 

PS.  also try add an option 

options compress=yes;

 

at top of my code.

FreelanceReinh
Jade | Level 19

Hi @Sean_OConnor,

 

I guess it's the last datastep that requires too much RAM because it uses four hash objects. In the past, similar problems have been addressed with PGStats's SubGraphs macro, which uses only one hash object. Here's how it works on your sample data (using dataset HAVE as KSharp created it):

 

%SubGraphs(have,out=households);

proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;

The result matches exactly your "final outcome".

 

 

Chances are that you could even get away without using hash objects:

Ksharp
Super User

I would doubt that PG's code could work for such big table.So is for SAS/OR .

Sean_OConnor
Fluorite | Level 6

Running the above piece of code I'm getting an error saying the dataset households does not exist;

 

 

data have;
infile cards expandtabs;
input from $  to $ ;
cards;
1	104428	Married 
2	36186	Parent
3	507094	Parent
4	147446	Parent
5	453559	Parent
6	395150	Sibling
7	495697	Married 
8	430549	Parent
9	430304	Married 
10	110819	Married 
11	218256	Married 
12	199523	Married 
13	72263	Parent
14	73282	Married 
15	87799	Parent
16	376709	Parent
17	9042	Married 
18	421753	Parent
19	463196	Sibling
20	97903	Married 
87799	15	Child
376709	16	Child
9042	17	Married 
421753	18	Child
463196	19	Sibling
97903	20	Married
;
run;




%SubGraphs(have,out=households);

proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;
FreelanceReinh
Jade | Level 19

You compiled macro SubGraphs before calling it, didn't you? That is, you submitted the macro code from %macro SubGraphs (...); to %mend SubGraphs;?

Sean_OConnor
Fluorite | Level 6

As in this?

 

%macro SubGraphs(have,out=households);

proc sql;
create table want as
select node as ID, cats('x_',clust+99) as Household_id length=8
from households order by length(ID)>2, clust, node;
quit;
%mend;
options mlogic mprint;
%SubGraphs;
options nomlogic nomprint;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 2017 views
  • 1 like
  • 5 in conversation