I propose a different SAS Hash-object solution. The first DATA STEP creates a simulated geneological (Parent,Child) table creating some lineages with more than 90 generations. The second DATA STEP is an alternative (proposed) solution using SAS Hash objects and arrays. Using the array, non-Hash-object solution which I have presented previously in this forum, 4674 lineages are presented. It took 7 minutes. Using the hash methods presented previously here, using the criterion: hashexp:20 produces the following SAS error message: ERROR: Hash object added 8912880 items when memory failure occurred. Using the hash methods presented previously here, using the criterion: hashexp:40 produces in the SAS LOG: NOTE: There were 4305 observations read from the data set WORK.TEMP. The hash methods took 10 seconds. Using the hash methods presented previously here, using the criterion: no hashexp used produces in the SAS LOG: NOTE: The data set WORK.TEMP has 4305 observations and 1 variables. The hash methods took 10 seconds and gave identical results as with the (hashexp:40) criterion. So the array method produced 369 (=4674-4305) more lineages than the hash methods presented in this forum. in_ary = 1 when the lineage is present in the array (non-hash) method. in_hsh = 1 when the lineage is present in the hash method(s). Doing a full join on the 2 set of lineages, the results are: in_ary in_hsh c_tot 0 1 51 1 0 420 1 1 4254 Let us look briefly some of the 51 lineages present in the hash results and not the array method. Let us look at the hash lineages where the end-of-line (or latest) child is 495, 806, 1032. According to the hash results: The lastest child 495 occurs at the 66th generation. The lastest child 806 occurs at the 89th generation. The lastest child 1032 occurs at the 83rd generation. Looking at the input table Forms for either Parent in (495,806,1032), one finds: FormID Parent Child 101 1032 417 101 1032 8796 101 1032 4866 101 1032 3620 101 495 474 101 495 6750 101 806 791 Thus the supposed terminal (end-of-line) children produce other children. Given the SAS system I was using (at a Fortune 500 company), the hash methods previously presented here did not complete the task. Hash methods are fast, so I wrote a hash-object solution. It produces 4674 lineages. It took 4 secs. Going from 7 minutes to 4 secs is progress. /*********************/ /**** DATA STEP 1 ****/ /*********************/ data t_a(keep=zParent zChild rnd); array elemA(10000) (1:10000); array genA(500); ** start index of a generation **; array genB(500); ** end index of a generation **; ** scramble the elemA matrix elements; do i = 1 to 7000; i1 = ceil(10000*ranuni(3)); i2 = ceil(10000*ranuni(5)); x = elemA(i1); elemA(i1) = elemA(i2); elemA(i2) = x; end; r=0; ** count number of generations; c=0; ** count nuber of children; do while (c<=10000); r+1; if (r=1) then do; c+1; genA(1)=1; genB(1)=1; end; else do; g= 10+ceil(200*ranuni(27)); do j=1 to g; c+1; if (c>10000) then leave; else do; if j=1 then genA(r)=c; genB(r)=c; end; end; end; end; ** for given child generation: connect child to parent; do i1 = r to 2 by -1; do i2 = genA(i1) to genB(i1); g_sum= genB(i1-1)-genA(i1-1)+1; x1= genA(i1-1) -1 +ceil(g_sum*ranuni(3)); zParent= elemA(x1); zChild = elemA(i2); rnd=ranuni(19); output; end; end; run; proc sort data=t_a; by rnd; run; *** just to make things interesting; /** transform table t_a to be compatible with previous Hash solutions **/ data Forms(keep=FormID Parent Child); length FormID Parent Child $8.; set t_a; FormID = '101'; Parent = strip(put(zParent,8.)); Child = strip(put(zChild,8.)); run; /*********************/ /**** DATA STEP 2 ****/ /*********************/ data t_want(keep=t_gen P1 G1-G9999); length c_end 8.; t_max=0; array GG(10000) P1 G1-G9999; if _N_=1 then do; if 0 then set t_a; declare hash ha(dataset:'t_a',multidata:'N' ); ha.definekey('zChild'); ha.definedata('zChild','zParent'); ha.definedone(); declare hash hb(dataset:'t_a',multidata:'N' ); hb.definekey('zParent'); hb.definedata('zChild','zParent'); hb.definedone(); declare hash hd(multidata:'N'); hd.definekey('c_end'); declare hiter dIter('hd'); hd.definedone(); end; do until (zDone); ** make list of end-of-line children; set t_a(rename=(zchild=zc zparent=zp)) nobs=n_last end=zDone; zParent=zC; if hb.find() then do; c_end=zParent; if hd.check() then do; hd.add(); end; end; end; rc = dIter.first(); do while (rc = 0); aDone=0; t_gen=0; do until (aDone); if (t_gen=0) then do; t_gen+1; GG(t_gen)= c_end; zChild=c_end; end; else if not(ha.find()) then do; t_gen+1; GG(t_gen)= zParent; zChild=zParent; end; else if ha.find() then do; aDone=1; end; end; do i=1 to int(t_gen/2); ** reverse lineage sequence; x1= GG(i); GG(i)= GG(t_gen+1-i); GG(t_gen+1-i)=x1; end; output; call missing(of GG(*)); t_max=max(t_max,t_gen); rc = dIter.next(); end; call execute('data t_want; set t_want(keep=t_gen P1 G1-'||compress('G'||put(t_max-1,6.))||');run;'); run;
... View more