BookmarkSubscribeRSS Feed
Ksharp
Super User

Here is my version.

Hi. Actually Patrick has already done it for me .Thanks Patrick. Happy Holiday!

data Forms;

input FormID $ Parent $ Child $ @@;

cards;

1  1 .

1  1 2   1  2 20 1 20 200

1  1 3   1  3 30 1 30 300   1 300 3000

2  9  .

2  9 4   2  4 5   2  4 6   2   6 5000

2  6 7   2  7 8   2  6 400 2   6 4000

2  6 10 2  6 11   2 6  12

;

run;

data check;

set forms ;

P=Child;C=Parent;

drop Parent Child;

run;

data temp(keep= path);

if _n_ eq 1 then do;

length path _path  $ 400 ;

if 0 then set check;

declare hash ch(hashexp:20,dataset:'check(where=(P is not missing))');

ch.definekey('FormID','P');

ch.definedone();

if 0 then set Forms;

declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');

ha.definekey('FormID','Parent');

ha.definedata('Child');

ha.definedone();

declare hash pa(ordered:'Y');

declare hiter hi_path('pa');

pa.definekey('count');

pa.definedata('path');

pa.definedone();

end;

set Forms(where=(child is not missing));

count=1;

path=catx(' ',FormID,Parent,Child);

pa.add();

do while(hi_path.next()=0);

_path=path;

FormID=scan(path,1,' ');

P=scan(path,2,' ');

Parent=scan(path,-1,' ');

rc=ha.find();

if (rc ne 0 or (rc eq 0 and find(path,strip(Child)))) and ch.check() ne 0 then output;

do while(rc=0);

  if not find(path,strip(Child)) then do;

                                      count+1;

                                      path=catx(' ',path,Child);

                                      pa.add();

                                      path=_path;

                                        end;

  rc=ha.find_next();

end;

end;

pa.clear();

run;

proc sql noprint;

select max(n)-3 into : n separated by ' '

  from (select countw(path) as n from temp);

quit;

data want;

set temp;

array x{*} $ 40 FID P C G1-G&n;

do i=1 to dim(x);

  x{i}=scan(path,i,' ');

end;

drop path i;

run;

Xia Keshan

drjohn61
Calcite | Level 5

Thanks All,

I REALLY appreciate the help. This will streamline my code and the time needed to run it immensely!

I would have liked to award TWO correct answers, since both Patrick and Xia's are correct....but it won't let me. Xia still gets 20 points though! Smiley Happy

Thanks again,

John L.

billfish
Quartz | Level 8

My suggested solution is not as sophisticated as the ones already presented. Looking at the input and output, I notice that
1- A given FormId correspond to the same Parent
2- The child at of the line for each line does not show up in the parent column.

The following solution assumes that there is no lineage with more than 9999 generations.

data want(keep=FormID P1 G1-G9999);

   if 0 then set forms; *just want FormID to be first variable in output;

   array PP(10000) ; * parent;

   array CA(10000) ; * child ;

   array CB(10000) ; * end_of_line children;

   array LA(10000) P1 G1-G9999; * lineage sequence;

   r=0;

   zP=.;

   call missing(of PP(*));

   call missing(of CA(*));

   call missing(of CB(*));

   * load data for given FormID in matrices PP CA;

   do until (last.FormID);

      set forms end=zEnd;

      by FormID;

      r+1;

      PP(r)=Parent;

      CA(r)=Child;

      if Child=. then zP=Parent;

   end;

   * find the end_of_line Children ;

   s=0;

   do i= 1 to r;

      if not(CA(i)=.) and not(CA(i) in PP) then do;

         s+1;

         CB(s)=CA(i);

      end;

   end;

   * for each end_of_line child, find the lineage;

   do i = 1 to s;

      t=1;

      LA(t)=CB(i);

      zDone=0;

      do until (zDone);

         do j = 1 to r;

            if LA(t)=CA(j) then do;

               t+1;

               LA(t)=PP(j);

               if LA(t)=zP then zDone=1;

               leave;

            end;

         end;

      end;

      t_max=max(t_max,t);

      do j = 1 to int(t/2); *reverse lineage sequence;

         a1 = LA(j);

         LA(j)=LA(t+1-j);;

         LA(t+1-j)= a1;

      end;

      output;

      call missing(of LA(*)); * clear for next lineage;

   end;

   * delete all G# which is null for all lineages;

   if zEnd then do;

      call execute('data want; set want(keep=FormID P1 G1-'||compress('G'||put(t_max-1,5.))||');run;');

   end;

run;

billfish
Quartz | Level 8

I propose a different SAS Hash-object solution.

The first DATA STEP creates a simulated geneological (Parent,Child) table creating some lineages with more than 90 generations.
The second DATA STEP is an alternative (proposed) solution using SAS Hash objects and arrays.

Using the array, non-Hash-object solution which I have presented previously in this forum, 4674 lineages are presented. It took 7 minutes.

Using the hash methods presented previously here, using the criterion:
hashexp:20
produces the following SAS error message:
ERROR: Hash object added 8912880 items when memory failure occurred.


Using the hash methods presented previously here, using the criterion:
hashexp:40
produces in the SAS LOG:
NOTE: There were 4305 observations read from the data set WORK.TEMP.
The hash methods took 10 seconds.


Using the hash methods presented previously here, using the criterion:
no hashexp used
produces in the SAS LOG:
NOTE: The data set WORK.TEMP has 4305 observations and 1 variables.
The hash methods took 10 seconds and gave identical results as with the (hashexp:40) criterion.


So the array method produced 369 (=4674-4305) more lineages than the hash methods presented in this forum.


in_ary = 1 when the lineage is present in the array (non-hash) method.
in_hsh = 1 when the lineage is present in the hash method(s).

Doing a full join on the 2 set of lineages, the results are:

in_ary in_hsh c_tot
  0       1       51
  1       0      420
  1       1     4254

Let us look briefly some of the 51 lineages present in the hash results and not the array method.

Let us look at the hash lineages where the end-of-line (or latest) child is 495, 806, 1032. According to the hash results:
The lastest child  495 occurs at the 66th generation.
The lastest child  806 occurs at the 89th generation.
The lastest child 1032 occurs at the 83rd generation.

Looking at the input table Forms for either Parent in (495,806,1032), one finds:

FormID  Parent   Child

101      1032     417
101      1032     8796
101      1032     4866
101      1032     3620
101      495      474
101      495      6750
101      806      791

Thus the supposed terminal (end-of-line) children produce other children.


Given the SAS system I was using (at a Fortune 500 company), the hash methods previously presented here did not complete the task.

Hash methods are fast, so I wrote a hash-object solution. It produces 4674 lineages. It took 4 secs. Going from 7 minutes to 4 secs is progress.

/*********************/
/**** DATA STEP 1 ****/
/*********************/
data t_a(keep=zParent zChild rnd);
   array elemA(10000) (1:10000);
   array genA(500); ** start index of a generation **;
   array genB(500); ** end   index of a generation **;
   ** scramble the elemA matrix elements;
   do i = 1 to 7000;
      i1 = ceil(10000*ranuni(3));
      i2 = ceil(10000*ranuni(5));
      x = elemA(i1);
      elemA(i1) = elemA(i2);
      elemA(i2) = x;
   end;
   r=0; ** count number of generations;
   c=0; ** count nuber of children;
   do while (c<=10000);
      r+1;
      if (r=1) then do;
         c+1;
         genA(1)=1;
         genB(1)=1;
      end;
      else do;
         g= 10+ceil(200*ranuni(27));
         do j=1 to g;
            c+1;
            if (c>10000) then leave;
            else do;
               if j=1 then genA(r)=c;
               genB(r)=c;
            end;
         end;
      end;
   end;
   ** for given child generation: connect child to parent;
   do i1 = r to 2 by -1;
      do i2 = genA(i1) to genB(i1);
         g_sum= genB(i1-1)-genA(i1-1)+1;
         x1=  genA(i1-1) -1 +ceil(g_sum*ranuni(3));
         zParent= elemA(x1);
         zChild = elemA(i2);
         rnd=ranuni(19);
         output;
      end;
   end;
run;

proc sort data=t_a; by rnd; run; *** just to make things interesting;

/** transform table t_a to be compatible with previous Hash solutions **/
data Forms(keep=FormID Parent Child);
   length FormID Parent Child $8.;
   set t_a;
   FormID = '101';
   Parent = strip(put(zParent,8.));
   Child  = strip(put(zChild,8.));
run;


/*********************/
/**** DATA STEP 2 ****/
/*********************/
data t_want(keep=t_gen P1 G1-G9999);
   length c_end 8.;
   t_max=0;
   array GG(10000) P1 G1-G9999;

   if _N_=1 then do;
      if 0 then set t_a;
      declare hash ha(dataset:'t_a',multidata:'N' );
      ha.definekey('zChild');
      ha.definedata('zChild','zParent');
      ha.definedone();

      declare hash hb(dataset:'t_a',multidata:'N' );
      hb.definekey('zParent');
      hb.definedata('zChild','zParent');
      hb.definedone();

      declare hash hd(multidata:'N');
      hd.definekey('c_end');
      declare hiter dIter('hd');
      hd.definedone();
   end;

   do until (zDone); ** make list of end-of-line children;
      set t_a(rename=(zchild=zc zparent=zp)) nobs=n_last end=zDone;
      zParent=zC;
      if hb.find() then do;
         c_end=zParent;
         if hd.check() then do; hd.add(); end;
      end;
   end;

   rc = dIter.first();
   do while (rc = 0);
      aDone=0;
      t_gen=0;
      do until (aDone);
              if (t_gen=0)      then do; t_gen+1; GG(t_gen)= c_end;   zChild=c_end; end;
         else if not(ha.find()) then do; t_gen+1; GG(t_gen)= zParent; zChild=zParent; end;
         else if     ha.find()  then do; aDone=1; end;
      end;
      do i=1 to int(t_gen/2); ** reverse lineage sequence;
         x1= GG(i);
         GG(i)= GG(t_gen+1-i);
         GG(t_gen+1-i)=x1;
      end;
      output;
      call missing(of GG(*));
      t_max=max(t_max,t_gen);
      rc = dIter.next();
   end;

   call execute('data t_want; set t_want(keep=t_gen P1 G1-'||compress('G'||put(t_max-1,6.))||');run;');
run;

Ksharp
Super User

I can optimize it as well , if you want.

And apparently your code didn't consider the LOOP situation .

if data like 1  1 2   1  2 20 1 20 200 1 200 201 1 201 20

You will get the wrong result .

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 3161 views
  • 11 likes
  • 8 in conversation