Collapsing Data (multiple lines to one) - Page 2

Ksharp · Posted 12-26-2014 01:06 AM

Here is my version.

Hi. Actually Patrick has already done it for me .Thanks Patrick. Happy Holiday!

data Forms;

input FormID $ Parent $ Child $ @@;

cards;

1 1 .

1 1 2 1 2 20 1 20 200

1 1 3 1 3 30 1 30 300 1 300 3000

2 9 .

2 9 4 2 4 5 2 4 6 2 6 5000

2 6 7 2 7 8 2 6 400 2 6 4000

2 6 10 2 6 11 2 6 12

;

run;

data check;

set forms ;

P=Child;C=Parent;

drop Parent Child;

run;

data temp(keep= path);

if _n_ eq 1 then do;

length path _path $ 400 ;

if 0 then set check;

declare hash ch(hashexp:20,dataset:'check(where=(P is not missing))');

ch.definekey('FormID','P');

ch.definedone();

if 0 then set Forms;

declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');

ha.definekey('FormID','Parent');

ha.definedata('Child');

ha.definedone();

declare hash pa(ordered:'Y');

declare hiter hi_path('pa');

pa.definekey('count');

pa.definedata('path');

pa.definedone();

end;

set Forms(where=(child is not missing));

count=1;

path=catx(' ',FormID,Parent,Child);

pa.add();

do while(hi_path.next()=0);

_path=path;

FormID=scan(path,1,' ');

P=scan(path,2,' ');

Parent=scan(path,-1,' ');

rc=ha.find();

if (rc ne 0 or (rc eq 0 and find(path,strip(Child)))) and ch.check() ne 0 then output;

do while(rc=0);

if not find(path,strip(Child)) then do;

count+1;

path=catx(' ',path,Child);

pa.add();

path=_path;

end;

rc=ha.find_next();

end;

pa.clear();

run;

proc sql noprint;

select max(n)-3 into : n separated by ' '

from (select countw(path) as n from temp);

quit;

data want;

set temp;

array x{*} $ 40 FID P C G1-G&n;

do i=1 to dim(x);

x{i}=scan(path,i,' ');

end;

drop path i;

run;

Xia Keshan

drjohn61 · Posted 12-26-2014 10:47 AM

Thanks All,

I REALLY appreciate the help. This will streamline my code and the time needed to run it immensely!

I would have liked to award TWO correct answers, since both Patrick and Xia's are correct....but it won't let me. Xia still gets 20 points though!

Thanks again,

John L.

billfish · Posted 12-29-2014 12:07 PM

My suggested solution is not as sophisticated as the ones already presented. Looking at the input and output, I notice that
1- A given FormId correspond to the same Parent
2- The child at of the line for each line does not show up in the parent column.

The following solution assumes that there is no lineage with more than 9999 generations.

data want(keep=FormID P1 G1-G9999);

if 0 then set forms; *just want FormID to be first variable in output;

array PP(10000) ; * parent;

array CA(10000) ; * child ;

array CB(10000) ; * end_of_line children;

array LA(10000) P1 G1-G9999; * lineage sequence;

r=0;

zP=.;

call missing(of PP(*));

call missing(of CA(*));

call missing(of CB(*));

* load data for given FormID in matrices PP CA;

do until (last.FormID);

set forms end=zEnd;

by FormID;

r+1;

PP(r)=Parent;

CA(r)=Child;

if Child=. then zP=Parent;

end;

* find the end_of_line Children ;

s=0;

do i= 1 to r;

if not(CA(i)=.) and not(CA(i) in PP) then do;

s+1;

CB(s)=CA(i);

end;

* for each end_of_line child, find the lineage;

do i = 1 to s;

t=1;

LA(t)=CB(i);

zDone=0;

do until (zDone);

do j = 1 to r;

if LA(t)=CA(j) then do;

t+1;

LA(t)=PP(j);

if LA(t)=zP then zDone=1;

leave;

end;

t_max=max(t_max,t);

do j = 1 to int(t/2); *reverse lineage sequence;

a1 = LA(j);

LA(j)=LA(t+1-j);;

LA(t+1-j)= a1;

end;

output;

call missing(of LA(*)); * clear for next lineage;

end;

* delete all G# which is null for all lineages;

if zEnd then do;

call execute('data want; set want(keep=FormID P1 G1-'||compress('G'||put(t_max-1,5.))||');run;');

end;

run;

billfish · Posted 01-07-2015 03:28 PM

I propose a different SAS Hash-object solution.

The first DATA STEP creates a simulated geneological (Parent,Child) table creating some lineages with more than 90 generations.
The second DATA STEP is an alternative (proposed) solution using SAS Hash objects and arrays.

Using the array, non-Hash-object solution which I have presented previously in this forum, 4674 lineages are presented. It took 7 minutes.

Using the hash methods presented previously here, using the criterion:
hashexp:20
produces the following SAS error message:
ERROR: Hash object added 8912880 items when memory failure occurred.

Using the hash methods presented previously here, using the criterion:
hashexp:40
produces in the SAS LOG:
NOTE: There were 4305 observations read from the data set WORK.TEMP.
The hash methods took 10 seconds.

Using the hash methods presented previously here, using the criterion:
no hashexp used
produces in the SAS LOG:
NOTE: The data set WORK.TEMP has 4305 observations and 1 variables.
The hash methods took 10 seconds and gave identical results as with the (hashexp:40) criterion.

So the array method produced 369 (=4674-4305) more lineages than the hash methods presented in this forum.

in_ary = 1 when the lineage is present in the array (non-hash) method.
in_hsh = 1 when the lineage is present in the hash method(s).

Doing a full join on the 2 set of lineages, the results are:

in_ary in_hsh c_tot
0       1       51
1       0      420
1       1     4254

Let us look briefly some of the 51 lineages present in the hash results and not the array method.

Let us look at the hash lineages where the end-of-line (or latest) child is 495, 806, 1032. According to the hash results:
The lastest child 495 occurs at the 66th generation.
The lastest child 806 occurs at the 89th generation.
The lastest child 1032 occurs at the 83rd generation.

Looking at the input table Forms for either Parent in (495,806,1032), one finds:

FormID Parent Child

101      1032     417
101      1032     8796
101      1032     4866
101      1032     3620
101      495      474
101      495      6750
101      806      791

Thus the supposed terminal (end-of-line) children produce other children.

Given the SAS system I was using (at a Fortune 500 company), the hash methods previously presented here did not complete the task.

Hash methods are fast, so I wrote a hash-object solution. It produces 4674 lineages. It took 4 secs. Going from 7 minutes to 4 secs is progress.

/*********************/
/**** DATA STEP 1 ****/
/*********************/
data t_a(keep=zParent zChild rnd);
   array elemA(10000) (1:10000);
   array genA(500); ** start index of a generation **;
   array genB(500); ** end   index of a generation **;
   ** scramble the elemA matrix elements;
   do i = 1 to 7000;
      i1 = ceil(10000*ranuni(3));
      i2 = ceil(10000*ranuni(5));
      x = elemA(i1);
      elemA(i1) = elemA(i2);
      elemA(i2) = x;
   end;
   r=0; ** count number of generations;
   c=0; ** count nuber of children;
   do while (c<=10000);
      r+1;
      if (r=1) then do;
         c+1;
         genA(1)=1;
         genB(1)=1;
      end;
      else do;
         g= 10+ceil(200*ranuni(27));
         do j=1 to g;
            c+1;
            if (c>10000) then leave;
            else do;
               if j=1 then genA(r)=c;
               genB(r)=c;
            end;
         end;
      end;
   end;
   ** for given child generation: connect child to parent;
   do i1 = r to 2 by -1;
      do i2 = genA(i1) to genB(i1);
         g_sum= genB(i1-1)-genA(i1-1)+1;
         x1= genA(i1-1) -1 +ceil(g_sum*ranuni(3));
         zParent= elemA(x1);
         zChild = elemA(i2);
         rnd=ranuni(19);
         output;
      end;
   end;
run;

proc sort data=t_a; by rnd; run; *** just to make things interesting;

/** transform table t_a to be compatible with previous Hash solutions **/
data Forms(keep=FormID Parent Child);
   length FormID Parent Child $8.;
   set t_a;
   FormID = '101';
   Parent = strip(put(zParent,8.));
   Child = strip(put(zChild,8.));
run;

/*********************/
/**** DATA STEP 2 ****/
/*********************/
data t_want(keep=t_gen P1 G1-G9999);
   length c_end 8.;
   t_max=0;
   array GG(10000) P1 G1-G9999;

   if _N_=1 then do;
      if 0 then set t_a;
      declare hash ha(dataset:'t_a',multidata:'N' );
      ha.definekey('zChild');
      ha.definedata('zChild','zParent');
      ha.definedone();

      declare hash hb(dataset:'t_a',multidata:'N' );
      hb.definekey('zParent');
      hb.definedata('zChild','zParent');
      hb.definedone();

      declare hash hd(multidata:'N');
      hd.definekey('c_end');
      declare hiter dIter('hd');
      hd.definedone();
   end;

   do until (zDone); ** make list of end-of-line children;
      set t_a(rename=(zchild=zc zparent=zp)) nobs=n_last end=zDone;
      zParent=zC;
      if hb.find() then do;
         c_end=zParent;
         if hd.check() then do; hd.add(); end;
      end;
   end;

   rc = dIter.first();
   do while (rc = 0);
      aDone=0;
      t_gen=0;
      do until (aDone);
              if (t_gen=0)      then do; t_gen+1; GG(t_gen)= c_end;   zChild=c_end; end;
         else if not(ha.find()) then do; t_gen+1; GG(t_gen)= zParent; zChild=zParent; end;
         else if     ha.find() then do; aDone=1; end;
      end;
      do i=1 to int(t_gen/2); ** reverse lineage sequence;
         x1= GG(i);
         GG(i)= GG(t_gen+1-i);
         GG(t_gen+1-i)=x1;
      end;
      output;
      call missing(of GG(*));
      t_max=max(t_max,t_gen);
      rc = dIter.next();
   end;

call execute('data t_want; set t_want(keep=t_gen P1 G1-'||compress('G'||put(t_max-1,6.))||');run;');
run;

Ksharp · Posted 01-07-2015 09:28 PM

I can optimize it as well , if you want.

And apparently your code didn't consider the LOOP situation .

if data like 1 1 2 1 2 20 1 20 200 1 200 201 1 201 20

You will get the wrong result .

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

Re: Collapsing Data (multiple lines to one)

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!