BookmarkSubscribeRSS Feed
drjohn61
Calcite | Level 5

I have a data set that has multiple lines per record that I want to collapse into one. The problem is the following:

Each record has a 'FormID' (FID) that is the main key.

Each line for the FormID has a 'Parent' (P) and a 'Child' (C) field.

The uppermost 'level' is the Parent with no Child (Child is Null).

Each Parent  can have multiple Children.

Each Child can have Grandchildren (G1C) ...which can have Great-Grandchildren (G2C)...and so on.

There is no set number of Children, Grandchildren, etc... per Parent.

The data look like this:

FormID     Parent#     Child#

1               1

1               1               2

1               2               20

1               20             200

1               1               3

1               3               30

1               30             300

1               300           3000

.

.

.

In the above case, the records would collapse as follows:

FID-1 P-1 C-2 G1C-20 G2C-200 G3C-.

FID-1 P-1 C-3 G1C-30 G2C-300 G3C-3000

.

.

.

In other words, where the Child# becomes the Parent# for the same FormID, that signifies the next level down. I am trying to figure out how to dynamically determine the number of levels and bring the lines up to the appropriate FormID/Parent/Child/Grandchild/etc. combination.

I have hardcoded it to a G3C level by left joining any combo of FID and P/C/GC/G1C/G2C/G3C combo, since, currently, there are only G2Cs in the set, but that could change at any time...and is really inefficient...

Any help would be appreciated.

Thanks,

John L.

19 REPLIES 19
Astounding
PROC Star

Just  bit of help, at least I hope it turns out that way.  When you bring the data into SAS, you have to select what the variable names will be and they have to remain the same from one observation to the next.

So it would help if you could describe the final outcome based on your sample data, supplying both the variable names and their values for this sample.  Any additional description couldn't hurt, but that would be the starting point.

Most likely, the first step would involve a program that calculates the maximum number of children, grandchildren, etc. needed for your particular incoming data.

Good luck.

Reeza
Super User

Do you have SAS/OR? Sounds like a BOM problem to some degree.

FriedEgg
SAS Employee

I agree with Reeza about BOM on this one.

data Forms;

input FormID Parent Child @@;

cards;

1  1 .  

1  1 2   1  2 20 1 20 200

1  1 3   1  3 30 1 30 300   1 300 3000

2  9  .

2  9 4   2  4 5   2  4 6   2   6 5000

2  6 7   2  7 8   2  6 400 2   6 4000

2  6 10 2  6 11   2 6  12

;

run;

data FormV1 / view=FormV1;

if 0 then set Forms;

declare hash C(dataset:'Forms(where=(Child is not null))', MultiData:'Y', Ordered:'Y');

   C.definekey('FormID', 'Parent');

   C.definedata('Parent', 'Child');

   C.definedone();

declare hash P(dataset:'Forms(where=(Child is null))', Ordered:'Y');

   P.definekey('FormID', 'Parent');

   P.definedone();

declare hiter Pi('P');

declare hash F(Ordered:'Y');

   F.definekey('FormID', 'Generation', 'Parent', 'Child');

   F.definedone();

declare hiter Fi('F');

r=Pi.first();

do while(r=0); *FormID Loop;

   Generation=1;

   link search;

   r1=Fi.first();

   do while(r1=0);

      Generation+1;

      Parent=Child;

      link search;

      r1=Fi.next();

   end;

   F.clear();

   r=Pi.next();

end;

drop r r1 rc;

stop;

search:

   rc=C.find();

   put _all_;

   do while(rc=0);

      output;

      F.add();

      rc=C.find_next();

   end;

   return;

  

run;

data FormV2 / view=FormV2;

do Cidx=1 by 1 until(last.Generation);

   set FormV1;

   by FormID Generation;

   if first.FormID then P=Parent; else call missing(P);

   output;

end;

run;

proc transpose

data=FormV2

out=FormT(drop=_NAME_)

prefix=G

delim=C;

by FormID;

copy P;

id Generation Cidx;

var Child;

run;

PGStats
Opal | Level 21

I agree with this can be viewed as a network (or tree) problem. How exactly would this FormID collapse:

FormID     Parent#     Child#

1               1

1               1               2

1               2               20

1               2               30

1               2               40

1               20             200

1               40             500


PG

PG
drjohn61
Calcite | Level 5

Thanks All,

FIrst--I have to say I am an absolute Newbie to SAS...so, please bear with me...

Second...not sure on SAS/OR... I will have to ask our techies...

To give more info on the issue:

This is pathogen sampling data and is in a database that I pull using SQL...

For each sample, there is an associated form with a unique ID. Each sample is assigned a 'ParentSample' Number. You always know which is the parent because the ChildSample# is null... (Actually in the DB, the ParentSample is Null and the ChildSample not--the ChildSample is then the ParentSample. I just swap this in the SQL used to pull the data).

One sample could be analyzed for multiple pathogens...e.g., Salmonella and Campylobacter...

Each Pathogen can have multiple isolates with its own serotype and other factors...

So you could have (This is how I would like the data):

Form# 1

ParentSample#1  -- Salmonella Positive / Campylobacter Positive

ChildSample#2 Salmonella Enteritidis => Grandchild#3 PFGE Pattern => GreatGranchild#4 Antimicrobial Resistance

ChildSample#5 Salmonella Heidelberg => Grandchild#6 PFGE Pattern => GreatGranchild#7 Antimicrobial Resistance

ChildSample#8 Campylobacter Jejuni => Grandchild#9 PFGE Pattern

I used consecutive numbers in the example--actually they are multidigit numbers--I just used the above to show that they should be unique within the FormID/ParentSample

The data is always split between the variables ParentSample and ChildSample with the added info (e.g. Pathogen, Serotype, Antimicrobial Resistance) in additional variables...

So, the above would be in the DB as such:

FormID#     ParentSample#     ChildSample#     Info

1                 1                            -                          Salmonella + / Campylobacter +

1                 1                            2                         Enteritidis

1                 2                            3                         PFGE

1                 3                            4                         Antimicrobial Resistance

1                 1                            5                         Heidelberg

1                 5                            6                         PFGE

1                 6                            7                         Antimicrobial Resistance

1                 1                            8                         Jejuni

1                 8                            9                         PFGE

Normally, it will be sorted by FormID, ParentSample, ChildSample:

FormID#     ParentSample#     ChildSample#     Info

1                 1                            -                          Salmonella + / Campylobacter +

1                 1                            2                         Enteritidis

1                 1                            5                         Heidelberg

1                 1                            8                         Jejuni

1                 2                            3                         PFGE

1                 3                            4                         Antimicrobial Resistance

1                 5                            6                         PFGE

1                 6                            7                         Antimicrobial Resistance

1                 8                            9                         PFGE

I should also mention that normally there is only one Grandchild, GreatGrandChild, etc per Child... So, there can be multiple Children per Parent--but usually everything from there is singular back to the Child.

So PGStats data above would collapse like this:

1 2 20 200

1 2 30

1 2 40 500

FriedEgg
SAS Employee

This new criteria makes my previous posting only directional and not accurate to the problem at hand.

drjohn61
Calcite | Level 5

I appreciate the help. I am sorry I was not explicit enough. Being new to this whole SAS thing and trying to solve this has been driving me crazy... I am pretty sure I could solve it in Access with VBA...but I don't want to have to run the data through two systems...

Reeza
Super User

You can run

proc setinit; run;

and check the log to see if you have SAS/OR. If you're in academia it's highly likely, but the solution to your problem goes beyond my skill set. 

drjohn61
Calcite | Level 5

It seems I have:

OR OPT

OR PRS

OR IVS

OR LVS

Of course, I don't really know what any of that means... Smiley Happy

Ksharp
Super User

You could refer to this paper I wrote . I indeed love your question .:smileylaugh:


data Forms;
input FormID $ Parent $ Child $ @@;
cards;
1  1 .  
1  1 2   1  2 20 1 20 200
1  1 3   1  3 30 1 30 300   1 300 3000
2  9  .
2  9 4   2  4 5   2  4 6   2   6 5000
2  6 7   2  7 8   2  6 400 2   6 4000
2  6 10 2  6 11   2 6  12
;
run;


data temp(keep= path);
if _n_ eq 1 then do;
length path _path  $ 400 ;

if 0 then set Forms;
declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');
ha.definekey('FormID','Parent');
ha.definedata('Child');
ha.definedone();

declare hash pa(ordered:'Y');
declare hiter hi_path('pa');
pa.definekey('count');
pa.definedata('path');
pa.definedone();
end;

set Forms(where=(child is not missing));
count=1;
path=catx(' ',FormID,Parent,Child); 
pa.add();
do while(hi_path.next()=0);
_path=path;  
FormID=scan(path,1,' ');
Parent=scan(path,-1,' '); 
rc=ha.find(); if rc ne 0 then output; /*the path is deepest, we output it.*/
 do while(rc=0);
  if not find(path,strip(Child)) then do;
                                      count+1;
                                      path=catx(' ',path,Child);
                                      pa.add(); 
                                      path=_path;
                                        end;
  rc=ha.find_next();
 end;
end;
pa.clear();
run;
data want(keep=want);
 set temp;
 length want $ 400;
 do i=1 to countw(path);
   if i=1 then want=catx(' ',want,cats('FID-',scan(path,i,' ')));
    else if i=2 then want=catx(' ',want,cats('P-',scan(path,i,' ')));
      else  if i=3 then want=catx(' ',want,cats('C-',scan(path,i,' ')));
       else  want=catx(' ',want,cats('G',i-3,'C-',scan(path,i,' ')));
 end;
run;



Xia Keshan

Ksharp
Super User

Make it Better ! Smiley Happy

data Forms;

input FormID $ Parent $ Child $ @@;

cards;

1  1 . 

1  1 2   1  2 20 1 20 200

1  1 3   1  3 30 1 30 300   1 300 3000

2  9  .

2  9 4   2  4 5   2  4 6   2   6 5000

2  6 7   2  7 8   2  6 400 2   6 4000

2  6 10 2  6 11   2 6  12

;

run;

data check;

set forms ;

P=Child;C=Parent;

drop Parent Child;

run;

data temp(keep= path);

if _n_ eq 1 then do;

length path _path  $ 400 ;

if 0 then set check;

declare hash ch(hashexp:20,dataset:'check(where=(P is not missing))');

ch.definekey('FormID','P');

ch.definedone();

if 0 then set Forms;

declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');

ha.definekey('FormID','Parent');

ha.definedata('Child');

ha.definedone();

declare hash pa(ordered:'Y');

declare hiter hi_path('pa');

pa.definekey('count');

pa.definedata('path');

pa.definedone();

end;

set Forms(where=(child is not missing));

count=1;

path=catx(' ',FormID,Parent,Child);

pa.add();

do while(hi_path.next()=0);

_path=path; 

FormID=scan(path,1,' ');

P=scan(path,2,' ');

Parent=scan(path,-1,' ');

rc=ha.find();

if (rc ne 0 or (rc eq 0 and find(path,strip(Child)))) and ch.check() ne 0 then output;

do while(rc=0);

  if not find(path,strip(Child)) then do;

                                      count+1;

                                      path=catx(' ',path,Child);

                                      pa.add();

                                      path=_path;

                                        end;

  rc=ha.find_next();

end;

end;

pa.clear();

run;

data want(keep=want);

set temp;

length want $ 400;

do i=1 to countw(path);

   if i=1 then want=catx(' ',want,cats('FID-',scan(path,i,' ')));

    else if i=2 then want=catx(' ',want,cats('P-',scan(path,i,' ')));

else  if i=3 then want=catx(' ',want,cats('C-',scan(path,i,' ')));

  else  want=catx(' ',want,cats('G',i-3,'C-',scan(path,i,' ')));

end;

run;

Xia Keshan

drjohn61
Calcite | Level 5


Thanks...this is a bit above my level by just looking at it. I am going to have to run it and see what I get. Playing is how I learn things... Smiley Happy

drjohn61
Calcite | Level 5

Xia,

This is AWESOME! The only thing that I need different is the last step... I probably confused things a bit above...

Instead of the output:

Want

FID-1 P-1 C-2 G1-3 G2-4 G3-5

I need the outout:

FID   P     C     G1     G2     G3

1       1     2      3        4        5

In other words, I need the prefixes you concatenated onto the numbers to be the variable names. Smiley Happy

Patrick
Opal | Level 21

Tampering with 's code below changes in bold should give you what you're after.

data temp(keep= path);

  if _n_ eq 1 then

    do;

      length path _path  $ 400;

      if 0 then set check;

      declare hash ch(hashexp:20,dataset:'check(where=(P is not missing))');

      ch.definekey('FormID','P');

      ch.definedone();

      if 0 then set Forms;

      declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');

      ha.definekey('FormID','Parent');

      ha.definedata('Child');

      ha.definedone();

      declare hash pa(ordered:'Y');

      declare hiter hi_path('pa');

      pa.definekey('count');

      pa.definedata('path');

      pa.definedone();

    end;

  set Forms(where=(child is not missing)) end=last;

  count=1;

  path=catx(' ',FormID,Parent,Child);

  pa.add();

  do while(hi_path.next()=0);

    _path=path;

    FormID=scan(path,1,' ');

    P=scan(path,2,' ');

    Parent=scan(path,-1,' ');

    rc=ha.find();

    if (rc ne 0 or (rc eq 0 and find(path,strip(Child)))) and ch.check() ne 0 then

      output;

    do while(rc=0);

      if not find(path,strip(Child)) then

        do;

          count+1;

          path=catx(' ',path,Child);

          pa.add();

          path=_path;

        end;

      rc=ha.find_next();

    end;

  end;

  pa.clear();

  retain max_levels;

  max_levels=max(countw(path),max_levels);

  if last then call symputx('max_levels',max(4,max_levels));

run;

data want (keep=FID P C G:);

  set temp;

array el {*} 8. FID P C G1 - G%eval(&max_levels-3);

  do i=1 to countw(path);

    el=scan(path,i,' ');

  end;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 3089 views
  • 11 likes
  • 8 in conversation