Re: Collapsing Data (multiple lines to one)

drjohn61 · Posted 12-24-2014 05:05 PM

I have a data set that has multiple lines per record that I want to collapse into one. The problem is the following:

Each record has a 'FormID' (FID) that is the main key.

Each line for the FormID has a 'Parent' (P) and a 'Child' (C) field.

The uppermost 'level' is the Parent with no Child (Child is Null).

Each Parent can have multiple Children.

Each Child can have Grandchildren (G1C) ...which can have Great-Grandchildren (G2C)...and so on.

There is no set number of Children, Grandchildren, etc... per Parent.

The data look like this:

FormID Parent# Child#

1 1

1 1 2

1 2 20

1 20 200

1 1 3

1 3 30

1 30 300

1 300 3000

.

In the above case, the records would collapse as follows:

FID-1 P-1 C-2 G1C-20 G2C-200 G3C-.

FID-1 P-1 C-3 G1C-30 G2C-300 G3C-3000

.

In other words, where the Child# becomes the Parent# for the same FormID, that signifies the next level down. I am trying to figure out how to dynamically determine the number of levels and bring the lines up to the appropriate FormID/Parent/Child/Grandchild/etc. combination.

I have hardcoded it to a G3C level by left joining any combo of FID and P/C/GC/G1C/G2C/G3C combo, since, currently, there are only G2Cs in the set, but that could change at any time...and is really inefficient...

Any help would be appreciated.

Thanks,

John L.

Astounding · Posted 12-24-2014 08:18 PM

Just bit of help, at least I hope it turns out that way. When you bring the data into SAS, you have to select what the variable names will be and they have to remain the same from one observation to the next.

So it would help if you could describe the final outcome based on your sample data, supplying both the variable names and their values for this sample. Any additional description couldn't hurt, but that would be the starting point.

Most likely, the first step would involve a program that calculates the maximum number of children, grandchildren, etc. needed for your particular incoming data.

Good luck.

Reeza · Posted 12-24-2014 09:00 PM

Do you have SAS/OR? Sounds like a BOM problem to some degree.

FriedEgg · Posted 12-24-2014 10:41 PM

I agree with Reeza about BOM on this one.

data Forms;

input FormID Parent Child @@;

cards;

1 1 .

1 1 2 1 2 20 1 20 200

1 1 3 1 3 30 1 30 300 1 300 3000

2 9 .

2 9 4 2 4 5 2 4 6 2 6 5000

2 6 7 2 7 8 2 6 400 2 6 4000

2 6 10 2 6 11 2 6 12

;

run;

data FormV1 / view=FormV1;

if 0 then set Forms;

declare hash C(dataset:'Forms(where=(Child is not null))', MultiData:'Y', Ordered:'Y');

C.definekey('FormID', 'Parent');

C.definedata('Parent', 'Child');

C.definedone();

declare hash P(dataset:'Forms(where=(Child is null))', Ordered:'Y');

P.definekey('FormID', 'Parent');

P.definedone();

declare hiter Pi('P');

declare hash F(Ordered:'Y');

F.definekey('FormID', 'Generation', 'Parent', 'Child');

F.definedone();

declare hiter Fi('F');

r=Pi.first();

do while(r=0); *FormID Loop;

Generation=1;

link search;

r1=Fi.first();

do while(r1=0);

Generation+1;

Parent=Child;

link search;

r1=Fi.next();

end;

F.clear();

r=Pi.next();

end;

drop r r1 rc;

stop;

search:

rc=C.find();

put _all_;

do while(rc=0);

output;

F.add();

rc=C.find_next();

end;

return;

run;

data FormV2 / view=FormV2;

do Cidx=1 by 1 until(last.Generation);

set FormV1;

by FormID Generation;

if first.FormID then P=Parent; else call missing(P);

output;

end;

run;

proc transpose

data=FormV2

out=FormT(drop=_NAME_)

prefix=G

delim=C;

by FormID;

copy P;

id Generation Cidx;

var Child;

run;

PGStats · Posted 12-24-2014 10:47 PM

I agree with this can be viewed as a network (or tree) problem. How exactly would this FormID collapse:

FormID Parent# Child#

1 1

1 1 2

1 2 20

1 2 30

1 2 40

1 20 200

1 40 500

PG

drjohn61 · Posted 12-24-2014 11:57 PM

Thanks All,

FIrst--I have to say I am an absolute Newbie to SAS...so, please bear with me...

Second...not sure on SAS/OR... I will have to ask our techies...

To give more info on the issue:

This is pathogen sampling data and is in a database that I pull using SQL...

For each sample, there is an associated form with a unique ID. Each sample is assigned a 'ParentSample' Number. You always know which is the parent because the ChildSample# is null... (Actually in the DB, the ParentSample is Null and the ChildSample not--the ChildSample is then the ParentSample. I just swap this in the SQL used to pull the data).

One sample could be analyzed for multiple pathogens...e.g., Salmonella and Campylobacter...

Each Pathogen can have multiple isolates with its own serotype and other factors...

So you could have (This is how I would like the data):

Form# 1

ParentSample#1 -- Salmonella Positive / Campylobacter Positive

ChildSample#2 Salmonella Enteritidis => Grandchild#3 PFGE Pattern => GreatGranchild#4 Antimicrobial Resistance

ChildSample#5 Salmonella Heidelberg => Grandchild#6 PFGE Pattern => GreatGranchild#7 Antimicrobial Resistance

ChildSample#8 Campylobacter Jejuni => Grandchild#9 PFGE Pattern

I used consecutive numbers in the example--actually they are multidigit numbers--I just used the above to show that they should be unique within the FormID/ParentSample

The data is always split between the variables ParentSample and ChildSample with the added info (e.g. Pathogen, Serotype, Antimicrobial Resistance) in additional variables...

So, the above would be in the DB as such:

FormID# ParentSample# ChildSample# Info

1 1 - Salmonella + / Campylobacter +

1 1 2 Enteritidis

1 2 3 PFGE

1 3 4 Antimicrobial Resistance

1 1 5 Heidelberg

1 5 6 PFGE

1 6 7 Antimicrobial Resistance

1 1 8 Jejuni

1 8 9 PFGE

Normally, it will be sorted by FormID, ParentSample, ChildSample:

FormID# ParentSample# ChildSample# Info

1 1 - Salmonella + / Campylobacter +

1 1 2 Enteritidis

1 1 5 Heidelberg

1 1 8 Jejuni

1 2 3 PFGE

1 3 4 Antimicrobial Resistance

1 5 6 PFGE

1 6 7 Antimicrobial Resistance

1 8 9 PFGE

I should also mention that normally there is only one Grandchild, GreatGrandChild, etc per Child... So, there can be multiple Children per Parent--but usually everything from there is singular back to the Child.

So PGStats data above would collapse like this:

1 2 20 200

1 2 30

1 2 40 500

FriedEgg · Posted 12-25-2014 12:27 AM

This new criteria makes my previous posting only directional and not accurate to the problem at hand.

drjohn61 · Posted 12-25-2014 02:00 PM

I appreciate the help. I am sorry I was not explicit enough. Being new to this whole SAS thing and trying to solve this has been driving me crazy... I am pretty sure I could solve it in Access with VBA...but I don't want to have to run the data through two systems...

Reeza · Posted 12-25-2014 01:44 AM

You can run

proc setinit; run;

and check the log to see if you have SAS/OR. If you're in academia it's highly likely, but the solution to your problem goes beyond my skill set.

drjohn61 · Posted 12-25-2014 02:06 PM

It seems I have:

OR OPT

OR PRS

OR IVS

OR LVS

Of course, I don't really know what any of that means...

Ksharp · Posted 12-25-2014 01:54 AM

You could refer to this paper I wrote . I indeed love your question .:smileylaugh:


data Forms;
input FormID $ Parent $ Child $ @@;
cards;
1  1 .  
1  1 2   1  2 20 1 20 200
1  1 3   1  3 30 1 30 300   1 300 3000
2  9  .
2  9 4   2  4 5   2  4 6   2   6 5000
2  6 7   2  7 8   2  6 400 2   6 4000
2  6 10 2  6 11   2 6  12
;
run;


data temp(keep= path);
if _n_ eq 1 then do;
length path _path  $ 400 ;

if 0 then set Forms;
declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');
ha.definekey('FormID','Parent');
ha.definedata('Child');
ha.definedone();

declare hash pa(ordered:'Y');
declare hiter hi_path('pa');
pa.definekey('count');
pa.definedata('path');
pa.definedone();
end;

set Forms(where=(child is not missing));
count=1;
path=catx(' ',FormID,Parent,Child); 
pa.add();
do while(hi_path.next()=0);
_path=path;  
FormID=scan(path,1,' ');
Parent=scan(path,-1,' '); 
rc=ha.find(); if rc ne 0 then output; /*the path is deepest, we output it.*/
 do while(rc=0);
  if not find(path,strip(Child)) then do;
                                      count+1;
                                      path=catx(' ',path,Child);
                                      pa.add(); 
                                      path=_path;
                                        end;
  rc=ha.find_next();
 end;
end;
pa.clear();
run;
data want(keep=want);
 set temp;
 length want $ 400;
 do i=1 to countw(path);
   if i=1 then want=catx(' ',want,cats('FID-',scan(path,i,' ')));
    else if i=2 then want=catx(' ',want,cats('P-',scan(path,i,' ')));
      else  if i=3 then want=catx(' ',want,cats('C-',scan(path,i,' ')));
       else  want=catx(' ',want,cats('G',i-3,'C-',scan(path,i,' ')));
 end;
run;

Xia Keshan

Ksharp · Posted 12-25-2014 03:44 AM

Make it Better !

data Forms;

input FormID $ Parent $ Child $ @@;

cards;

1 1 .

1 1 2 1 2 20 1 20 200

1 1 3 1 3 30 1 30 300 1 300 3000

2 9 .

2 9 4 2 4 5 2 4 6 2 6 5000

2 6 7 2 7 8 2 6 400 2 6 4000

2 6 10 2 6 11 2 6 12

;

run;

data check;

set forms ;

P=Child;C=Parent;

drop Parent Child;

run;

data temp(keep= path);

if _n_ eq 1 then do;

length path _path $ 400 ;

if 0 then set check;

declare hash ch(hashexp:20,dataset:'check(where=(P is not missing))');

ch.definekey('FormID','P');

ch.definedone();

if 0 then set Forms;

declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');

ha.definekey('FormID','Parent');

ha.definedata('Child');

ha.definedone();

declare hash pa(ordered:'Y');

declare hiter hi_path('pa');

pa.definekey('count');

pa.definedata('path');

pa.definedone();

end;

set Forms(where=(child is not missing));

count=1;

path=catx(' ',FormID,Parent,Child);

pa.add();

do while(hi_path.next()=0);

_path=path;

FormID=scan(path,1,' ');

P=scan(path,2,' ');

Parent=scan(path,-1,' ');

rc=ha.find();

if (rc ne 0 or (rc eq 0 and find(path,strip(Child)))) and ch.check() ne 0 then output;

do while(rc=0);

if not find(path,strip(Child)) then do;

count+1;

path=catx(' ',path,Child);

pa.add();

path=_path;

end;

rc=ha.find_next();

end;

pa.clear();

run;

data want(keep=want);

set temp;

length want $ 400;

do i=1 to countw(path);

if i=1 then want=catx(' ',want,cats('FID-',scan(path,i,' ')));

else if i=2 then want=catx(' ',want,cats('P-',scan(path,i,' ')));

else if i=3 then want=catx(' ',want,cats('C-',scan(path,i,' ')));

else want=catx(' ',want,cats('G',i-3,'C-',scan(path,i,' ')));

end;

run;

Xia Keshan

drjohn61 · Posted 12-25-2014 02:01 PM

Thanks...this is a bit above my level by just looking at it. I am going to have to run it and see what I get. Playing is how I learn things...

drjohn61 · Posted 12-25-2014 05:21 PM

Xia,

This is AWESOME! The only thing that I need different is the last step... I probably confused things a bit above...

Instead of the output:

Want

FID-1 P-1 C-2 G1-3 G2-4 G3-5

I need the outout:

FID P C G1 G2 G3

1 1 2 3 4 5

In other words, I need the prefixes you concatenated onto the numbers to be the variable names.

Patrick · Posted 12-25-2014 06:17 PM

Tampering with 's code below changes in bold should give you what you're after.

data temp(keep= path);

if _n_ eq 1 then

do;

length path _path $ 400;

if 0 then set check;

declare hash ch(hashexp:20,dataset:'check(where=(P is not missing))');

ch.definekey('FormID','P');

ch.definedone();

if 0 then set Forms;

declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');

ha.definekey('FormID','Parent');

ha.definedata('Child');

ha.definedone();

declare hash pa(ordered:'Y');

declare hiter hi_path('pa');

pa.definekey('count');

pa.definedata('path');

pa.definedone();

end;

set Forms(where=(child is not missing)) end=last;

count=1;

path=catx(' ',FormID,Parent,Child);

pa.add();

do while(hi_path.next()=0);

_path=path;

FormID=scan(path,1,' ');

P=scan(path,2,' ');

Parent=scan(path,-1,' ');

rc=ha.find();

if (rc ne 0 or (rc eq 0 and find(path,strip(Child)))) and ch.check() ne 0 then

output;

do while(rc=0);

if not find(path,strip(Child)) then

do;

count+1;

path=catx(' ',path,Child);

pa.add();

path=_path;

end;

rc=ha.find_next();

end;

pa.clear();

retain max_levels;

max_levels=max(countw(path),max_levels);

if last then call symputx('max_levels',max(4,max_levels));

run;

data want (keep=FID P C G:);

set temp;

array el {*} 8. FID P C G1 - G%eval(&max_levels-3);

do i=1 to countw(path);

el=scan(path,i,' ');

end;

run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away