I have a data set that has multiple lines per record that I want to collapse into one. The problem is the following:
Each record has a 'FormID' (FID) that is the main key.
Each line for the FormID has a 'Parent' (P) and a 'Child' (C) field.
The uppermost 'level' is the Parent with no Child (Child is Null).
Each Parent can have multiple Children.
Each Child can have Grandchildren (G1C) ...which can have Great-Grandchildren (G2C)...and so on.
There is no set number of Children, Grandchildren, etc... per Parent.
The data look like this:
FormID Parent# Child#
1 1
1 1 2
1 2 20
1 20 200
1 1 3
1 3 30
1 30 300
1 300 3000
.
.
.
In the above case, the records would collapse as follows:
FID-1 P-1 C-2 G1C-20 G2C-200 G3C-.
FID-1 P-1 C-3 G1C-30 G2C-300 G3C-3000
.
.
.
In other words, where the Child# becomes the Parent# for the same FormID, that signifies the next level down. I am trying to figure out how to dynamically determine the number of levels and bring the lines up to the appropriate FormID/Parent/Child/Grandchild/etc. combination.
I have hardcoded it to a G3C level by left joining any combo of FID and P/C/GC/G1C/G2C/G3C combo, since, currently, there are only G2Cs in the set, but that could change at any time...and is really inefficient...
Any help would be appreciated.
Thanks,
John L.
Just bit of help, at least I hope it turns out that way. When you bring the data into SAS, you have to select what the variable names will be and they have to remain the same from one observation to the next.
So it would help if you could describe the final outcome based on your sample data, supplying both the variable names and their values for this sample. Any additional description couldn't hurt, but that would be the starting point.
Most likely, the first step would involve a program that calculates the maximum number of children, grandchildren, etc. needed for your particular incoming data.
Good luck.
Do you have SAS/OR? Sounds like a BOM problem to some degree.
I agree with Reeza about BOM on this one.
data Forms;
input FormID Parent Child @@;
cards;
1 1 .
1 1 2 1 2 20 1 20 200
1 1 3 1 3 30 1 30 300 1 300 3000
2 9 .
2 9 4 2 4 5 2 4 6 2 6 5000
2 6 7 2 7 8 2 6 400 2 6 4000
2 6 10 2 6 11 2 6 12
;
run;
data FormV1 / view=FormV1;
if 0 then set Forms;
declare hash C(dataset:'Forms(where=(Child is not null))', MultiData:'Y', Ordered:'Y');
C.definekey('FormID', 'Parent');
C.definedata('Parent', 'Child');
C.definedone();
declare hash P(dataset:'Forms(where=(Child is null))', Ordered:'Y');
P.definekey('FormID', 'Parent');
P.definedone();
declare hiter Pi('P');
declare hash F(Ordered:'Y');
F.definekey('FormID', 'Generation', 'Parent', 'Child');
F.definedone();
declare hiter Fi('F');
r=Pi.first();
do while(r=0); *FormID Loop;
Generation=1;
link search;
r1=Fi.first();
do while(r1=0);
Generation+1;
Parent=Child;
link search;
r1=Fi.next();
end;
F.clear();
r=Pi.next();
end;
drop r r1 rc;
stop;
search:
rc=C.find();
put _all_;
do while(rc=0);
output;
F.add();
rc=C.find_next();
end;
return;
run;
data FormV2 / view=FormV2;
do Cidx=1 by 1 until(last.Generation);
set FormV1;
by FormID Generation;
if first.FormID then P=Parent; else call missing(P);
output;
end;
run;
proc transpose
data=FormV2
out=FormT(drop=_NAME_)
prefix=G
delim=C;
by FormID;
copy P;
id Generation Cidx;
var Child;
run;
Thanks All,
FIrst--I have to say I am an absolute Newbie to SAS...so, please bear with me...
Second...not sure on SAS/OR... I will have to ask our techies...
To give more info on the issue:
This is pathogen sampling data and is in a database that I pull using SQL...
For each sample, there is an associated form with a unique ID. Each sample is assigned a 'ParentSample' Number. You always know which is the parent because the ChildSample# is null... (Actually in the DB, the ParentSample is Null and the ChildSample not--the ChildSample is then the ParentSample. I just swap this in the SQL used to pull the data).
One sample could be analyzed for multiple pathogens...e.g., Salmonella and Campylobacter...
Each Pathogen can have multiple isolates with its own serotype and other factors...
So you could have (This is how I would like the data):
Form# 1
ParentSample#1 -- Salmonella Positive / Campylobacter Positive
ChildSample#2 Salmonella Enteritidis => Grandchild#3 PFGE Pattern => GreatGranchild#4 Antimicrobial Resistance
ChildSample#5 Salmonella Heidelberg => Grandchild#6 PFGE Pattern => GreatGranchild#7 Antimicrobial Resistance
ChildSample#8 Campylobacter Jejuni => Grandchild#9 PFGE Pattern
I used consecutive numbers in the example--actually they are multidigit numbers--I just used the above to show that they should be unique within the FormID/ParentSample
The data is always split between the variables ParentSample and ChildSample with the added info (e.g. Pathogen, Serotype, Antimicrobial Resistance) in additional variables...
So, the above would be in the DB as such:
FormID# ParentSample# ChildSample# Info
1 1 - Salmonella + / Campylobacter +
1 1 2 Enteritidis
1 2 3 PFGE
1 3 4 Antimicrobial Resistance
1 1 5 Heidelberg
1 5 6 PFGE
1 6 7 Antimicrobial Resistance
1 1 8 Jejuni
1 8 9 PFGE
Normally, it will be sorted by FormID, ParentSample, ChildSample:
FormID# ParentSample# ChildSample# Info
1 1 - Salmonella + / Campylobacter +
1 1 2 Enteritidis
1 1 5 Heidelberg
1 1 8 Jejuni
1 2 3 PFGE
1 3 4 Antimicrobial Resistance
1 5 6 PFGE
1 6 7 Antimicrobial Resistance
1 8 9 PFGE
I should also mention that normally there is only one Grandchild, GreatGrandChild, etc per Child... So, there can be multiple Children per Parent--but usually everything from there is singular back to the Child.
So PGStats data above would collapse like this:
1 2 20 200
1 2 30
1 2 40 500
This new criteria makes my previous posting only directional and not accurate to the problem at hand.
I appreciate the help. I am sorry I was not explicit enough. Being new to this whole SAS thing and trying to solve this has been driving me crazy... I am pretty sure I could solve it in Access with VBA...but I don't want to have to run the data through two systems...
You can run
proc setinit; run;
and check the log to see if you have SAS/OR. If you're in academia it's highly likely, but the solution to your problem goes beyond my skill set.
It seems I have:
OR OPT
OR PRS
OR IVS
OR LVS
Of course, I don't really know what any of that means...
You could refer to this paper I wrote . I indeed love your question .:smileylaugh:
data Forms; input FormID $ Parent $ Child $ @@; cards; 1 1 . 1 1 2 1 2 20 1 20 200 1 1 3 1 3 30 1 30 300 1 300 3000 2 9 . 2 9 4 2 4 5 2 4 6 2 6 5000 2 6 7 2 7 8 2 6 400 2 6 4000 2 6 10 2 6 11 2 6 12 ; run; data temp(keep= path); if _n_ eq 1 then do; length path _path $ 400 ; if 0 then set Forms; declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y'); ha.definekey('FormID','Parent'); ha.definedata('Child'); ha.definedone(); declare hash pa(ordered:'Y'); declare hiter hi_path('pa'); pa.definekey('count'); pa.definedata('path'); pa.definedone(); end; set Forms(where=(child is not missing)); count=1; path=catx(' ',FormID,Parent,Child); pa.add(); do while(hi_path.next()=0); _path=path; FormID=scan(path,1,' '); Parent=scan(path,-1,' '); rc=ha.find(); if rc ne 0 then output; /*the path is deepest, we output it.*/ do while(rc=0); if not find(path,strip(Child)) then do; count+1; path=catx(' ',path,Child); pa.add(); path=_path; end; rc=ha.find_next(); end; end; pa.clear(); run; data want(keep=want); set temp; length want $ 400; do i=1 to countw(path); if i=1 then want=catx(' ',want,cats('FID-',scan(path,i,' '))); else if i=2 then want=catx(' ',want,cats('P-',scan(path,i,' '))); else if i=3 then want=catx(' ',want,cats('C-',scan(path,i,' '))); else want=catx(' ',want,cats('G',i-3,'C-',scan(path,i,' '))); end; run;
Xia Keshan
Make it Better !
data Forms;
input FormID $ Parent $ Child $ @@;
cards;
1 1 .
1 1 2 1 2 20 1 20 200
1 1 3 1 3 30 1 30 300 1 300 3000
2 9 .
2 9 4 2 4 5 2 4 6 2 6 5000
2 6 7 2 7 8 2 6 400 2 6 4000
2 6 10 2 6 11 2 6 12
;
run;
data check;
set forms ;
P=Child;C=Parent;
drop Parent Child;
run;
data temp(keep= path);
if _n_ eq 1 then do;
length path _path $ 400 ;
if 0 then set check;
declare hash ch(hashexp:20,dataset:'check(where=(P is not missing))');
ch.definekey('FormID','P');
ch.definedone();
if 0 then set Forms;
declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');
ha.definekey('FormID','Parent');
ha.definedata('Child');
ha.definedone();
declare hash pa(ordered:'Y');
declare hiter hi_path('pa');
pa.definekey('count');
pa.definedata('path');
pa.definedone();
end;
set Forms(where=(child is not missing));
count=1;
path=catx(' ',FormID,Parent,Child);
pa.add();
do while(hi_path.next()=0);
_path=path;
FormID=scan(path,1,' ');
P=scan(path,2,' ');
Parent=scan(path,-1,' ');
rc=ha.find();
if (rc ne 0 or (rc eq 0 and find(path,strip(Child)))) and ch.check() ne 0 then output;
do while(rc=0);
if not find(path,strip(Child)) then do;
count+1;
path=catx(' ',path,Child);
pa.add();
path=_path;
end;
rc=ha.find_next();
end;
end;
pa.clear();
run;
data want(keep=want);
set temp;
length want $ 400;
do i=1 to countw(path);
if i=1 then want=catx(' ',want,cats('FID-',scan(path,i,' ')));
else if i=2 then want=catx(' ',want,cats('P-',scan(path,i,' ')));
else if i=3 then want=catx(' ',want,cats('C-',scan(path,i,' ')));
else want=catx(' ',want,cats('G',i-3,'C-',scan(path,i,' ')));
end;
run;
Xia Keshan
Thanks...this is a bit above my level by just looking at it. I am going to have to run it and see what I get. Playing is how I learn things...
Xia,
This is AWESOME! The only thing that I need different is the last step... I probably confused things a bit above...
Instead of the output:
Want
FID-1 P-1 C-2 G1-3 G2-4 G3-5
I need the outout:
FID P C G1 G2 G3
1 1 2 3 4 5
In other words, I need the prefixes you concatenated onto the numbers to be the variable names.
Tampering with 's code below changes in bold should give you what you're after.
data temp(keep= path);
if _n_ eq 1 then
do;
length path _path $ 400;
if 0 then set check;
declare hash ch(hashexp:20,dataset:'check(where=(P is not missing))');
ch.definekey('FormID','P');
ch.definedone();
if 0 then set Forms;
declare hash ha(hashexp:20,dataset:'Forms(where=(child is not missing))',multidata:'Y');
ha.definekey('FormID','Parent');
ha.definedata('Child');
ha.definedone();
declare hash pa(ordered:'Y');
declare hiter hi_path('pa');
pa.definekey('count');
pa.definedata('path');
pa.definedone();
end;
set Forms(where=(child is not missing)) end=last;
count=1;
path=catx(' ',FormID,Parent,Child);
pa.add();
do while(hi_path.next()=0);
_path=path;
FormID=scan(path,1,' ');
P=scan(path,2,' ');
Parent=scan(path,-1,' ');
rc=ha.find();
if (rc ne 0 or (rc eq 0 and find(path,strip(Child)))) and ch.check() ne 0 then
output;
do while(rc=0);
if not find(path,strip(Child)) then
do;
count+1;
path=catx(' ',path,Child);
pa.add();
path=_path;
end;
rc=ha.find_next();
end;
end;
pa.clear();
retain max_levels;
max_levels=max(countw(path),max_levels);
if last then call symputx('max_levels',max(4,max_levels));
run;
data want (keep=FID P C G:);
set temp;
array el {*} 8. FID P C G1 - G%eval(&max_levels-3);
do i=1 to countw(path);
el=scan(path,i,' ');
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.