Greetings,
I have a data structure that looks like this:
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;
So, we have family id, individual id, implicate number and imputed income for each implicate.
What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:
DATA want ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;
In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.
So far, I came up with this solution:
%let implist_1=imp_inc;
%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);
This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.
I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.
Thank you very much for your support.
Best regards
I think this will do the same, but reduce the memory consumption to the bare minimum:
data want (drop=rc);
set have;
by famid;
if first.famid
then do;
declare hash h();
h.definekey('implicate');
h.definedata('imp_inc');
h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
run;
Do the 400 variables of interest have some common naming convention?
Unfortunately, no. They may have very different and unrelated names.
Ok. But the principle for all variables is the same: If they are missing, their value should be replaced by one of the first 5 values (first missing replaced by first value etc.) for the given by group, correct?
I think this does what you want (if my assessments are correct), is reasonable fast and should not spend too much memory
I added three other variables with different values and no common naming conventions so you can easily edit the code according to your needs 🙂
data have;
input famid indid implicate imp_inc;
somevar=imp_inc*2;
othervar=imp_inc*3;
thirdvar=imp_inc*4;
cards ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
run;
data want(drop=rc);
if _N_=1 then do;
declare hash h(dataset:'have(where=(indid=1))');
h.definekey('famid', 'implicate');
h.definedata('imp_inc', 'somevar', 'othervar', 'thirdvar');
h.definedone();
end;
set have;
by famid;
if imp_inc ne 1 then rc=h.find();
run;
Let me know if this does not meet your needs or you need some clarification
I think this will do the same, but reduce the memory consumption to the bare minimum:
data want (drop=rc);
set have;
by famid;
if first.famid
then do;
declare hash h();
h.definekey('implicate');
h.definedata('imp_inc');
h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
run;
@Kurt_Bremser, I'm not entirely sure here, since I'm not an expert on hash objects 🙂 I agree that famid does not have to be a key variable.
However, I think a new instance of the hash object h for each by group is created with this approach, which would not reduce memory consumption.
Perhaps something like this?
data want(drop=rc);
if _N_=1 then do;
declare hash h();
h.definekey('implicate');
h.definedata('imp_inc');
h.definedone();
end;
set have;
by famid;
if indid = 1 then rc = h.add();
else rc = h.find();
if last.famid then h.clear();
run;
@PeterClemmensen wrote:
@Kurt_Bremser, I'm not entirely sure here, since I'm not an expert on hash objects 🙂 I agree that famid does not have to be a key variable.
However, I think a new instance of the hash object h for each by group is created with this approach, which would not reduce memory consumption.
Perhaps something like this?
data want(drop=rc); if _N_=1 then do; declare hash h(); h.definekey('implicate'); h.definedata('imp_inc'); h.definedone(); end; set have; by famid; if indid = 1 then rc = h.add(); else rc = h.find(); if last.famid then h.clear(); run;
I think you are right. After re-reading the documentation for the declare statement, it "instantiates" the object; it can well be that the old object is "orphaned" and just keeps on hanging around in memory limbo. Your solution prevents this and does the same re-initialization at group change that I wanted to achieve.
My main goal was to prevent the loading of a major part of the dataset into memory when only the data of the current group is needed.
If one positively knew that only 5 indid can exist for any given family, a temporary array could also be used to propagate the values. But the hash object is more flexible, and additional variables need only be added to the definedata() method, which can be neatly accomplished by reading them into a macro variable from dictionary.columns.
I did some tests with options fullstimer. @PeterClemmensen's improvement on my solution is the one to use, as it has the lowest memory consumption and is quicker than his original method.
My initial suggestion is faulty, as it is the first one to crash when the dataset is expanded.
Works great thank you!
A temp array might be faster
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
;
RUN ;
data want;
array t(999) _temporary_;
call missing(of t(*));
do until(last.famid);
set have;
by famid;
if indid=1 then t(implicate)=imp_inc;
else imp_inc=t(implicate);
output;
end;
run;
Hash is OK here but can really get unwieldy. There is not that much need to load, look up using dot notation method at least for your data which assuming your sample is pretty good representative is a very straight forward simple one pass that doesn't really warrant external data structure requirement as such in my opinion.
The temporaray array proves to be the fastest and consumes the least memory, but I think expanding it for more variables will be less straightforward than the hash method, as you would need to create a two-dimensional temporary array and a corresponding array for the existing variables that are filled and scanned in a do loop.
Personally, I like the hash approach as its core part
if indid = 1
then rc = h.add();
else rc = h.find();
if last.famid then rc = h.clear();
will stay the same no matter how many variables need to be dealt with, and no assumptions need to be made (size of array) for the possible number of individuals in a family.
Sure and right, however for OPs requirement exclusively here, I think temp array alone should suffice. This is something I am seriously assuming OP has presented the sample that is representative of what he/she has in real.
If you want to run the tests yourself:
data have;
do famid = 1 to 100000;
do indid = 1 to 5;
do implicate = 1 to 5;
imp_inc = 10000; /* just some value */
output;
end;
end;
end;
run;
options fullstimer;
data want (drop=rc);
set have;
by famid;
if _n_ = 1
then do;
declare hash h();
h.definekey('implicate');
h.definedata('imp_inc');
h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
if last.famid then rc = h.clear();
run;
data want(drop=rc);
if _N_=1 then do;
declare hash h(dataset:'have(where=(indid=1))');
h.definekey('famid', 'implicate');
h.definedata('imp_inc');
h.definedone();
end;
set have;
by famid;
if imp_inc ne 1 then rc=h.find();
run;
data want;
array t(999) _temporary_;
call missing(of t(*));
do until(last.famid);
set have;
by famid;
if indid=1 then t(implicate)=imp_inc;
else imp_inc=t(implicate);
output;
end;
run;
If it doesn't show you decisive results, just increase the loop counts in the dataset creation.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.