BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Luis_Martins
Fluorite | Level 6

Greetings,

 

I have a data structure that looks like this:

 

DATA have ; 
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;

So, we have family id, individual id, implicate number and imputed income for each implicate. 

What i need is to replicate the results of  the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:

 

DATA want ; 
INPUT famid indid implicate imp_inc; 
CARDS ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;

In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.

 

So far, I came up with this solution: 

%let implist_1=imp_inc;

%macro copyv1(list);
	%let nwords=%sysfunc(countw(&list));
	%do i=1 %to &nwords;
	%let varl=%scan(&list, &i);
		proc means data=have max noprint;
			var &varl;
			by famid implicate;
			where indid=1;
			OUTPUT OUT=copy max=max_&varl;	
		run;
		data want;
	 		set have;
			drop &varl;
		run;
		data want (drop=_TYPE_ _FREQ_);
			merge want copy;
			by famid implicate;
			rename max_&varl=&varl;
		run;
	%end;
%mend;
%copyv1(&imp_list1);

This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.

 

I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.

 

Thank you very much for your support.

 

Best regards 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

I think this will do the same, but reduce the memory consumption to the bare minimum:

data want (drop=rc);
set have;
by famid;
if first.famid
then do;
  declare hash h();
  h.definekey('implicate');
  h.definedata('imp_inc');
  h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
run;

 

View solution in original post

15 REPLIES 15
PeterClemmensen
Tourmaline | Level 20

Do the 400 variables of interest have some common naming convention?

Luis_Martins
Fluorite | Level 6

Unfortunately, no. They may have very different and unrelated names.

PeterClemmensen
Tourmaline | Level 20

Ok. But the principle for all variables is the same: If they are missing, their value should be replaced by one of the first 5 values (first missing replaced by first value etc.) for the given by group, correct?

PeterClemmensen
Tourmaline | Level 20

I think this does what you want (if my assessments are correct), is reasonable fast and should not spend too much memory

 

I added three other variables with different values and no common naming conventions so you can easily edit the code according to your needs 🙂

 

data have; 
input famid indid implicate imp_inc; 
somevar=imp_inc*2;
othervar=imp_inc*3;
thirdvar=imp_inc*4;
cards ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
run;

data want(drop=rc);
   if _N_=1 then do;
      declare hash h(dataset:'have(where=(indid=1))');
      h.definekey('famid', 'implicate');
      h.definedata('imp_inc', 'somevar', 'othervar', 'thirdvar');
      h.definedone();
   end;

   set have;
   by famid;
   
   if imp_inc ne 1 then rc=h.find();
run;

 

Let me know if this does not meet your needs or you need some clarification  

Kurt_Bremser
Super User

I think this will do the same, but reduce the memory consumption to the bare minimum:

data want (drop=rc);
set have;
by famid;
if first.famid
then do;
  declare hash h();
  h.definekey('implicate');
  h.definedata('imp_inc');
  h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
run;

 

PeterClemmensen
Tourmaline | Level 20

@Kurt_Bremser, I'm not entirely sure here, since I'm not an expert on hash objects 🙂 I agree that famid does not have to be a key variable.

 

However, I think a new instance of the hash object h for each by group is created with this approach, which would not reduce memory consumption.

 

Perhaps something like this?

 

 

data want(drop=rc);
   if _N_=1 then do;
      declare hash h();
      h.definekey('implicate');
      h.definedata('imp_inc');
      h.definedone();
   end;

   set have;
   by famid;

   if indid = 1 then rc = h.add();
   else rc = h.find();

   if last.famid then h.clear();
run;

 

Kurt_Bremser
Super User

@PeterClemmensen wrote:

@Kurt_Bremser, I'm not entirely sure here, since I'm not an expert on hash objects 🙂 I agree that famid does not have to be a key variable.

 

However, I think a new instance of the hash object h for each by group is created with this approach, which would not reduce memory consumption.

 

Perhaps something like this?

 

 

data want(drop=rc);
   if _N_=1 then do;
      declare hash h();
      h.definekey('implicate');
      h.definedata('imp_inc');
      h.definedone();
   end;

   set have;
   by famid;

   if indid = 1 then rc = h.add();
   else rc = h.find();

   if last.famid then h.clear();
run;

 


I think you are right. After re-reading the documentation for the declare statement, it "instantiates" the object; it can well be that the old object is "orphaned" and just keeps on hanging around in memory limbo. Your solution prevents this and does the same re-initialization at group change that I wanted to achieve.

 

My main goal was to prevent the loading of a major part of the dataset into memory when only the data of the current group is needed.

If one positively knew that only 5 indid can exist for any given family, a temporary array could also be used to propagate the values. But the hash object is more flexible, and additional variables need only be added to the definedata() method, which can be neatly accomplished by reading them into a macro variable from dictionary.columns.

Kurt_Bremser
Super User

I did some tests with options fullstimer. @PeterClemmensen's improvement on my solution is the one to use, as it has the lowest memory consumption and is quicker than his original method.

My initial suggestion is faulty, as it is the first one to crash when the dataset is expanded.

Luis_Martins
Fluorite | Level 6

Works great thank you!

novinosrin
Tourmaline | Level 20

A temp array might be faster

 

DATA have ; 
INPUT famid indid implicate imp_inc; 
CARDS ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
;
RUN ;

data want;
array t(999) _temporary_;
call missing(of t(*));
do until(last.famid);
set have;
by famid;
if indid=1 then t(implicate)=imp_inc;
else imp_inc=t(implicate);
output;
end;
run;




novinosrin
Tourmaline | Level 20

Hash is OK here but can really get unwieldy. There is not that much need to load, look up using dot notation method at least for your data which assuming your sample is pretty good representative is a very straight forward simple one pass that doesn't really warrant external data structure requirement as such in my opinion. 

Kurt_Bremser
Super User

The temporaray array proves to be the fastest and consumes the least memory, but I think expanding it for more variables will be less straightforward than the hash method, as you would need to create a two-dimensional temporary array and a corresponding array for the existing variables that are filled and scanned in a do loop.

Personally, I like the hash approach as its core part

if indid = 1
then rc = h.add();
else rc = h.find();
if last.famid then rc = h.clear();

will stay the same no matter how many variables need to be dealt with, and no assumptions need to be made (size of array) for the possible number of individuals in a family.

novinosrin
Tourmaline | Level 20

Sure and right, however for OPs requirement exclusively here, I think temp array alone should suffice. This is something I am seriously assuming OP has presented the sample that is representative of what he/she has in real. 

Kurt_Bremser
Super User

If you want to run the tests yourself:

data have;
do famid = 1 to 100000;
  do indid = 1 to 5;
    do implicate = 1 to 5;
      imp_inc = 10000; /* just some value */
      output;
    end;
  end;
end;
run;

options fullstimer;

data want (drop=rc);
set have;
by famid;
if _n_ = 1
then do;
  declare hash h();
  h.definekey('implicate');
  h.definedata('imp_inc');
  h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
if last.famid then rc = h.clear();
run;

data want(drop=rc);
   if _N_=1 then do;
      declare hash h(dataset:'have(where=(indid=1))');
      h.definekey('famid', 'implicate');
      h.definedata('imp_inc');
      h.definedone();
   end;

   set have;
   by famid;
   
   if imp_inc ne 1 then rc=h.find();
run;

data want;
array t(999) _temporary_;
call missing(of t(*));
do until(last.famid);
  set have;
  by famid;
  if indid=1 then t(implicate)=imp_inc;
  else imp_inc=t(implicate);
  output;
end;
run;

If it doesn't show you decisive results, just increase the loop counts in the dataset creation.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 2351 views
  • 3 likes
  • 4 in conversation