Solved: Replicate multiple results across rows

Luis_Martins · Posted 11-07-2018 04:41 AM

Greetings,

I have a data structure that looks like this:

DATA have ; 
INPUT famid indid implicate imp_inc; 
CARDS ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;

So, we have family id, individual id, implicate number and imputed income for each implicate.

What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:

DATA want ; 
INPUT famid indid implicate imp_inc; 
CARDS ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;

In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.

So far, I came up with this solution:

%let implist_1=imp_inc;

%macro copyv1(list);
	%let nwords=%sysfunc(countw(&list));
	%do i=1 %to &nwords;
	%let varl=%scan(&list, &i);
		proc means data=have max noprint;
			var &varl;
			by famid implicate;
			where indid=1;
			OUTPUT OUT=copy max=max_&varl;	
		run;
		data want;
	 		set have;
			drop &varl;
		run;
		data want (drop=_TYPE_ _FREQ_);
			merge want copy;
			by famid implicate;
			rename max_&varl=&varl;
		run;
	%end;
%mend;
%copyv1(&imp_list1);

This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.

I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.

Thank you very much for your support.

Best regards

Kurt_Bremser · Posted 11-07-2018 07:17 AM

I think this will do the same, but reduce the memory consumption to the bare minimum:

data want (drop=rc);
set have;
by famid;
if first.famid
then do;
  declare hash h();
  h.definekey('implicate');
  h.definedata('imp_inc');
  h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

PeterClemmensen · Posted 11-07-2018 06:12 AM

Do the 400 variables of interest have some common naming convention?

The DATA to DATA Step Macro
Blog: SASnrd

Luis_Martins · Posted 11-07-2018 06:17 AM

Unfortunately, no. They may have very different and unrelated names.

PeterClemmensen · Posted 11-07-2018 06:28 AM

Ok. But the principle for all variables is the same: If they are missing, their value should be replaced by one of the first 5 values (first missing replaced by first value etc.) for the given by group, correct?

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 11-07-2018 06:42 AM

I think this does what you want (if my assessments are correct), is reasonable fast and should not spend too much memory

I added three other variables with different values and no common naming conventions so you can easily edit the code according to your needs 🙂

data have; 
input famid indid implicate imp_inc; 
somevar=imp_inc*2;
othervar=imp_inc*3;
thirdvar=imp_inc*4;
cards ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
run;

data want(drop=rc);
   if _N_=1 then do;
      declare hash h(dataset:'have(where=(indid=1))');
      h.definekey('famid', 'implicate');
      h.definedata('imp_inc', 'somevar', 'othervar', 'thirdvar');
      h.definedone();
   end;

   set have;
   by famid;
   
   if imp_inc ne 1 then rc=h.find();
run;

Let me know if this does not meet your needs or you need some clarification

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 11-07-2018 07:17 AM

I think this will do the same, but reduce the memory consumption to the bare minimum:

data want (drop=rc);
set have;
by famid;
if first.famid
then do;
  declare hash h();
  h.definekey('implicate');
  h.definedata('imp_inc');
  h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 11-07-2018 08:11 AM

@Kurt_Bremser, I'm not entirely sure here, since I'm not an expert on hash objects 🙂 I agree that famid does not have to be a key variable.

However, I think a new instance of the hash object h for each by group is created with this approach, which would not reduce memory consumption.

Perhaps something like this?

data want(drop=rc);
   if _N_=1 then do;
      declare hash h();
      h.definekey('implicate');
      h.definedata('imp_inc');
      h.definedone();
   end;

   set have;
   by famid;

   if indid = 1 then rc = h.add();
   else rc = h.find();

   if last.famid then h.clear();
run;

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 11-07-2018 08:29 AM

@PeterClemmensen wrote:

@Kurt_Bremser, I'm not entirely sure here, since I'm not an expert on hash objects 🙂 I agree that famid does not have to be a key variable.

However, I think a new instance of the hash object h for each by group is created with this approach, which would not reduce memory consumption.

Perhaps something like this?
data want(drop=rc);
   if _N_=1 then do;
      declare hash h();
      h.definekey('implicate');
      h.definedata('imp_inc');
      h.definedone();
   end;

   set have;
   by famid;

   if indid = 1 then rc = h.add();
   else rc = h.find();

   if last.famid then h.clear();
run;

I think you are right. After re-reading the documentation for the declare statement, it "instantiates" the object; it can well be that the old object is "orphaned" and just keeps on hanging around in memory limbo. Your solution prevents this and does the same re-initialization at group change that I wanted to achieve.

My main goal was to prevent the loading of a major part of the dataset into memory when only the data of the current group is needed.

If one positively knew that only 5 indid can exist for any given family, a temporary array could also be used to propagate the values. But the hash object is more flexible, and additional variables need only be added to the definedata() method, which can be neatly accomplished by reading them into a macro variable from dictionary.columns.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 11-07-2018 09:43 AM

I did some tests with options fullstimer. @PeterClemmensen's improvement on my solution is the one to use, as it has the lowest memory consumption and is quicker than his original method.

My initial suggestion is faulty, as it is the first one to crash when the dataset is expanded.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Luis_Martins · Posted 11-07-2018 08:15 AM

Works great thank you!

novinosrin · Posted 11-07-2018 09:37 AM

A temp array might be faster

DATA have ; 
INPUT famid indid implicate imp_inc; 
CARDS ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
;
RUN ;

data want;
array t(999) _temporary_;
call missing(of t(*));
do until(last.famid);
set have;
by famid;
if indid=1 then t(implicate)=imp_inc;
else imp_inc=t(implicate);
output;
end;
run;

novinosrin · Posted 11-07-2018 09:42 AM

Hash is OK here but can really get unwieldy. There is not that much need to load, look up using dot notation method at least for your data which assuming your sample is pretty good representative is a very straight forward simple one pass that doesn't really warrant external data structure requirement as such in my opinion.

Kurt_Bremser · Posted 11-07-2018 09:55 AM

The temporaray array proves to be the fastest and consumes the least memory, but I think expanding it for more variables will be less straightforward than the hash method, as you would need to create a two-dimensional temporary array and a corresponding array for the existing variables that are filled and scanned in a do loop.

Personally, I like the hash approach as its core part

if indid = 1
then rc = h.add();
else rc = h.find();
if last.famid then rc = h.clear();

will stay the same no matter how many variables need to be dealt with, and no assumptions need to be made (size of array) for the possible number of individuals in a family.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

novinosrin · Posted 11-07-2018 09:58 AM

Sure and right, however for OPs requirement exclusively here, I think temp array alone should suffice. This is something I am seriously assuming OP has presented the sample that is representative of what he/she has in real.

Kurt_Bremser · Posted 11-07-2018 09:58 AM

If you want to run the tests yourself:

data have;
do famid = 1 to 100000;
  do indid = 1 to 5;
    do implicate = 1 to 5;
      imp_inc = 10000; /* just some value */
      output;
    end;
  end;
end;
run;

options fullstimer;

data want (drop=rc);
set have;
by famid;
if _n_ = 1
then do;
  declare hash h();
  h.definekey('implicate');
  h.definedata('imp_inc');
  h.definedone();
end;
if indid = 1
then rc = h.add();
else rc = h.find();
if last.famid then rc = h.clear();
run;

data want(drop=rc);
   if _N_=1 then do;
      declare hash h(dataset:'have(where=(indid=1))');
      h.definekey('famid', 'implicate');
      h.definedata('imp_inc');
      h.definedone();
   end;

   set have;
   by famid;
   
   if imp_inc ne 1 then rc=h.find();
run;

data want;
array t(999) _temporary_;
call missing(of t(*));
do until(last.famid);
  set have;
  by famid;
  if indid=1 then t(implicate)=imp_inc;
  else imp_inc=t(implicate);
  output;
end;
run;

If it doesn't show you decisive results, just increase the loop counts in the dataset creation.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Re: Replicate multiple results across rows

Ready to join fellow brilliant minds for the SAS Hackathon?