Solved: Re: Fill down multiple rows

eabc0351 · Posted 09-17-2020 10:06 AM

Hi all. I have the following data:

data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8 
1 -7 -6 . . 
2 -20 -19 -18 -17 
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .  
;
run;

I would like the following data:

data want;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 . .
1 -19 -18 -21 -20
1 -15 -14 -17 -16
1 -11 -10 -13 -12
1 -7 -6 -9 -8
2 -20 -19 -18 . 
2 -16 -15 -14 -17 
2 -12 -11 -9 -13
2 -7 -6 -5 -8
2 -3 -2 -1 -4
;
run;

Each ID has 5 observations only. I am looking to do this for all the variables in the table, not just the 4 that are listed here. Thank you for your assistance!

novinosrin · Posted 09-17-2020 10:35 AM


data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8 
1 -7 -6 . 5 
2 -20 -19 -18 -17 
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .  
;
run;


data want;
 do until(last.id);
  set have;
  by id;
  array d day:;
  array t(4) _temporary_;/*Expand the array subscript to the actual you have*/
 end;
 do over d;
  if d=. then t(_i_)=_i_;
 end;
 do until(last.id);
  set have;
  by id;
  do over d;
   d=ifn(_i_ in t ,lag(d),d);
   if _i_ in t and first.id then d=.;
  end;
  output;
 end;
 call missing(of t(*));
run;

EDITED: @eabc0351 To include "and not first.id"---> d=ifn(_i_ in t ,lag(d),d);
if _i_ in t and first.id then d=.; Please take notice!

View solution in original post

PaigeMiller · Posted 09-17-2020 10:15 AM

Please explain the logic that lets you go from data set HAVE to data set WANT.

--
Paige Miller

eabc0351 · Posted 09-17-2020 10:21 AM

If the values of any variables in the last.id are missing, then fill down (row 5 per id) from row above (row 4 per id). I have accomplished this with the following code:
data want;
update have (obs=0) have;
by id;
output;
run;

Now, I need to delete the values that filled down from row 4. Row 4 now contains the missing values. Then I would like to fill down (row 4) from row above (row 3). Then delete filled down values from row 3. Fill down from row 2. Delete row 2 filled down value. Then fill down from row 1. In the data want, row 1 contains the missing values.

ballardw · Posted 09-17-2020 10:20 AM

Why does ID 1 have two variables "moved" and ID 2 only 1 (apparently). Do different ID values have different rules? If so then you need to explain the rules for each ID.

For "all variables", what are the differing rules involving which variables for which ids?

eabc0351 · Posted 09-17-2020 10:33 AM

There are not different rules for each ID. I have 7 variables to indicate day (day1-day7). For the last.id, each person could have at least one missing and up to 7 missing values for these day variables. For example:
ID=1 might have missing values for day4-day7 in last.id. But ID=2 might have missing values for day2-day7.
Then there are corresponding varaibles to the days that describe peoples drug use on those days. For instance:
ID=1 would have a 1 or 0 for druguse1-druguse3 but missing values for druguse4-druguse7 in last.id, corresponding to the missing values for the days.
ID=2 the missing values in last.id would be druguse2-druguse7.
Because there are so many variables, I was hoping to fill down to row 5 from row above (row 4) if variables are missing using this code:
data want;
update have (obs=0) have;
by id;
output;
run;
Then I was thinking it may be possible to delete all the values that filled down from row 4. Then fill down from row 3 to rown 4, delete values from row that that went down... and so on. The missing values would end up in first.id instead of last.id.
If this logic will not work, then please let me know, and I can find another way.

novinosrin · Posted 09-17-2020 10:35 AM


data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8 
1 -7 -6 . 5 
2 -20 -19 -18 -17 
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .  
;
run;


data want;
 do until(last.id);
  set have;
  by id;
  array d day:;
  array t(4) _temporary_;/*Expand the array subscript to the actual you have*/
 end;
 do over d;
  if d=. then t(_i_)=_i_;
 end;
 do until(last.id);
  set have;
  by id;
  do over d;
   d=ifn(_i_ in t ,lag(d),d);
   if _i_ in t and first.id then d=.;
  end;
  output;
 end;
 call missing(of t(*));
run;

EDITED: @eabc0351 To include "and not first.id"---> d=ifn(_i_ in t ,lag(d),d);
if _i_ in t and first.id then d=.; Please take notice!

eabc0351 · Posted 09-17-2020 10:47 AM

@novinosrin, thank you. That worked for the day variable. I would like to expand this code for all the drug variables now. For example, I have the following groups in addition to day1-day7:
alc1-alc7
can1-can7
coc1-coc7
her1-her7... and so on. There are about 20 groups of these drug varaibles.
I would like to do the same thing we did for day for the remainder of the variables.
How would I expand this array to incorporate all of those groups?

novinosrin · Posted 09-17-2020 11:01 AM

Hi @eabc0351 So you have 20 groups with 7 variables in each group..Is that correct?

If yes, you have 140 variables in total.

Group the 140 variables carefully(for the array index to evaluate) left to right in one array like-;

array day1-day7 alc1-alc7; etc

and initialize the other temporary array with 140 elements like

 array t(140) _temporary_;

Please try the above suggestion. I am sure you can whilst we folks are lazy to type;

eabc0351 · Posted 09-17-2020 11:05 AM

It worked. Thanks so much!

mkeintz · Posted 09-17-2020 11:06 AM

I'll attribute my misinterpretation of the objective to ~~laziness~~ forgetting the nature of asynchronous communication - I only saw the initial topic introduction and made the wrong assumption. While I was multitasking and generating the response (struck out below), clarification happened and was neglected by me.

You already have a great solution by @novinosrin , but in the spirit of my earlier reply here is a corrected approach in the same style as my first. It's main difference from the accepted solution is that it uses a single SET statement with HAVE repeated as the target, vs having two SETs each targetting HAVE once. You can see the structural differences that implies:

data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8
1 -7 -6 . .
2 -20 -19 -18 -17
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .
run;

data want (drop=_:);
  set have (in=firstpass) 
      have (in=secondpass);
  by id;

  array d _numeric_;
  array  _endmiss    _endmiss1-_endmiss5;
  retain _endmiss:   _start_secondpass;

  _start_secondpass=dif(secondpass);
  do over d;
    if firstpass=1 then _endmiss=ifn(d=.,1,0);
	d=ifn(secondpass and _endmiss=1,lag(d),d);
	if _start_secondpass=1 and _endmiss=1 then d=.;
  end;
  if secondpass;
run;

The SET with two HAVEs with a BY statement reads each ID group twice. Use the first pass to establish the _ENDMISS flags.

The DIF function is X-lag(X), so _start_secondpass=1 only when the second pass of each id group starts. The allows lagged responses to be replace by missing at the start of each ID group output.

Make sure the ARRAY statement for _ENDMISS is long enough to accommodate all the variables.

Erroneous response struct out below:

~~You want to carry-forward by one observation a subset of variables, within ID groups. The tools for this are the array statement and the lag function:~~

data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8 
1 -7 -6 . . 
2 -20 -19 -18 -17 
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .  
run;

data want;
  set have;
  by id;
  array d day3-day4;
  do over d;
    d=ifn(first.id,.,lag(d));
  end;
run;

~~If all the carry-forward variables are numeric, just include them in the array statement.~~

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

eabc0351 · Posted 09-17-2020 03:58 PM

@mkeintz. Thank you for reviewing that earlier response and correcting it. I will give this solution a try as well. Thanks so much!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!