Hi all. I have the following data:
data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8
1 -7 -6 . .
2 -20 -19 -18 -17
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .
;
run;
I would like the following data:
data want;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 . .
1 -19 -18 -21 -20
1 -15 -14 -17 -16
1 -11 -10 -13 -12
1 -7 -6 -9 -8
2 -20 -19 -18 .
2 -16 -15 -14 -17
2 -12 -11 -9 -13
2 -7 -6 -5 -8
2 -3 -2 -1 -4
;
run;
Each ID has 5 observations only. I am looking to do this for all the variables in the table, not just the 4 that are listed here. Thank you for your assistance!
data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8
1 -7 -6 . 5
2 -20 -19 -18 -17
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .
;
run;
data want;
do until(last.id);
set have;
by id;
array d day:;
array t(4) _temporary_;/*Expand the array subscript to the actual you have*/
end;
do over d;
if d=. then t(_i_)=_i_;
end;
do until(last.id);
set have;
by id;
do over d;
d=ifn(_i_ in t ,lag(d),d);
if _i_ in t and first.id then d=.;
end;
output;
end;
call missing(of t(*));
run;
EDITED: @eabc0351 To include "and not first.id"---> d=ifn(_i_ in t ,lag(d),d);
if _i_ in t and first.id then d=.; Please take notice!
Please explain the logic that lets you go from data set HAVE to data set WANT.
Why does ID 1 have two variables "moved" and ID 2 only 1 (apparently). Do different ID values have different rules? If so then you need to explain the rules for each ID.
For "all variables", what are the differing rules involving which variables for which ids?
data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8
1 -7 -6 . 5
2 -20 -19 -18 -17
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .
;
run;
data want;
do until(last.id);
set have;
by id;
array d day:;
array t(4) _temporary_;/*Expand the array subscript to the actual you have*/
end;
do over d;
if d=. then t(_i_)=_i_;
end;
do until(last.id);
set have;
by id;
do over d;
d=ifn(_i_ in t ,lag(d),d);
if _i_ in t and first.id then d=.;
end;
output;
end;
call missing(of t(*));
run;
EDITED: @eabc0351 To include "and not first.id"---> d=ifn(_i_ in t ,lag(d),d);
if _i_ in t and first.id then d=.; Please take notice!
Hi @eabc0351 So you have 20 groups with 7 variables in each group..Is that correct?
If yes, you have 140 variables in total.
Group the 140 variables carefully(for the array index to evaluate) left to right in one array like-;
array day1-day7 alc1-alc7; etc
and initialize the other temporary array with 140 elements like
array t(140) _temporary_;
Please try the above suggestion. I am sure you can whilst we folks are lazy to type;
I'll attribute my misinterpretation of the objective to laziness forgetting the nature of asynchronous communication - I only saw the initial topic introduction and made the wrong assumption. While I was multitasking and generating the response (struck out below), clarification happened and was neglected by me.
You already have a great solution by @novinosrin , but in the spirit of my earlier reply here is a corrected approach in the same style as my first. It's main difference from the accepted solution is that it uses a single SET statement with HAVE repeated as the target, vs having two SETs each targetting HAVE once. You can see the structural differences that implies:
data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8
1 -7 -6 . .
2 -20 -19 -18 -17
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .
run;
data want (drop=_:);
set have (in=firstpass)
have (in=secondpass);
by id;
array d _numeric_;
array _endmiss _endmiss1-_endmiss5;
retain _endmiss: _start_secondpass;
_start_secondpass=dif(secondpass);
do over d;
if firstpass=1 then _endmiss=ifn(d=.,1,0);
d=ifn(secondpass and _endmiss=1,lag(d),d);
if _start_secondpass=1 and _endmiss=1 then d=.;
end;
if secondpass;
run;
The SET with two HAVEs with a BY statement reads each ID group twice. Use the first pass to establish the _ENDMISS flags.
The DIF function is X-lag(X), so _start_secondpass=1 only when the second pass of each id group starts. The allows lagged responses to be replace by missing at the start of each ID group output.
Make sure the ARRAY statement for _ENDMISS is long enough to accommodate all the variables.
Erroneous response struct out below:
You want to carry-forward by one observation a subset of variables, within ID groups. The tools for this are the array statement and the lag function:
data have; input id day1 day2 day3 day4; datalines; 1 -23 -22 -21 -20 1 -19 -18 -17 -16 1 -15 -14 -13 -12 1 -11 -10 -9 -8 1 -7 -6 . . 2 -20 -19 -18 -17 2 -16 -15 -14 -13 2 -12 -11 -9 -8 2 -7 -6 -5 -4 2 -3 -2 -1 . run; data want; set have; by id; array d day3-day4; do over d; d=ifn(first.id,.,lag(d)); end; run;
If all the carry-forward variables are numeric, just include them in the array statement.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.