BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
eabc0351
Quartz | Level 8

Hi all. I have the following data:

data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8 
1 -7 -6 . . 
2 -20 -19 -18 -17 
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .  
;
run;

I would like the following data:

data want;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 . .
1 -19 -18 -21 -20
1 -15 -14 -17 -16
1 -11 -10 -13 -12
1 -7 -6 -9 -8
2 -20 -19 -18 . 
2 -16 -15 -14 -17 
2 -12 -11 -9 -13
2 -7 -6 -5 -8
2 -3 -2 -1 -4
;
run;

Each ID has 5 observations only. I am looking to do this for all the variables in the table, not just the 4 that are listed here. Thank you for your assistance!

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8 
1 -7 -6 . 5 
2 -20 -19 -18 -17 
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .  
;
run;


data want;
 do until(last.id);
  set have;
  by id;
  array d day:;
  array t(4) _temporary_;/*Expand the array subscript to the actual you have*/
 end;
 do over d;
  if d=. then t(_i_)=_i_;
 end;
 do until(last.id);
  set have;
  by id;
  do over d;
   d=ifn(_i_ in t ,lag(d),d);
   if _i_ in t and first.id then d=.;
  end;
  output;
 end;
 call missing(of t(*));
run;

 EDITED: @eabc0351  To include "and not first.id"---> d=ifn(_i_ in t ,lag(d),d);
if _i_ in t and first.id then d=.;
  Please take notice!

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26

Please explain the logic that lets you go from data set HAVE to data set WANT.

--
Paige Miller
eabc0351
Quartz | Level 8
If the values of any variables in the last.id are missing, then fill down (row 5 per id) from row above (row 4 per id). I have accomplished this with the following code:
data want;
update have (obs=0) have;
by id;
output;
run;

Now, I need to delete the values that filled down from row 4. Row 4 now contains the missing values. Then I would like to fill down (row 4) from row above (row 3). Then delete filled down values from row 3. Fill down from row 2. Delete row 2 filled down value. Then fill down from row 1. In the data want, row 1 contains the missing values.
ballardw
Super User

Why does ID 1 have two variables "moved" and ID 2 only 1 (apparently). Do different ID values have different rules? If so then you need to explain the rules for each ID.

 

For "all variables", what are the differing rules involving which variables for which ids?

eabc0351
Quartz | Level 8
There are not different rules for each ID. I have 7 variables to indicate day (day1-day7). For the last.id, each person could have at least one missing and up to 7 missing values for these day variables. For example:
ID=1 might have missing values for day4-day7 in last.id. But ID=2 might have missing values for day2-day7.
Then there are corresponding varaibles to the days that describe peoples drug use on those days. For instance:
ID=1 would have a 1 or 0 for druguse1-druguse3 but missing values for druguse4-druguse7 in last.id, corresponding to the missing values for the days.
ID=2 the missing values in last.id would be druguse2-druguse7.
Because there are so many variables, I was hoping to fill down to row 5 from row above (row 4) if variables are missing using this code:
data want;
update have (obs=0) have;
by id;
output;
run;
Then I was thinking it may be possible to delete all the values that filled down from row 4. Then fill down from row 3 to rown 4, delete values from row that that went down... and so on. The missing values would end up in first.id instead of last.id.
If this logic will not work, then please let me know, and I can find another way.
novinosrin
Tourmaline | Level 20

data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8 
1 -7 -6 . 5 
2 -20 -19 -18 -17 
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .  
;
run;


data want;
 do until(last.id);
  set have;
  by id;
  array d day:;
  array t(4) _temporary_;/*Expand the array subscript to the actual you have*/
 end;
 do over d;
  if d=. then t(_i_)=_i_;
 end;
 do until(last.id);
  set have;
  by id;
  do over d;
   d=ifn(_i_ in t ,lag(d),d);
   if _i_ in t and first.id then d=.;
  end;
  output;
 end;
 call missing(of t(*));
run;

 EDITED: @eabc0351  To include "and not first.id"---> d=ifn(_i_ in t ,lag(d),d);
if _i_ in t and first.id then d=.;
  Please take notice!

eabc0351
Quartz | Level 8
@novinosrin, thank you. That worked for the day variable. I would like to expand this code for all the drug variables now. For example, I have the following groups in addition to day1-day7:
alc1-alc7
can1-can7
coc1-coc7
her1-her7... and so on. There are about 20 groups of these drug varaibles.
I would like to do the same thing we did for day for the remainder of the variables.
How would I expand this array to incorporate all of those groups?
novinosrin
Tourmaline | Level 20

Hi @eabc0351  So you have 20 groups with 7 variables in each group..Is that correct?

 

If yes, you have 140 variables in total.

Group the 140 variables carefully(for the array index to evaluate) left to right in one array like-;

 

array day1-day7 alc1-alc7; etc 

 

and initialize  the other temporary array with 140 elements like

 

 array t(140) _temporary_;

 

Please try the above suggestion. I am sure you can whilst we folks are lazy to type;

eabc0351
Quartz | Level 8
It worked. Thanks so much!
mkeintz
PROC Star

I'll attribute my misinterpretation of the objective to laziness forgetting the nature of asynchronous communication -  I only saw the initial topic introduction and made the wrong assumption. While I was multitasking and generating the response (struck out below), clarification happened and was neglected by me.

 

You already have a great solution by @novinosrin , but in the spirit of my earlier reply here is a corrected approach in the same style as my first.  It's main difference from the accepted solution is that it uses a single SET statement with HAVE repeated as the target, vs having two SETs each targetting HAVE once.  You can see the structural differences that implies:

 

data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8
1 -7 -6 . .
2 -20 -19 -18 -17
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .
run;

data want (drop=_:);
  set have (in=firstpass) 
      have (in=secondpass);
  by id;

  array d _numeric_;
  array  _endmiss    _endmiss1-_endmiss5;
  retain _endmiss:   _start_secondpass;

  _start_secondpass=dif(secondpass);
  do over d;
    if firstpass=1 then _endmiss=ifn(d=.,1,0);
	d=ifn(secondpass and _endmiss=1,lag(d),d);
	if _start_secondpass=1 and _endmiss=1 then d=.;
  end;
  if secondpass;
run;

The SET with two HAVEs with a BY statement reads each ID group twice.  Use the first pass to establish the _ENDMISS flags.

 

The DIF function is X-lag(X), so _start_secondpass=1 only when the second pass of each id group starts.  The allows lagged responses to be replace by missing at the start of each ID group output.

 

Make sure the ARRAY statement for _ENDMISS is long enough to accommodate all the variables.

 

 

Erroneous response struct out below:

 

You want to carry-forward by one observation a subset of variables, within ID groups.  The tools for this are the array statement and the lag function:

 

data have;
input id day1 day2 day3 day4;
datalines;
1 -23 -22 -21 -20
1 -19 -18 -17 -16
1 -15 -14 -13 -12
1 -11 -10 -9 -8 
1 -7 -6 . . 
2 -20 -19 -18 -17 
2 -16 -15 -14 -13
2 -12 -11 -9 -8
2 -7 -6 -5 -4
2 -3 -2 -1 .  
run;

data want;
  set have;
  by id;
  array d day3-day4;
  do over d;
    d=ifn(first.id,.,lag(d));
  end;
run;

If all the carry-forward variables are numeric, just include them in the array statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
eabc0351
Quartz | Level 8
@mkeintz. Thank you for reviewing that earlier response and correcting it. I will give this solution a try as well. Thanks so much!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1477 views
  • 3 likes
  • 5 in conversation