BookmarkSubscribeRSS Feed
smital
Calcite | Level 5

I need some programming advice with a repeated measures dataset in long format. There are some exposure variables (e.g., "arrested") that were only measured at baseline (i.e., survey=1) and appear as missing values at other timepoints (i.e., survey 2-5). I would like to use the baseline measurement across the other timepoints. What is the best way to do that? 

Here's a sample of the current dataset:

caseidsurveyarrested
10111
1012.
1013.
1014.
1015.
20110
2012.
2013.
30111
3012.
3013.
3014.
40110
4012.
50110

 

And here is what I would like it to look like:

caseidsurveyarrested
10111
10121
10131
10141
10151
20110
20120
20130
30111
30121
30131
30141
40110
40120
50110

 

Other things to know about the dataset:

  • cases (caseid) have different numbers of observations, ranging from 1-5
  • this is one example with "arrested" as the variable that was only measured at baseline, but there are many other variables like this
6 REPLIES 6
ballardw
Super User

Please see if this gets you started. Please note that example data is best provided as a working data step. That way we do not need to make guesses about variable names or properties. It is a good idea to post code in a text or code box opened with either the </> or "running man" icons that appear above the message window. The message window will reformat pasted text and that often results in code that will not run.

 

data have;
input caseid survey arrested;
datalines;
101	1	1
101	2	.
101	3	.
101	4	.
101	5	.
201	1	0
201	2	.
201	3	.
301	1	1
301	2	.
301	3	.
301	4	.
401	1	0
401	2	.
501	1	0
;

data want;
  set have;
  by caseid;
  retain larrest;
  if first.caseid then call missing(larrest);
  if not missing(arrested) then larrest=arrested;
  arrested=coalesce(arrested,larrest);
  drop larrest;
run;

Retained variables should have properties defined, especially if character before the Retain statement. Use a length statement to set retained character variables to the length of the target values it will hold.

Retain means that the value of the variable is kept across the data step iteration boundary.

The BY statement creates automatic variables First. and Last. that are 1/0 values that can be used to determine if the current observation is the first or last of the by group. The code above assumes all of your records are at least grouped by Caseid. If that is not the case then data likely needs to be sorted by the Caseid and likely the survey variable.

The example only resets one retained variable at the first of a by group. Call missing will work with multiple variables using a comma delimited list and is one of the few functions that allows mixing numeric and character variables.

The if not missing(arrested) then larrest=arrested; will set the Retained value to the most recent value in the data. You didn't show any where that occurs but may be the case in your other variables. You would want one statement similar for each of the other variables.

The Coalesce function, and the character equivalent Coalescec, is a way to assign values using the first value in the parameter list from left to right. So it allows skipping a check to assign Arrested the value of Larrest only when the current value is missing.

After you are sure the logic is working then drop the retained variables. Optionally , only assign the value to the retained variable at the First.caseid and then you have both the Arrested, with some missing values, and Larrest, with ONLY the value of the first observation for Caseid.

With your example either approach would work but the one shown may be more flexible if the value of the variable changes across time and you want to keep that going forward.

smital
Calcite | Level 5

Many, many thanks @ballardw! For the solution and the tips on posting. I think the answer is yes, but will this code also replace values where the value is not missing--see jail variable below? And can you review the below code for running the data step with multiple variable? Thank you!

 

data have;
input caseid survey arrested jail;
datalines;
101	1	1  1 
101	2	.  1
101	3	.  .
101	4	.  .
101	5	.  0
201	1	0  0
201	2	.  0 
201	3	.  .
301	1	1  1
301	2	.  0
301	3	.  .
301	4	.  0
401	1	0  0
401	2	.  .
501	1	0  0
;

data want;
  set have;
  by caseid;
  retain larrest;
  if first.caseid then call missing(larrest, ljail);
  if not missing(arrested) then larrest=arrested;
if not missing (jail) then ljail=jail;
arrested=coalesce(arrested,larrest);
jail = coalesce(jail,ljail); drop larrest ljail; run;

 

ballardw
Super User

You want to create additional variable such as Ljail on the retain. (my naming has an internal 'last arrest value' or 'last jail value' )

Retain statement will allow creating multiple variables with one statement. So

Retain larrest ljail;

Which is defaulting to numeric variables.

 

The code I used will NOT replace an existing value. If you want to do so that is the "alternate" approach where the first value is kept. But may require some additional description of the logic depending on the conditions, if any involved.

Maybe a slightly more complex example data set will help clarify what my code is doing. I am using some values below that are likely not valid for your actual data just so it is easier to see and added some records.

data have;
input caseid survey arrested jail;
datalines;
101	1	1  1 
101	2	.  .
101	3	.  4
101	4	.  .
101	5	.  .
101	6	.  .
101	7	.  0
101	8	.  .
201	1	0  1
201	2	.  2 
201	3	.  .
301	1	1  1
301	2	.  0
301	3	.  .
301	4	.  0
401	1	0  0
401	2	.  .
501	1	0  0
;

data want;
  set have;
  by caseid;
  retain larrest ljail;
  if first.caseid then call missing(larrest, ljail);
  if not missing(arrested) then larrest=arrested;
  if not missing (jail) then ljail=jail;
  arrested=coalesce(arrested,larrest);
  jail = coalesce(jail,ljail);
  drop larrest ljail;
run;
Ksharp
Super User
data have;
input caseid survey arrested;
datalines;
101	1	1
101	2	.
101	3	.
101	4	.
101	5	.
201	1	0
201	2	.
201	3	.
301	1	1
301	2	.
301	3	.
301	4	.
401	1	0
401	2	.
501	1	0
;
data want;
 update have(obs=0) have;
 by caseid;
 output;
run;
smital
Calcite | Level 5

Thanks, @Ksharp! Can you explain how this code would work with multiple variables and if I wanted to replace a variable with the value from the first observation? See dataset below.

data have;
input caseid survey arrested jail;
datalines;
101 1 1 1 
101 2 . .
101 3 . 4
101 4 . .
101 5 . .
201 1 0 1
201 2 . 2 
201 3 . .
301 1 1 1
301 2 . 0
301 3 . .
301 4 . 0
401 1 0 0
401 2 . .
501 1 0 0
;
run;
Ksharp
Super User
" if I wanted to replace a variable with the value from the first observation?"
My code would NOT replace it from the first obs, while would replace it from most near non-missing value .

If you want replace it from the first obs ,try this one :

data have;
input caseid survey arrested jail;
datalines;
101 1 1 1
101 2 . .
101 3 . 4
101 4 . .
101 5 . .
201 1 0 1
201 2 . 2
201 3 . .
301 1 1 1
301 2 . 0
301 3 . .
301 4 . 0
401 1 0 0
401 2 . .
501 1 0 0
;
run;

data want;
set have;
by caseid;
retain _arrested _jail;
if first.caseid then do;
_arrested=arrested; _jail=jail;
end;
else do;
arrested=coalesce(arrested,_arrested);
jail=coalesce(jail,_jail);
end;
drop _:;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1987 views
  • 1 like
  • 3 in conversation