- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I need some programming advice with a repeated measures dataset in long format. There are some exposure variables (e.g., "arrested") that were only measured at baseline (i.e., survey=1) and appear as missing values at other timepoints (i.e., survey 2-5). I would like to use the baseline measurement across the other timepoints. What is the best way to do that?
Here's a sample of the current dataset:
caseid | survey | arrested |
101 | 1 | 1 |
101 | 2 | . |
101 | 3 | . |
101 | 4 | . |
101 | 5 | . |
201 | 1 | 0 |
201 | 2 | . |
201 | 3 | . |
301 | 1 | 1 |
301 | 2 | . |
301 | 3 | . |
301 | 4 | . |
401 | 1 | 0 |
401 | 2 | . |
501 | 1 | 0 |
And here is what I would like it to look like:
caseid | survey | arrested |
101 | 1 | 1 |
101 | 2 | 1 |
101 | 3 | 1 |
101 | 4 | 1 |
101 | 5 | 1 |
201 | 1 | 0 |
201 | 2 | 0 |
201 | 3 | 0 |
301 | 1 | 1 |
301 | 2 | 1 |
301 | 3 | 1 |
301 | 4 | 1 |
401 | 1 | 0 |
401 | 2 | 0 |
501 | 1 | 0 |
Other things to know about the dataset:
- cases (caseid) have different numbers of observations, ranging from 1-5
- this is one example with "arrested" as the variable that was only measured at baseline, but there are many other variables like this
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please see if this gets you started. Please note that example data is best provided as a working data step. That way we do not need to make guesses about variable names or properties. It is a good idea to post code in a text or code box opened with either the </> or "running man" icons that appear above the message window. The message window will reformat pasted text and that often results in code that will not run.
data have; input caseid survey arrested; datalines; 101 1 1 101 2 . 101 3 . 101 4 . 101 5 . 201 1 0 201 2 . 201 3 . 301 1 1 301 2 . 301 3 . 301 4 . 401 1 0 401 2 . 501 1 0 ; data want; set have; by caseid; retain larrest; if first.caseid then call missing(larrest); if not missing(arrested) then larrest=arrested; arrested=coalesce(arrested,larrest); drop larrest; run;
Retained variables should have properties defined, especially if character before the Retain statement. Use a length statement to set retained character variables to the length of the target values it will hold.
Retain means that the value of the variable is kept across the data step iteration boundary.
The BY statement creates automatic variables First. and Last. that are 1/0 values that can be used to determine if the current observation is the first or last of the by group. The code above assumes all of your records are at least grouped by Caseid. If that is not the case then data likely needs to be sorted by the Caseid and likely the survey variable.
The example only resets one retained variable at the first of a by group. Call missing will work with multiple variables using a comma delimited list and is one of the few functions that allows mixing numeric and character variables.
The if not missing(arrested) then larrest=arrested; will set the Retained value to the most recent value in the data. You didn't show any where that occurs but may be the case in your other variables. You would want one statement similar for each of the other variables.
The Coalesce function, and the character equivalent Coalescec, is a way to assign values using the first value in the parameter list from left to right. So it allows skipping a check to assign Arrested the value of Larrest only when the current value is missing.
After you are sure the logic is working then drop the retained variables. Optionally , only assign the value to the retained variable at the First.caseid and then you have both the Arrested, with some missing values, and Larrest, with ONLY the value of the first observation for Caseid.
With your example either approach would work but the one shown may be more flexible if the value of the variable changes across time and you want to keep that going forward.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Many, many thanks @ballardw! For the solution and the tips on posting. I think the answer is yes, but will this code also replace values where the value is not missing--see jail variable below? And can you review the below code for running the data step with multiple variable? Thank you!
data have; input caseid survey arrested jail; datalines; 101 1 1 1 101 2 . 1 101 3 . . 101 4 . . 101 5 . 0 201 1 0 0 201 2 . 0 201 3 . . 301 1 1 1 301 2 . 0 301 3 . . 301 4 . 0 401 1 0 0 401 2 . . 501 1 0 0 ; data want; set have; by caseid; retain larrest; if first.caseid then call missing(larrest, ljail); if not missing(arrested) then larrest=arrested;
if not missing (jail) then ljail=jail;
arrested=coalesce(arrested,larrest);
jail = coalesce(jail,ljail); drop larrest ljail; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You want to create additional variable such as Ljail on the retain. (my naming has an internal 'last arrest value' or 'last jail value' )
Retain statement will allow creating multiple variables with one statement. So
Retain larrest ljail;
Which is defaulting to numeric variables.
The code I used will NOT replace an existing value. If you want to do so that is the "alternate" approach where the first value is kept. But may require some additional description of the logic depending on the conditions, if any involved.
Maybe a slightly more complex example data set will help clarify what my code is doing. I am using some values below that are likely not valid for your actual data just so it is easier to see and added some records.
data have; input caseid survey arrested jail; datalines; 101 1 1 1 101 2 . . 101 3 . 4 101 4 . . 101 5 . . 101 6 . . 101 7 . 0 101 8 . . 201 1 0 1 201 2 . 2 201 3 . . 301 1 1 1 301 2 . 0 301 3 . . 301 4 . 0 401 1 0 0 401 2 . . 501 1 0 0 ; data want; set have; by caseid; retain larrest ljail; if first.caseid then call missing(larrest, ljail); if not missing(arrested) then larrest=arrested; if not missing (jail) then ljail=jail; arrested=coalesce(arrested,larrest); jail = coalesce(jail,ljail); drop larrest ljail; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
input caseid survey arrested;
datalines;
101 1 1
101 2 .
101 3 .
101 4 .
101 5 .
201 1 0
201 2 .
201 3 .
301 1 1
301 2 .
301 3 .
301 4 .
401 1 0
401 2 .
501 1 0
;
data want;
update have(obs=0) have;
by caseid;
output;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, @Ksharp! Can you explain how this code would work with multiple variables and if I wanted to replace a variable with the value from the first observation? See dataset below.
data have; input caseid survey arrested jail; datalines; 101 1 1 1 101 2 . . 101 3 . 4 101 4 . . 101 5 . . 201 1 0 1 201 2 . 2 201 3 . . 301 1 1 1 301 2 . 0 301 3 . . 301 4 . 0 401 1 0 0 401 2 . . 501 1 0 0 ; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
My code would NOT replace it from the first obs, while would replace it from most near non-missing value .
If you want replace it from the first obs ,try this one :
data have;
input caseid survey arrested jail;
datalines;
101 1 1 1
101 2 . .
101 3 . 4
101 4 . .
101 5 . .
201 1 0 1
201 2 . 2
201 3 . .
301 1 1 1
301 2 . 0
301 3 . .
301 4 . 0
401 1 0 0
401 2 . .
501 1 0 0
;
run;
data want;
set have;
by caseid;
retain _arrested _jail;
if first.caseid then do;
_arrested=arrested; _jail=jail;
end;
else do;
arrested=coalesce(arrested,_arrested);
jail=coalesce(jail,_jail);
end;
drop _:;
run;