Hello,
I am working with a long form dataset, meaning that there are multiple observations per participant (because of multiple visits). I was trying to think of code that will allow me to include the first non missing value for each variable. For example, I want to get the first non missing value for for marital satisfaction and number of children seen per month. If the data was missing for the first visit, I would want the data from the second visit, if not then third visit and so on. Any suggestions for the most efficient way to go about doing this? Thank you in advance for all your help!
Sure. For example, say this was my dataset:
Study ID Visit # Marriage Satisfaction Age Have Close Friends?
104 1 Missing 8 Missing
104 2 Yes 9 Yes
104 3 No Missing No
105 1 Missing Missing Missing
105 2 Missing Missing Missing
105 3 Yes 10 Yes
106 1 Missing Missing Missing
106 2 Yes 9 No
106 3 No 10 Yes
So I would want to create a code the would give me the first non-missing value of these variables for each ID. For example, for ID 104 that would be visit 2 for marriage, visit 1 for age, visit 2 for having close friends.
You will get better and quicker answers if you show your sample data as a data step, e.g. like this:
options missing=' ';
data have;
infile cards dsd delimiter=',' missover;
length
Study_ID 8
Visit_no 8
Marriage_satisfaction $3
Age 8
Close_friends $3
;
input Study_ID--Close_friends;
cards;
104,1,,8,,
104,2,Yes,9,Yes,
104,3,No,,No,
105,1,,,,
105,2,,,,
105,3,Yes,10,Yes,
106,1,,,,
106,2,Yes,9,No,
106,3,No,10,Yes,
;run;
The problem you have can be solved with an update in place solution. But first you may want to copy your input dataset to the output dataset:
data want;
set have;
run;
Then you can update the WANT dataset like this:
data want;
length
_Marriage_satisfaction $3
_Age 8
_Close_friends $3
;
do _N_=nobs to 1 by -1;
modify want point=_N_ nobs=nobs;
first_ID=Study_ID ne lag(Study_ID);
if not first_ID then do;
if missing(Marriage_satisfaction) then
Marriage_satisfaction=_Marriage_satisfaction;
if missing(age) then
age=_age;
if missing(Close_friends) then
Close_friends=_Close_friends;
replace;
end;
_Marriage_satisfaction=Marriage_satisfaction;
_age=age;
_Close_friends=Close_friends;
end;
stop;
run;
So the basic idea is to run through the dataset backwards, and store the values to be updated in temporary variables (after getting the previous value if missing).
Thanks at @s_lassen for providing data in usable form 😉
retain + coalesce(c) could be used:
data want;
set have;
by Study_ID;
length
_Marriage_satisfaction $ 3
_Age 8
_Close_friends $ 3
;
retain _:;
if first.Study_ID then do;
call missing(of _:);
end;
_Marriage_satisfaction = coalescec(_Marriage_satisfaction, Marriage_satisfaction);
_Age = coalesce(_Age, Age);
_Close_friends = coalescec(_Close_friends, Close_friends);
if last.Study_ID then do;
Marriage_satisfaction = _Marriage_satisfaction;
Age = _Age;
Close_friends = _Close_friends;
output;
end;
drop _:;
run;
data have; input StudyID Visit Marriage $ Age ; cards; 104 1 . 8 . 104 2 Yes 9 Yes 104 3 No . No 105 1 . . . 105 2 . . . 105 3 Yes 10 Yes 106 1 . . . 106 2 Yes 9 No 106 3 No 10 ; data first_Marriage; set have(where=(Marriage is not missing)); by StudyID; if first.StudyID; keep StudyID Visit Marriage; run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.