Hi All,
I have this longitudinal dataset with multiple observations per studyid:
data have;
input studyID FSA $3. Exposure Age ;
cards;
1 B2Y 384 17
1 B2Y 384 17
1 B2Y 384 18
2 BgT 1000 15
3 M6D 400 14
3 M6D 400 15
3 M6D 400 16
run;
What I want to do is create a new variable called "Age_start" which is the lowest value for age for each indivdual in the study, so the "want" dataset would look like this:
data want;
input studyID FSA $3. Exposure Age age_start ;
cards;
1 B2Y 384 17 17
1 B2Y 384 17 17
1 B2Y 384 18 17
2 BgT 1000 15 15
3 M6D 400 14 14
3 M6D 400 15 14
3 M6D 400 16 14
run;
As per usual, any thoughts would be hugely appreciated. I've been playing with it for a couple of hours using proc sort and then running variations of a data step that looks like
data want;
set have;
age_start = first.age;
run;
but all I get are zeros.
Any thoughts are much appreciated.
Thank you.
Rightcoast
data want;
set have;
by studyid;
retain age_start;
if first.studyID then
age_start = age;
run;
data want;
set have;
by studyid;
retain age_start;
if first.studyID then
age_start = age;
run;
data have;
input studyID FSA $3. Exposure Age ;
cards;
1 B2Y 384 17
1 B2Y 384 17
1 B2Y 384 18
2 BgT 1000 15
3 M6D 400 14
3 M6D 400 15
3 M6D 400 16
run;
proc sql;
create table want as
select *, min(age) as age_start
from have
group by studyid
order by studyid;
quit;
FIRST. and LAST. variables are binary valued numbers that are generated to indicate if the current observation is the first or last in the current group defined by the values of the BY values upto and including the one you listed.
Looks like you have found a feature (A.K.A. bug) in SAS. The reference to FIRST.AGE should have generated an error since not only is AGE not in your list of BY variables but you don't even have any BY statement. Instead it looks like SAS just treated it as always FALSE. Hence the constant value of zero.
Easy to do in SQL code since SAS will automatically remerge summary statistics for you.
proc sql;
create table want as select *,min(age) as first_age from have group by studyid;
quit;
Or if the data is sorted (either by AGE or by DATE and your AGE variables are correct) within STUDYID then you can use real BY variable processing in a data step.
data want;
set have;
by studyid ;
if first.studyid then first_age=age;
retain first_age;
run;
Now if you have some missing values of AGE then there is a possibility that FIRST_AGE will be missing. You could add this statement so that the first non-missing value is taken.
first_age=coalesce(first_age,age);
But then the earlier values will still have missing values.
You could also add what are called DOW loops. Basically placing the SET/MERGE statement inside of DO loop. In this case two DO loops. One to find the MIN and another to actually write the values.
data want;
do until (last.studyid);
set have;
by studyid;
first_age=min(first_age,age);
end;
do until (last.studyid);
set have;
by studyid;
output;
end;
run;
Hi Tom,
Thanks so much for your thoughtful response. Thankfully I don't have any missing data, but I'll file it away because I'm sure it will come up in the future. The clarification on the .first/.last variables was especially helpful because I thought they were a product of proc sort generally, not of the BY statement. I'm sure that will help my programming going forward.
As always I'm super humbled and impressed by all the people willing to help out on here.
Thanks so much, hopefully one day I can get skilled enough to pay it forward and help out someone else.
Rightcoast
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.