Hi all,
I have some repeated measures data where each ID could potentially have 2 rows of data. A sample is shown below.
ID |
Time |
Data |
1234 |
T1 |
1 |
1234 |
T2 |
1 |
5555 |
T1 |
0 |
6777 |
T2 |
. |
9876 |
T1 |
0 |
9876 |
T2 |
0 |
1000 |
T1 |
1 |
2000 |
T2 |
1 |
8888 |
T1 |
1 |
8888 |
T2 |
. |
9000 |
T1 |
0 |
9000 |
T2 |
. |
1010 |
T1 |
. |
1010 |
T2 |
. |
I'm trying to reduce each ID down to 1 observation each. The criteria is that I'm prioritizing their data on Time 2 (Time=T2). So if they have non-missing data on T2, I pick that row as that ID's observation.
I used the code below to re-sort the data in descending "Time", so that the row where Time=T2 appears first. After that, it's easy for me to just take the first row (which should be the T2 row if they have 2 rows of data) as the observation for each ID.
proc sort data=test;
by ID descending time;
run;
data want;
set test;
by ID;
if first.ID;
run;
My issue now are the data where they have missing data on T2, but not on T1. Looking at the sample data, if I proceeded with my code above ^, the ID "9000" (as well as ID 8888), would have their T2 observation picked where "Data" is missing. However, when you look at their T1 observation, ID 9000 has existing Data in T1 that's "0".
Non-missing data would mean like any value that's not ".". Therefore, Data=0 would count as non-missing data.
Basically, if the ID has 2 rows, prioritize picking their T2 row - if there's data. If there isn't, look at their T1 row - if there is data, pick that row instead. If all rows for Data are missing for that ID, then stick with picking their T2 row.
I'd like to hopefully produce a table like the one below:
ID |
Time |
Data |
1234 |
T2 |
1 |
5555 |
T1 |
0 |
6777 |
T2 |
. |
9876 |
T2 |
0 |
1000 |
T1 |
1 |
2000 |
T2 |
1 |
8888 |
T1 |
1 |
9000 |
T1 |
0 |
1010 |
T2 |
. |
Any help would be very much appreciated, thanks.
Thank you @asgee Please post the best representative sample so that we don't have to go back and forth. Of course, it may take some time to prepare a mock data, however it's worth spending some time to think through carefully before posting the data, so that we are done in one shot 🙂
Btw, no worries it's no biggie-
proc sql;
create table want(drop=_n_) as
select *, max(Time='T2' and data>.) as _n_
from have
group by id
having _n_ and Time='T2' or not _n_ and time='T1'
or count(distinct time)=1;
quit;
I didn't test this, but you can. I am 99.998% sure it will work 😉
proc sort data=test(where=(not missing(data))) out=test1;
by ID descending time;
run;
data want;
set test1;
by ID;
if first.ID;
run;
Hi @asgee Fun SQL stuff-
data have;
input ID Time $ Data;
cards;
1234 T1 .
1234 T2 31
5555 T1 10
6777 T2 .
9876 T1 64
9876 T2 12
1000 T1 43
2000 T2 9
8888 T1 40
8888 T2 .
9000 T1 90
9000 T2 .
;
proc sql;
create table want(drop=_n_) as
select *, max(Time='T2' and data) as _n_
from have
group by id
having _n_ and Time='T2' or not _n_ and time='T1'
or count(distinct time)=1;
quit;
Hi @novinosrin, thanks for the response! The code appears to work through all of the criteria, however I think it flips the last criteria where if the ID has two observations with existing data (T1 and T2 are non-missing).
My fault for not mentioning that the original "Data" variable has values in the format of (0, 1 or .). So for example:
- If you were to change the data on ID 9876 to have two rows of non-missing data on both T1 and T2, but those data are "0" (meaning T1=0, T2=0). They'd be considered "non-missing" as they're not "."
-When I ran your code, I think for ID's like these, it prioritized picking the T1 row instead of the T2 row. I guess since both rows were 0, it instead picked the first in line.
Otherwise, the code works perfectly when selecting the issue of T2 being missing, but T1 isn't (i.e. the ID 9000 issue). It also picks up the ID's with only single rows.
I'll keep testing out your code and see if I can adjust it. Maybe its just a resorting error on my end... Thanks again.
Thank you @asgee Please post the best representative sample so that we don't have to go back and forth. Of course, it may take some time to prepare a mock data, however it's worth spending some time to think through carefully before posting the data, so that we are done in one shot 🙂
Btw, no worries it's no biggie-
proc sql;
create table want(drop=_n_) as
select *, max(Time='T2' and data>.) as _n_
from have
group by id
having _n_ and Time='T2' or not _n_ and time='T1'
or count(distinct time)=1;
quit;
And just to account for -"If all rows for Data are missing for that ID, then stick with picking their T2 row. " May be this is more prudent-
proc sql;
create table want(drop=_n_) as
select *, max(Time='T2' and data>.) as _n_
from have
group by id
having _n_ and Time='T2' or not _n_ and time='T1' and n(data)
or not _n_ and time='T2' and n(data)=0
or count(distinct time)=1;
quit;
data have; input ID Time $ Data; cards; 1234 T1 1 1234 T2 1 5555 T1 0 6777 T2 . 9876 T1 0 9876 T2 0 1000 T1 1 2000 T2 1 8888 T1 1 8888 T2 . 9000 T1 0 9000 T2 . 1010 T1 . 1010 T2 . ; data temp; set have; missing=missing(data); run; proc sort data=temp;by id descending missing time;run; data want; set temp; by id; if last.id; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.