Solved: Re: How do I detect and subset non-missing observations in repeated da...

asgee · Posted 01-02-2021 05:28 PM

Hi all,

I have some repeated measures data where each ID could potentially have 2 rows of data. A sample is shown below.

ID	Time	Data
1234	T1	1
1234	T2	1
5555	T1	0
6777	T2	.
9876	T1	0
9876	T2	0
1000	T1	1
2000	T2	1
8888	T1	1
8888	T2	.
9000	T1	0
9000	T2	.
1010	T1	.
1010	T2	.

I'm trying to reduce each ID down to 1 observation each. The criteria is that I'm prioritizing their data on Time 2 (Time=T2). So if they have non-missing data on T2, I pick that row as that ID's observation.

I used the code below to re-sort the data in descending "Time", so that the row where Time=T2 appears first. After that, it's easy for me to just take the first row (which should be the T2 row if they have 2 rows of data) as the observation for each ID.

proc sort data=test;
by ID descending time;
run;
	
data want;
set test;
by ID;
if first.ID;
run;

My issue now are the data where they have missing data on T2, but not on T1. Looking at the sample data, if I proceeded with my code above ^, the ID "9000" (as well as ID 8888), would have their T2 observation picked where "Data" is missing. However, when you look at their T1 observation, ID 9000 has existing Data in T1 that's "0".

Non-missing data would mean like any value that's not ".". Therefore, Data=0 would count as non-missing data.

Basically, if the ID has 2 rows, prioritize picking their T2 row - if there's data. If there isn't, look at their T1 row - if there is data, pick that row instead. If all rows for Data are missing for that ID, then stick with picking their T2 row.

I'd like to hopefully produce a table like the one below:

ID	Time	Data
1234	T2	1
5555	T1	0
6777	T2	.
9876	T2	0
1000	T1	1
2000	T2	1
8888	T1	1
9000	T1	0
1010	T2	.

Any help would be very much appreciated, thanks.

novinosrin · Posted 01-02-2021 06:54 PM

Thank you @asgee Please post the best representative sample so that we don't have to go back and forth. Of course, it may take some time to prepare a mock data, however it's worth spending some time to think through carefully before posting the data, so that we are done in one shot 🙂

Btw, no worries it's no biggie-


proc sql;
 create table want(drop=_n_) as
 select *,  max(Time='T2' and data>.) as _n_
 from have
 group by id
 having _n_  and Time='T2' or  not _n_ and time='T1' 
 or count(distinct time)=1;
quit;

View solution in original post

PaigeMiller · Posted 01-02-2021 05:36 PM

I didn't test this, but you can. I am 99.998% sure it will work 😉

proc sort data=test(where=(not missing(data))) out=test1;
by ID descending time;
run;
	
data want;
set test1;
by ID;
if first.ID;
run;

--
Paige Miller

asgee · Posted 01-02-2021 06:28 PM

Hi @PaigeMiller, thanks for the response! I tested your code but I think the proc sort at the start removed all the missing data 😞 I think I'd still want to keep the missing data if the ID has all their rows missing. However, it did fix the issue of the T2 being missing but T1 as not. Will try to test out and see if I can adjust this if anything. Thanks again.

novinosrin · Posted 01-02-2021 05:51 PM

Hi @asgee Fun SQL stuff-



data have;
input ID	Time $	Data;
cards;
1234	T1	.
1234	T2	31
5555	T1	10
6777	T2	.
9876	T1	64
9876	T2	12
1000	T1	43
2000	T2	9
8888	T1	40
8888	T2	.
9000	T1	90
9000	T2	.
;

proc sql;
 create table want(drop=_n_) as
 select *,  max(Time='T2' and data) as _n_
 from have
 group by id
 having _n_  and Time='T2' or not _n_ and time='T1' 
 or  count(distinct time)=1;
quit;

asgee · Posted 01-02-2021 06:35 PM

Hi @novinosrin, thanks for the response! The code appears to work through all of the criteria, however I think it flips the last criteria where if the ID has two observations with existing data (T1 and T2 are non-missing).

My fault for not mentioning that the original "Data" variable has values in the format of (0, 1 or .). So for example:
- If you were to change the data on ID 9876 to have two rows of non-missing data on both T1 and T2, but those data are "0" (meaning T1=0, T2=0). They'd be considered "non-missing" as they're not "."
-When I ran your code, I think for ID's like these, it prioritized picking the T1 row instead of the T2 row. I guess since both rows were 0, it instead picked the first in line.

Otherwise, the code works perfectly when selecting the issue of T2 being missing, but T1 isn't (i.e. the ID 9000 issue). It also picks up the ID's with only single rows.

I'll keep testing out your code and see if I can adjust it. Maybe its just a resorting error on my end... Thanks again.

asgee · Posted 01-02-2021 06:40 PM

@novinosrin I've edited my question above to reflect the "Data" variable. Perhaps the values being (0,1 or .) make a big difference...

novinosrin · Posted 01-02-2021 06:54 PM

Thank you @asgee Please post the best representative sample so that we don't have to go back and forth. Of course, it may take some time to prepare a mock data, however it's worth spending some time to think through carefully before posting the data, so that we are done in one shot 🙂

Btw, no worries it's no biggie-


proc sql;
 create table want(drop=_n_) as
 select *,  max(Time='T2' and data>.) as _n_
 from have
 group by id
 having _n_  and Time='T2' or  not _n_ and time='T1' 
 or count(distinct time)=1;
quit;

asgee · Posted 01-02-2021 07:02 PM

Thanks so much @novinosrin, yes I realized my mistake afterwards - my apologies for that 😞 . The code above works perfectly, really appreciate your help! Will be more wary the next time I post.

Thanks again!

novinosrin · Posted 01-02-2021 07:11 PM

And just to account for -"If all rows for Data are missing for that ID, then stick with picking their T2 row. " May be this is more prudent-


proc sql;
 create table want(drop=_n_) as
 select *,  max(Time='T2' and data>.) as _n_
 from have
 group by id
 having _n_  and Time='T2' or  not _n_ and time='T1' and n(data)
 or  not _n_ and time='T2' and n(data)=0
 or count(distinct time)=1;
quit;

asgee · Posted 01-02-2021 07:20 PM

Thanks for this! The code works great and accounts for that as well, thanks again @novinosrin! Really appreciate the help.

Ksharp · Posted 01-02-2021 10:58 PM

data have;
input ID	Time $	Data;
cards;
1234
T1
1
1234
T2
1
5555
T1
0
6777
T2
.
9876
T1
0
9876
T2
0
1000
T1
1
2000
T2
1
8888
T1
1
8888
T2
.
9000
T1
0
9000
T2
.
1010
T1
.
1010
T2
.
;

data temp;
 set have;
 missing=missing(data);
run;
proc sort data=temp;by id descending missing time;run;
data want;
 set temp;
 by id;
 if last.id;
run;

How do I detect and subset non-missing observations in repeated data (missing -> non-missing)?

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Re: How do I detect and subset non-missing observations in repeated data (missing -> non-missing)

Register Today!

SAS Training: Just a Click Away