Solved: How to subset repeated observations based on non missing values of a v...

asgee · Posted 10-19-2020 04:42 PM

Hi all,

I'm working on this dataset that has repeated measures over time. I want to detect the IDs where they have at least 2 non missing observations in the variable "Data". A sample of the dataset is shown below:

ID	Visit	Data
AAA	1	131
AAA	2	5
AAA	3	.
BBB	1	12
BBB	2	61
CCC	1	17
DDD	1	87
DDD	2	.
DDD	3	13
EEE	1	.
EEE	2	.
FFF	1	18
GGG	1	31
HHH	1	.
HHH	2	.
HHH	3	25

At the moment, I've modified a code that @novinosrin helped me out with from a previous question (thanks again!). However, when I ran this modified code, it seems to only detect (keep) the IDs where there is non-missing data right after each other (back-to-back). I believe it has something to do with the _flag step and have tried modifying that as well but I'm struggling to figure it out:

data want;
     do _n_=1 by 1 until(last.id);
	   set have;
       by id;
       if (data ne .) then _count=sum(_count,1);
	   else _count=0;
	   if _count>=2 then _flag=1;
	   end;
	 do _n_=1 to _n_;
	   set have;
	   if _flag then output;
	   end;
	drop _:;
run;

In this case, as long as there is at least 2 non-missing data in that ID (regardless of whether the non-missing data are back-to-back). Based on the sample table above, I'd want to produce something like the table below:

ID	Visit	Data
AAA	1	131
AAA	2	5
AAA	3	.
BBB	1	12
BBB	2	61
DDD	1	87
DDD	2	.
DDD	3	13

Any help would be much appreciated!

Thanks

Reeza · Posted 10-19-2020 04:49 PM

I think you can just use the HAVING clause in PROC SQL. Does the following work for you? If not, please post the log.

proc sql;

create table want as 

select *

from have

group by ID

having count(data)>2;

quit;

@asgee wrote:

Hi all,

I'm working on this dataset that has repeated measures over time. I want to detect the IDs where they have at least 2 non missing observations in the variable "Data". A sample of the dataset is shown below:

ID

Visit

Data

AAA

1

131

AAA

2

5

AAA

3

.

BBB

1

12

BBB

2

61

CCC

1

17

DDD

1

87

DDD

2

.

DDD

3

13

EEE

1

.

EEE

2

.

FFF

1

18

GGG

1

31

HHH

1

.

HHH

2

.

HHH

3

XY

At the moment, I've modified a code that @novinosrin helped me out with from a previous question (thanks again!). However, when I ran this modified code, it seems to only detect (keep) the IDs where there is non-missing data right after each other (back-to-back). I believe it has something to do with the _flag step and have tried modifying that as well but I'm struggling to figure it out:
data want;
     do _n_=1 by 1 until(last.id);
	   set have;
       by id;
       if (data ne .) then _count=sum(_count,1);
	   else _count=0;
	   if _count>=2 then _flag=1;
	   end;
	 do _n_=1 to _n_;
	   set have;
	   if _flag then output;
	   end;
	drop _:;
run;
In this case, as long as there is at least 2 non-missing data in that ID (regardless of whether the non-missing data are back-to-back). Based on the sample table above, I'd want to produce something like the table below:

ID

Visit

Data

AAA

1

131

AAA

2

5

AAA

3

.

BBB

1

12

BBB

2

61

DDD

1

87

DDD

2

.

DDD

3

13

Any help would be much appreciated!

Thanks

View solution in original post

Reeza · Posted 10-19-2020 04:49 PM

I think you can just use the HAVING clause in PROC SQL. Does the following work for you? If not, please post the log.

proc sql;

create table want as 

select *

from have

group by ID

having count(data)>2;

quit;

@asgee wrote:

Hi all,

I'm working on this dataset that has repeated measures over time. I want to detect the IDs where they have at least 2 non missing observations in the variable "Data". A sample of the dataset is shown below:

ID

Visit

Data

AAA

1

131

AAA

2

5

AAA

3

.

BBB

1

12

BBB

2

61

CCC

1

17

DDD

1

87

DDD

2

.

DDD

3

13

EEE

1

.

EEE

2

.

FFF

1

18

GGG

1

31

HHH

1

.

HHH

2

.

HHH

3

XY

At the moment, I've modified a code that @novinosrin helped me out with from a previous question (thanks again!). However, when I ran this modified code, it seems to only detect (keep) the IDs where there is non-missing data right after each other (back-to-back). I believe it has something to do with the _flag step and have tried modifying that as well but I'm struggling to figure it out:
data want;
     do _n_=1 by 1 until(last.id);
	   set have;
       by id;
       if (data ne .) then _count=sum(_count,1);
	   else _count=0;
	   if _count>=2 then _flag=1;
	   end;
	 do _n_=1 to _n_;
	   set have;
	   if _flag then output;
	   end;
	drop _:;
run;
In this case, as long as there is at least 2 non-missing data in that ID (regardless of whether the non-missing data are back-to-back). Based on the sample table above, I'd want to produce something like the table below:

ID

Visit

Data

AAA

1

131

AAA

2

5

AAA

3

.

BBB

1

12

BBB

2

61

DDD

1

87

DDD

2

.

DDD

3

13

Any help would be much appreciated!

Thanks

asgee · Posted 10-19-2020 05:03 PM

Hi @Reeza ! Thanks for the reply. Yes it seems like your code works perfectly! I never came across the "HAVING" clause so that's really good to know. Could there also be a "NOT HAVING" clause if I wanted to subset the opposite?

Reeza · Posted 10-19-2020 05:10 PM

@asgee wrote:
Hi @Reeza ! Thanks for the reply. Yes it seems like your code works perfectly! I never came across the "HAVING" clause so that's really good to know. Could there also be a "NOT HAVING" clause if I wanted to subset the opposite?

Try it and let us know 🙂

novinosrin · Posted 10-19-2020 04:53 PM


data have;
input ID $	Visit	Data $;
cards;
AAA	1	131
AAA	2	5
AAA	3	.
BBB	1	12
BBB	2	61
CCC	1	17
DDD	1	87
DDD	2	.
DDD	3	13
EEE	1	.
EEE	2	.
FFF	1	18
GGG	1	31
HHH	1	.
HHH	2	.
HHH	3	XY
;

proc sql;
 create table want as
 select *
 from have
 group by id
 having n(data)>=2
 order by id,visit;
quit;

/*or*/
data want;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  _n=sum(^missing(data),_n);
 end;
 do _n_=1 to _n_;
  set have;
  if _n>=2 then output;
 end;
 drop _:;
run;

asgee · Posted 10-19-2020 05:04 PM

Hi @novinosrin ! Thanks for your reply. I tried your code and it works as well! Will keep your solution in mind for other issues I have with repeated measures. Thanks again!

How to subset repeated observations based on non missing values of a variable?

Re: How to subset repeated observations based on non missing values of a variable?

Re: How to subset repeated observations based on non missing values of a variable?

Re: How to subset repeated observations based on non missing values of a variable?

Re: How to subset repeated observations based on non missing values of a variable?

Re: How to subset repeated observations based on non missing values of a variable?

Re: How to subset repeated observations based on non missing values of a variable?

Classroom Training Available!