Hello.
I am currently working with a panel data. I have found that some of the IDs are missing some values for some of the years. I would like to delete the IDs if they have any missing variables.
For example,
ID Gender Income
1 1 5000
1 1 7000
1 1 .
2 2 3000
2 2 5000
2 2 1000
3 1 .
3 1 900000
3 1 12345
Since ID number 1 and 3 have missing data, I would like to delete all observations under #1 and 3.
I've merged and transposed data from multiple datasets to create a single dataset.
Not tested but code as below should work.
/* option 1 */
data want;
if _n_=1 then
do;
dcl hash h1(dataset:'have(where=(missing(income)))', duplicate:'r');
h1.defineKey('id');
h1.defineDone();
end;
set have;
if h1.check() ne 0 then delete;
run;
/* option 2 */
proc sql;
delete from have o
where exists
(select * from have i where i.income is missing and i.id=o.id)
;
quit;
data have;
input ID Gender Income ;
cards;
1 1 5000
1 1 7000
1 1 .
2 2 3000
2 2 5000
2 2 1000
3 1 .
3 1 900000
3 1 12345
;
data want;
merge have(in=a where=(missing(income))) have(in=b) ;
by id;
if b and not a;
run;
proc sql;
create table want as
select *
from have
where id not in (select id from have where missing(income));
quit;
Just one more way:
proc sql;
create table want as
select *
from have
group by ID
having count(INCOME) = count(*);
quit;
data want;
do until(last.id);
set have;
by id;
if missing(income) then miss=sum(miss,1);
end;
do until(last.id);
set have;
by id;
if miss< 0 then output;
end;
drop miss;
run;
Just to be different, but hash (if not sorted) and merge (if sorted) are most likely the best options.
data WANT;
REC_START_NO = sum(1,REC_END_NO);
REC_KEEP = 1;
do until(last.ID);
set HAVE;
by ID;
REC_KEEP + -missing(INCOME);
REC_END_NO + 1;
end;
do REC_NO = REC_START_NO to REC_END_NO while(REC_KEEP);
set HAVE point = REC_NO;
output;
end;
drop REC_: ;
run;
[Edited: Change flag from drop to keep]
@ChrisNZ :
"hash (if not sorted) and merge (if sorted) are most likely the best options"
Agreed! But still a like for thinking out of these and other boxes.
The efficiency of POINT= depends on data. Your approach includes the same-key groups not having a missing value. If the input file were huge and the key-groups having a missing value were small and rare, it would be more efficient to find their obs numbers and mark them as deleted via MODIFY with POINT=. Then the rest of the processing would just read the original file, auto-excluding the marked records. The SQL's DELETE basically does the same, but I don't know without testing how it stacks up against MODIFY performance-wise.
Kind regards
Paul D.
@hashman Good points .
In any case, since we are reading one BY group at a time, I expect the data set pages to be still be loaded in memory when the POINT= read operations take place (if they do), so the speed should be tolerable.
@ChrisNZ :
With your approach, I'd expect POINT= perform very well at any rate since your step never uses it to read the file out of order. It would be particularly true with enough memory to SASFILE it.
What I meant by "exclude rather than include" is something in this vein:
data have ;
input id gender income ;
cards;
1 1 5000
1 1 .
1 1 7000
1 1 .
2 2 3000
2 2 5000
2 2 1000
3 1 .
3 1 900000
3 1 12345
;
run ;
data rid (keep = rid_:) ;
do until (last.id) ;
set have (keep = id income) curobs = q ;
by id ;
if first.id then rid_from = q ;
if missing (income) then _missflag = 1 ;
end ;
if _missflag ;
rid_to = q ;
run ;
data have ;
set rid ;
do rid = rid_from to rid_to ;
modify have point = rid ;
remove ;
end ;
run ;
I'd expect it to perform faster than pretty much anything else against HAVE with a small number of short "missing" ID groups relative to the overall number of ID groups.
Kind regards
Paul D.
Thank you guys for the help.
Is there a way to do the same for multiple variables?
For example,
PID Gender Income Net_worth
1 1 4999 .
1 1 . 111
1 1 131 1441
2 2 1000 555
2 2 2000 24443
2 2 3000 59530
3 1 155 .
3 1 122 12312
3 1 1244 .
Since PIDs 1 and 3 having missing values (1 in income and 3 in networth) I would like to delete all observations for both PIDs. PID is patient ID.
Thank you again.
One way
data have;
input PID Gender Income Net_worth;
datalines;
1 1 4999 .
1 1 . 111
1 1 131 1441
2 2 1000 555
2 2 2000 24443
2 2 3000 59530
3 1 155 .
3 1 122 12312
3 1 1244 .
;
data want;
if _N_=1 then do;
declare hash h(dataset:'have(where=(nmiss(Income, Net_worth) ge 1)');
h.defineKey('PID');
h.defineDone();
end;
set have;
if h.check() ne 0;
run;
data have;
input ID Gender Income ;
cards;
1 1 5000
1 1 7000
1 1 .
2 2 3000
2 2 5000
2 2 1000
3 1 .
3 1 900000
3 1 12345
;
proc sql;
create table want as
select *
from have
group by id
having nmiss(income)=0;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.