Re: Deleting all observations with same ID if missing data

hwkim286 · Posted 08-13-2019 09:52 PM

Hello.

I am currently working with a panel data. I have found that some of the IDs are missing some values for some of the years. I would like to delete the IDs if they have any missing variables.

For example,

ID Gender Income

1 1 5000

1 1 7000

1 1 .

2 2 3000

2 2 5000

2 2 1000

3 1 .

3 1 900000

3 1 12345

Since ID number 1 and 3 have missing data, I would like to delete all observations under #1 and 3.

I've merged and transposed data from multiple datasets to create a single dataset.

Patrick · Posted 08-13-2019 10:13 PM

Not tested but code as below should work.

/* option 1 */
data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(where=(missing(income)))', duplicate:'r');
      h1.defineKey('id');
      h1.defineDone();
    end;
  set have;
  if h1.check() ne 0 then delete;
run;

/* option 2 */
proc sql;
  delete from have o
  where exists
  (select * from have i where i.income is missing and i.id=o.id)
  ;
quit;

novinosrin · Posted 08-13-2019 10:20 PM

data have;
input ID    Gender     Income ;
cards;
1         1             5000
1         1             7000 
1         1                .
2         2             3000
2         2             5000
2         2             1000
3         1              .
3         1             900000
3         1             12345
;
data want;
merge have(in=a where=(missing(income))) have(in=b) ;
by id;
if b and not a;
run;

proc sql;
create table want as
select *
from have
where id not in (select id from have where missing(income));
quit;

PGStats · Posted 08-13-2019 11:00 PM

Just one more way:

proc sql;
create table want as 
select * 
from have
group by ID
having count(INCOME) = count(*);
quit;

PG

r_behata · Posted 08-13-2019 11:06 PM

data want;
do until(last.id);
	set have;
	by id;
	if missing(income) then miss=sum(miss,1);
end;

do until(last.id);
	set have;
	by id;
	if miss< 0 then output;
end;

drop miss;
run;

ChrisNZ · Posted 08-14-2019 12:55 AM

Just to be different, but hash (if not sorted) and merge (if sorted) are most likely the best options.

data WANT;
  REC_START_NO = sum(1,REC_END_NO);
  REC_KEEP     = 1;

  do until(last.ID);
    set HAVE;
    by ID;
    REC_KEEP   + -missing(INCOME);
    REC_END_NO + 1;
  end;

  do REC_NO = REC_START_NO to REC_END_NO while(REC_KEEP);
    set HAVE point = REC_NO;
    output;
  end;

  drop REC_: ;
run;

[Edited: Change flag from drop to keep]

High-Performance SAS Coding - Third Edition

hashman · Posted 08-14-2019 02:46 AM

@ChrisNZ :

"hash (if not sorted) and merge (if sorted) are most likely the best options"

Agreed! But still a like for thinking out of these and other boxes.

The efficiency of POINT= depends on data. Your approach includes the same-key groups not having a missing value. If the input file were huge and the key-groups having a missing value were small and rare, it would be more efficient to find their obs numbers and mark them as deleted via MODIFY with POINT=. Then the rest of the processing would just read the original file, auto-excluding the marked records. The SQL's DELETE basically does the same, but I don't know without testing how it stacks up against MODIFY performance-wise.

Kind regards

Paul D.

ChrisNZ · Posted 08-14-2019 07:17 PM

@hashman Good points .

In any case, since we are reading one BY group at a time, I expect the data set pages to be still be loaded in memory when the POINT= read operations take place (if they do), so the speed should be tolerable.

High-Performance SAS Coding - Third Edition

hashman · Posted 08-14-2019 09:48 PM

@ChrisNZ :

With your approach, I'd expect POINT= perform very well at any rate since your step never uses it to read the file out of order. It would be particularly true with enough memory to SASFILE it.

What I meant by "exclude rather than include" is something in this vein:

data have ;                                 
  input id gender income ;                  
cards;                                      
1  1  5000                                  
1  1     .                                  
1  1  7000                                  
1  1     .                                  
2  2  3000                                  
2  2  5000                                  
2  2  1000                                  
3  1   .                                    
3  1  900000                                
3  1  12345                                 
;                                           
run ;                                       
                                            
data rid (keep = rid_:) ;                   
  do until (last.id) ;                      
    set have (keep = id income) curobs = q ;
    by id ;                                 
    if first.id then rid_from = q ;         
    if missing (income) then _missflag = 1 ;
  end ;                                     
  if _missflag ;                            
  rid_to = q ;                              
run ;                                       
                                            
data have ;                                 
  set rid ;                                 
  do rid = rid_from to rid_to ;             
    modify have point = rid ;               
    remove ;                                
  end ;                                     
run ;

I'd expect it to perform faster than pretty much anything else against HAVE with a small number of short "missing" ID groups relative to the overall number of ID groups.

Kind regards

Paul D.

hwkim286 · Posted 08-14-2019 02:38 AM

Thank you guys for the help.

Is there a way to do the same for multiple variables?

For example,

PID Gender Income Net_worth

1 1 4999 .

1 1 . 111

1 1 131 1441

2 2 1000 555

2 2 2000 24443

2 2 3000 59530

3 1 155 .

3 1 122 12312

3 1 1244 .

Since PIDs 1 and 3 having missing values (1 in income and 3 in networth) I would like to delete all observations for both PIDs. PID is patient ID.

Thank you again.

PeterClemmensen · Posted 08-14-2019 02:45 AM

One way

data have;
input PID Gender Income Net_worth;
datalines;
1 1 4999 .
1 1 . 111
1 1 131 1441
2 2 1000 555
2 2 2000 24443
2 2 3000 59530
3 1 155 . 
3 1 122 12312
3 1 1244 .
;

data want;
   if _N_=1 then do;
      declare hash h(dataset:'have(where=(nmiss(Income, Net_worth) ge 1)');
      h.defineKey('PID');
      h.defineDone();
   end;
 
   set have;
   if h.check() ne 0;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Ksharp · Posted 08-14-2019 07:31 AM

data have;
input ID    Gender     Income ;
cards;
1         1             5000
1         1             7000 
1         1                .
2         2             3000
2         2             5000
2         2             1000
3         1              .
3         1             900000
3         1             12345
;
proc sql;
create table want as
select *
 from have
  group by id
   having nmiss(income)=0;
quit;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away