BookmarkSubscribeRSS Feed
hwkim286
Calcite | Level 5

Hello.

 

I am currently working with a panel data. I have found that some of the IDs are missing some values for some of the years. I would like to delete the IDs if they have any missing variables.

 

For example,

ID    Gender     Income

1         1             5000

1         1             7000 

1         1                .

2         2             3000

2         2             5000

2         2             1000

3         1              .

3         1             900000

3         1             12345

 

Since ID number 1 and 3 have missing data, I would like to delete all observations under #1 and 3.

 

I've merged and transposed data from multiple datasets to create a single dataset.

11 REPLIES 11
Patrick
Opal | Level 21

Not tested but code as below should work.

 

/* option 1 */
data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(where=(missing(income)))', duplicate:'r');
      h1.defineKey('id');
      h1.defineDone();
    end;
  set have;
  if h1.check() ne 0 then delete;
run;

/* option 2 */
proc sql;
  delete from have o
  where exists
  (select * from have i where i.income is missing and i.id=o.id)
  ;
quit;
 
novinosrin
Tourmaline | Level 20
data have;
input ID    Gender     Income ;
cards;
1         1             5000
1         1             7000 
1         1                .
2         2             3000
2         2             5000
2         2             1000
3         1              .
3         1             900000
3         1             12345
;
data want;
merge have(in=a where=(missing(income))) have(in=b) ;
by id;
if b and not a;
run;

 

 

proc sql;
create table want as
select *
from have
where id not in (select id from have where missing(income));
quit;
 
PGStats
Opal | Level 21

Just one more way:

 

proc sql;
create table want as 
select * 
from have
group by ID
having count(INCOME) = count(*);
quit;
PG
r_behata
Barite | Level 11
data want;
do until(last.id);
	set have;
	by id;
	if missing(income) then miss=sum(miss,1);
end;

do until(last.id);
	set have;
	by id;
	if miss< 0 then output;
end;

drop miss;
run;
ChrisNZ
Tourmaline | Level 20

Just to be different, but hash (if not sorted) and merge (if sorted) are most likely the best options.

data WANT;
  REC_START_NO = sum(1,REC_END_NO);
  REC_KEEP     = 1;

  do until(last.ID);
    set HAVE;
    by ID;
    REC_KEEP   + -missing(INCOME);
    REC_END_NO + 1;
  end;

  do REC_NO = REC_START_NO to REC_END_NO while(REC_KEEP);
    set HAVE point = REC_NO;
    output;
  end;

  drop REC_: ;
run;

 

[Edited: Change flag from drop to keep]

hashman
Ammonite | Level 13

@ChrisNZ : 

"hash (if not sorted) and merge (if sorted) are most likely the best options"

 

Agreed! But still a like for thinking out of these and other boxes.

 

The efficiency of POINT= depends on data. Your approach includes the same-key groups not having a missing value. If the input file were huge and the key-groups having a missing value were small and rare, it would be more efficient to find their obs numbers and mark them as deleted via MODIFY with POINT=. Then the rest of the processing would just read the original file, auto-excluding the marked records. The SQL's DELETE basically does the same, but I don't know without testing how it stacks up against MODIFY performance-wise. 

 

Kind regards

Paul D.  

 

ChrisNZ
Tourmaline | Level 20

@hashman  Good points .

In any case, since we are reading one BY group at a time, I expect the data set pages to be still be loaded in memory when the POINT= read operations take place (if they do), so the speed should be tolerable.

hashman
Ammonite | Level 13

@ChrisNZ :

With your approach, I'd expect POINT= perform very well at any rate since your step never uses it to read the file out of order. It would be particularly true with enough memory to SASFILE it. 

 

What I meant by "exclude rather than include" is something in this vein:

data have ;                                 
  input id gender income ;                  
cards;                                      
1  1  5000                                  
1  1     .                                  
1  1  7000                                  
1  1     .                                  
2  2  3000                                  
2  2  5000                                  
2  2  1000                                  
3  1   .                                    
3  1  900000                                
3  1  12345                                 
;                                           
run ;                                       
                                            
data rid (keep = rid_:) ;                   
  do until (last.id) ;                      
    set have (keep = id income) curobs = q ;
    by id ;                                 
    if first.id then rid_from = q ;         
    if missing (income) then _missflag = 1 ;
  end ;                                     
  if _missflag ;                            
  rid_to = q ;                              
run ;                                       
                                            
data have ;                                 
  set rid ;                                 
  do rid = rid_from to rid_to ;             
    modify have point = rid ;               
    remove ;                                
  end ;                                     
run ;                                       

I'd expect it to perform faster than pretty much anything else against HAVE with a small number of short "missing" ID groups relative to the overall number of ID groups.

 

Kind regards

Paul D. 

hwkim286
Calcite | Level 5

Thank you guys for the help.

Is there a way to do the same for multiple variables?

 

For example,

 

PID     Gender     Income       Net_worth

1            1               4999             .

1            1                   .                111

1            1                131              1441

2            2                1000            555

2            2                2000            24443

2            2                3000            59530

3            1                155                . 

3            1                122              12312

3            1                1244              .

 

Since PIDs 1 and 3 having missing values (1 in income and 3 in networth) I would like to delete all observations for both PIDs. PID is patient ID.

 

Thank you again.

PeterClemmensen
Tourmaline | Level 20

One way

 

data have;
input PID Gender Income Net_worth;
datalines;
1 1 4999 .
1 1 . 111
1 1 131 1441
2 2 1000 555
2 2 2000 24443
2 2 3000 59530
3 1 155 . 
3 1 122 12312
3 1 1244 .
;

data want;
   if _N_=1 then do;
      declare hash h(dataset:'have(where=(nmiss(Income, Net_worth) ge 1)');
      h.defineKey('PID');
      h.defineDone();
   end;
 
   set have;
   if h.check() ne 0;
run;
Ksharp
Super User
data have;
input ID    Gender     Income ;
cards;
1         1             5000
1         1             7000 
1         1                .
2         2             3000
2         2             5000
2         2             1000
3         1              .
3         1             900000
3         1             12345
;
proc sql;
create table want as
select *
 from have
  group by id
   having nmiss(income)=0;
quit;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 4266 views
  • 8 likes
  • 9 in conversation