Solved: Re: How can I select a group of observations by facility for each ID?

Amanda3 · Posted 10-15-2019 06:07 PM

Hi, my dataset includes multiple diagnostic codes from multiple facilities for each patient. I want to only keep observations (ID) for the FACILITY with the greatest number of diagnostic CODEs. Is there an easy way to do this?

For example, this:

ID FACILITY CODE

1 A 416000.1

1 A 751101.2

1 B 416000.1

2 A 416000.1

2 A 751101.2

2 B 416000.1

2 B 102000.1

2 B 102146.2

would become this:

ID FACILITY CODE

1 A 416000.1

1 A 751101.2

2 B 416000.1

2 B 102000.1

2 B 102146.2

Thanks,

Amanda

novinosrin · Posted 10-15-2019 06:13 PM

Hi @Amanda3 Assuming I understand your requirement


data have;
input ID FACILITY $ CODE;
cards;
1 A 416000.1
1 A 751101.2
1 B 416000.1
2 A 416000.1
2 A 751101.2
2 B 416000.1
2 B 102000.1
2 B 102146.2
;
 
proc sql;
create table want(drop=c) as
select *
from (select *,count(*) as c from have group id,facility)
group by id
having max(c)=c;
quit;

View solution in original post

novinosrin · Posted 10-15-2019 06:13 PM

Hi @Amanda3 Assuming I understand your requirement


data have;
input ID FACILITY $ CODE;
cards;
1 A 416000.1
1 A 751101.2
1 B 416000.1
2 A 416000.1
2 A 751101.2
2 B 416000.1
2 B 102000.1
2 B 102146.2
;
 
proc sql;
create table want(drop=c) as
select *
from (select *,count(*) as c from have group id,facility)
group by id
having max(c)=c;
quit;

Amanda3 · Posted 10-16-2019 09:29 AM

@novinosrin, Thank you! It worked. I should have asked sooner. I cannot believe how much time I wasted.

Amanda3 · Posted 10-16-2019 10:59 AM

One problem, how do I select only one facility for each ID when facilities have the same max?

For example:

ID FACILITY CODE

3 A 416000.1

3 A 751101.2

3 B 416000.1

novinosrin · Posted 10-16-2019 11:03 AM

Good question @Amanda3 , So it;s a business question as to how to deal with ties?

You got to tell us, which ones do you want to keep where there is a tie? How would you want to prioritize?

Amanda3 · Posted 10-16-2019 06:02 PM

Thanks for being so nice. There is a fourth variable SEVERITY (ranging from 1-6, where 1 is least severe and 6 is most severe). I would want to keep patients with the most severe injuries, but I suspect there will be many ties (again). In the case of another tie, it wouldn't matter which facility was kept.

novinosrin · Posted 10-16-2019 06:10 PM

Okay @Amanda3 Take your time, examine the variables and your business requirement thoroughly and once you are clear, please post your modified input sample and the expected output for the new sample. Briefly add the logic notes why as in like your "Severity" variable. Once we have this squeaky clean, coding is 5 mins work.

Amanda3 · Posted 10-17-2019 08:59 AM

@novinosrin, thank you and sorry for the confusion. My dataset includes multiple diagnostic codes from multiple facilities for each patient. I want to only keep observations (ID) for the FACILITY with the greatest number of diagnostic CODEs.

data have;

input ID FACILITY $ CODE SEVERITY;

cards;

1 A 416000.1 1

1 A 751101.2 2

1 B 416000.1 1

2 A 416000.1 1

2 A 751101.2 2

2 B 416000.1 1

2 B 102000.1 1

2 B 102146.2 2

3 A 416000.1 1

3 A 751101.2 2

3 B 416000.1 1

4 A 416000.1 1

4 A 751101.2 1

4 B 416000.1 1

4 B 416000.1 1;

You have already helped me determine how to use SAS to select FACILITY 1A and 2B (since they have the most CODEs), but FACILITY 3 A/B and 4 A/B both have two CODEs. I would like to select FACILITY 3A because it has a greater total SEVERITY (1+2). The SEVERITY scores for facility 4 A/B are tied, so it doesn’t matter which I include. In the WANT dataset I just included first FACILITY (4A).

WANT:

1 A 416000.1 1

1 A 751101.2 2

2 B 416000.1 1

2 B 102000.1 1

2 B 102146.2 2

3 A 416000.1 1

3 A 751101.2 2

4 A 416000.1 1

4 A 751101.1 1

Thank you again for sparing your valuable time to help!

novinosrin · Posted 10-17-2019 10:43 AM

Hi @Amanda3 Keeping it simple

/*Modified the same SQL with a sum of severity and 
descending severity into a temp table for a rank order*/
proc sql;
create table temp(drop=c) as
select *
from (select *,count(*) as c,sum(severity) as s from have group id,facility)
group by id
having max(c)=c
order by id,facility,s desc; 
quit;
/*Using the rank order subsetting the ID facility that falls into the 
top order i.e n=1 and finally dropping all temp logic variables*/
data want;
set temp;
by id facility;
if first.id then n=1;
else if first.facility then  n+1;
if n=1;
drop n s ;
run;

PaigeMiller · Posted 10-15-2019 06:15 PM

@Amanda3 wrote:

Hi, my dataset includes multiple diagnostic codes from multiple facilities for each patient. I want to only keep observations (ID) for the FACILITY with the greatest number of diagnostic CODEs. Is there an easy way to do this?

I don't understand the logic that gets your from input to output. Why isn't the third line of the input (1 B 416000.1) included in the output? Why aren't the rows that begin with 2 A in the output?

--
Paige Miller

Amanda3 · Posted 10-16-2019 09:28 AM

@PaigeMiller Sorry for the confusion. The logic that got me from the input to output was that ID#1 went to facility A (2 codes) and facility B (1 code), while ID#2 went to facility A (2 codes) and B (3 codes). I want to keep the facility with the most codes.

hashman · Posted 10-16-2019 12:00 AM

@Amanda3:

It can't get more concise than @novinosrin's SQL where all procedural details are hidden behind the brain of the SQL optimizer. But the task offers a good proving ground for a nice exercise of making use of BY processing coupled with a hash table. Your sample data are sorted by ID FACILITY but not by CODE. Hence, we can take advantage of the sorted order and use a hash table to find the maximal number of distinct CODE values per FACILITY:

data have ;                                
  input id facility $ code $ ;             
  cards ;                                  
1 A 416000.1                               
1 A 751101.2                               
1 B 416000.1                               
2 A 416000.1                               
2 A 751101.2                               
2 B 416000.1                               
2 B 102000.1                               
2 B 102146.2                               
;                                          
run ;                                      
                                           
data want (drop = _:) ;                    
  if _n_ = 1 then do ;                     
    dcl hash q () ;                        
    q.definekey ("code") ;                 
    q.definedone () ;                      
  end ;                                    
  do _n_ = q.clear() by 0 until (last.id) ;
    do until (last.facility) ;             
      set have ;                           
      by id facility ;                     
      q.ref() ;                            
    end ;                                  
    if q.num_items <= _qmax then continue ;
    _qmax = q.num_items ;                  
    _fmax = facility ;                     
  end ;                                    
  do until (last.id) ;                     
    set have ;                             
    by id ;                                
    if facility = _fmax then output ;      
  end ;                                    
run ;

Now if the input file were sorted by ID FACILITY CODE, BY processing alone would more than suffice:

proc sort data = have out = shave ;                                                                                                     
  by id facility code ;                                                                                                                 
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  do until (last.id) ;                                                                                                                  
    do _q = 0 by 0 until (last.facility) ;                                                                                              
      set shave ;                                                                                                                       
      by id facility code ;                                                                                                             
      _q + first.code ;                                                                                                                 
    end ;                                                                                                                               
    if _q <= _qmax then continue ;                                                                                                      
    _qmax = _q ;                                                                                                                        
    _fmax = facility ;                                                                                                                  
  end ;                                                                                                                                 
  do until (last.id) ;                                                                                                                  
    set shave ;                                                                                                                         
    by id ;                                                                                                                             
    if facility = _fmax then output ;                                                                                                   
  end ;                                                                                                                                 
run ;

Kind regards

Paul D.

novinosrin · Posted 10-16-2019 08:32 AM

Good morning Guru @hashman , Indeed classic and slick. However, it seems it's OP's very first post and tagged as "NewUser", so without taking away the potential of the prospect of great ability of new users, hmm nevertheless your level of Classics might be too overwhelming? I recall reading your magnificent "DO" paper 3 years ago at a time when my experience was 2.5 years and that was mind boggling.

Having said that, I know Paige miller and I are certainly benefited. So Thank you and Kudos 🙂

hashman · Posted 10-16-2019 03:49 PM

@novinosrin: Thanks 1E6 for all the kind words.

To your "new user" point, methinks it merely indicates a new user of this discussion forum and not necessarily a new SAS user.

Secondly, IMO, one can learn a lot from struggling to understand code (written by a solid programmer) that at first glance seems incomprehensible. Finally, what we post here is intended not solely for the OPs but for anyone who cares to read them.

Kind regards

Paul D.

Amanda3 · Posted 10-16-2019 09:29 AM

@hashman Thanks for the response. I would have to agree with @novinosrin that you are at a level above. I will have to revisit your response when my skills improve.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away