Hi, my dataset includes multiple diagnostic codes from multiple facilities for each patient. I want to only keep observations (ID) for the FACILITY with the greatest number of diagnostic CODEs. Is there an easy way to do this?
For example, this:
ID FACILITY CODE
1 A 416000.1
1 A 751101.2
1 B 416000.1
2 A 416000.1
2 A 751101.2
2 B 416000.1
2 B 102000.1
2 B 102146.2
would become this:
ID FACILITY CODE
1 A 416000.1
1 A 751101.2
2 B 416000.1
2 B 102000.1
2 B 102146.2
Thanks,
Amanda
Hi @Amanda3 Assuming I understand your requirement
data have;
input ID FACILITY $ CODE;
cards;
1 A 416000.1
1 A 751101.2
1 B 416000.1
2 A 416000.1
2 A 751101.2
2 B 416000.1
2 B 102000.1
2 B 102146.2
;
proc sql;
create table want(drop=c) as
select *
from (select *,count(*) as c from have group id,facility)
group by id
having max(c)=c;
quit;
Hi @Amanda3 Assuming I understand your requirement
data have;
input ID FACILITY $ CODE;
cards;
1 A 416000.1
1 A 751101.2
1 B 416000.1
2 A 416000.1
2 A 751101.2
2 B 416000.1
2 B 102000.1
2 B 102146.2
;
proc sql;
create table want(drop=c) as
select *
from (select *,count(*) as c from have group id,facility)
group by id
having max(c)=c;
quit;
One problem, how do I select only one facility for each ID when facilities have the same max?
For example:
ID FACILITY CODE
3 A 416000.1
3 A 751101.2
3 B 416000.1
3 B 416000.1
Good question @Amanda3 , So it;s a business question as to how to deal with ties?
You got to tell us, which ones do you want to keep where there is a tie? How would you want to prioritize?
Okay @Amanda3 Take your time, examine the variables and your business requirement thoroughly and once you are clear, please post your modified input sample and the expected output for the new sample. Briefly add the logic notes why as in like your "Severity" variable. Once we have this squeaky clean, coding is 5 mins work.
@novinosrin, thank you and sorry for the confusion. My dataset includes multiple diagnostic codes from multiple facilities for each patient. I want to only keep observations (ID) for the FACILITY with the greatest number of diagnostic CODEs.
data have;
input ID FACILITY $ CODE SEVERITY;
cards;
1 A 416000.1 1
1 A 751101.2 2
1 B 416000.1 1
2 A 416000.1 1
2 A 751101.2 2
2 B 416000.1 1
2 B 102000.1 1
2 B 102146.2 2
3 A 416000.1 1
3 A 751101.2 2
3 B 416000.1 1
3 B 416000.1 1
4 A 416000.1 1
4 A 751101.2 1
4 B 416000.1 1
4 B 416000.1 1;
You have already helped me determine how to use SAS to select FACILITY 1A and 2B (since they have the most CODEs), but FACILITY 3 A/B and 4 A/B both have two CODEs. I would like to select FACILITY 3A because it has a greater total SEVERITY (1+2). The SEVERITY scores for facility 4 A/B are tied, so it doesn’t matter which I include. In the WANT dataset I just included first FACILITY (4A).
WANT:
1 A 416000.1 1
1 A 751101.2 2
2 B 416000.1 1
2 B 102000.1 1
2 B 102146.2 2
3 A 416000.1 1
3 A 751101.2 2
4 A 416000.1 1
4 A 751101.1 1
Thank you again for sparing your valuable time to help!
Hi @Amanda3 Keeping it simple
/*Modified the same SQL with a sum of severity and
descending severity into a temp table for a rank order*/
proc sql;
create table temp(drop=c) as
select *
from (select *,count(*) as c,sum(severity) as s from have group id,facility)
group by id
having max(c)=c
order by id,facility,s desc;
quit;
/*Using the rank order subsetting the ID facility that falls into the
top order i.e n=1 and finally dropping all temp logic variables*/
data want;
set temp;
by id facility;
if first.id then n=1;
else if first.facility then n+1;
if n=1;
drop n s ;
run;
@Amanda3 wrote:
Hi, my dataset includes multiple diagnostic codes from multiple facilities for each patient. I want to only keep observations (ID) for the FACILITY with the greatest number of diagnostic CODEs. Is there an easy way to do this?
I don't understand the logic that gets your from input to output. Why isn't the third line of the input (1 B 416000.1) included in the output? Why aren't the rows that begin with 2 A in the output?
@PaigeMiller Sorry for the confusion. The logic that got me from the input to output was that ID#1 went to facility A (2 codes) and facility B (1 code), while ID#2 went to facility A (2 codes) and B (3 codes). I want to keep the facility with the most codes.
It can't get more concise than @novinosrin's SQL where all procedural details are hidden behind the brain of the SQL optimizer. But the task offers a good proving ground for a nice exercise of making use of BY processing coupled with a hash table. Your sample data are sorted by ID FACILITY but not by CODE. Hence, we can take advantage of the sorted order and use a hash table to find the maximal number of distinct CODE values per FACILITY:
data have ;
input id facility $ code $ ;
cards ;
1 A 416000.1
1 A 751101.2
1 B 416000.1
2 A 416000.1
2 A 751101.2
2 B 416000.1
2 B 102000.1
2 B 102146.2
;
run ;
data want (drop = _:) ;
if _n_ = 1 then do ;
dcl hash q () ;
q.definekey ("code") ;
q.definedone () ;
end ;
do _n_ = q.clear() by 0 until (last.id) ;
do until (last.facility) ;
set have ;
by id facility ;
q.ref() ;
end ;
if q.num_items <= _qmax then continue ;
_qmax = q.num_items ;
_fmax = facility ;
end ;
do until (last.id) ;
set have ;
by id ;
if facility = _fmax then output ;
end ;
run ;
Now if the input file were sorted by ID FACILITY CODE, BY processing alone would more than suffice:
proc sort data = have out = shave ;
by id facility code ;
run ;
data want (drop = _:) ;
do until (last.id) ;
do _q = 0 by 0 until (last.facility) ;
set shave ;
by id facility code ;
_q + first.code ;
end ;
if _q <= _qmax then continue ;
_qmax = _q ;
_fmax = facility ;
end ;
do until (last.id) ;
set shave ;
by id ;
if facility = _fmax then output ;
end ;
run ;
Kind regards
Paul D.
Good morning Guru @hashman , Indeed classic and slick. However, it seems it's OP's very first post and tagged as "NewUser", so without taking away the potential of the prospect of great ability of new users, hmm nevertheless your level of Classics might be too overwhelming? I recall reading your magnificent "DO" paper 3 years ago at a time when my experience was 2.5 years and that was mind boggling.
Having said that, I know Paige miller and I are certainly benefited. So Thank you and Kudos 🙂
@novinosrin: Thanks 1E6 for all the kind words.
To your "new user" point, methinks it merely indicates a new user of this discussion forum and not necessarily a new SAS user.
Secondly, IMO, one can learn a lot from struggling to understand code (written by a solid programmer) that at first glance seems incomprehensible. Finally, what we post here is intended not solely for the OPs but for anyone who cares to read them.
Kind regards
Paul D.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: