- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello All,
I have claims dataset and want to see most frequent Health Plan of the year for each patient who have health plans assigned for each month. In case if some one have two or more health plans with same frequency e.g. patient patient ID 3 in my example I would like to assign the latest health plan.
data have;
input id 1 hp_01 $ 3-5 hp_02 $ 7-9 hp_03 $ 11-13 hp_04 $ 15-17 hp_05 $ 19-21 hp_06 $ 23-25 hp_07 $ 27-29 hp_08 $ 31-33 hp_09 $ 35-37 hp_10 $ 39-41 hp_11 $ 43-45 hp_12 $ 47-49;
datalines;
1 MCO MCO MCO PPO PPO PPO PPO PPO PPO PPO PPO PPO
2 PDP PDP PDP PDP PDP PDP PDP PDP PDP PDP MCO MCO
3 EDP EDP EDP EDP EDP EDP PPO PPO PPO PPO PPO PPO
4 MCO MCO MCO MCO MCO MCO MCO MCO MCO EDP EDP EDP
5 EDP EDP EDP EDP MCO MCO PPO PPO PPO PPO EDP EDP
6 MCO NET NET NET EDP EDP NET NET MCO MCO MCO MCO
7 PPO PPO PPO NET NET NET NET NET PPO PPO MCO MCO
8 EDP MCO MCO MCO PPO PPO PPO PPO PPO NET NET NET
9 NET NET NET NET PPO PPO PPO PPO PPO PPO PPO PPO
;
Thank you for help in advance.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Below one way that should work.
data have;
input id 1 hp_01 $ 3-5 hp_02 $ 7-9 hp_03 $ 11-13 hp_04 $ 15-17 hp_05 $ 19-21
hp_06 $ 23-25 hp_07 $ 27-29 hp_08 $ 31-33 hp_09 $ 35-37 hp_10 $ 39-41
hp_11 $ 43-45 hp_12 $ 47-49;
datalines;
1 MCO MCO MCO PPO PPO PPO PPO PPO PPO PPO PPO PPO
2 PDP PDP PDP PDP PDP PDP PDP PDP PDP PDP MCO MCO
3 EDP EDP EDP EDP EDP EDP PPO PPO PPO PPO PPO PPO
4 MCO MCO MCO MCO MCO MCO MCO MCO MCO EDP EDP EDP
5 EDP EDP EDP EDP MCO MCO PPO PPO PPO PPO EDP EDP
6 MCO NET NET NET EDP EDP NET NET MCO MCO MCO MCO
7 PPO PPO PPO NET NET NET NET NET PPO PPO MCO MCO
8 EDP MCO MCO MCO PPO PPO PPO PPO PPO NET NET NET
9 NET NET NET NET PPO PPO PPO PPO PPO PPO PPO PPO
;
data long;
set have;
array hp {*} hp_01 - hp_12;
do _i=1 to dim(hp);
month_num=input(scan(vname(hp[_i]),-1,'_'),best32.);
health_plan=hp[_i];
output;
end;
keep id month_num health_plan;
run;
proc sql;
create table counts as
select
id
,health_plan
,max(month_num) as max_month_num
,count(*) as n_occurences
from long
group by id, health_plan
order by id, n_occurences, max_month_num
;
quit;
data want;
set counts;
by id;
if last.id;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Below one way that should work.
data have;
input id 1 hp_01 $ 3-5 hp_02 $ 7-9 hp_03 $ 11-13 hp_04 $ 15-17 hp_05 $ 19-21
hp_06 $ 23-25 hp_07 $ 27-29 hp_08 $ 31-33 hp_09 $ 35-37 hp_10 $ 39-41
hp_11 $ 43-45 hp_12 $ 47-49;
datalines;
1 MCO MCO MCO PPO PPO PPO PPO PPO PPO PPO PPO PPO
2 PDP PDP PDP PDP PDP PDP PDP PDP PDP PDP MCO MCO
3 EDP EDP EDP EDP EDP EDP PPO PPO PPO PPO PPO PPO
4 MCO MCO MCO MCO MCO MCO MCO MCO MCO EDP EDP EDP
5 EDP EDP EDP EDP MCO MCO PPO PPO PPO PPO EDP EDP
6 MCO NET NET NET EDP EDP NET NET MCO MCO MCO MCO
7 PPO PPO PPO NET NET NET NET NET PPO PPO MCO MCO
8 EDP MCO MCO MCO PPO PPO PPO PPO PPO NET NET NET
9 NET NET NET NET PPO PPO PPO PPO PPO PPO PPO PPO
;
data long;
set have;
array hp {*} hp_01 - hp_12;
do _i=1 to dim(hp);
month_num=input(scan(vname(hp[_i]),-1,'_'),best32.);
health_plan=hp[_i];
output;
end;
keep id month_num health_plan;
run;
proc sql;
create table counts as
select
id
,health_plan
,max(month_num) as max_month_num
,count(*) as n_occurences
from long
group by id, health_plan
order by id, n_occurences, max_month_num
;
quit;
data want;
set counts;
by id;
if last.id;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @Patrick for the solution it worked. I was trying to transpose the data and doing the same but since the dataset was too big it was not working for me. The array method worked.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I never understand where this pattern of using BEST as if it was the name of a INFORMAT originated from. But I see it more and more often (probably just because it triggers a reaction). The BEST format gets its name from the idea that it will try to find the best way to display a number within the limited number of characters requested. But there is no corresponding "best" way to store a number. There is only one way to store a number in SAS. If you use BEST as the name of an informat then SAS just treats it as an alias for the normal numeric informat. So input(xx,best32.) just means input(xx,32.).
If you feel a need to use a "name" for the informat instead of just using 32. then why not use one of its aliases like F that does not imply that it is doing something that is impossible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can use arrays to manually calculate the mode:
1. Calculate the frequency of each variable value;
2. Compare the frequency of each variable value. When the frequency becomes larger, the variable value who have the larger frequency is the mode;
data want;
set have;
array _par_[12]hp_01-hp_12;
array _cnt_[12]_temporary_;
call missing(of _cnt_[*]);
do i=1 to dim(_par_);
do j=1 to dim(_par_);
_cnt_[i]+(_par_[i]=_par_[j]);
end;
if frequency<=_cnt_[i] then do;
frequency=_cnt_[i];
most_frequent_value=_par_[i];
end;
end;
run;
Since you ask for the latest mode, I use "frequency<=_cnt_[i]
" instead of "frequency<_cnt_[i]
".
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @whymath for the reply. I tried your code and it worked as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
input id 1 hp_01 $ 3-5 hp_02 $ 7-9 hp_03 $ 11-13 hp_04 $ 15-17 hp_05 $ 19-21
hp_06 $ 23-25 hp_07 $ 27-29 hp_08 $ 31-33 hp_09 $ 35-37 hp_10 $ 39-41
hp_11 $ 43-45 hp_12 $ 47-49;
datalines;
1 MCO MCO MCO PPO PPO PPO PPO PPO PPO PPO PPO PPO
2 PDP PDP PDP PDP PDP PDP PDP PDP PDP PDP MCO MCO
3 EDP EDP EDP EDP EDP EDP PPO PPO PPO PPO PPO PPO
4 MCO MCO MCO MCO MCO MCO MCO MCO MCO EDP EDP EDP
5 EDP EDP EDP EDP MCO MCO PPO PPO PPO PPO EDP EDP
6 MCO NET NET NET EDP EDP NET NET MCO MCO MCO MCO
7 PPO PPO PPO NET NET NET NET NET PPO PPO MCO MCO
8 EDP MCO MCO MCO PPO PPO PPO PPO PPO NET NET NET
9 NET NET NET NET PPO PPO PPO PPO PPO PPO PPO PPO
;
proc transpose data=have out=temp;
by id;
var hp_:;
run;
proc freq data=temp noprint;
table id*col1/out=temp2;
run;
proc sort data=temp2;
by id descending count;
run;
data temp3;
set temp2;
by id count notsorted;
if first.id then n=0;
n+first.count;
if n=1;
run;
proc transpose data=temp3 out=temp4(drop=_name_ _label_) prefix=_;
by id;
var count;
id col1;
run;
data want;
merge have temp4;
by id;
array x{*} hp_:;
do i=dim(x) to 1 by -1;
if not missing(input(vvaluex(cats('_',x{i})),best.)) then do;want=x{i}; leave;end;
end;
drop _: i;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @Ksharp . Proc transpose didn't work for me as the dataset was too big and for some reason SAS was giving error that it ran out of resources but when I used array which was kind of doing the same worked for me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
OK. If you have a BIG table, try the following Hash Table:
data have;
input id 1 hp_01 $ 3-5 hp_02 $ 7-9 hp_03 $ 11-13 hp_04 $ 15-17 hp_05 $ 19-21
hp_06 $ 23-25 hp_07 $ 27-29 hp_08 $ 31-33 hp_09 $ 35-37 hp_10 $ 39-41
hp_11 $ 43-45 hp_12 $ 47-49;
datalines;
1 MCO MCO MCO PPO PPO PPO PPO PPO PPO PPO PPO PPO
2 PDP PDP PDP PDP PDP PDP PDP PDP PDP PDP MCO MCO
3 EDP EDP EDP EDP EDP EDP PPO PPO PPO PPO PPO PPO
4 MCO MCO MCO MCO MCO MCO MCO MCO MCO EDP EDP EDP
5 EDP EDP EDP EDP MCO MCO PPO PPO PPO PPO EDP EDP
6 MCO NET NET NET EDP EDP NET NET MCO MCO MCO MCO
7 PPO PPO PPO NET NET NET NET NET PPO PPO MCO MCO
8 EDP MCO MCO MCO PPO PPO PPO PPO PPO NET NET NET
9 NET NET NET NET PPO PPO PPO PPO PPO PPO PPO PPO
;
data want;
if _n_=1 then do;
length key $ 80;
declare hash h();
declare hiter hi('h');
h.definekey('key');
h.definedata('key','count','idx');
h.definedone();
end;
set have;
h.clear();
array x{*} hp_01-hp_12;
do i=1 to dim(x);
if not missing(x{i}) then do;
key=x{i};
if h.find()=0 then count=count+1;
else count=1;
idx=i;
h.replace();
end;
end;
do while(hi.next()=0);
if want_count<count then do;want_key=key;want_count=count;want_idx=idx;end;
else if want_count=count and want_idx<idx then do;want_key=key;want_idx=idx;end;
end;
drop i key count idx;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You effectively want the most frequent word in a set of words, all of which are in the same observation. This can also be done using a single character variable containing that list of words:
data want (drop=_: i);
set have;
length _word_list $100 _word $3;
_word_list=catx(' ',of hp_:);
do while (_word_list^=' ');
_word=scan(_word_list,-1);
_freq=0;
do i=1 to countw(_word_list);
if scan(_word_list,i)=_word then _freq=_freq+1;
end;
if _freq > mode_freq then do;
mode_freq=_freq;
mode=_word;
end;
_word_list=tranwrd(_word_list,trim(_word),' ');
end;
run;
Note that the assignment of a value to _WORD using
_word=scan(_word_list,-1);
continually takes the rightmost available value (i.e. the latest) instead of the leftmost. Then, if there is a tie in the frequency of two words, the subsequent statement
if _freq > mode_freq then do;
guarantees that the most recent (rightmost) word with the tied frequency is assigned, per the request.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------