hello. i want to calculate one patient's 'maximum visit' of clinic, which changes over time.
here is the raw data. it's medical billing data. 'id 1' visited 3 different clinic and 'id 2' visit 2 different clinic. in '20030124'(third row) 'id 1''s most visiting clinic is '12135' and maximum value is '2'. however in '20030607'(6th row), it is '75423' and maximum value is '3'. and in '20030815'(7th row) most visitng clinic is '12135' and '75423'. maximum value is stlll '3'.
'id 2''s most visiting clinic is '25875', and maximum value changes over time. it is 1, 2, 2, 3, 4, 5 in order.
id date clinic
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875
and this is what i want.
id date clinic count max
1 20020408 12135 1 1
1 20020715 12135 2 2
1 20030124 15109 1 2
1 20030308 75423 1 2
1 20030505 75423 2 2
1 20030607 75423 3 3
1 20030815 12135 3 3
2 20020615 25875 1 1
2 20020717 25875 2 2
2 20020824 55415 1 2
2 20020907 25875 3 3
2 20021215 25875 4 4
2 20030108 25875 5 5
(bold number is maximum value)
*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column.
i really appreciate your help.
Editted note: I realize I overlooked you request for variable COUNT. Just change all instances of "_FREQ" to "COUNT", in both programs.
You want a running cumulative single-clinic maximum count. Here's my first thought:
data have;
input id date yymmdd8. clinic;
format date date9.;
datalines;
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875
run;
data want (drop=_:);
set have;
by id;
retain max;
if _n_=1 then do;
declare hash h ();
h.definekey('clinic');
h.definedata('_freq');
h.definedone();
end;
if first.id then do;
max=0;
_rc=h.clear();
end;
if h.find() ^=0 then _freq=1;
else _freq=_freq+1;
max=max(max,_freq);
_rc=h.replace();
run;
The above code clears out the hash object at the start of each id, so that a visit to a given clinic by one ID is not contaminated by clinic counts built up by a previous ID.
But the h.clear() method may create more overhead than necessary. In that case change the definekey method for object H to be keyed on two variables (ID and CLINIC), rather than just on CLINIC. Then there's no possibility of confounding counts between different patients:
data want (drop=_:);
set have;
by id;
retain max;
if _n_=1 then do;
declare hash h ();
h.definekey('id','clinic');
h.definedata('_freq');
h.definedone();
end;
if first.id then max=0;
if h.find() ^=0 then _freq=1;
else _freq=_freq+1;
max=max(max,_freq);
_rc=h.replace();
run;
Editted note: I realize I overlooked you request for variable COUNT. Just change all instances of "_FREQ" to "COUNT", in both programs.
You want a running cumulative single-clinic maximum count. Here's my first thought:
data have;
input id date yymmdd8. clinic;
format date date9.;
datalines;
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875
run;
data want (drop=_:);
set have;
by id;
retain max;
if _n_=1 then do;
declare hash h ();
h.definekey('clinic');
h.definedata('_freq');
h.definedone();
end;
if first.id then do;
max=0;
_rc=h.clear();
end;
if h.find() ^=0 then _freq=1;
else _freq=_freq+1;
max=max(max,_freq);
_rc=h.replace();
run;
The above code clears out the hash object at the start of each id, so that a visit to a given clinic by one ID is not contaminated by clinic counts built up by a previous ID.
But the h.clear() method may create more overhead than necessary. In that case change the definekey method for object H to be keyed on two variables (ID and CLINIC), rather than just on CLINIC. Then there's no possibility of confounding counts between different patients:
data want (drop=_:);
set have;
by id;
retain max;
if _n_=1 then do;
declare hash h ();
h.definekey('id','clinic');
h.definedata('_freq');
h.definedone();
end;
if first.id then max=0;
if h.find() ^=0 then _freq=1;
else _freq=_freq+1;
max=max(max,_freq);
_rc=h.replace();
run;
Why is there no 2 in the last count series?
I think this gives you the count column
https://stats.idre.ucla.edu/sas/faq/how-can-i-create-an-enumeration-variable-by-groups/
And it should provide enough of the logic to implement your rolling max as well, the term for the last variable is a rolling or moving maximum.
A basic data step is all that’s required here, no arrays or macros needed.
@km0927 wrote:
hello. i want to calculate one patient's 'maximum visit' of clinic, which changes over time.
here is the raw data. it's medical billing data. 'id 1' visited 3 different clinic and 'id 2' visit 2 different clinic. in '20030124'(third row) 'id 1''s most visiting clinic is '12135' and maximum value is '2'. however in '20030607'(6th row), it is '75423' and maximum value is '3'. and in '20030815'(7th row) most visitng clinic is '12135' and '75423'. maximum value is stlll '3'.
'id 2''s most visiting clinic is '25875', and maximum value changes over time. it is 1, 2, 2, 3, 4, 5 in order.
id date clinic
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875
and this is what i want.
id date clinic count max
1 20020408 12135 1 1
1 20020715 12135 2 2
1 20030124 15109 1 2
1 20030308 75423 1 2
1 20030505 75423 2 2
1 20030607 75423 3 3
1 20030815 12135 3 3
2 20020615 25875 1 1
2 20020717 25875 2 2
2 20020824 55415 1 2
2 20020907 25875 3 3
2 20021215 25875 4 4
2 20030108 25875 5 5
(bold number is maximum value)
*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column.
i really appreciate your help.
For the concern- "*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column."
Please refer to the below post
Nevertheless, Since your data is not sorted by clinic , Hash is of course the easiest and simplest to park the key(clinic) count somewhere. I agree that array is little tedious.
Just out of curiosity, Is your dataset massive? would sort be expensive?
In addition, if the data were not sorted by ID (say the data is sorted by date), then one could just maintain a second hash object, keyed on ID, recording the max for each ID:
data want (drop=_:);
set have;
if _n_=1 then do; /*count for each id/clinic*/
declare hash h ();
h.definekey('id','clinic');
h.definedata('count');
h.definedone();
declare hash m(); /*track max for each id*/
m.definekey('id');
m.definedata('max');
m.definedone();
end;
if h.find()^=0 then count=1;
else count=count+1;
_rc=h.replace();
_rc=m.find();
if count>max then do;
max=count;
_rc=m.replace();
end;
run;
Note that MAX is missing for every iteration of the data step, until the M.FIND() method is executed. It stays missing if the current id is not yet in hash object m. So the "if count>max" test will always be true for a new id.
thanks. i have done 'array' with the help of others.
but "It seems that there is too much to handle." a similar sentence came out on log screen. i don't know why it appeared.
dataset comprised of 1 millon row, about 50 thousands id and the number of clinics seems to be that much.
Are you saying you have an array program developed, but that the program is generating an error message? If so, please show us the actual log, not a paraphrase of the log note. Also show the part of the log that includes not only the error message, but the program statements that generated the message.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.