BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
km0927
Obsidian | Level 7

hello. i want to calculate one patient's 'maximum visit' of clinic, which changes over time.

 

here is the raw data. it's medical billing data. 'id 1' visited 3 different clinic and 'id 2' visit 2 different clinic. in '20030124'(third row) 'id 1''s most visiting clinic is '12135' and maximum value is '2'. however in '20030607'(6th row), it is '75423' and maximum value is '3'. and in '20030815'(7th row) most visitng clinic is '12135' and '75423'. maximum value is stlll '3'.

 

'id 2''s most visiting clinic is '25875', and maximum value changes over time. it is 1, 2, 2, 3, 4, 5 in order.

 

id date clinic
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875

 

and this is what i want.

 

id date clinic count max
1 20020408 12135 1 1
1 20020715 12135 2 2
1 20030124 15109 1 2
1 20030308 75423 1 2
1 20030505 75423 2 2
1 20030607 75423 3 3
1 20030815 12135 3 3
2 20020615 25875 1 1
2 20020717 25875 2 2
2 20020824 55415 1 2
2 20020907 25875 3 3
2 20021215 25875 4 4
2 20030108 25875 5 5

 

(bold number is maximum value)

 

*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column.

 

i really appreciate your help.

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Editted note:  I realize I overlooked you request for variable COUNT.   Just change all instances of "_FREQ" to "COUNT", in both programs.

 

 

You want a running cumulative single-clinic maximum count.   Here's my first thought:

 

data have;
  input id date  yymmdd8. clinic;
  format date date9.;
datalines;
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875
run;

data want (drop=_:);
  set have;
  by id;
  retain max;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('clinic');
      h.definedata('_freq');
      h.definedone();
  end;

  if first.id then do;
    max=0;
    _rc=h.clear();
  end;
  if h.find() ^=0 then _freq=1;
  else _freq=_freq+1;
  max=max(max,_freq);
  _rc=h.replace();
run;

 

The above code clears out the hash object at the start of each id, so that a visit to a given clinic by one ID is not contaminated by clinic counts built up by a previous ID. 

 

But the h.clear() method may create more overhead than necessary.  In that case change the definekey method for object H to be keyed on two variables (ID and CLINIC), rather than just on CLINIC.    Then there's no possibility of confounding counts between different patients:

 

data want (drop=_:);
  set have;
  by id;
  retain max;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('id','clinic');
      h.definedata('_freq');
      h.definedone();
  end;

  if first.id then max=0;
  if h.find() ^=0 then _freq=1;
  else _freq=_freq+1;
  max=max(max,_freq);
  _rc=h.replace();
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

7 REPLIES 7
mkeintz
PROC Star

Editted note:  I realize I overlooked you request for variable COUNT.   Just change all instances of "_FREQ" to "COUNT", in both programs.

 

 

You want a running cumulative single-clinic maximum count.   Here's my first thought:

 

data have;
  input id date  yymmdd8. clinic;
  format date date9.;
datalines;
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875
run;

data want (drop=_:);
  set have;
  by id;
  retain max;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('clinic');
      h.definedata('_freq');
      h.definedone();
  end;

  if first.id then do;
    max=0;
    _rc=h.clear();
  end;
  if h.find() ^=0 then _freq=1;
  else _freq=_freq+1;
  max=max(max,_freq);
  _rc=h.replace();
run;

 

The above code clears out the hash object at the start of each id, so that a visit to a given clinic by one ID is not contaminated by clinic counts built up by a previous ID. 

 

But the h.clear() method may create more overhead than necessary.  In that case change the definekey method for object H to be keyed on two variables (ID and CLINIC), rather than just on CLINIC.    Then there's no possibility of confounding counts between different patients:

 

data want (drop=_:);
  set have;
  by id;
  retain max;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('id','clinic');
      h.definedata('_freq');
      h.definedone();
  end;

  if first.id then max=0;
  if h.find() ^=0 then _freq=1;
  else _freq=_freq+1;
  max=max(max,_freq);
  _rc=h.replace();
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Reeza
Super User

Why is there no 2 in the last count series?

I think this gives you the count column

https://stats.idre.ucla.edu/sas/faq/how-can-i-create-an-enumeration-variable-by-groups/

 

And it should provide enough of the logic to implement your rolling max as well, the term for the last variable is a rolling or moving maximum. 

 

A basic data step is all that’s required here, no arrays or macros needed. 

 


@km0927 wrote:

hello. i want to calculate one patient's 'maximum visit' of clinic, which changes over time.

 

here is the raw data. it's medical billing data. 'id 1' visited 3 different clinic and 'id 2' visit 2 different clinic. in '20030124'(third row) 'id 1''s most visiting clinic is '12135' and maximum value is '2'. however in '20030607'(6th row), it is '75423' and maximum value is '3'. and in '20030815'(7th row) most visitng clinic is '12135' and '75423'. maximum value is stlll '3'.

 

'id 2''s most visiting clinic is '25875', and maximum value changes over time. it is 1, 2, 2, 3, 4, 5 in order.

 

id date clinic
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875

 

and this is what i want.

 

id date clinic count max
1 20020408 12135 1 1
1 20020715 12135 2 2
1 20030124 15109 1 2
1 20030308 75423 1 2
1 20030505 75423 2 2
1 20030607 75423 3 3
1 20030815 12135 3 3
2 20020615 25875 1 1
2 20020717 25875 2 2
2 20020824 55415 1 2
2 20020907 25875 3 3
2 20021215 25875 4 4
2 20030108 25875 5 5

 

(bold number is maximum value)

 

*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column.

 

i really appreciate your help.


 

novinosrin
Tourmaline | Level 20

For the concern- "*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column."

 

Please refer to the below post

 

https://communities.sas.com/t5/SAS-Programming/what-s-the-limit-to-how-many-elements-variables-a-SAS...

 

 

Nevertheless, Since your data is not sorted by clinic , Hash is of course the easiest and simplest to park the key(clinic) count somewhere. I agree that array is little tedious. 

 

Just out of curiosity, Is your dataset massive? would sort be expensive?

 

 

mkeintz
PROC Star

@novinosrin

 

In addition, if the data were not sorted by ID (say the data is sorted by date), then one could just maintain a second hash object, keyed on ID, recording the max for each ID:

 

data want (drop=_:);
  set have;
  if _n_=1 then do; /*count for each id/clinic*/
    declare hash h ();
      h.definekey('id','clinic');
      h.definedata('count');
      h.definedone();

    declare hash m();  /*track max for each id*/
      m.definekey('id');
      m.definedata('max');
      m.definedone();
  end;

  if h.find()^=0 then count=1;
  else count=count+1;
  _rc=h.replace();

  _rc=m.find();
  if count>max then do;
    max=count;
    _rc=m.replace();
  end;
run;

 

Note that MAX is missing for every iteration of the data step, until the M.FIND() method is executed.  It stays missing if the current id is not yet in hash object m.  So the "if count>max" test will always be true for a new id.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
novinosrin
Tourmaline | Level 20

Thank you  @mkeintz Sir, You have got it squeaky clean and neat. I really like the max hash, serves a greater purpose of audit. 

 

Hi @km0927, Good evening. I hope you understood the distinction between Mark's responses as I did. 

km0927
Obsidian | Level 7

thanks. i have done 'array' with the help of others.

 

but "It seems that there is too much to handle." a similar sentence came out on log screen. i don't know why it appeared.

 

dataset comprised of 1 millon row, about 50 thousands id and the number of clinics seems to be that much.

mkeintz
PROC Star

Are you saying you have an array program developed, but that the program is generating an error message?  If so, please show us the actual log, not a paraphrase of the log note.  Also show the part of the log that includes not only the error message, but the program statements that generated the message.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 946 views
  • 3 likes
  • 4 in conversation