Solved: Re: calculating max by each row of group, which changes over time.

km0927 · Posted 10-14-2018 12:29 PM

hello. i want to calculate one patient's 'maximum visit' of clinic, which changes over time.

here is the raw data. it's medical billing data. 'id 1' visited 3 different clinic and 'id 2' visit 2 different clinic. in '20030124'(third row) 'id 1''s most visiting clinic is '12135' and maximum value is '2'. however in '20030607'(6th row), it is '75423' and maximum value is '3'. and in '20030815'(7th row) most visitng clinic is '12135' and '75423'. maximum value is stlll '3'.

'id 2''s most visiting clinic is '25875', and maximum value changes over time. it is 1, 2, 2, 3, 4, 5 in order.

id date clinic
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875

and this is what i want.

id date clinic count max
1 20020408 12135 1 1
1 20020715 12135 2 2
1 20030124 15109 1 2
1 20030308 75423 1 2
1 20030505 75423 2 2
1 20030607 75423 3 3
1 20030815 12135 3 3
2 20020615 25875 1 1
2 20020717 25875 2 2
2 20020824 55415 1 2
2 20020907 25875 3 3
2 20021215 25875 4 4
2 20030108 25875 5 5

(bold number is maximum value)

*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column.

i really appreciate your help.

mkeintz · Posted 10-14-2018 01:31 PM

Editted note: I realize I overlooked you request for variable COUNT. Just change all instances of "_FREQ" to "COUNT", in both programs.

You want a running cumulative single-clinic maximum count. Here's my first thought:

data have;
  input id date  yymmdd8. clinic;
  format date date9.;
datalines;
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875
run;

data want (drop=_:);
  set have;
  by id;
  retain max;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('clinic');
      h.definedata('_freq');
      h.definedone();
  end;

  if first.id then do;
    max=0;
    _rc=h.clear();
  end;
  if h.find() ^=0 then _freq=1;
  else _freq=_freq+1;
  max=max(max,_freq);
  _rc=h.replace();
run;

The above code clears out the hash object at the start of each id, so that a visit to a given clinic by one ID is not contaminated by clinic counts built up by a previous ID.

But the h.clear() method may create more overhead than necessary. In that case change the definekey method for object H to be keyed on two variables (ID and CLINIC), rather than just on CLINIC. Then there's no possibility of confounding counts between different patients:

data want (drop=_:);
  set have;
  by id;
  retain max;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('id','clinic');
      h.definedata('_freq');
      h.definedone();
  end;

  if first.id then max=0;
  if h.find() ^=0 then _freq=1;
  else _freq=_freq+1;
  max=max(max,_freq);
  _rc=h.replace();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 10-14-2018 01:31 PM

Editted note: I realize I overlooked you request for variable COUNT. Just change all instances of "_FREQ" to "COUNT", in both programs.

You want a running cumulative single-clinic maximum count. Here's my first thought:

data have;
  input id date  yymmdd8. clinic;
  format date date9.;
datalines;
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875
run;

data want (drop=_:);
  set have;
  by id;
  retain max;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('clinic');
      h.definedata('_freq');
      h.definedone();
  end;

  if first.id then do;
    max=0;
    _rc=h.clear();
  end;
  if h.find() ^=0 then _freq=1;
  else _freq=_freq+1;
  max=max(max,_freq);
  _rc=h.replace();
run;

The above code clears out the hash object at the start of each id, so that a visit to a given clinic by one ID is not contaminated by clinic counts built up by a previous ID.

But the h.clear() method may create more overhead than necessary. In that case change the definekey method for object H to be keyed on two variables (ID and CLINIC), rather than just on CLINIC. Then there's no possibility of confounding counts between different patients:

data want (drop=_:);
  set have;
  by id;
  retain max;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('id','clinic');
      h.definedata('_freq');
      h.definedone();
  end;

  if first.id then max=0;
  if h.find() ^=0 then _freq=1;
  else _freq=_freq+1;
  max=max(max,_freq);
  _rc=h.replace();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Reeza · Posted 10-14-2018 01:31 PM

Why is there no 2 in the last count series?

I think this gives you the count column

https://stats.idre.ucla.edu/sas/faq/how-can-i-create-an-enumeration-variable-by-groups/

And it should provide enough of the logic to implement your rolling max as well, the term for the last variable is a rolling or moving maximum.

A basic data step is all that’s required here, no arrays or macros needed.

@km0927 wrote:

hello. i want to calculate one patient's 'maximum visit' of clinic, which changes over time.

here is the raw data. it's medical billing data. 'id 1' visited 3 different clinic and 'id 2' visit 2 different clinic. in '20030124'(third row) 'id 1''s most visiting clinic is '12135' and maximum value is '2'. however in '20030607'(6th row), it is '75423' and maximum value is '3'. and in '20030815'(7th row) most visitng clinic is '12135' and '75423'. maximum value is stlll '3'.

'id 2''s most visiting clinic is '25875', and maximum value changes over time. it is 1, 2, 2, 3, 4, 5 in order.

id date clinic
1 20020408 12135
1 20020715 12135
1 20030124 15109
1 20030308 75423
1 20030505 75423
1 20030607 75423
1 20030815 12135
2 20020615 25875
2 20020717 25875
2 20020824 55415
2 20020907 25875
2 20021215 25875
2 20030108 25875

and this is what i want.

id date clinic count max
1 20020408 12135 1 1
1 20020715 12135 2 2
1 20030124 15109 1 2
1 20030308 75423 1 2
1 20030505 75423 2 2
1 20030607 75423 3 3
1 20030815 12135 3 3
2 20020615 25875 1 1
2 20020717 25875 2 2
2 20020824 55415 1 2
2 20020907 25875 3 3
2 20021215 25875 4 4
2 20030108 25875 5 5

(bold number is maximum value)

*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column.

i really appreciate your help.

novinosrin · Posted 10-14-2018 06:29 PM

For the concern- "*in addition, because it is national billing data, type of clinic is more than 10 thousands. so it is impossible to make virtual array by clinic. SAS can't comprehend that much column."

Please refer to the below post

https://communities.sas.com/t5/SAS-Programming/what-s-the-limit-to-how-many-elements-variables-a-SAS...

Nevertheless, Since your data is not sorted by clinic , Hash is of course the easiest and simplest to park the key(clinic) count somewhere. I agree that array is little tedious.

Just out of curiosity, Is your dataset massive? would sort be expensive?

mkeintz · Posted 10-14-2018 09:50 PM

@novinosrin

In addition, if the data were not sorted by ID (say the data is sorted by date), then one could just maintain a second hash object, keyed on ID, recording the max for each ID:

data want (drop=_:);
  set have;
  if _n_=1 then do; /*count for each id/clinic*/
    declare hash h ();
      h.definekey('id','clinic');
      h.definedata('count');
      h.definedone();

    declare hash m();  /*track max for each id*/
      m.definekey('id');
      m.definedata('max');
      m.definedone();
  end;

  if h.find()^=0 then count=1;
  else count=count+1;
  _rc=h.replace();

  _rc=m.find();
  if count>max then do;
    max=count;
    _rc=m.replace();
  end;
run;

Note that MAX is missing for every iteration of the data step, until the M.FIND() method is executed. It stays missing if the current id is not yet in hash object m. So the "if count>max" test will always be true for a new id.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 10-14-2018 10:31 PM

Thank you @mkeintz Sir, You have got it squeaky clean and neat. I really like the max hash, serves a greater purpose of audit.

Hi @km0927, Good evening. I hope you understood the distinction between Mark's responses as I did.

km0927 · Posted 10-20-2018 01:48 PM

thanks. i have done 'array' with the help of others.

but "It seems that there is too much to handle." a similar sentence came out on log screen. i don't know why it appeared.

dataset comprised of 1 millon row, about 50 thousands id and the number of clinics seems to be that much.

mkeintz · Posted 10-20-2018 02:21 PM

Are you saying you have an array program developed, but that the program is generating an error message? If so, please show us the actual log, not a paraphrase of the log note. Also show the part of the log that includes not only the error message, but the program statements that generated the message.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Classroom Training Available!