BookmarkSubscribeRSS Feed
Henrik_P
Calcite | Level 5

Hi all,

This is my first post (I think?) asking for help in the forum. 

 

I have a situation where I need to calculate a monthly year-to-date (YTD) distinct count of the number of customers benefiting from a particular insurance. As an example, for January 2023, the number should simply be the number of distinct customers that benefited during that month. For February 2023, it should be the number of distinct customers that benefited from the isurance at least once durin the period 2023-01-01 to 2023-02-28 and so on, all the way to December. This needs to be done for three years and is a repeated process.

Currently, the organization I am working for uses a PROC SQL Macro loop to achieve this and to calculate rolling 12-month values of the same statistic. However, this process is very slow, sometimes taking upwards of 2 hours with the data we have. When not working with values that need to be distinct, I have gradually transitioned to using a multilevel format and PROC SUMMARY to achieve this instead. (Up until recently, PROC EXPAND has not been an option.) However, when distinct counts are required, this method does not work.

In order to try and solve this problem, I decided to explore PROC CAS and the aggregation.aggregate action, as it is capable of doing distinct calculations. Luckily, I think I have figured out how to do the rolling twelve-month calculations that way, but I do not yet grasp the procedure well enough to do YTD. Does anyone have any ideas on how PROC CAS could be leveraged to achieve this?



I will provde the code i used to calculate R12 as soon as I can access it.

 

SAMPLE DATA (abridged):
Variable names and datat structure are acureate, Content is nonsense.
A lagers saple dataset is attaced as a csv.

DATA WORK.TESTDATA_CAS;
    LENGTH
        ID               $ 3
        DELFORMAN        $ 12
        AVDELNING        $ 6
        KON                8
        DATE               8 ;
    LABEL
        AVDELNING        = "AVDELNING"
        DATE             = "DATUM" ;
    FORMAT
        ID               $CHAR3.
        DELFORMAN        $CHAR12.
        AVDELNING        $CHAR6.
        KON              BEST12.
        DATE             DATE9. ;
    INFORMAT
        ID               $CHAR3.
        DELFORMAN        $CHAR12.
        AVDELNING        $CHAR6.
        KON              BEST12.
        DATE             DATE9. ;
    INFILE DATALINES4
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        ID               : $CHAR3.
        DELFORMAN        : $CHAR12.
        AVDELNING        : $CHAR6.
        KON              : BEST32.
        DATE             : BEST32. ;
DATALINES4;
F68MintSweet222763
F51Cookie DoughSour222840
F97StrawberryBitter221974
F23Cookie DoughBitter123335
F73Cookie DoughSour222171
F73ChocolateSweet122491
F0StrawberrySalty122153
F47StrawberrySweet223739
F38StrawberrySweet121952
F83Cookie DoughSalty221939
F31MintSour122324
F90StrawberryBitter222375
F38PistachioSour223509
F30Cookie DoughSour122902
F72ChocolateSweet223355
F80MintBitter222271
F65PistachioBitter222297
F61PistachioSour223096
F83ChocolateSalty122483
F60MintSour123371
F66Cookie DoughBitter122770
F90PistachioBitter222793
F82Cookie DoughSour123460
F19PistachioBitter222837
F20ChocolateSalty121926
F27ChocolateBitter223201
F3VanillaSalty223595
F57Cookie DoughSalty222363
F7StrawberrySour223130
F87Cookie DoughSweet222676
F78StrawberrySweet123052
F79MintSour121940
F45Cookie DoughBitter223721
F10MintSour123437
F48ChocolateSalty122422
F98ChocolateSweet123483
F72VanillaSour121976
F49Cookie DoughBitter122143
F99MintSour122521
F26StrawberrySweet122973
F83ChocolateSour223643
F44ChocolateSour123394
F96StrawberrySweet123526
F96MintSalty122805
F56ChocolateSour123204
F35ChocolateBitter222243
F36StrawberrySalty223101
F64VanillaBitter122352
F7MintSweet223281
F53VanillaSweet123305
F97VanillaSour223403
F72VanillaBitter123688
F30ChocolateSalty123137
F8Cookie DoughSour123438
F56ChocolateBitter121930
F49ChocolateSour223234
F53PistachioSalty122782
F67PistachioSour222425
F53VanillaSour123726
F26ChocolateSour222898
F45VanillaBitter123177
F45PistachioSalty222495
F38Cookie DoughSweet223239
F30MintBitter121920
F33VanillaSweet123158
F23Cookie DoughSweet122933
F7MintBitter223704
F8StrawberryBitter223391
F46VanillaSalty223371
F0VanillaSalty222721
F62StrawberrySour123324
F64StrawberrySweet122533
F6VanillaSalty123233
F8StrawberrySour123056
F75ChocolateSalty222319
F41MintBitter223618
F16MintSalty122911
F68MintSweet223701
F37ChocolateSalty122508
F56StrawberrySweet222480
F59Cookie DoughBitter222666
F91StrawberrySweet221926
F33VanillaSalty223147
F78PistachioSour222079
F25MintBitter222792
F53StrawberrySalty123232
F97Cookie DoughBitter223457
F13VanillaBitter121992
F15ChocolateSalty123372
F27ChocolateSalty123515
F5Cookie DoughSour122528
F87VanillaBitter122783
F94MintBitter122726
F63StrawberrySalty223737
F82VanillaSweet223621
F34ChocolateSalty223015
F32Cookie DoughBitter122986
F87Cookie DoughSalty122681
F62StrawberrySour121996
F3Cookie DoughSweet223659
F89Cookie DoughSalty222123
F93VanillaSour223383
F25MintBitter123534
F25StrawberrySalty122436
F76VanillaBitter122967
F25MintSour223242
F0VanillaBitter122908
F38PistachioSour122379
F37VanillaSour123440
;;;;
RUN;


Thank you in advance.


  
 

     

3 REPLIES 3
ballardw
Super User

You problem description includes "benefiting from a particular insurance."  Which variable in the example data holds the values of insurance? Which particular value are we looking for in the example data? Or is this actually a "for each type of insurance"?

 

Are you keeping the results from a previous month and adding to that or redoing the whole process from scratch?

mkeintz
PROC Star

At one point you state the objective is to "calculate rolling 12-month values of the same statistic.", but you started as stating you want YTD cumulative unique ID counts for each calendar year, i.e. not a rolling 12-month value.

 

But either way, why bother with proc sql in a loop?  In the YTD case, you can use PROC SQL a single time to create a view with the earliest month in each calendar year for each ID, for every calendar year.  Then a PROC FREQ followed by BY YEAR will generate monthly cumulative counts of ID's for each year.

 

proc sql;
  create view vneed as
  select year(date) as year, id, min(month(date)) as earliest_month
  from testdata_cas 
  group by calculated year,id;
quit;
proc freq data=vneed;
  by year;
  table earliest_month;
run;

Of course you might need a where clause in the proc sql to select a "particular insurance".  

 

Edit note:  If the cumulative freq column of proc freq doesn't satisfy, skip the proc freq and run a DATA step to produce cumulative counts for each year.

 

data counts (keep=year month newid_count cumulative_count);
  set vneed;
  by year;
  array counts {12} _temporary_;
  if first.year then call missing(of counts{*});
  counts{earliest_month}+1;
  if last.year;
  length month newid_count 8;
  call missing(cumulative_count);
  do month=1 to 12;
    newid_count=coalesce(counts{month},0);
    cumulative_count+newid_count;
    output;
  end;
run;

 

@PaigeMiller's note on using the MLF feature in the proc summary class statement suggests this improvement to my suggestion:

 

proc format;
  value fmnth  (multilabel) 
    1   ='Jan'       1-2 ='Jan-Feb'   1-3 ='Jan-Mar'   
    1-4 ='Jan-Apr'   1-5 ='Jan-May'   1-6 ='Jan-Jun'
    1-7 ='Jan-Jul'   1-8 ='Jan-Aug'   1-9 ='Jan-Sep'
    1-10='Jan-Oct'   1-11='Jan-Nov'   1-12='Jan-Dec'
    ;
run;

proc summary data=vneed nway;
  class year;
  class earliest_month/mlf;
  output out=want (drop=_type_ rename=(_freq_=cum_freq));
  format earliest_month fmnth.;
run;
proc sort;
  by year cum_freq;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PaigeMiller
Diamond | Level 26

If YTD is what is being asked for, no looping needed, no macro needed.

 

See https://communities.sas.com/t5/SAS-Enterprise-Guide/Last-month-and-YTD-data/m-p/829569#M41275

 

In that thread, the input data was three letter month names (Jan, Feb, etc.) so that's what the code uses. Its even simpler if you have actual SAS date or date/time values.

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 522 views
  • 0 likes
  • 4 in conversation