Re: Understanding scheduling frequency basis execution dates

dandsouza · Posted 03-25-2023 08:02 PM

Hello All

We are been provided a batch of 10k files and its execution dates that were executed at varied frequency, we are now task to break down this batch into various execution frequency such as Daily , Weekly, Monthly , Quarterly, just checking if anyone has done something wherein they analyzes the date column and provide this analysis

Regards
Danny

Patrick · Posted 03-25-2023 08:44 PM

You will need to be much more descriptive of what you have and what you need.

"We are been provided a batch of 10k files and its execution dates that were executed at varied frequency..."
What are these 10K files? Log files with a datetime portion in the name? Or .sas files and then some data with the name of the .sas files and the dates when these have been executed? Or something else?

"we are now task to break down this batch into various execution frequency"

With 10K files doesn't sound like a single flow but more like scheduling information for a whole environment with many flows.
Do you have the data that links programs (or logs) to flows, that tells how the dependencies between flows, that tells you which .sas programs belong to a flow?

Commercial schedulers normally maintain the execution history for audit and reporting purposes. Which scheduler has been used?

There are two steps for what you're after:
1. Prepare the data for analysis

2. Analysis/report

Please provide some representative sample data of what you have.

Please show us/describe more in-detail the analysis/report you need including the purpose of this analysis. What questions to you want to answer?

dandsouza · Posted 03-26-2023 02:11 AM

Hi Patrick,
Thank you for the revert, The data of the file is as below
Name of the File , Execution Date
IncomeCalc 01Dec2021

IncomeCalc 02Dec2021

IncomeCalc 03Dec2021

IncomeCalc 04Dec2021

IncomeCalc 05Dec2021

IncomeCalc 06Dec2021

IncomeCalc 07Dec2021

DonateCalc 07Dec2021

ExpenseCalc 15Dec2021

IncomeCalc 08Dec2021

IncomeCalc 09Dec2021

IncomeCalc 19Dec2021

IncomeCalc 11Dec2021

IncomeCalc 12Dec2021

IncomeCalc 13Dec2021

IncomeCalc 14Dec2021

DonateCalc 07Dec2021

.

ExpenseCalc 30Dec2021

Analysis - Bucket the file by frequency such as (Daily, Weekly, Monthly)

With regards to your query

We dont have the source of the above input, neither is there any linkage provided and Yes we are working on multiple approach, one of time is to understand the date interval function to set a Frequency flag ( Daily, Weekly, Montly).

Kurt_Bremser · Posted 03-26-2023 03:14 AM

So you have a dataset like this.

data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc            01Dec2021
IncomeCalc            02Dec2021
IncomeCalc            03Dec2021
IncomeCalc            04Dec2021
IncomeCalc            05Dec2021
IncomeCalc            06Dec2021
IncomeCalc            07Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Dec2021
IncomeCalc            08Dec2021
IncomeCalc            09Dec2021
IncomeCalc            19Dec2021
IncomeCalc            11Dec2021
IncomeCalc            12Dec2021
IncomeCalc            13Dec2021
IncomeCalc            14Dec2021
DonateCalc            07Dec2021
;

?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 03-26-2023 06:10 AM

If you have a data set like that, with 3 or more records for each filename:

data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc            01Dec2021
IncomeCalc            02Dec2021
IncomeCalc            03Dec2021
IncomeCalc            04Dec2021
IncomeCalc            05Dec2021
IncomeCalc            06Dec2021
IncomeCalc            07Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Dec2021
IncomeCalc            08Dec2021
IncomeCalc            09Dec2021
IncomeCalc            19Dec2021
IncomeCalc            11Dec2021
IncomeCalc            12Dec2021
IncomeCalc            13Dec2021
IncomeCalc            14Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Jan2022
DonateCalc            07Jan2022
ExpenseCalc         15Feb2022
DonateCalc            07Feb2022
ExpenseCalc         15Mar2022
;

Proc sort data=have;
  by filename date;
run;

data want;
   set have;
   length interval $ 15;
   by filename;
   retain fcount;
   interval= intget(lag2(date),lag1(date),date);
   if first.filename or fcount<3 then interval=.;
   fcount+1;
   drop fcount;
   
run;

The Inget may generate a lot of "Invalid argument" results in the log at the change in filename if the intervals are different between the different filename values. I made up some for Donate and Expense just to have at least 3 to demonstrate. The function Inget returns a missing interval if the values don't make sense for a set of 3.

I force missing for the first 2 observation of each filename just in case there might be an acceptable interval that just happens to cross the filenames when using the Lag function.

If your actual data has repeated dates for the same filename value you may want to consider using the NODUPEKEY in proc sort as they sort of mess up the Intget function behavior.

ballardw · Posted 03-25-2023 10:21 PM

There is a data step function INTGET that given three, (that is three exactly, count not to 2 or 4, 5 is right out) date or datetime values that will return an interval between the the three. So if, and this is a big if, your data has sequential observations with a date, or datetime variable, that is one observation per date the function may be able to tell you information about the interval represented by the dates in the file.

A small example:

data have;
   do month=1 to 12;
      date= mdy(month,1,2023);
      output;
   end;
   format date date9.;
run;

data example;
    set have (obs=3);
    interval= intget(lag2(date),lag1(date),date);
    if _n_=3 then put "The interval is:" interval;
 run;

The may have to be sorted in an appropriate manner that the first observations in the data set are representative.

The interval can indicate consistent shifts. I recommend reading the documentation.

Obviously the above example places the interval information in the data set created so can be used in multiple ways.

Somewhat irregular intervals may not be detected such as the first day of the month that the stock market is open.

ErikLund_Jensen · Posted 03-26-2023 11:25 AM

Hi @dandsouza

I will suggest a slightly more complex approach:

For each reoccuring filename, calculate the count of days since last run + the number of times a given filename reoccurs.
For each combination of filename / day_count, calculate the number of occurrences.
For each combination of filename / day_count, calculate the percentage of this day_count.
For each filename, take the most frequently occurring day_count and translate the day_count to a scheduling period.

The translation in step 4 may need some refinement based on the result from step 3. it is difficult to figure out from the provided data example.

data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc            01Dec2021
IncomeCalc            02Dec2021
IncomeCalc            03Dec2021
IncomeCalc            04Dec2021
IncomeCalc            05Dec2021
IncomeCalc            06Dec2021
IncomeCalc            07Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Dec2021
IncomeCalc            08Dec2021
IncomeCalc            09Dec2021
IncomeCalc            19Dec2021
IncomeCalc            11Dec2021
IncomeCalc            12Dec2021
IncomeCalc            13Dec2021
IncomeCalc            14Dec2021
DonateCalc            07Dec2021
ExpenseCalc         30Dec2021
run;

/* Step 1 - calculate the count of days since last run + the number of times a given filename reoccurs. */
proc sort data=have;
  by filename date;
run;

data interval (drop=date_last intervals_total) files (drop=date date_last interval);
  set have;
  by filename date;

  if first.filename then intervals_total = 0;
  intervals_total + 1;

  date_last = lag(date);
  if not first.filename then do;
    interval = date-date_last;
    output interval;
  end;

  if last.filename then do;
    intervals_total = intervals_total - 1;
    output files;
  end;
run;

/* Step 2 - For each combination of filename / day_count, calculate the number of occurrences. */
proc sql;
  create table interval_count as
    select 
      filename, 
      interval, 
      count(*) as intervals_found
    from interval
    group by  
      filename,
      interval
    order by 
      filename,
      intervals_found
  ;
quit;

/* Step 3 - For each combination of filename / day_count, calculate the percentage of this day_count.*/
data want_intermediate; 
  merge interval_count files;
  by filename;
  format pct_occurred 5.0;
  pct_occurred = (intervals_found * 100) / intervals_total;
run;

/* Step 4 - For each filename, take the most frequently occurring day_count and translate the day_count to a scheduling period. */
data want_final; 
  set want_intermediate;
  by filename;
  length schedule $20;
  if last.filename then do;
    if 86 < interval < 33 then schedule = 'Quarterly';
    else if 26 < interval < 33 then schedule = 'Monthly';
    else if 12 < interval < 16 then schedule = 'SecondWeek';
    else if interval = 7 then schedule = 'Weekly';
    else if interval = 1 then schedule = 'Daily';
    else schedule = 'Undetermined';
    output;
  end;
run;

ErikLund_Jensen · Posted 03-26-2023 11:29 AM

Hi @dandsouza

Sorry, forgot to copy a ; before the first run; - this hopefullky looks better:

data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc            01Dec2021
IncomeCalc            02Dec2021
IncomeCalc            03Dec2021
IncomeCalc            04Dec2021
IncomeCalc            05Dec2021
IncomeCalc            06Dec2021
IncomeCalc            07Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Dec2021
IncomeCalc            08Dec2021
IncomeCalc            09Dec2021
IncomeCalc            19Dec2021
IncomeCalc            11Dec2021
IncomeCalc            12Dec2021
IncomeCalc            13Dec2021
IncomeCalc            14Dec2021
DonateCalc            07Dec2021
ExpenseCalc         30Dec2021
;
run;

/* Step 1 - calculate the count of days since last run + the number of times a given filename reoccurs. */
proc sort data=have;
  by filename date;
run;

data interval (drop=date_last intervals_total) files (drop=date date_last interval);
  set have;
  by filename date;

  if first.filename then intervals_total = 0;
  intervals_total + 1;

  date_last = lag(date);
  if not first.filename then do;
    interval = date-date_last;
    output interval;
  end;

  if last.filename then do;
    intervals_total = intervals_total - 1;
    output files;
  end;
run;

/* Step 2 - For each combination of filename / day_count, calculate the number of occurrences. */
proc sql;
  create table interval_count as
    select 
      filename, 
      interval, 
      count(*) as intervals_found
    from interval
    group by  
      filename,
      interval
    order by 
      filename,
      intervals_found
  ;
quit;

/* Step 3 - For each combination of filename / day_count, calculate the percentage of this day_count.*/
data want_intermediate; 
  merge interval_count files;
  by filename;
  format pct_occurred 5.0;
  pct_occurred = (intervals_found * 100) / intervals_total;
run;

/* Step 4 - For each filename, take the most frequently occurring day_count and translate the day_count to a scheduling period. */
data want_final; 
  set want_intermediate;
  by filename;
  length schedule $20;
  if last.filename then do;
    if 86 < interval < 33 then schedule = 'Quarterly';
    else if 26 < interval < 33 then schedule = 'Monthly';
    else if 12 < interval < 16 then schedule = 'SecondWeek';
    else if interval = 7 then schedule = 'Weekly';
    else if interval = 1 then schedule = 'Daily';
    else schedule = 'Undetermined';
    output;
  end;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!