BookmarkSubscribeRSS Feed
dandsouza
Explorer | Level 4
Hello All

We are been provided a batch of 10k files and its execution dates that were executed at varied frequency, we are now task to break down this batch into various execution frequency such as Daily , Weekly, Monthly , Quarterly, just checking if anyone has done something wherein they analyzes the date column and provide this analysis

Regards
Danny
7 REPLIES 7
Patrick
Opal | Level 21

You will need to be much more descriptive of what you have and what you need. 

 

"We are been provided a batch of 10k files and its execution dates that were executed at varied frequency..."
What are these 10K files? Log files with a datetime portion in the name? Or .sas files and then some data with the name of the .sas files and the dates when these have been executed? Or something else?

 

"we are now task to break down this batch into various execution frequency"

With 10K files doesn't sound like a single flow but more like scheduling information for a whole environment with many flows.
Do you have the data that links programs (or logs) to flows, that tells how the dependencies between flows, that tells you which .sas programs belong to a flow?

Commercial schedulers normally maintain the execution history for audit and reporting purposes. Which scheduler has been used?

 

There are two steps for what you're after:
1. Prepare the data for analysis

2. Analysis/report 

 

Please provide some representative sample data of what you have.

Please show us/describe more in-detail the analysis/report you need including the purpose of this analysis. What questions to you want to answer?

dandsouza
Explorer | Level 4

Hi Patrick, 
Thank you for the revert,   The data of the file is as below 
Name of the File ,  Execution Date
IncomeCalc            01Dec2021

IncomeCalc            02Dec2021

IncomeCalc            03Dec2021

IncomeCalc            04Dec2021

IncomeCalc            05Dec2021

IncomeCalc            06Dec2021

IncomeCalc            07Dec2021

DonateCalc            07Dec2021

ExpenseCalc         15Dec2021

IncomeCalc            08Dec2021

IncomeCalc            09Dec2021

IncomeCalc            19Dec2021

IncomeCalc            11Dec2021

IncomeCalc            12Dec2021

IncomeCalc            13Dec2021

IncomeCalc            14Dec2021

DonateCalc            07Dec2021

.

.

.

.

.

.

ExpenseCalc         30Dec2021

 

Analysis - Bucket the file by frequency such as (Daily, Weekly, Monthly) 

 

With regards to your query 

We dont have the source of the above input, neither is there any linkage provided and Yes we are working on multiple approach, one of time is to understand the date interval function to set a Frequency flag ( Daily, Weekly, Montly).

 

 

Kurt_Bremser
Super User

So you have a dataset like this.

data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc            01Dec2021
IncomeCalc            02Dec2021
IncomeCalc            03Dec2021
IncomeCalc            04Dec2021
IncomeCalc            05Dec2021
IncomeCalc            06Dec2021
IncomeCalc            07Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Dec2021
IncomeCalc            08Dec2021
IncomeCalc            09Dec2021
IncomeCalc            19Dec2021
IncomeCalc            11Dec2021
IncomeCalc            12Dec2021
IncomeCalc            13Dec2021
IncomeCalc            14Dec2021
DonateCalc            07Dec2021
;

?

ballardw
Super User

If you have a data set like that, with 3 or more records for each filename:

data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc            01Dec2021
IncomeCalc            02Dec2021
IncomeCalc            03Dec2021
IncomeCalc            04Dec2021
IncomeCalc            05Dec2021
IncomeCalc            06Dec2021
IncomeCalc            07Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Dec2021
IncomeCalc            08Dec2021
IncomeCalc            09Dec2021
IncomeCalc            19Dec2021
IncomeCalc            11Dec2021
IncomeCalc            12Dec2021
IncomeCalc            13Dec2021
IncomeCalc            14Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Jan2022
DonateCalc            07Jan2022
ExpenseCalc         15Feb2022
DonateCalc            07Feb2022
ExpenseCalc         15Mar2022
;

Proc sort data=have;
  by filename date;
run;

data want;
   set have;
   length interval $ 15;
by filename;
retain fcount; interval= intget(lag2(date),lag1(date),date);
if first.filename or fcount<3 then interval=.;
fcount+1;
drop fcount;
run;

The Inget may generate a lot of "Invalid argument" results in the log  at the change in filename if the intervals are different between the different filename values. I made up some for Donate and Expense just to have at least 3 to demonstrate. The function Inget returns a missing interval if the values don't make sense for a set of 3.

I force missing for the first 2 observation of each filename just in case there might be an acceptable interval that just happens to cross the filenames when using the Lag function.

 

If your actual data has repeated dates for the same filename value you may want to consider using the NODUPEKEY in proc sort as they sort of mess up the Intget function behavior.

ballardw
Super User

There is a data step function INTGET that given three, (that is three exactly, count not to 2 or 4, 5 is right out) date or datetime values that will return an interval between the the three. So if, and this is a big if, your data has sequential observations with a date, or datetime variable, that is one observation per date the function may be able to tell you information about the interval represented by the dates in the file.

 

A small example:

data have;
   do month=1 to 12;
      date= mdy(month,1,2023);
      output;
   end;
   format date date9.;
run;

data example;
    set have (obs=3);
    interval= intget(lag2(date),lag1(date),date);
    if _n_=3 then put "The interval is:" interval;
 run;

The may have to be sorted in an appropriate manner that the first observations in the data set are representative.

 

The interval can indicate consistent shifts. I recommend reading the documentation.

 

Obviously the above example places the interval information in the data set created so can be used in multiple ways.

 

Somewhat irregular intervals may not be detected such as the first day of the month that the stock market is open.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @dandsouza 

 

I will suggest a slightly more complex approach:

  1. For each reoccuring filename, calculate the count of days since last run + the number of times a given filename reoccurs.
  2. For each combination of filename / day_count, calculate the number of occurrences.
  3. For each combination of filename / day_count, calculate the percentage of this day_count.
  4. For each filename, take the most frequently occurring day_count and translate the day_count to a scheduling period.

The translation in step 4 may need some refinement based on the result from step 3. it is difficult to figure out from the provided data example.

data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc            01Dec2021
IncomeCalc            02Dec2021
IncomeCalc            03Dec2021
IncomeCalc            04Dec2021
IncomeCalc            05Dec2021
IncomeCalc            06Dec2021
IncomeCalc            07Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Dec2021
IncomeCalc            08Dec2021
IncomeCalc            09Dec2021
IncomeCalc            19Dec2021
IncomeCalc            11Dec2021
IncomeCalc            12Dec2021
IncomeCalc            13Dec2021
IncomeCalc            14Dec2021
DonateCalc            07Dec2021
ExpenseCalc         30Dec2021
run;

/* Step 1 - calculate the count of days since last run + the number of times a given filename reoccurs. */
proc sort data=have;
  by filename date;
run;

data interval (drop=date_last intervals_total) files (drop=date date_last interval);
  set have;
  by filename date;

  if first.filename then intervals_total = 0;
  intervals_total + 1;

  date_last = lag(date);
  if not first.filename then do;
    interval = date-date_last;
    output interval;
  end;

  if last.filename then do;
    intervals_total = intervals_total - 1;
    output files;
  end;
run;

/* Step 2 - For each combination of filename / day_count, calculate the number of occurrences. */
proc sql;
  create table interval_count as
    select 
      filename, 
      interval, 
      count(*) as intervals_found
    from interval
    group by  
      filename,
      interval
    order by 
      filename,
      intervals_found
  ;
quit;

/* Step 3 - For each combination of filename / day_count, calculate the percentage of this day_count.*/
data want_intermediate; 
  merge interval_count files;
  by filename;
  format pct_occurred 5.0;
  pct_occurred = (intervals_found * 100) / intervals_total;
run;

/* Step 4 - For each filename, take the most frequently occurring day_count and translate the day_count to a scheduling period. */
data want_final; 
  set want_intermediate;
  by filename;
  length schedule $20;
  if last.filename then do;
    if 86 < interval < 33 then schedule = 'Quarterly';
    else if 26 < interval < 33 then schedule = 'Monthly';
    else if 12 < interval < 16 then schedule = 'SecondWeek';
    else if interval = 7 then schedule = 'Weekly';
    else if interval = 1 then schedule = 'Daily';
    else schedule = 'Undetermined';
    output;
  end;
run;

 

 

 

 

 

 

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @dandsouza 

 

Sorry, forgot to copy a ; before the first run; - this hopefullky looks better:

 

data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc            01Dec2021
IncomeCalc            02Dec2021
IncomeCalc            03Dec2021
IncomeCalc            04Dec2021
IncomeCalc            05Dec2021
IncomeCalc            06Dec2021
IncomeCalc            07Dec2021
DonateCalc            07Dec2021
ExpenseCalc         15Dec2021
IncomeCalc            08Dec2021
IncomeCalc            09Dec2021
IncomeCalc            19Dec2021
IncomeCalc            11Dec2021
IncomeCalc            12Dec2021
IncomeCalc            13Dec2021
IncomeCalc            14Dec2021
DonateCalc            07Dec2021
ExpenseCalc         30Dec2021
;
run;

/* Step 1 - calculate the count of days since last run + the number of times a given filename reoccurs. */
proc sort data=have;
  by filename date;
run;

data interval (drop=date_last intervals_total) files (drop=date date_last interval);
  set have;
  by filename date;

  if first.filename then intervals_total = 0;
  intervals_total + 1;

  date_last = lag(date);
  if not first.filename then do;
    interval = date-date_last;
    output interval;
  end;

  if last.filename then do;
    intervals_total = intervals_total - 1;
    output files;
  end;
run;

/* Step 2 - For each combination of filename / day_count, calculate the number of occurrences. */
proc sql;
  create table interval_count as
    select 
      filename, 
      interval, 
      count(*) as intervals_found
    from interval
    group by  
      filename,
      interval
    order by 
      filename,
      intervals_found
  ;
quit;

/* Step 3 - For each combination of filename / day_count, calculate the percentage of this day_count.*/
data want_intermediate; 
  merge interval_count files;
  by filename;
  format pct_occurred 5.0;
  pct_occurred = (intervals_found * 100) / intervals_total;
run;

/* Step 4 - For each filename, take the most frequently occurring day_count and translate the day_count to a scheduling period. */
data want_final; 
  set want_intermediate;
  by filename;
  length schedule $20;
  if last.filename then do;
    if 86 < interval < 33 then schedule = 'Quarterly';
    else if 26 < interval < 33 then schedule = 'Monthly';
    else if 12 < interval < 16 then schedule = 'SecondWeek';
    else if interval = 7 then schedule = 'Weekly';
    else if interval = 1 then schedule = 'Daily';
    else schedule = 'Undetermined';
    output;
  end;
run;

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1255 views
  • 0 likes
  • 5 in conversation