You will need to be much more descriptive of what you have and what you need.
"We are been provided a batch of 10k files and its execution dates that were executed at varied frequency..."
What are these 10K files? Log files with a datetime portion in the name? Or .sas files and then some data with the name of the .sas files and the dates when these have been executed? Or something else?
"we are now task to break down this batch into various execution frequency"
With 10K files doesn't sound like a single flow but more like scheduling information for a whole environment with many flows.
Do you have the data that links programs (or logs) to flows, that tells how the dependencies between flows, that tells you which .sas programs belong to a flow?
Commercial schedulers normally maintain the execution history for audit and reporting purposes. Which scheduler has been used?
There are two steps for what you're after:
1. Prepare the data for analysis
2. Analysis/report
Please provide some representative sample data of what you have.
Please show us/describe more in-detail the analysis/report you need including the purpose of this analysis. What questions to you want to answer?
Hi Patrick,
Thank you for the revert, The data of the file is as below
Name of the File , Execution Date
IncomeCalc 01Dec2021
IncomeCalc 02Dec2021
IncomeCalc 03Dec2021
IncomeCalc 04Dec2021
IncomeCalc 05Dec2021
IncomeCalc 06Dec2021
IncomeCalc 07Dec2021
DonateCalc 07Dec2021
ExpenseCalc 15Dec2021
IncomeCalc 08Dec2021
IncomeCalc 09Dec2021
IncomeCalc 19Dec2021
IncomeCalc 11Dec2021
IncomeCalc 12Dec2021
IncomeCalc 13Dec2021
IncomeCalc 14Dec2021
DonateCalc 07Dec2021
.
.
.
.
.
.
ExpenseCalc 30Dec2021
Analysis - Bucket the file by frequency such as (Daily, Weekly, Monthly)
With regards to your query
We dont have the source of the above input, neither is there any linkage provided and Yes we are working on multiple approach, one of time is to understand the date interval function to set a Frequency flag ( Daily, Weekly, Montly).
So you have a dataset like this.
data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc 01Dec2021
IncomeCalc 02Dec2021
IncomeCalc 03Dec2021
IncomeCalc 04Dec2021
IncomeCalc 05Dec2021
IncomeCalc 06Dec2021
IncomeCalc 07Dec2021
DonateCalc 07Dec2021
ExpenseCalc 15Dec2021
IncomeCalc 08Dec2021
IncomeCalc 09Dec2021
IncomeCalc 19Dec2021
IncomeCalc 11Dec2021
IncomeCalc 12Dec2021
IncomeCalc 13Dec2021
IncomeCalc 14Dec2021
DonateCalc 07Dec2021
;
?
If you have a data set like that, with 3 or more records for each filename:
data have; input filename :$20. date :date9.; format date yymmdd10.; datalines; IncomeCalc 01Dec2021 IncomeCalc 02Dec2021 IncomeCalc 03Dec2021 IncomeCalc 04Dec2021 IncomeCalc 05Dec2021 IncomeCalc 06Dec2021 IncomeCalc 07Dec2021 DonateCalc 07Dec2021 ExpenseCalc 15Dec2021 IncomeCalc 08Dec2021 IncomeCalc 09Dec2021 IncomeCalc 19Dec2021 IncomeCalc 11Dec2021 IncomeCalc 12Dec2021 IncomeCalc 13Dec2021 IncomeCalc 14Dec2021 DonateCalc 07Dec2021 ExpenseCalc 15Jan2022 DonateCalc 07Jan2022 ExpenseCalc 15Feb2022 DonateCalc 07Feb2022 ExpenseCalc 15Mar2022 ; Proc sort data=have; by filename date; run; data want; set have; length interval $ 15;
by filename;
retain fcount; interval= intget(lag2(date),lag1(date),date);
if first.filename or fcount<3 then interval=.;
fcount+1;
drop fcount;
run;
The Inget may generate a lot of "Invalid argument" results in the log at the change in filename if the intervals are different between the different filename values. I made up some for Donate and Expense just to have at least 3 to demonstrate. The function Inget returns a missing interval if the values don't make sense for a set of 3.
I force missing for the first 2 observation of each filename just in case there might be an acceptable interval that just happens to cross the filenames when using the Lag function.
If your actual data has repeated dates for the same filename value you may want to consider using the NODUPEKEY in proc sort as they sort of mess up the Intget function behavior.
There is a data step function INTGET that given three, (that is three exactly, count not to 2 or 4, 5 is right out) date or datetime values that will return an interval between the the three. So if, and this is a big if, your data has sequential observations with a date, or datetime variable, that is one observation per date the function may be able to tell you information about the interval represented by the dates in the file.
A small example:
data have; do month=1 to 12; date= mdy(month,1,2023); output; end; format date date9.; run; data example; set have (obs=3); interval= intget(lag2(date),lag1(date),date); if _n_=3 then put "The interval is:" interval; run;
The may have to be sorted in an appropriate manner that the first observations in the data set are representative.
The interval can indicate consistent shifts. I recommend reading the documentation.
Obviously the above example places the interval information in the data set created so can be used in multiple ways.
Somewhat irregular intervals may not be detected such as the first day of the month that the stock market is open.
Hi @dandsouza
I will suggest a slightly more complex approach:
The translation in step 4 may need some refinement based on the result from step 3. it is difficult to figure out from the provided data example.
data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc 01Dec2021
IncomeCalc 02Dec2021
IncomeCalc 03Dec2021
IncomeCalc 04Dec2021
IncomeCalc 05Dec2021
IncomeCalc 06Dec2021
IncomeCalc 07Dec2021
DonateCalc 07Dec2021
ExpenseCalc 15Dec2021
IncomeCalc 08Dec2021
IncomeCalc 09Dec2021
IncomeCalc 19Dec2021
IncomeCalc 11Dec2021
IncomeCalc 12Dec2021
IncomeCalc 13Dec2021
IncomeCalc 14Dec2021
DonateCalc 07Dec2021
ExpenseCalc 30Dec2021
run;
/* Step 1 - calculate the count of days since last run + the number of times a given filename reoccurs. */
proc sort data=have;
by filename date;
run;
data interval (drop=date_last intervals_total) files (drop=date date_last interval);
set have;
by filename date;
if first.filename then intervals_total = 0;
intervals_total + 1;
date_last = lag(date);
if not first.filename then do;
interval = date-date_last;
output interval;
end;
if last.filename then do;
intervals_total = intervals_total - 1;
output files;
end;
run;
/* Step 2 - For each combination of filename / day_count, calculate the number of occurrences. */
proc sql;
create table interval_count as
select
filename,
interval,
count(*) as intervals_found
from interval
group by
filename,
interval
order by
filename,
intervals_found
;
quit;
/* Step 3 - For each combination of filename / day_count, calculate the percentage of this day_count.*/
data want_intermediate;
merge interval_count files;
by filename;
format pct_occurred 5.0;
pct_occurred = (intervals_found * 100) / intervals_total;
run;
/* Step 4 - For each filename, take the most frequently occurring day_count and translate the day_count to a scheduling period. */
data want_final;
set want_intermediate;
by filename;
length schedule $20;
if last.filename then do;
if 86 < interval < 33 then schedule = 'Quarterly';
else if 26 < interval < 33 then schedule = 'Monthly';
else if 12 < interval < 16 then schedule = 'SecondWeek';
else if interval = 7 then schedule = 'Weekly';
else if interval = 1 then schedule = 'Daily';
else schedule = 'Undetermined';
output;
end;
run;
Hi @dandsouza
Sorry, forgot to copy a ; before the first run; - this hopefullky looks better:
data have;
input filename :$20. date :date9.;
format date yymmdd10.;
datalines;
IncomeCalc 01Dec2021
IncomeCalc 02Dec2021
IncomeCalc 03Dec2021
IncomeCalc 04Dec2021
IncomeCalc 05Dec2021
IncomeCalc 06Dec2021
IncomeCalc 07Dec2021
DonateCalc 07Dec2021
ExpenseCalc 15Dec2021
IncomeCalc 08Dec2021
IncomeCalc 09Dec2021
IncomeCalc 19Dec2021
IncomeCalc 11Dec2021
IncomeCalc 12Dec2021
IncomeCalc 13Dec2021
IncomeCalc 14Dec2021
DonateCalc 07Dec2021
ExpenseCalc 30Dec2021
;
run;
/* Step 1 - calculate the count of days since last run + the number of times a given filename reoccurs. */
proc sort data=have;
by filename date;
run;
data interval (drop=date_last intervals_total) files (drop=date date_last interval);
set have;
by filename date;
if first.filename then intervals_total = 0;
intervals_total + 1;
date_last = lag(date);
if not first.filename then do;
interval = date-date_last;
output interval;
end;
if last.filename then do;
intervals_total = intervals_total - 1;
output files;
end;
run;
/* Step 2 - For each combination of filename / day_count, calculate the number of occurrences. */
proc sql;
create table interval_count as
select
filename,
interval,
count(*) as intervals_found
from interval
group by
filename,
interval
order by
filename,
intervals_found
;
quit;
/* Step 3 - For each combination of filename / day_count, calculate the percentage of this day_count.*/
data want_intermediate;
merge interval_count files;
by filename;
format pct_occurred 5.0;
pct_occurred = (intervals_found * 100) / intervals_total;
run;
/* Step 4 - For each filename, take the most frequently occurring day_count and translate the day_count to a scheduling period. */
data want_final;
set want_intermediate;
by filename;
length schedule $20;
if last.filename then do;
if 86 < interval < 33 then schedule = 'Quarterly';
else if 26 < interval < 33 then schedule = 'Monthly';
else if 12 < interval < 16 then schedule = 'SecondWeek';
else if interval = 7 then schedule = 'Weekly';
else if interval = 1 then schedule = 'Daily';
else schedule = 'Undetermined';
output;
end;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.