Good evening to the community! I have the following issue. I have a bunch of the following CSV files that contain customer details for a bank:
Customer_details_202101 , customer_details_202102 ,customer_details_202101, customer_details_202111
Where the first 4 digits in the 6-digit number at the end of the names is the year (2021 here) and the last 2 are the month, i.e. I have data from January to November 2021.
I want to create a macro program that creates a dataset named CUSTOMER_DETAILS_HIST that includes all the customer details from 01/01/1950 to 31/12/2100.
To do so, I have created the macro as follows:
%macro creation;
%do i=195001 %to 210012;
filename raw "C:\Customer_details_&i";
data customer_details_history;
infile raw dsd firstobs=2;
input variables;
run;
%end;
%mend creation;
%creation;
However, the code above will create a dataset for 2021/01, but then it will replace the dataset with the data from 2021/02. What I want to do is append, not replace.
I was thinking of creating another %do block inside the loop, telling SAS: "Create the dataset for the first month that is available. Then, exit the loop." (so that I can perform the appending on a separate stage). How can I tell SAS to exit the loop, but not the session? (so, not %ABORT). I know that in a DATA step DO loop there is the LEAVE statement. Is there anything I can use in a macro?
Thank you in advance!
My answer based on the following assumptions:
1. You need a job for an initial load of multiple months.
2. You need a job for ongoing load of new months.
3. You get for each month all customers and just need to append the source data to target.
4. The macro variable that contains the year/month value is called &yearmon
It's been years since I've used DIS in anger but I believe that after the initial creation of the external file the only thing you need to change is the path to something like below.
C:\source\&yearmon\customer_hist_&yearmon..csv
...and of course tick the box to have this path in double quotes.
Now for your ongoing job (first flow picture) the only thing you need to ensure is that macro variable &yearmon exists and is populated. Because the date or batch number for which and ETL process executes is required for many jobs such information is often kept and retrieved from a control table that gets populated at the start of the ETL. Not sure what you're building but I'd recommend to consider such an approach. Using DIS you then could create a custom transformation used as first step in your flows that retrieves this info and populates a macro variable like &yearmon.
Below sample code for such a custom transformation (or user written code). It doesn't use a control table but it demonstrates the approach.
/*** header: DIS generated macros ***/
/* &input1 will contain table registered as input to custom code transformation */
%let input1=work.customer_hist;
/*** body: custom code ***/
/* ensure macro variable &yearmon exists */
%macro cust_set_mvar(datevar=date);
%if %symexist(yearmon)=0 %then %global yearmon;
/* processing if &yearmon is empty */
%if %nrbquote(&yearmon)=%nrbquote() %then
%do;
/* if target table customer_hist exists */
%if %sysfunc(exist(&input1)) %then
%do;
/* use latest date of already loaded data to populate &yearmon */
/* - &yearmon = latest date plus 1 month */
proc sql noprint;
select put(intnx('month',max_date,1,'b'),yymmn6. -l) into :yearmon trimmed
from (
select max(&datevar) as max_date
from &input1
)
;
quit;
%end;
%end;
/* set yearmon to 202101 if still empty because: */
/* - target table doesn't exist */
/* - or has zero rows */
/* - or all date variables are missing (shouldn't be possible) */
%if %nrbquote(&yearmon)=%nrbquote() %then %let yearmon=202101;
%mend;
You could add code similar to above to your flow to ensure macro variable &yearmon gets created and is populated if it doesn't exist already.
For the one-off job for the initial load (your 2nd flow picture): Looks good to me. The loop will call flow 1. Make sure to define the loop in a way so control table variable ccyymm populates macro variable yearmon with the required string. And of course run the loop in sequence and not in parallel.
Because the sample code I shared above will only populate &yearmon if it doesn't exist already or is empty when running the inner job via loop &yearmon will get used as populated via the loop transformation.
And just some thoughts:
I'm always careful when it comes to designs which result in quickly growing tables. Consider how this data will get used and design your data model accordingly.
Customer data doesn't change that often and many times only the most recent customer information is required. It might be worth to consider loading your table using SCD2. Or eventually maintain a customer table that only stores the latest known information of active and inactive customers and store the monthly history tables separately. And if it was me I'd keep the history tables monthly the way you get it from source (i.e. customer_&yearmon) stored under a single folder. You then can always create a view for easy access to all the history months (or the last 12 months, or...).
1. You can read all the CSV files in one data step. No need for macro or for a loop.
See the FILENAME= option of the INFILE statement.
For example:
data _null_;
file "%sysfunc(pathname(WORK))/ttt201001.txt" ;
put 'a';
file "%sysfunc(pathname(WORK))/ttt201002.txt" ;
put 'b';
run;
data WANT;
length FN FILENAME $256;
infile "%sysfunc(pathname(WORK))/ttt*.txt" filename=FN;
input V $;
FILENAME=FN;
run;
2. Consider creating a data set by month as it's easier to manage (no endless growth, retire older data), unless you know you're mostly going to use all of them together. You can then use them together at will via a view
data CUSTOMER_DETAILS_HISTORY/view=CUSTOMER_DETAILS_HISTORY;
set CUSTOMER_DETAILS_19:
CUSTOMER_DETAILS_20: ;
run;
Just add an APPEND step to your process:
%macro creation;
%do i=195001 %to 210012;
filename raw "C:\Customer_details_&i";
data customer_details_history;
infile raw dsd firstobs=2;
input variables;
run;
proc datasets library = MyLib;
append base = CUSTOMER_DETAILS_HIST
data = WORK.customer_details_history
;
run;
quit;
%end;
%mend creation;
%creation;
Actually it does not solve the problem, unfortunately. What I want to do, is to create a macro variable that includes all the files and folders in the directory. I will then use this macro variable, to see if the &i variable is included in the dircetory. Let me do some more research, thanks for the help though (Thanks all of you who responded actually).
@mvalsamis wrote:
Actually it does not solve the problem, unfortunately. What I want to do, is to create a macro variable that includes all the files and folders in the directory. I will then use this macro variable, to see if the &i variable is included in the dircetory. Let me do some more research, thanks for the help though (Thanks all of you who responded actually).
How many files do you expect to have in the directory? A macro variable can only contain 64K bytes. And if you use a data step to generate it you will be limited to the 32K byte limit of a character variable.
What do you expect to DO with this macro variable that contains so many names?
@mvalsamis wrote:
Actually it does not solve the problem, unfortunately. What I want to do, is to create a macro variable that includes all the files and folders in the directory. I will then use this macro variable, to see if the &i variable is included in the dircetory. Let me do some more research, thanks for the help though (Thanks all of you who responded actually).
Pretty poor approach to create hundreds of values, since 195013 to 195099 wouldn't make sense as your file name structure implies. Not to mention looking 70+ plus years into the future (210012).
Better would be to search your file system for existing ones using the DIR command and pipe the result to a text file to use as a control.
at a command prompt something like
DIR C:\<top folder to search\customer_details_*.csv /S /B
use a > to send the results to a text file. Or use that as a PIPE in a filename statement
That would be better if you could post some CSV sample file.
Assuming all these CSV file are under path "c:\temp\z\" :
data fname;
infile cards truncover dsd;
input fname $200.;
cards;
C:\temp\z\Customer_details_202101.csv
C:\temp\z\Customer_details_202102.csv
;
data CUSTOMER_DETAILS_HIST ;
set fname;
infile dummy filevar=fname end=last dsd truncover firstobs=2; /*suppose the first line is variable name*/
length filename $ 100;
filename=scan(fname,-2,'.\');
do while(not last);
input name :$20. sex :$20. age height weight; /*list your variable here*/
output;
end;
run;
1950 to 2100 is 151*12 months.
Sounds like you want to do something like:
%let start = "01JAN1950"d;
%let end = "01JAN2025"d;
%let path = c:\foldername\ ;
data want;
length start end month 8 fname $200 ;
start = intnx('month',&start,0);
end = intnx('month',&end,0);
format start end month date9.;
do offset = 0 to intck('month',start,end);
month = intnx('month',start,offset);
fname = cats("&path",'Customer_details_',put(month,yymmn6.),'.csv');
if not fileexist(fname) then put 'NOTE: Skipping ' month= ' file not found.';
else do;
infile csv dsd filevar=fname truncover firstobs=2 end=eof;
do while(not eof);
input .... list your variables here ....;
output;
end;
end;
end;
run;
You could easily convert that into a macro by just making the three macro variables the parameters of the macro.
After a careful research and help from colleagues, I realised that the way I have structured the question is wrong. What I wanted to do is create a dimensional model, where one of the dimensions are the customer details from 2021 to 2100. Whenever new data is added, the model must be automatically updated. I am doing this project on DI Studio.
To load the raw data from 01/2021 to 11/2021 (and beyond), I used a control table with all the months from 01/2021 to 12/2100 as follows:
CCYYMM |
202101 |
202102 |
202103 |
202104 |
202105 |
202106 |
202107 |
202108 |
202109 |
202110 |
202111 |
202112 |
202201 |
... |
210012 |
Then, I used this control table to create a parameterised (macro) job with the following parameter (macro variable):
yearmon
with default value:
202101
This is both a parameter in the job, and in the source file, which is named:
CUSTOMER_HIST
And located somewhere like that:
C:\source\&yearmon\customer_hist_&yearmon..csv
Where &yearmon, as defined, takes values 202101, 202102,....,202111.
The diagram of the job is as in the image.
Please make sure:
1. In the Properties of the raw file, beneath File Location tab, select "Double quotes around file name".
2. Define the parameter in the Parameters tab of the Properties of the raw file.
3. In the Table Loader Properties, Load Technique Tab, select: Append to Existing.
After I ran the macro job, I created a loop job, where I ran the whole loop of the values contained in the control table.
The diagram is as follows:
1. Notice the macro job included, noted with the "&" symbol.
2. Open the loop tab, and match the created parameter/ macro variable with the variable of the control table.
The final output table CUSTOMER_HIST is as follows (a sample):
Cutomer_ID | DATE | FName | LName | Gender | Date_of_birth | Postcode | Address | Alive_flag |
38962 | 01JAN2021 | John | Smith | M | 23MAR1964 | BS2 1EG | 18 Clifton Road, Bristol, England, UK | 1 |
35688 | 01MAR2021 | Sarah | Williams | F | 03APR2008 | LA1 7TY | 1 Manchester Road, Lancaster, England, UK | 1 |
... | ||||||||
56789 | 01NOV2021 | Georgia | O'Sullivan | F | 03AUG1996 | BT5 8UI | 22 Church Lane, Belfast, Northern Ireland, UK |
Note: the examples above are imaginary and just came out of my head. Any similarity with the reality is coincident. I don't even know if the addresses above exist in the respective towns,
My answer based on the following assumptions:
1. You need a job for an initial load of multiple months.
2. You need a job for ongoing load of new months.
3. You get for each month all customers and just need to append the source data to target.
4. The macro variable that contains the year/month value is called &yearmon
It's been years since I've used DIS in anger but I believe that after the initial creation of the external file the only thing you need to change is the path to something like below.
C:\source\&yearmon\customer_hist_&yearmon..csv
...and of course tick the box to have this path in double quotes.
Now for your ongoing job (first flow picture) the only thing you need to ensure is that macro variable &yearmon exists and is populated. Because the date or batch number for which and ETL process executes is required for many jobs such information is often kept and retrieved from a control table that gets populated at the start of the ETL. Not sure what you're building but I'd recommend to consider such an approach. Using DIS you then could create a custom transformation used as first step in your flows that retrieves this info and populates a macro variable like &yearmon.
Below sample code for such a custom transformation (or user written code). It doesn't use a control table but it demonstrates the approach.
/*** header: DIS generated macros ***/
/* &input1 will contain table registered as input to custom code transformation */
%let input1=work.customer_hist;
/*** body: custom code ***/
/* ensure macro variable &yearmon exists */
%macro cust_set_mvar(datevar=date);
%if %symexist(yearmon)=0 %then %global yearmon;
/* processing if &yearmon is empty */
%if %nrbquote(&yearmon)=%nrbquote() %then
%do;
/* if target table customer_hist exists */
%if %sysfunc(exist(&input1)) %then
%do;
/* use latest date of already loaded data to populate &yearmon */
/* - &yearmon = latest date plus 1 month */
proc sql noprint;
select put(intnx('month',max_date,1,'b'),yymmn6. -l) into :yearmon trimmed
from (
select max(&datevar) as max_date
from &input1
)
;
quit;
%end;
%end;
/* set yearmon to 202101 if still empty because: */
/* - target table doesn't exist */
/* - or has zero rows */
/* - or all date variables are missing (shouldn't be possible) */
%if %nrbquote(&yearmon)=%nrbquote() %then %let yearmon=202101;
%mend;
You could add code similar to above to your flow to ensure macro variable &yearmon gets created and is populated if it doesn't exist already.
For the one-off job for the initial load (your 2nd flow picture): Looks good to me. The loop will call flow 1. Make sure to define the loop in a way so control table variable ccyymm populates macro variable yearmon with the required string. And of course run the loop in sequence and not in parallel.
Because the sample code I shared above will only populate &yearmon if it doesn't exist already or is empty when running the inner job via loop &yearmon will get used as populated via the loop transformation.
And just some thoughts:
I'm always careful when it comes to designs which result in quickly growing tables. Consider how this data will get used and design your data model accordingly.
Customer data doesn't change that often and many times only the most recent customer information is required. It might be worth to consider loading your table using SCD2. Or eventually maintain a customer table that only stores the latest known information of active and inactive customers and store the monthly history tables separately. And if it was me I'd keep the history tables monthly the way you get it from source (i.e. customer_&yearmon) stored under a single folder. You then can always create a view for easy access to all the history months (or the last 12 months, or...).
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.