Solved: Create a Customer Details dataset and load it every month

mvalsamis · Posted 01-16-2025 01:46 PM

Good evening to the community! I have the following issue. I have a bunch of the following CSV files that contain customer details for a bank:

Customer_details_202101 , customer_details_202102 ,customer_details_202101, customer_details_202111

Where the first 4 digits in the 6-digit number at the end of the names is the year (2021 here) and the last 2 are the month, i.e. I have data from January to November 2021.

I want to create a macro program that creates a dataset named CUSTOMER_DETAILS_HIST that includes all the customer details from 01/01/1950 to 31/12/2100.

To do so, I have created the macro as follows:

%macro creation;

%do i=195001 %to 210012;

filename raw "C:\Customer_details_&i";

data customer_details_history;
     infile raw dsd firstobs=2;
     input variables;
run;



%end;
%mend creation;
%creation;

However, the code above will create a dataset for 2021/01, but then it will replace the dataset with the data from 2021/02. What I want to do is append, not replace.

I was thinking of creating another %do block inside the loop, telling SAS: "Create the dataset for the first month that is available. Then, exit the loop." (so that I can perform the appending on a separate stage). How can I tell SAS to exit the loop, but not the session? (so, not %ABORT). I know that in a DATA step DO loop there is the LEAVE statement. Is there anything I can use in a macro?

Thank you in advance!

Patrick · Posted 01-24-2025 10:53 PM

My answer based on the following assumptions:

1. You need a job for an initial load of multiple months.

2. You need a job for ongoing load of new months.

3. You get for each month all customers and just need to append the source data to target.

4. The macro variable that contains the year/month value is called &yearmon

It's been years since I've used DIS in anger but I believe that after the initial creation of the external file the only thing you need to change is the path to something like below.

C:\source\&yearmon\customer_hist_&yearmon..csv

...and of course tick the box to have this path in double quotes.

Now for your ongoing job (first flow picture) the only thing you need to ensure is that macro variable &yearmon exists and is populated. Because the date or batch number for which and ETL process executes is required for many jobs such information is often kept and retrieved from a control table that gets populated at the start of the ETL. Not sure what you're building but I'd recommend to consider such an approach. Using DIS you then could create a custom transformation used as first step in your flows that retrieves this info and populates a macro variable like &yearmon.

Below sample code for such a custom transformation (or user written code). It doesn't use a control table but it demonstrates the approach.

Spoiler

/*** header: DIS generated macros ***/
/* &input1 will contain table registered as input to custom code transformation */
%let input1=work.customer_hist;

/*** body: custom code ***/
/* ensure macro variable &yearmon exists */
%macro cust_set_mvar(datevar=date);
	%if %symexist(yearmon)=0 %then %global yearmon;
	
	/* processing if &yearmon is empty */
	%if %nrbquote(&yearmon)=%nrbquote() %then
		%do;
			/* if target table customer_hist exists */
			%if %sysfunc(exist(&input1)) %then 
				%do;
					/* use latest date of already loaded data to populate &yearmon */
					/* - &yearmon = latest date plus 1 month                       */
					proc sql noprint;
						select put(intnx('month',max_date,1,'b'),yymmn6. -l) into :yearmon trimmed
						from (
							select max(&datevar) as max_date
							from &input1
						)
						;
					quit;
				%end;
		%end;
	/* set yearmon to 202101 if still empty because:  							*/
	/* - target table doesn't exist 																*/
	/* - or has zero rows             															*/
	/* - or all date variables are missing (shouldn't be possible) 	*/
	%if %nrbquote(&yearmon)=%nrbquote() %then %let yearmon=202101;
%mend;

/*** header: DIS generated macros ***/ /* &input1 will contain table registered as input to custom code transformation */ %let input1=work.customer_hist; /*** body: custom code ***/ /* ensure macro variable &yearmon exists */ %macro cust_set_mvar(datevar=date); %if %symexist(yearmon)=0 %then %global yearmon; /* processing if &yearmon is empty */ %if %nrbquote(&yearmon)=%nrbquote() %then %do; /* if target table customer_hist exists */ %if %sysfunc(exist(&input1)) %then %do; /* use latest date of already loaded data to populate &yearmon */ /* - &yearmon = latest date plus 1 month */ proc sql noprint; select put(intnx('month',max_date,1,'b'),yymmn6. -l) into :yearmon trimmed from ( select max(&datevar) as max_date from &input1 ) ; quit; %end; %end; /* set yearmon to 202101 if still empty because: */ /* - target table doesn't exist */ /* - or has zero rows */ /* - or all date variables are missing (shouldn't be possible) */ %if %nrbquote(&yearmon)=%nrbquote() %then %let yearmon=202101; %mend;

You could add code similar to above to your flow to ensure macro variable &yearmon gets created and is populated if it doesn't exist already.

For the one-off job for the initial load (your 2nd flow picture): Looks good to me. The loop will call flow 1. Make sure to define the loop in a way so control table variable ccyymm populates macro variable yearmon with the required string. And of course run the loop in sequence and not in parallel.
Because the sample code I shared above will only populate &yearmon if it doesn't exist already or is empty when running the inner job via loop &yearmon will get used as populated via the loop transformation.

And just some thoughts:
I'm always careful when it comes to designs which result in quickly growing tables. Consider how this data will get used and design your data model accordingly.
Customer data doesn't change that often and many times only the most recent customer information is required. It might be worth to consider loading your table using SCD2. Or eventually maintain a customer table that only stores the latest known information of active and inactive customers and store the monthly history tables separately. And if it was me I'd keep the history tables monthly the way you get it from source (i.e. customer_&yearmon) stored under a single folder. You then can always create a view for easy access to all the history months (or the last 12 months, or...).

View solution in original post

ChrisNZ · Posted 01-16-2025 05:53 PM

1. You can read all the CSV files in one data step. No need for macro or for a loop.

See the FILENAME= option of the INFILE statement.

For example:

data _null_;
  file "%sysfunc(pathname(WORK))/ttt201001.txt" ;
  put  'a'; 
  file "%sysfunc(pathname(WORK))/ttt201002.txt" ;
  put  'b';
run;

data WANT;
  length FN FILENAME $256;
  infile  "%sysfunc(pathname(WORK))/ttt*.txt" filename=FN;
  input V $; 
  FILENAME=FN;
run;

2. Consider creating a data set by month as it's easier to manage (no endless growth, retire older data), unless you know you're mostly going to use all of them together. You can then use them together at will via a view

data CUSTOMER_DETAILS_HISTORY/view=CUSTOMER_DETAILS_HISTORY;
  set CUSTOMER_DETAILS_19:
      CUSTOMER_DETAILS_20: ;
run;

High-Performance SAS Coding - Third Edition

SASKiwi · Posted 01-16-2025 06:02 PM

Just add an APPEND step to your process:

%macro creation;

%do i=195001 %to 210012;

filename raw "C:\Customer_details_&i";

data customer_details_history;
     infile raw dsd firstobs=2;
     input variables;
run;

proc datasets library = MyLib;
  append base = CUSTOMER_DETAILS_HIST
         data = WORK.customer_details_history
		 ;
run;
quit;

%end;
%mend creation;
%creation;

mvalsamis · Posted 01-17-2025 05:45 AM

Your code works, although it needed a few amendments.

mvalsamis · Posted 01-17-2025 11:49 AM

Actually it does not solve the problem, unfortunately. What I want to do, is to create a macro variable that includes all the files and folders in the directory. I will then use this macro variable, to see if the &i variable is included in the dircetory. Let me do some more research, thanks for the help though (Thanks all of you who responded actually).

Tom · Posted 01-17-2025 02:06 PM

@mvalsamis wrote:

Actually it does not solve the problem, unfortunately. What I want to do, is to create a macro variable that includes all the files and folders in the directory. I will then use this macro variable, to see if the &i variable is included in the dircetory. Let me do some more research, thanks for the help though (Thanks all of you who responded actually).

How many files do you expect to have in the directory? A macro variable can only contain 64K bytes. And if you use a data step to generate it you will be limited to the 32K byte limit of a character variable.

What do you expect to DO with this macro variable that contains so many names?

ballardw · Posted 01-17-2025 02:35 PM

@mvalsamis wrote:

Actually it does not solve the problem, unfortunately. What I want to do, is to create a macro variable that includes all the files and folders in the directory. I will then use this macro variable, to see if the &i variable is included in the dircetory. Let me do some more research, thanks for the help though (Thanks all of you who responded actually).

Pretty poor approach to create hundreds of values, since 195013 to 195099 wouldn't make sense as your file name structure implies. Not to mention looking 70+ plus years into the future (210012).

Better would be to search your file system for existing ones using the DIR command and pipe the result to a text file to use as a control.

at a command prompt something like

DIR C:\<top folder to search\customer_details_*.csv /S /B

use a > to send the results to a text file. Or use that as a PIPE in a filename statement

Ksharp · Posted 01-16-2025 08:15 PM

That would be better if you could post some CSV sample file.

Assuming all these CSV file are under path "c:\temp\z\" :

data fname;
infile cards truncover dsd;
input fname $200.;
cards;
C:\temp\z\Customer_details_202101.csv
C:\temp\z\Customer_details_202102.csv
;

data CUSTOMER_DETAILS_HIST ;
set fname;
infile dummy filevar=fname end=last dsd truncover firstobs=2;  /*suppose the first line is variable name*/
length filename $ 100;
filename=scan(fname,-2,'.\');
do while(not last);
input name :$20. sex :$20. age height weight;  /*list your variable here*/
output;
end;
run;

Tom · Posted 01-17-2025 02:20 PM

1950 to 2100 is 151*12 months.

Sounds like you want to do something like:

%let start = "01JAN1950"d;
%let end = "01JAN2025"d;
%let path = c:\foldername\ ;

data want;
  length start end month 8 fname $200 ;
  start = intnx('month',&start,0);
  end = intnx('month',&end,0);
  format start end month date9.;
  do offset = 0 to intck('month',start,end);
    month = intnx('month',start,offset);
    fname = cats("&path",'Customer_details_',put(month,yymmn6.),'.csv');
    if not fileexist(fname) then put 'NOTE: Skipping ' month= ' file not found.';
    else do;
      infile csv dsd filevar=fname truncover firstobs=2 end=eof;
      do while(not eof);
        input .... list your variables here ....;
        output;
      end;
    end;
  end;
run;

You could easily convert that into a macro by just making the three macro variables the parameters of the macro.

mvalsamis · Posted 01-24-2025 12:16 PM

After a careful research and help from colleagues, I realised that the way I have structured the question is wrong. What I wanted to do is create a dimensional model, where one of the dimensions are the customer details from 2021 to 2100. Whenever new data is added, the model must be automatically updated. I am doing this project on DI Studio.

To load the raw data from 01/2021 to 11/2021 (and beyond), I used a control table with all the months from 01/2021 to 12/2100 as follows:

CCYYMM
202101
202102
202103
202104
202105
202106
202107
202108
202109
202110
202111
202112
202201
...
210012

Then, I used this control table to create a parameterised (macro) job with the following parameter (macro variable):

yearmon

with default value:

This is both a parameter in the job, and in the source file, which is named:

CUSTOMER_HIST

And located somewhere like that:

C:\source\&yearmon\customer_hist_&yearmon..csv

Where &yearmon, as defined, takes values 202101, 202102,....,202111.

The diagram of the job is as in the image.

Please make sure:

1. In the Properties of the raw file, beneath File Location tab, select "Double quotes around file name".

2. Define the parameter in the Parameters tab of the Properties of the raw file.

3. In the Table Loader Properties, Load Technique Tab, select: Append to Existing.

After I ran the macro job, I created a loop job, where I ran the whole loop of the values contained in the control table.

The diagram is as follows:

1. Notice the macro job included, noted with the "&" symbol.

2. Open the loop tab, and match the created parameter/ macro variable with the variable of the control table.

The final output table CUSTOMER_HIST is as follows (a sample):

Cutomer_ID	DATE	FName	LName	Gender	Date_of_birth	Postcode	Address	Alive_flag
38962	01JAN2021	John	Smith	M	23MAR1964	BS2 1EG	18 Clifton Road, Bristol, England, UK	1
35688	01MAR2021	Sarah	Williams	F	03APR2008	LA1 7TY	1 Manchester Road, Lancaster, England, UK	1
...
56789	01NOV2021	Georgia	O'Sullivan	F	03AUG1996	BT5 8UI	22 Church Lane, Belfast, Northern Ireland, UK

Note: the examples above are imaginary and just came out of my head. Any similarity with the reality is coincident. I don't even know if the addresses above exist in the respective towns,

Patrick · Posted 01-24-2025 10:53 PM

My answer based on the following assumptions:

1. You need a job for an initial load of multiple months.

2. You need a job for ongoing load of new months.

3. You get for each month all customers and just need to append the source data to target.

4. The macro variable that contains the year/month value is called &yearmon

It's been years since I've used DIS in anger but I believe that after the initial creation of the external file the only thing you need to change is the path to something like below.

C:\source\&yearmon\customer_hist_&yearmon..csv

...and of course tick the box to have this path in double quotes.

Now for your ongoing job (first flow picture) the only thing you need to ensure is that macro variable &yearmon exists and is populated. Because the date or batch number for which and ETL process executes is required for many jobs such information is often kept and retrieved from a control table that gets populated at the start of the ETL. Not sure what you're building but I'd recommend to consider such an approach. Using DIS you then could create a custom transformation used as first step in your flows that retrieves this info and populates a macro variable like &yearmon.

Below sample code for such a custom transformation (or user written code). It doesn't use a control table but it demonstrates the approach.

Spoiler

/*** header: DIS generated macros ***/
/* &input1 will contain table registered as input to custom code transformation */
%let input1=work.customer_hist;

/*** body: custom code ***/
/* ensure macro variable &yearmon exists */
%macro cust_set_mvar(datevar=date);
	%if %symexist(yearmon)=0 %then %global yearmon;
	
	/* processing if &yearmon is empty */
	%if %nrbquote(&yearmon)=%nrbquote() %then
		%do;
			/* if target table customer_hist exists */
			%if %sysfunc(exist(&input1)) %then 
				%do;
					/* use latest date of already loaded data to populate &yearmon */
					/* - &yearmon = latest date plus 1 month                       */
					proc sql noprint;
						select put(intnx('month',max_date,1,'b'),yymmn6. -l) into :yearmon trimmed
						from (
							select max(&datevar) as max_date
							from &input1
						)
						;
					quit;
				%end;
		%end;
	/* set yearmon to 202101 if still empty because:  							*/
	/* - target table doesn't exist 																*/
	/* - or has zero rows             															*/
	/* - or all date variables are missing (shouldn't be possible) 	*/
	%if %nrbquote(&yearmon)=%nrbquote() %then %let yearmon=202101;
%mend;

/*** header: DIS generated macros ***/ /* &input1 will contain table registered as input to custom code transformation */ %let input1=work.customer_hist; /*** body: custom code ***/ /* ensure macro variable &yearmon exists */ %macro cust_set_mvar(datevar=date); %if %symexist(yearmon)=0 %then %global yearmon; /* processing if &yearmon is empty */ %if %nrbquote(&yearmon)=%nrbquote() %then %do; /* if target table customer_hist exists */ %if %sysfunc(exist(&input1)) %then %do; /* use latest date of already loaded data to populate &yearmon */ /* - &yearmon = latest date plus 1 month */ proc sql noprint; select put(intnx('month',max_date,1,'b'),yymmn6. -l) into :yearmon trimmed from ( select max(&datevar) as max_date from &input1 ) ; quit; %end; %end; /* set yearmon to 202101 if still empty because: */ /* - target table doesn't exist */ /* - or has zero rows */ /* - or all date variables are missing (shouldn't be possible) */ %if %nrbquote(&yearmon)=%nrbquote() %then %let yearmon=202101; %mend;

You could add code similar to above to your flow to ensure macro variable &yearmon gets created and is populated if it doesn't exist already.

For the one-off job for the initial load (your 2nd flow picture): Looks good to me. The loop will call flow 1. Make sure to define the loop in a way so control table variable ccyymm populates macro variable yearmon with the required string. And of course run the loop in sequence and not in parallel.
Because the sample code I shared above will only populate &yearmon if it doesn't exist already or is empty when running the inner job via loop &yearmon will get used as populated via the loop transformation.

And just some thoughts:
I'm always careful when it comes to designs which result in quickly growing tables. Consider how this data will get used and design your data model accordingly.
Customer data doesn't change that often and many times only the most recent customer information is required. It might be worth to consider loading your table using SCD2. Or eventually maintain a customer table that only stores the latest known information of active and inactive customers and store the monthly history tables separately. And if it was me I'd keep the history tables monthly the way you get it from source (i.e. customer_&yearmon) stored under a single folder. You then can always create a view for easy access to all the history months (or the last 12 months, or...).

Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Re: Create a Customer Details dataset and load it every month

Registration is open

SAS Training: Just a Click Away