Hi,
Any idea how I can make the code below more efficient with the ability to work for multiple months?
At the moment this works for 3 additional months, but if I want 36 additional this will get messy quickly!
The first step runs a base dataset and then merges on the next 3 subsequent months.
%macro months(m0,m1,m2,m3);
data month_&m0.;
set X.full_pop_&m0.
where obs_ref ne 'BRE';
run;
proc sql;
create table month_&m0. as
select
a.*,
b.obs_sia as obs_sia_m1,
b.obs_final_predef_flag as obs_final_predef_flag_m1,
c.obs_sia as obs_sia_m2,
c.obs_final_predef_flag as obs_final_predef_flag_m2,
d.obs_sia as obs_sia_m3,
d.obs_final_predef_flag as obs_final_predef_flag_m3
from
month_&m0. a
left join
X.full_pop_&m1. b
on a.ID=b.ID
left join
X.full_pop_&m2. c
on a.ID=c.ID
left join
X.full_pop_&m3. d
on a.ID=d.ID
;
run;
%mend months;
%months(201903,201904,201905,201906);
%months(201902,201903,201904,201905);
Thanks in advance,
Pete
I don't know if it is possible to join 36 different data sets using PROC SQL (or even if there is a limit), but ignoring that, you need a loop inside the macro.
UNTESTED CODE
%macro months(months=);
%let m1=%scan(&months,1,%str( ));
data month_&m1.;
set X.full_pop_&m1.
where obs_ref ne 'BRE';
run;
proc sql;
create table month_&m1 as select a.*,
%do i=2 %to %sysfunc(countw(&months));
%let thismonth=full_pop_%scan(&months,&i,%str( ));
&thismonth..obs_sia as obs_sia_m&i,
&thismonth..obs_final_predef_flag as obs_final_predef_flag_m&i
%if &i<%sysfunc(countw(&months)) %then %str(,);
%end;
from month_&m1 a
%do i=2 %to %sysfunc(countw(&months));
%let thismonth=full_pop_%scan(&months,&i,%str( ));
left join X.&thismonth on a.ID=&thismonth..ID
%end;
quit;
%mend months;
%months(months=201902 201903 201904 201905)
Of course, it would be much easier (if possible) not to have the month data split up into data sets by months in the first place. I realize that if you have no control over the format of the input, then the above macro ought to work; but really don't split the data up by month if you have that control. Or, depending on what you intend to do with all of this data, you would be much better off creating a long format data set (which would be much easier to do) rather than this wide data set that you are creating (in fact if you create the long data set and absolutely must have the wide data set, you could create create the wide data set using PROC TRANSPOSE). So, really, ignore the macro above and take advantage of the advice in this paragraph.
This bit of your code
a.*, b.obs_sia as obs_sia_m1, b.obs_final_predef_flag as obs_final_predef_flag_m1, c.obs_sia as obs_sia_m2, c.obs_final_predef_flag as obs_final_predef_flag_m2, d.obs_sia as obs_sia_m3, d.obs_final_predef_flag as obs_final_predef_flag_m3
Indicates to me that you are making a wide data set. There are very few times when that is actually an optimal form of the data for use. For most purposes you will be better off just appending data together and having variables to indicate the actual month (or what ever) is needed. Then use BY group processing to do work grouped by months.
Can you explain why you are making a wide data set? As in how will it be used.
Sure.
The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).
Unfortunately I have no control over how the monthly datasets are created.
For each month I plan to use the flags to say:
if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;
@PetePatel wrote:
Sure.
The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).
Unfortunately I have no control over how the monthly datasets are created.
For each month I plan to use the flags to say:
if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;
This seems to be a case where you have decided that wide is the proper format, but we are questioning if that is really necessary, especially since a long data set is easier to create and easier to manipulate. What do you plan to do after you create this wide data set — what analysis? what plots? what reports?
@PetePatel wrote:
Sure.
The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).
Unfortunately I have no control over how the monthly datasets are created.
For each month I plan to use the flags to say:
if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;
Gack!
One of the reasons not to go this way. That line of code would need to change for every different number of files "merged".
It's not a matter of "too" wide. Wide is always less efficient than long, as it necessitates dynamic code. Long can be handled with static code where the data drives processing automatically.
1. Stack instead of side by side merge
2. Transpose - doesn't matter how many data sets or numbers with this approach.
Untested but this should work. You may need to adjust for other factors in your process.
It's cleaner than a macro IMO.
data long;
set full_pop_201904 - full_pop_201906 indsname = source;
dsn = source;
*you could grab the month/year from the data set name;
month = scan(source, -1, "_");
run;
proc sort data=long;
by ID;
run;
proc transpose data=long out=wide prefix = obs_sia_m;
by ID;
var final_predef_flag;
id month;
run;
Thanks for this Reeza.
This feels a lot more efficient. Let me try to explain in more detail so you can visualise the problem.
Volumes:
X.full_pop_201903 contains 250k rows and 1000 variables
X.full_pop_201904 contains 220k rows and 1000 variables
X.full_pop_201905 contains 260k rows and 1000 variables
X.full_pop_201906 contains 230k rows and 1000 variables
X.full_pop_201903 contains 250k rows and 1000 variables, but I only need 5 of these (id var hlr jkl clo month). This is my base dataset.
I then need to left join the next 3 subsequent datasets X.full_pop_201904, X.full_pop_201905 and X.full_pop_201906 by ID keeping only the 5 variables I require appending variable names by 1 each time.
So for example X.full_pop_201904 will have (var_1 hlr_1 jkl_1 clo_1 month_1) and X.full_pop_201906 will have (var_3 hlr_3 jkl_3 clo_3 month_3).
So my final dataset should contain 250k rows and include ID plus an additional 20 variables (5*4) as below:
ID
var
hlr
jkl
clo
month
var_1
hlr_1
jkl_1
clo_1
month_1
var_2
hlr_2
jkl_2
clo_2
month_2
var_3
hlr_3
jkl_3
clo_3
month_3
I then plan to do some calculations between the 5 variables kept. For example if hlr=1 and (hlr_1=0 or hlr_2=0 or hlr_3=0) then flag=1;
I need to then repeat this in some sort of macro for many more base datasets, each time merging with the subsequent 3 months.
In the end I can put into a proc summary and get results of my calculations for each month.
How can this be possible using your transpose technique below?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.