Re: SQL Merge monthly datasets

PetePatel · Posted 12-27-2019 08:56 AM

Hi,

Any idea how I can make the code below more efficient with the ability to work for multiple months?

At the moment this works for 3 additional months, but if I want 36 additional this will get messy quickly!

The first step runs a base dataset and then merges on the next 3 subsequent months.

%macro months(m0,m1,m2,m3);

data month_&m0.;
set X.full_pop_&m0.

where obs_ref ne 'BRE';
run;

proc sql;
create table month_&m0. as

select

a.*,
b.obs_sia as obs_sia_m1,
b.obs_final_predef_flag as obs_final_predef_flag_m1,

c.obs_sia as obs_sia_m2,
c.obs_final_predef_flag as obs_final_predef_flag_m2,

d.obs_sia as obs_sia_m3,
d.obs_final_predef_flag as obs_final_predef_flag_m3

from
month_&m0. a

left join
X.full_pop_&m1. b
on a.ID=b.ID

left join
X.full_pop_&m2. c
on a.ID=c.ID

left join
X.full_pop_&m3. d
on a.ID=d.ID

;
run;

%mend months;

%months(201903,201904,201905,201906);

%months(201902,201903,201904,201905);

Thanks in advance,

Pete

PaigeMiller · Posted 12-27-2019 09:18 AM

I don't know if it is possible to join 36 different data sets using PROC SQL (or even if there is a limit), but ignoring that, you need a loop inside the macro.

UNTESTED CODE

%macro months(months=);
    %let m1=%scan(&months,1,%str( ));
    data month_&m1.;
        set X.full_pop_&m1.
        where obs_ref ne 'BRE';
    run;

    proc sql;
        create table month_&m1 as select a.*,
        %do i=2 %to %sysfunc(countw(&months));
            %let thismonth=full_pop_%scan(&months,&i,%str( ));
            &thismonth..obs_sia as obs_sia_m&i,
            &thismonth..obs_final_predef_flag as obs_final_predef_flag_m&i
            %if &i<%sysfunc(countw(&months)) %then %str(,);
        %end;
        from month_&m1 a
        %do i=2 %to %sysfunc(countw(&months));
            %let thismonth=full_pop_%scan(&months,&i,%str( ));
            left join X.&thismonth on a.ID=&thismonth..ID
        %end;
    quit;

%mend months;

%months(months=201902 201903 201904 201905)

Of course, it would be much easier (if possible) not to have the month data split up into data sets by months in the first place. I realize that if you have no control over the format of the input, then the above macro ought to work; but really don't split the data up by month if you have that control. Or, depending on what you intend to do with all of this data, you would be much better off creating a long format data set (which would be much easier to do) rather than this wide data set that you are creating (in fact if you create the long data set and absolutely must have the wide data set, you could create create the wide data set using PROC TRANSPOSE). So, really, ignore the macro above and take advantage of the advice in this paragraph.

--
Paige Miller

ballardw · Posted 12-27-2019 10:31 AM

This bit of your code

a.*,
b.obs_sia as obs_sia_m1,
b.obs_final_predef_flag as obs_final_predef_flag_m1,

c.obs_sia as obs_sia_m2,
c.obs_final_predef_flag as obs_final_predef_flag_m2,

d.obs_sia as obs_sia_m3,
d.obs_final_predef_flag as obs_final_predef_flag_m3

Indicates to me that you are making a wide data set. There are very few times when that is actually an optimal form of the data for use. For most purposes you will be better off just appending data together and having variables to indicate the actual month (or what ever) is needed. Then use BY group processing to do work grouped by months.

Can you explain why you are making a wide data set? As in how will it be used.

PetePatel · Posted 12-27-2019 10:45 AM

Sure.

The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).

Unfortunately I have no control over how the monthly datasets are created.

For each month I plan to use the flags to say:

if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;

PaigeMiller · Posted 12-27-2019 11:14 AM

@PetePatel wrote:

Sure.

The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).

Unfortunately I have no control over how the monthly datasets are created.

For each month I plan to use the flags to say:

if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;

This seems to be a case where you have decided that wide is the proper format, but we are questioning if that is really necessary, especially since a long data set is easier to create and easier to manipulate. What do you plan to do after you create this wide data set — what analysis? what plots? what reports?

--
Paige Miller

ballardw · Posted 12-27-2019 11:16 AM

@PetePatel wrote:

Sure.

The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).

Unfortunately I have no control over how the monthly datasets are created.

For each month I plan to use the flags to say:

if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;

Gack!

One of the reasons not to go this way. That line of code would need to change for every different number of files "merged".

Kurt_Bremser · Posted 12-28-2019 12:54 PM

It's not a matter of "too" wide. Wide is always less efficient than long, as it necessitates dynamic code. Long can be handled with static code where the data drives processing automatically.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 12-27-2019 10:49 AM

1. Stack instead of side by side merge

2. Transpose - doesn't matter how many data sets or numbers with this approach.

Untested but this should work. You may need to adjust for other factors in your process.

It's cleaner than a macro IMO.


data long;
set full_pop_201904 - full_pop_201906 indsname = source;
dsn = source;
*you could grab the month/year from the data set name;
month = scan(source, -1, "_");
run;

proc sort data=long;
by ID;
run;

proc transpose data=long out=wide prefix = obs_sia_m;
by ID;
var final_predef_flag;
id month;
run;

PetePatel · Posted 12-30-2019 10:11 AM

Thanks for this Reeza.

This feels a lot more efficient. Let me try to explain in more detail so you can visualise the problem.

Volumes:

X.full_pop_201903 contains 250k rows and 1000 variables

X.full_pop_201904 contains 220k rows and 1000 variables

X.full_pop_201905 contains 260k rows and 1000 variables

X.full_pop_201906 contains 230k rows and 1000 variables

X.full_pop_201903 contains 250k rows and 1000 variables, but I only need 5 of these (id var hlr jkl clo month). This is my base dataset.

I then need to left join the next 3 subsequent datasets X.full_pop_201904, X.full_pop_201905 and X.full_pop_201906 by ID keeping only the 5 variables I require appending variable names by 1 each time.

So for example X.full_pop_201904 will have (var_1 hlr_1 jkl_1 clo_1 month_1) and X.full_pop_201906 will have (var_3 hlr_3 jkl_3 clo_3 month_3).

So my final dataset should contain 250k rows and include ID plus an additional 20 variables (5*4) as below:

ID

var

hlr

jkl

clo

month

var_1

hlr_1

jkl_1

clo_1

month_1

var_2

hlr_2

jkl_2

clo_2

month_2

var_3

hlr_3

jkl_3

clo_3

month_3

I then plan to do some calculations between the 5 variables kept. For example if hlr=1 and (hlr_1=0 or hlr_2=0 or hlr_3=0) then flag=1;

I need to then repeat this in some sort of macro for many more base datasets, each time merging with the subsequent 3 months.

In the end I can put into a proc summary and get results of my calculations for each month.

How can this be possible using your transpose technique below?

Reeza · Posted 01-06-2020 01:06 PM

You append the data sets. Use your BY in the summaries to get the results together. You may need to change logic to work across rows rather than columns or you transpose after stacking. It doesn't have to replace the full process but should easily simplify the merge to a trivial task.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away