BookmarkSubscribeRSS Feed
PetePatel
Quartz | Level 8

Hi,

 

Any idea how I can make the code below more efficient with the ability to work for multiple months?

 

At the moment this works for 3 additional months, but if I want 36 additional this will get messy quickly!

 

The first step runs a base dataset and then merges on the next 3 subsequent months.

 

%macro months(m0,m1,m2,m3);

data month_&m0.;
set X.full_pop_&m0.

where obs_ref ne 'BRE';
run;

proc sql;
create table month_&m0. as

select

a.*,
b.obs_sia as obs_sia_m1,
b.obs_final_predef_flag as obs_final_predef_flag_m1,

c.obs_sia as obs_sia_m2,
c.obs_final_predef_flag as obs_final_predef_flag_m2,

d.obs_sia as obs_sia_m3,
d.obs_final_predef_flag as obs_final_predef_flag_m3

from
month_&m0. a

left join
X.full_pop_&m1. b
on a.ID=b.ID

left join
X.full_pop_&m2. c
on a.ID=c.ID

left join
X.full_pop_&m3. d
on a.ID=d.ID

;
run;

%mend months;

%months(201903,201904,201905,201906);

%months(201902,201903,201904,201905);

 

Thanks in advance,

Pete

9 REPLIES 9
PaigeMiller
Diamond | Level 26

I don't know if it is possible to join 36 different data sets using PROC SQL (or even if there is a limit), but ignoring that, you need a loop inside the macro.

 

UNTESTED CODE

 

%macro months(months=);
    %let m1=%scan(&months,1,%str( ));
    data month_&m1.;
        set X.full_pop_&m1.
        where obs_ref ne 'BRE';
    run;

    proc sql;
        create table month_&m1 as select a.*,
        %do i=2 %to %sysfunc(countw(&months));
            %let thismonth=full_pop_%scan(&months,&i,%str( ));
            &thismonth..obs_sia as obs_sia_m&i,
            &thismonth..obs_final_predef_flag as obs_final_predef_flag_m&i
            %if &i<%sysfunc(countw(&months)) %then %str(,);
        %end;
        from month_&m1 a
        %do i=2 %to %sysfunc(countw(&months));
            %let thismonth=full_pop_%scan(&months,&i,%str( ));
            left join X.&thismonth on a.ID=&thismonth..ID
        %end;
    quit;

%mend months;

%months(months=201902 201903 201904 201905)

Of course, it would be much easier (if possible) not to have the month data split up into data sets by months in the first place. I realize that if you have no control over the format of the input, then the above macro ought to work; but really don't split the data up by month if you have that control. Or, depending on what you intend to do with all of this data, you would be much better off creating a long format data set (which would be much easier to do) rather than this wide data set that you are creating (in fact if you create the long data set and absolutely must have the wide data set, you could create create the wide data set using PROC TRANSPOSE). So, really, ignore the macro above and take advantage of the advice in this paragraph.

--
Paige Miller
ballardw
Super User

This bit of your code

a.*,
b.obs_sia as obs_sia_m1,
b.obs_final_predef_flag as obs_final_predef_flag_m1,

c.obs_sia as obs_sia_m2,
c.obs_final_predef_flag as obs_final_predef_flag_m2,

d.obs_sia as obs_sia_m3,
d.obs_final_predef_flag as obs_final_predef_flag_m3

Indicates to me that you are making a wide data set. There are very few times when that is actually an optimal form of the data for use. For most purposes you will be better off just appending data together and having variables to indicate the actual month (or what ever) is needed. Then use BY group processing to do work grouped by months.

 

Can you  explain why you are making a wide data set? As in how will it be used.

PetePatel
Quartz | Level 8

Sure.

 

The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).

 

Unfortunately I have no control over how the monthly datasets are created.

 

For each month I plan to use the flags to say:

if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;

 

PaigeMiller
Diamond | Level 26

@PetePatel wrote:

Sure.

 

The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).

 

Unfortunately I have no control over how the monthly datasets are created.

 

For each month I plan to use the flags to say:

if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;

 


This seems to be a case where you have decided that wide is the proper format, but we are questioning if that is really necessary, especially since a long data set is easier to create and easier to manipulate. What do you plan to do after you create this wide data set — what analysis? what plots? what reports?

--
Paige Miller
ballardw
Super User

@PetePatel wrote:

Sure.

 

The dataset will not be too wide. If I am merging up to 36 times, it will only be 72 additional variables (36*2).

 

Unfortunately I have no control over how the monthly datasets are created.

 

For each month I plan to use the flags to say:

if obs_sia_m1=1 or obs_sia_m2=1 or obs_sia_m3=1 or....obs_sia_m12=1 then obs_sia_12=1; else obs_sia_12=0;

 


Gack!

 

One of the reasons not to go this way. That line of code would need to change for every different number of files "merged".

Kurt_Bremser
Super User

It's not a matter of "too" wide. Wide is always less efficient than long, as it necessitates dynamic code. Long can be handled with static code where the data drives processing automatically.

Reeza
Super User

1. Stack instead of side by side merge

2. Transpose - doesn't matter how many data sets or numbers with this approach. 

 

Untested but this should work. You may need to adjust for other factors in your process. 

It's cleaner than a macro IMO.

 


data long;
set full_pop_201904 - full_pop_201906 indsname = source;
dsn = source;
*you could grab the month/year from the data set name;
month = scan(source, -1, "_");
run;

proc sort data=long;
by ID;
run;

proc transpose data=long out=wide prefix = obs_sia_m;
by ID;
var final_predef_flag;
id month;
run;

PetePatel
Quartz | Level 8

Thanks for this Reeza.

 

This feels a lot more efficient. Let me try to explain in more detail so you can visualise the problem.

 

Volumes:

X.full_pop_201903 contains 250k rows and 1000 variables

X.full_pop_201904 contains 220k rows and 1000 variables

X.full_pop_201905 contains 260k rows and 1000 variables

X.full_pop_201906 contains 230k rows and 1000 variables

 

X.full_pop_201903 contains 250k rows and 1000 variables, but I only need 5 of these (id var hlr jkl clo month). This is my base dataset.

 

I then need to left join the next 3 subsequent datasets X.full_pop_201904, X.full_pop_201905 and X.full_pop_201906 by ID keeping only the 5 variables I require appending variable names by 1 each time.

 

So for example X.full_pop_201904 will have (var_1 hlr_1 jkl_1 clo_1 month_1) and X.full_pop_201906 will have (var_3 hlr_3 jkl_3 clo_3 month_3).

 

So my final dataset should contain 250k rows and include ID plus an additional 20 variables (5*4) as below:

 

ID

var

hlr

jkl

clo

month

var_1

hlr_1

jkl_1

clo_1

month_1

var_2

hlr_2

jkl_2

clo_2

month_2

var_3

hlr_3

jkl_3

clo_3

month_3

 

I then plan to do some calculations between the 5 variables kept. For example if hlr=1 and (hlr_1=0 or hlr_2=0 or hlr_3=0) then flag=1;

 

I need to then repeat this in some sort of macro for many more base datasets, each time merging with the subsequent 3 months.

 

In the end I can put into a proc summary and get results of my calculations for each month.

 

How can this be possible using your transpose technique below?

 

 

 

 

Reeza
Super User
You append the data sets. Use your BY in the summaries to get the results together. You may need to change logic to work across rows rather than columns or you transpose after stacking. It doesn't have to replace the full process but should easily simplify the merge to a trivial task.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1292 views
  • 4 likes
  • 5 in conversation