Re: Left join on mutltiiple tables using loop

Dozo92i · Posted 03-02-2018 06:44 AM

Hello everyone,

I did a proc sql left join on many tables using a loop but it took too many time to perform since I got pretty big tables, so I wondered if there is a better way to do it. I got 12 tables with same variable ID and other numbered variables :

Table 1 has ID, A_1,B_1,C_1 as variables

Table 2 has ID, A_2, B_2, C_2 as variables ...

Table i has ID, A_i ,B_i, C_i as variables for i from 3 to 12.

I would like to have a single table which is the result of a left join of all those tables : ID,A_1,B_1,C_1, A_2,B_2,C_2....A_12,B_12_C12 . Down below is an example of a macro i use for it.

%macro join();

%do j= 2 %to 12 ;

proc sql;
create table table1 as select
a.*,
b.*
from table1 as a left join table&j as b
on a.id=b.id ;
quit;

%end;

%mend;

Thiss gives me a result but it took too long if I have huge tables( around 20 millions of rows).

Thanks in advance.

Kurt_Bremser · Posted 03-02-2018 07:15 AM

What are the relationships between your tables with regard to id?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Dozo92i · Posted 03-02-2018 10:27 AM

Let's suppose that the ID identifies a unique customer and :

- A_i represents the amount the customer should pay back for a car loan,

- B_ i represents the amount the customer should pay back for a house loan

- C_ i represents the amount the customer should pay back for leisure loan

" i " is the number of the month (i=1 corresponds to January data, i=2 corresponds to February data and so on ). It means " table_i " are monthly historical tables that I want to gather into a unique dataset for a modeling purpose.

Thanks.

RW9 · Posted 03-02-2018 10:47 AM

Then the data should be modeled in a way that allow simple variable naming, yet powerful storage:

ID LOAN_TYPE TIME_POINT VALUE

1 Car dd-mmm-yyyy xxx.xx

...

With the above structure you can have one dataset, with simple metadata - makes programming far simpler - minimum storage requirements (transposed data often has large quantities of blank data cells), and if need be at any point can be transposed up into the way you have it now. There are no downsides to normalised data, only pluses. Note also I put date in rather than just month thus avoiding nasty year issues, and allowing use of all the date functions.

Dozo92i · Posted 03-02-2018 11:14 AM

Actually I have no power to modify the way those datas are modeled. Your modeling is smart but I think it would lead to very huge tables if for example we stored 10 or even 30 years of mothly historical datas. I wish I have the datas modeled the way you mentioned for my current task...

RW9 · Posted 03-02-2018 07:55 AM

Posting test data and required output sure helps get a point across. From what you post I don't see why you do not use Base SAS merge, as this will make it far simpler for you:

data want;
  merge table:;
  by id;
run;

Note, I have not tested this, so not sure if the : actually works in this context, you may have to put table1 table2 etc. Still be simpler coding than using sql.

Dozo92i · Posted 03-02-2018 10:54 AM

Thanks for taking the time to answer.

Actually the problem is a bit difficult to explain that's why I use a simplified example.

I'll give a try to your suggestion to see if it improves the processing time.

Let's consider 7 tables (It could be more, like 12 as i mentioned in my post)

DATA table1;
INPUT id A1 B1 C1;
DATALINES;
1 91 11 212
2 92 12 222
3 93 13 233
4 94 14 242
RUN;

DATA table2;
INPUT id A2 B2 C2;
DATALINES;
1 81 11 112
2 82 12 122
3 83 13 133
4 84 14 142
RUN;

DATA table3;
INPUT id A3 B3 C3;
DATALINES;
1 71 10 111
2 72 11 121
3 73 12 131
4 74 13 141
RUN;

DATA table4;
INPUT id A4 B4 C4;
DATALINES;
1 61 9 110
2 62 8 120
3 63 11 130
4 64 12 140
RUN;

DATA table5;
INPUT id A5 B5 C5;
DATALINES;
1 60 8 109
2 61 7 119
3 62 10 129
4 63 11 139
RUN;

DATA table6;
INPUT id A6 B6 C6;
DATALINES;
1 59 7 108
2 60 6 118
3 61 9 128
4 62 10 138
RUN;

DATA table7;
INPUT id A7 B7 C7;
DATALINES;
1 50 6 102
2 55 7 110
3 56 8 111
4 53 9 120
RUN;

%macro join();

%do j= 2 %to 7 ;

proc sql;
create table table1 as select
a.*,
b.*
from table1 as a left join table&j as b
on a.id=b.id ;
quit;

%end;

%mend;

%join()

The result should look like

id A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 A5 B5 C5 A6 B6 C6 A7 B7 C7
1 91 11 212 81 11 112 71 10 111 61 9 110 60 8 109 59 7 108 50 6 102
2 92 12 222 82 12 122 72 11 121 62 8 120 61 7 119 60 6 118 55 7 110
3 93 13 233 83 13 133 73 12 131 63 11 130 62 10 129 61 9 128 56 8 111
4 94 14 242 84 14 142 74 13 141 64 12 140 63 11 139 62 10 138 53 9 120

Kurt_Bremser · Posted 03-02-2018 01:25 PM

Given that your example data show a 1:1 relationship all the way through, a simple data step merge will do it:

data want;
merge
  have1
  have2
  have3
  have4
  have5
  have6
  have7
;
by id;
run;

Once that works, you can build a list of dataset names in a data step, use call symput() to store it in a macro variable, and use that in the merge statement.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 03-02-2018 01:33 PM

And using a list such as @RW9 will work with merge.

data want;

merge table: ;

by id;

run;

so if your sets are nicely named, or even groups of "nice" names lists will work.

The "time" would come from large sets. It does take a certain amount of time to push lots of electrons around but I believe the overhead for a sing data step and merge is going to be less than multiple calls to Proc Sql.

Dozo92i · Posted 03-20-2018 10:28 AM

Hi everyone, thanks for all your answers,

@RW9 @Kurt_Bremser @ballardw I used "merge" to join all the tables but "merge" requires sorting first all those tables and that takes too much time since the tables are pretty big. A trick I used to Left join all my tables is to proceed as followed :

- I first do for all the tables a left join on the id of the first table ( Actually, I wanted to left join all the tables with respect to the table1's id ). I overwrite " table_j " for j from 2 to 12 :

%macro test();

%do j= 2 %to 12 ;

proc sql;

create table table&j as

select * from table1(keep=id) as a left join table&j as b on a.id=b.id;

quit;

%end;

%mend;

- secondly I just use a data step with many sets :

data final_table;
set table1;
set table2;
set table3;
set table4;
set table5;
set table6;
set table7;
set table8;
set table9;
set table10;
set table11;
set table12;
run;

This works fine and reduce processing time but the problem is that the data step doesn't ensure to match the rigth id from one table number "j" to another.

Kurt_Bremser · Posted 03-20-2018 10:43 AM

To think that a sort is not needed with SQL is a mistake. SQL just does the sort on its own, and if table1 is not sorted already, SQL will sort it 12 times, and tables 2 to 12 each once, so you get 23 sorts.

Sorting all tables and doing a data step merge will need only 12 sorts.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Dozo92i · Posted 03-20-2018 01:06 PM

Glad to know that.

Thanks !

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away