Hi Experts!
I have to extract information from a very huge table, and I do not know what is the best way to approach it.
Master_table
Date | C_ID | D_ID | Amount |
01/01/2000 | 1 | 3 | 567 |
02/02/2000 | 1 | 3 | 543 |
03/03/2000 | 1 | 3 | 789 |
01/01/2000 | 2 | 3 | 777 |
02/02/2000 | 2 | 3 | 987 |
03/03/2000 | 2 | 3 | 980 |
Master table have information about millions of Ids daily data.
My_List (900,000 unique ids)
C_ID | D_ID |
1 | 3 |
9 | 3 |
3 | 8 |
4 | 3 |
8 | 2 |
6 | 3 |
I want to extract all the information available from Master table for the ids in the my_list table.
I have the following code:
Data test1;
if 0 then set My_List;
IF _N_ = 1 THEN DO;
declare Hash H (dataset:'MY_LIST');
H.definekey('C_ID','D_ID');
H.DEFINEDATA('C_ID');
H.DEFINEDONE();
END;
SET Master_table (KEEP= C_ID D_ID AMOUNT);
IF H.FIND() = 0 THEN OUTPUT;
RUN;
It takes a very long time to run. Is there any way I can improve the code to run faster?
Thanks in advance.
Regards,
Myu
Here are three points that will likely speed up the process:
EDIT: Implement only point 1 and 3 🙂
data master_table;
input date : ddmmyy10. C_ID D_ID amount;
format date ddmmyy10.;
datalines;
01/01/2000 1 3 567
02/02/2000 1 3 543
03/03/2000 1 3 789
01/01/2000 2 3 777
02/02/2000 2 3 987
03/03/2000 2 3 980
;
data my_list;
input C_ID D_ID;
datalines;
1 3
9 3
3 8
4 3
8 2
6 3
;
data want;
if 0 then set my_list;
if _N_ = 1 then do;
declare hash h(dataset : "my_list", hashexp : 20); /* 1 */
h.definekey (all : "Y"); /* 2 */
h.definedone();
end;
set master_table(keep=C_ID D_ID amount);
if h.check() = 0; /* 3 */
run;
Here are three points that will likely speed up the process:
EDIT: Implement only point 1 and 3 🙂
data master_table;
input date : ddmmyy10. C_ID D_ID amount;
format date ddmmyy10.;
datalines;
01/01/2000 1 3 567
02/02/2000 1 3 543
03/03/2000 1 3 789
01/01/2000 2 3 777
02/02/2000 2 3 987
03/03/2000 2 3 980
;
data my_list;
input C_ID D_ID;
datalines;
1 3
9 3
3 8
4 3
8 2
6 3
;
data want;
if 0 then set my_list;
if _N_ = 1 then do;
declare hash h(dataset : "my_list", hashexp : 20); /* 1 */
h.definekey (all : "Y"); /* 2 */
h.definedone();
end;
set master_table(keep=C_ID D_ID amount);
if h.check() = 0; /* 3 */
run;
@PeterClemmensen wrote:
Here are three points that will likely speed up the process:
- Set hashexp : 20. This creates 2^20=1048576 binary search trees instead of 2^8=256 (default). This probably speeds up the search.
- Define the key portion only. Not the data portion of the hash object. You do not want to retrieve any data anyway.
- All you want to do is verify the existence of a given key in the hash object. Not retrieve any data from it. Therefore, use the Check() Method instead of the Find() Method.
@PeterClemmensen: Good idea to experiment with hashexp. Item 3 should definitely be implemented by @Myurathan. I'm skeptical about item 2, though: I think that would create even more (unneeded) data items, i.e., both C_ID and D_ID, by default. (Or has this default been changed recently?)
@FreelanceReinh, you're correct. I forgot that when specifying only the Key portion, all variables are read into the data portion as well.
Good catch! 🙂
@PeterClemmensen wrote:
@FreelanceReinh, you're correct. I forgot that when specifying only the Key portion, all variables are read into the data portion as well.
Good catch! 🙂
It should be possible to avoid that by using keep-Option in the declare-statement:
declare hash h(dataset : "my_list(keep=C_ID D_ID)", hashexp : 20);
What is the difference? My_List contains only C_ID and D_ID in the first place.
Alternatively, here is a temporary array approach. If the largest value of C_ID is less than the first dimension of the array and the largest value of D_ID is less than the second dimension of the array, this should be reasonably fast.
data want;
array id {9999, 9999} _temporary_;
do until (lr1);
set my_list end=lr1;
id[C_ID, D_ID] = 1;
end;
do until (lr2);
set master_table(keep=C_ID D_ID amount) end=lr2;
if id[C_ID, D_ID] then output;
end;
run;
Hi @Myurathan,
Are you aware of the library article Study on the best method to join two tables? Depending on your data (and other factors) it's possible that using a hash object is not the most efficient approach for your task. Note that if there are many observations (dates) per ID in the master table, your current code will look up the same key many times. The number of look-ups could be reduced substantially if the master table was sorted or indexed by C_ID D_ID (which would also be a prerequisite for some of the alternative approaches).
With your real data is Master_Table a SAS table or a database table?
What is "a very long time" and what is "huge"? - "Huge" in number of rows, variables and size of file in GB.
If it's a SAS table: Is it compressed or not?
If you just run below code does that execute significantly faster than your code with the hash key lookup (once changed using the check() method)?
data _null_; set master_table(keep=c_id d_id amount); run;
Eventually set options fullstimer; and post the generated log messages.
"It is a SQL Database table"
The SAS data step you wrote will need to pull all the data from the database into SAS. That's where you spend the time.
If the small table is in SAS then you will need to upload it into SQL Server (i.e. into a temporary table) and then write a SQL join which can fully execute on the database side to subset the data - and you then only pull the result set back into SAS for further processing.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.