BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Myurathan
Quartz | Level 8

Hi Experts!

 

I have to extract information from a very huge table, and I do not know what is the best way to approach it. 

Master_table 

DateC_IDD_IDAmount
01/01/200013567
02/02/200013543
03/03/200013789
01/01/200023777
02/02/200023987
03/03/200023980

 

Master table have information about millions of Ids daily data. 

 

My_List (900,000 unique ids)

C_IDD_ID
13
93
38
43
82
63

I want to extract all the information available from Master table for the ids in the my_list table. 

I have the following code:

 

Data test1;
if 0 then set My_List;
IF _N_ = 1 THEN DO;
	declare Hash H (dataset:'MY_LIST');
	H.definekey('C_ID','D_ID');
	H.DEFINEDATA('C_ID');
	H.DEFINEDONE();
	END;
SET Master_table (KEEP= C_ID D_ID AMOUNT);
IF H.FIND() = 0 THEN OUTPUT;
RUN;

It takes a very long time to run. Is there any way I can improve the code to run faster?

 

Thanks in advance. 

 

Regards,

Myu

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Here are three points that will likely speed up the process:

 

  1. Set hashexp : 20. This creates 2^20=1048576 binary search trees instead of 2^8=256 (default). This probably speeds up the search.
  2. Define the key portion only. Not the data portion of the hash object. You do not want to retrieve any data anyway.
  3. All you want to do is verify the existence of a given key in the hash object. Not retrieve any data from it. Therefore, use the Check() Method instead of the Find() Method.

 

 EDIT: Implement only point 1 and 3 🙂

 

 

data master_table;
input date : ddmmyy10. C_ID D_ID amount;
format date ddmmyy10.;
datalines;
01/01/2000 1 3 567
02/02/2000 1 3 543
03/03/2000 1 3 789
01/01/2000 2 3 777
02/02/2000 2 3 987
03/03/2000 2 3 980
;

data my_list;
input C_ID D_ID;
datalines;
1 3
9 3
3 8
4 3
8 2
6 3
;

data want;
   if 0 then set my_list;
   if _N_ = 1 then do;
      declare hash h(dataset : "my_list", hashexp : 20);  /* 1 */
      h.definekey (all : "Y");                            /* 2 */
      h.definedone();
   end;

   set master_table(keep=C_ID D_ID amount);
 
   if h.check() = 0;                                      /* 3 */
run;

 

View solution in original post

10 REPLIES 10
PeterClemmensen
Tourmaline | Level 20

Here are three points that will likely speed up the process:

 

  1. Set hashexp : 20. This creates 2^20=1048576 binary search trees instead of 2^8=256 (default). This probably speeds up the search.
  2. Define the key portion only. Not the data portion of the hash object. You do not want to retrieve any data anyway.
  3. All you want to do is verify the existence of a given key in the hash object. Not retrieve any data from it. Therefore, use the Check() Method instead of the Find() Method.

 

 EDIT: Implement only point 1 and 3 🙂

 

 

data master_table;
input date : ddmmyy10. C_ID D_ID amount;
format date ddmmyy10.;
datalines;
01/01/2000 1 3 567
02/02/2000 1 3 543
03/03/2000 1 3 789
01/01/2000 2 3 777
02/02/2000 2 3 987
03/03/2000 2 3 980
;

data my_list;
input C_ID D_ID;
datalines;
1 3
9 3
3 8
4 3
8 2
6 3
;

data want;
   if 0 then set my_list;
   if _N_ = 1 then do;
      declare hash h(dataset : "my_list", hashexp : 20);  /* 1 */
      h.definekey (all : "Y");                            /* 2 */
      h.definedone();
   end;

   set master_table(keep=C_ID D_ID amount);
 
   if h.check() = 0;                                      /* 3 */
run;

 

FreelanceReinh
Jade | Level 19

@PeterClemmensen wrote:

Here are three points that will likely speed up the process:

 

  1. Set hashexp : 20. This creates 2^20=1048576 binary search trees instead of 2^8=256 (default). This probably speeds up the search.
  2. Define the key portion only. Not the data portion of the hash object. You do not want to retrieve any data anyway.
  3. All you want to do is verify the existence of a given key in the hash object. Not retrieve any data from it. Therefore, use the Check() Method instead of the Find() Method.

@PeterClemmensen: Good idea to experiment with hashexp. Item 3 should definitely be implemented by @Myurathan. I'm skeptical about item 2, though: I think that would create even more (unneeded) data items, i.e., both C_ID and D_ID, by default. (Or has this default been changed recently?)

PeterClemmensen
Tourmaline | Level 20

@FreelanceReinh, you're correct. I forgot that when specifying only the Key portion, all variables are read into the data portion as well. 

 

Good catch! 🙂

andreas_lds
Jade | Level 19

@PeterClemmensen wrote:

@FreelanceReinh, you're correct. I forgot that when specifying only the Key portion, all variables are read into the data portion as well. 

 

Good catch! 🙂


It should be possible to avoid that by using keep-Option in the declare-statement:

declare hash h(dataset : "my_list(keep=C_ID D_ID)", hashexp : 20);
PeterClemmensen
Tourmaline | Level 20

What is the difference? My_List contains only C_ID and D_ID in the first place.

PeterClemmensen
Tourmaline | Level 20

Alternatively, here is a temporary array approach. If the largest value of C_ID is less than the first dimension of the array and the largest value of D_ID is less than the second dimension of the array, this should be reasonably fast.

 

data want;
   array id {9999, 9999} _temporary_;
   
   do until (lr1);
      set my_list end=lr1;
      id[C_ID, D_ID] = 1;
   end;

   do until (lr2);
      set master_table(keep=C_ID D_ID amount) end=lr2;
      if id[C_ID, D_ID] then output;
   end;
run;
FreelanceReinh
Jade | Level 19

Hi @Myurathan,

 

Are you aware of the library article Study on the best method to join two tables? Depending on your data (and other factors) it's possible that using a hash object is not the most efficient approach for your task. Note that if there are many observations (dates) per ID in the master table, your current code will look up the same key many times. The number of look-ups could be reduced substantially if the master table was sorted or indexed by C_ID D_ID (which would also be a prerequisite for some of the alternative approaches).

Patrick
Opal | Level 21

With your real data is Master_Table a SAS table or a database table?

 

What is "a very long time" and what is "huge"? - "Huge" in number of rows, variables and size of file in GB.

If it's a SAS table: Is it compressed or not?

 

If you just run below code does that execute significantly faster than your code with the hash key lookup (once changed using the check() method)?

data _null_;
  set master_table(keep=c_id d_id amount);
run;

Eventually set options fullstimer; and post the generated log messages.

Myurathan
Quartz | Level 8
Hi @Patrick,
Let me answer questions:
1. With your real data is Master_Table a SAS table or a database table?
It is a SQL Database table.

2. Long time means it is running for around 10 hrs without any success.
3. Size of the Master data is 1180GB (Considering indexes)

Thank you for your hep.
Patrick
Opal | Level 21

"It is a SQL Database table"

The SAS data step you wrote will need to pull all the data from the database into SAS. That's where you spend the time.

If the small table is in SAS then you will need to upload it into SQL Server (i.e. into a temporary table) and then write a SQL join which can fully execute on the database side to subset the data - and you then only pull the result set back into SAS for further processing.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2545 views
  • 7 likes
  • 5 in conversation