Solved: Re: extract information from a very large table

Myurathan · Posted 07-20-2020 06:46 AM

Hi Experts!

I have to extract information from a very huge table, and I do not know what is the best way to approach it.

Master_table

Date	C_ID	D_ID	Amount
01/01/2000	1	3	567
02/02/2000	1	3	543
03/03/2000	1	3	789
01/01/2000	2	3	777
02/02/2000	2	3	987
03/03/2000	2	3	980

Master table have information about millions of Ids daily data.

My_List (900,000 unique ids)

C_ID	D_ID
1	3
9	3
3	8
4	3
8	2
6	3

I want to extract all the information available from Master table for the ids in the my_list table.

I have the following code:

Data test1;
if 0 then set My_List;
IF _N_ = 1 THEN DO;
	declare Hash H (dataset:'MY_LIST');
	H.definekey('C_ID','D_ID');
	H.DEFINEDATA('C_ID');
	H.DEFINEDONE();
	END;
SET Master_table (KEEP= C_ID D_ID AMOUNT);
IF H.FIND() = 0 THEN OUTPUT;
RUN;

It takes a very long time to run. Is there any way I can improve the code to run faster?

Thanks in advance.

Regards,

Myu

PeterClemmensen · Posted 07-20-2020 07:56 AM

Here are three points that will likely speed up the process:

Set hashexp : 20. This creates 2^20=1048576 binary search trees instead of 2^8=256 (default). This probably speeds up the search.
~~Define the key portion only. Not the data portion of the hash object. You do not want to retrieve any data anyway.~~
All you want to do is verify the existence of a given key in the hash object. Not retrieve any data from it. Therefore, use the Check() Method instead of the Find() Method.

EDIT: Implement only point 1 and 3 🙂

data master_table;
input date : ddmmyy10. C_ID D_ID amount;
format date ddmmyy10.;
datalines;
01/01/2000 1 3 567
02/02/2000 1 3 543
03/03/2000 1 3 789
01/01/2000 2 3 777
02/02/2000 2 3 987
03/03/2000 2 3 980
;

data my_list;
input C_ID D_ID;
datalines;
1 3
9 3
3 8
4 3
8 2
6 3
;

data want;
   if 0 then set my_list;
   if _N_ = 1 then do;
      declare hash h(dataset : "my_list", hashexp : 20);  /* 1 */
      h.definekey (all : "Y");                            /* 2 */
      h.definedone();
   end;

   set master_table(keep=C_ID D_ID amount);
 
   if h.check() = 0;                                      /* 3 */
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 07-20-2020 07:56 AM

Here are three points that will likely speed up the process:

Set hashexp : 20. This creates 2^20=1048576 binary search trees instead of 2^8=256 (default). This probably speeds up the search.
~~Define the key portion only. Not the data portion of the hash object. You do not want to retrieve any data anyway.~~
All you want to do is verify the existence of a given key in the hash object. Not retrieve any data from it. Therefore, use the Check() Method instead of the Find() Method.

EDIT: Implement only point 1 and 3 🙂

data master_table;
input date : ddmmyy10. C_ID D_ID amount;
format date ddmmyy10.;
datalines;
01/01/2000 1 3 567
02/02/2000 1 3 543
03/03/2000 1 3 789
01/01/2000 2 3 777
02/02/2000 2 3 987
03/03/2000 2 3 980
;

data my_list;
input C_ID D_ID;
datalines;
1 3
9 3
3 8
4 3
8 2
6 3
;

data want;
   if 0 then set my_list;
   if _N_ = 1 then do;
      declare hash h(dataset : "my_list", hashexp : 20);  /* 1 */
      h.definekey (all : "Y");                            /* 2 */
      h.definedone();
   end;

   set master_table(keep=C_ID D_ID amount);
 
   if h.check() = 0;                                      /* 3 */
run;

The DATA to DATA Step Macro
Blog: SASnrd

FreelanceReinh · Posted 07-20-2020 08:59 AM

@PeterClemmensen wrote:

Here are three points that will likely speed up the process:

Set hashexp : 20. This creates 2^20=1048576 binary search trees instead of 2^8=256 (default). This probably speeds up the search.

Define the key portion only. Not the data portion of the hash object. You do not want to retrieve any data anyway.

All you want to do is verify the existence of a given key in the hash object. Not retrieve any data from it. Therefore, use the Check() Method instead of the Find() Method.

@PeterClemmensen: Good idea to experiment with hashexp. Item 3 should definitely be implemented by @Myurathan. I'm skeptical about item 2, though: I think that would create even more (unneeded) data items, i.e., both C_ID and D_ID, by default. (Or has this default been changed recently?)

PeterClemmensen · Posted 07-20-2020 09:17 AM

@FreelanceReinh, you're correct. I forgot that when specifying only the Key portion, all variables are read into the data portion as well.

Good catch! 🙂

The DATA to DATA Step Macro
Blog: SASnrd

andreas_lds · Posted 07-20-2020 09:34 AM

@PeterClemmensen wrote:

@FreelanceReinh, you're correct. I forgot that when specifying only the Key portion, all variables are read into the data portion as well.

Good catch! 🙂

It should be possible to avoid that by using keep-Option in the declare-statement:

declare hash h(dataset : "my_list(keep=C_ID D_ID)", hashexp : 20);

PeterClemmensen · Posted 07-20-2020 09:37 AM

What is the difference? My_List contains only C_ID and D_ID in the first place.

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 07-20-2020 08:05 AM

Alternatively, here is a temporary array approach. If the largest value of C_ID is less than the first dimension of the array and the largest value of D_ID is less than the second dimension of the array, this should be reasonably fast.

data want;
   array id {9999, 9999} _temporary_;
   
   do until (lr1);
      set my_list end=lr1;
      id[C_ID, D_ID] = 1;
   end;

   do until (lr2);
      set master_table(keep=C_ID D_ID amount) end=lr2;
      if id[C_ID, D_ID] then output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

FreelanceReinh · Posted 07-20-2020 08:54 AM

Hi @Myurathan,

Are you aware of the library article Study on the best method to join two tables? Depending on your data (and other factors) it's possible that using a hash object is not the most efficient approach for your task. Note that if there are many observations (dates) per ID in the master table, your current code will look up the same key many times. The number of look-ups could be reduced substantially if the master table was sorted or indexed by C_ID D_ID (which would also be a prerequisite for some of the alternative approaches).

Patrick · Posted 07-20-2020 10:02 AM

With your real data is Master_Table a SAS table or a database table?

What is "a very long time" and what is "huge"? - "Huge" in number of rows, variables and size of file in GB.

If it's a SAS table: Is it compressed or not?

If you just run below code does that execute significantly faster than your code with the hash key lookup (once changed using the check() method)?

data _null_;
  set master_table(keep=c_id d_id amount);
run;

Eventually set options fullstimer; and post the generated log messages.

Myurathan · Posted 07-20-2020 11:17 AM

Hi @Patrick,
Let me answer questions:
1. With your real data is Master_Table a SAS table or a database table?
It is a SQL Database table.

2. Long time means it is running for around 10 hrs without any success.
3. Size of the Master data is 1180GB (Considering indexes)

Thank you for your hep.

Patrick · Posted 07-20-2020 06:54 PM

"It is a SQL Database table"

The SAS data step you wrote will need to pull all the data from the database into SAS. That's where you spend the time.

If the small table is in SAS then you will need to upload it into SQL Server (i.e. into a temporary table) and then write a SQL join which can fully execute on the database side to subset the data - and you then only pull the result set back into SAS for further processing.

Classroom Training Available!