Re: Need help to query very big dataset

kajal_30 · Posted 06-03-2024 01:08 PM

proc sql;
	create table t1
	as
	select * 
	from 
	t2
	where 
	name like 'PNB%'
	and rn in 
	(select distinct rn from 
	t4)
	and 
	datepart(DATE) <= '7Dec2002:00:00:00'd
;quit;

I have a very big dataset and its taking forever to run and many times connections disrupts and I need to wait for a long again.

can you please suggest a better way to do it.

Here table T2 is like very big taking almost an hour to complete the query if everything goes well or need to restart many times .

Thank you in advance

kajal

Reeza · Posted 06-03-2024 01:26 PM

How big is big here? Are we talking millions/billions of rows or thousands?

Is the following slightly better? Do you have any indexes on the datasets or the ability to create any indexes?

proc sql;
	create table t1
	as
	select * 
	from 
	t2
	where 
	substr(name, 1, 3) = 'PNB'
	and rn in 
	(select distinct rn from 
	t4)
	and 
	datepart(DATE) <= '7Dec2002'd
;quit;

@kajal_30 wrote:
proc sql;
	create table t1
	as
	select * 
	from 
	t2
	where 
	name like 'PNB%'
	and rn in 
	(select distinct rn from 
	t4)
	and 
	datepart(DATE) <= '7Dec2002:00:00:00'd
;quit;
I have a very big dataset and its taking forever to run and many times connections disrupts and I need to wait for a long again.

can you please suggest a better way to do it.

Here table T2 is like very big taking almost an hour to complete the query if everything goes well or need to restart many times .

Thank you in advance

kajal

kajal_30 · Posted 06-03-2024 01:38 PM

Thank you @Reeza it is almost taking the same time. Is it possible to convert this step into hash join where we are selecting all columns ?

Regards

Kajal

Reeza · Posted 06-03-2024 01:44 PM

Sure, as long as you have enough memory you can probably convert this to a hash join.

LinusH · Posted 06-04-2024 04:01 AM

As stated by others, you need to provide more informaitno of your actual data.

Please provide a PROC CONTENTS of both datasets.

If in an external RDBMS, the DDL together with stats.

Add the following to your code:

options fullstimer msglevel=i;
proc sql _method;
/* In case of external data source */
options sastrace=',,,d' sastraceloc=saslog nostsuffix;

Data never sleeps

ChrisNZ · Posted 06-26-2024 12:30 AM

As a rule of thumb:

- not using functions will speed up the query

- a join is faster than a in (select ..) clause (depending in the existence of an index and the cardinality of the subset)

data TEST;
  length A $16;
  do I=1 to 1e7;
    A=put(ranuni(1)*1e16, 16.0);
    output;
  end;
run;
 
 proc sql;
   create table T as select * from TEST where A eqt '22'; 
 quit;  
 
 proc sql;   * real time 1.46 seconds;
   create table S as 
   select TEST.*
   from TEST
   where A in (select A from T);

proc sql;   * real time 0.81 seconds;
   create table S as 
   select TEST.*
   from TEST, T
   where TEST.A=T.A;
quit;

Also, put the most discriminant filters first to reduce volume as soon as possible.

 71          proc sql;
 72            create table T as select * from TEST where A like '2%' and A like '222222%';
 NOTE: Table WORK.T created, with 10 rows and 2 columns.
 
 NOTE: PROCEDURE SQL used (Total process time):
       real time           1.42 seconds
       
 
 73          proc sql;
 74            create table T as select * from TEST where A like '222222%' and A like '2%';
 NOTE: Table WORK.T created, with 10 rows and 2 columns.
 
 75          quit;
 NOTE: PROCEDURE SQL used (Total process time):
       real time           1.26 seconds

High-Performance SAS Coding - Third Edition

Tom · Posted 06-03-2024 02:40 PM

Here are a couple of things that might improve performance a little bit.

Use truncated comparison instead of LIKE operator:

/* name like 'PNB%' */
name EQT 'PNB'

Eliminate the DATEPART() function call by comparing to a datetime constant instead:

/* datepart(DATE) <= '7Dec2002:00:00:00'd */
DATE < '08DEC2002:00:00:00'dt

But you say that T2 is a LARGE dataset. So unless you have indexes on the dataset that will allow SAS to skip reading all of the observations the process of just performing those tests on every observation will be long running..

But in general using indexes when the tests are not equality tests i very hard. So SAs might not end up using the index even if they did exist.

Why would it "time-out"? What is timing out? Are you using SAS/Studio to run the code and your connection to SAS/Studio is timing out? Can you submit the step to run as a background ("batch") job instead?

Also where is the data? Perhaps it is not the process of reading the data or preforming the tests that it taking a long time. But instead it is just that you are moving a lot of data across the network from one computer to another.

Is the dataset actually a table in some foreign database, like ORACLE or Microsoft SQL Server? Can you explicitly filter the data on the foreign database side and only move the resulting observations to the SAS server?

Patrick · Posted 06-03-2024 07:34 PM

You need to tell us more about your tables. When people call a table large then some mean 100K rows and others 100M rows.

What are the volumes (number of rows) of t2 and t4?
How many distinct values for RN do you have in t4?
What is the datatype of RN and if character what's the variable length.
Where is the data stored (SAS table on disk or in some database like Oracle or both)?

A SAS data step hash lookup would likely perform better. Points 1-3 above are for making a decision if your table t4 could fit into memory, point 4 is for making a decision if processing needs to get pushed to a database side for processing (which a data step hash lookup wouldn't).

Also don't use select * but list the variables you need explicitly, like: select t2.name, t2.rn, t4.xxxx, ....

Astounding · Posted 06-03-2024 08:13 PM

Which dates do you want to select? As it stands, you are comparing dates to datetime, so that comparison eliminates zero observations.

Kurt_Bremser · Posted 06-26-2024 03:53 AM

More information.

How big is "big"? How many observations, how many variables, what is the observation size?

How large is t4? What is the type and size of rn? How many distinct values do you have?

Where are these datasets really stored? WORK (as your code suggests), permanent SAS library, external database?

Hash method:

data t1;
set t2;
where 
  name like 'PNB%'
  and 
  date <= "7dec2002:00:00:00"dt
;
if _n_ = 1
then do;
  declare hash t4 (dataset:"t4 (keep=rn)");
  t4.definekey("rn");
  t4.definedata("rn");
  t4.definedone();
end;
if t4.check();
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away