BookmarkSubscribeRSS Feed
kajal_30
Quartz | Level 8

 

proc sql;
	create table t1
	as
	select * 
	from 
	t2
	where 
	name like 'PNB%'
	and rn in 
	(select distinct rn from 
	t4)
	and 
	datepart(DATE) <= '7Dec2002:00:00:00'd
;quit;

 

 

I have a very big dataset and its taking forever to run and many times connections disrupts and I need to wait for a long again.

can you please suggest a better way to do it. 

Here table T2 is like very big taking almost an hour to complete the query if everything goes well or need to restart many times .

 

Thank you in advance 

kajal

9 REPLIES 9
Reeza
Super User

How big is big here? Are we talking millions/billions of rows or thousands?

 

Is the following slightly better? Do you have any indexes on the datasets or the ability to create any indexes?

 

 

 

proc sql;
	create table t1
	as
	select * 
	from 
	t2
	where 
	substr(name, 1, 3) = 'PNB'
	and rn in 
	(select distinct rn from 
	t4)
	and 
	datepart(DATE) <= '7Dec2002'd
;quit;

@kajal_30 wrote:

 

proc sql;
	create table t1
	as
	select * 
	from 
	t2
	where 
	name like 'PNB%'
	and rn in 
	(select distinct rn from 
	t4)
	and 
	datepart(DATE) <= '7Dec2002:00:00:00'd
;quit;

 

 

I have a very big dataset and its taking forever to run and many times connections disrupts and I need to wait for a long again.

can you please suggest a better way to do it. 

Here table T2 is like very big taking almost an hour to complete the query if everything goes well or need to restart many times .

 

Thank you in advance 

kajal


 

kajal_30
Quartz | Level 8

Thank you @Reeza  it is almost taking the same time. Is it possible to convert this step into hash join where we are selecting all columns ? 

 

Regards

Kajal

Reeza
Super User
Sure, as long as you have enough memory you can probably convert this to a hash join.
LinusH
Tourmaline | Level 20

As stated by others, you need to provide more informaitno of your actual data.

Please provide a PROC CONTENTS of both datasets.

If in an external RDBMS, the DDL together with stats.

Add the following to your code:

options fullstimer msglevel=i;
proc sql _method;
/* In case of external data source */
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
Data never sleeps
ChrisNZ
Tourmaline | Level 20

As a rule of thumb:

- not using functions will speed up the query

- a join is faster than a  in (select ..) clause (depending in the existence of an index and the cardinality of the subset)

data TEST;
  length A $16;
  do I=1 to 1e7;
    A=put(ranuni(1)*1e16, 16.0);
    output;
  end;
run;
 
 proc sql;
   create table T as select * from TEST where A eqt '22'; 
 quit;  
 
 proc sql;   * real time 1.46 seconds;
   create table S as 
   select TEST.*
   from TEST
   where A in (select A from T);

proc sql;   * real time 0.81 seconds;
   create table S as 
   select TEST.*
   from TEST, T
   where TEST.A=T.A;
quit;   
 

Also, put the most discriminant filters first to reduce volume as soon as possible.

 71          proc sql;
 72            create table T as select * from TEST where A like '2%' and A like '222222%';
 NOTE: Table WORK.T created, with 10 rows and 2 columns.
 
 NOTE: PROCEDURE SQL used (Total process time):
       real time           1.42 seconds
       
 
 73          proc sql;
 74            create table T as select * from TEST where A like '222222%' and A like '2%';
 NOTE: Table WORK.T created, with 10 rows and 2 columns.
 
 75          quit;
 NOTE: PROCEDURE SQL used (Total process time):
       real time           1.26 seconds
Tom
Super User Tom
Super User

Here are a couple of things that might improve performance a little bit.

Use truncated comparison instead of LIKE operator:

/* name like 'PNB%' */
name EQT 'PNB'

Eliminate the DATEPART() function call by comparing to a datetime constant instead:

/* datepart(DATE) <= '7Dec2002:00:00:00'd */
DATE < '08DEC2002:00:00:00'dt

But you say that T2 is a LARGE dataset.  So unless you have indexes on the dataset that will allow SAS to skip reading all of the observations the process of just performing those tests on every observation will be long running..  

 

But in general using indexes when the tests are not equality tests i very hard.  So SAs might not end up using the index even if they did exist.

 

Why would it "time-out"?  What is timing out?  Are you using SAS/Studio to run the code and your connection to SAS/Studio is timing out?  Can you submit the step to run as a background ("batch") job instead?

 

 

Also where is the data?  Perhaps it is not the process of reading the data or preforming the tests that it taking a long time.  But instead it is just that you are moving a lot of data across the network from one computer to another.

 

Is the dataset actually a table in some foreign database, like ORACLE or Microsoft SQL Server?  Can you explicitly filter the data on the foreign database side and only move the resulting observations to the SAS server?

Patrick
Opal | Level 21

You need to tell us more about your tables. When people call a table large then some mean 100K rows and others 100M rows. 

  1. What are the volumes (number of rows) of t2 and t4?
  2. How many distinct values for RN do you have in t4?
  3. What is the datatype of RN and if character what's the variable length.
  4. Where is the data stored (SAS table on disk or in some database like Oracle or both)?

A SAS data step hash lookup would likely perform better. Points 1-3 above are for making a decision if your table t4 could fit into memory, point 4 is for making a decision if processing needs to get pushed to a database side for processing (which a data step hash lookup wouldn't).

Also don't use select * but list the variables you need explicitly, like: select t2.name, t2.rn, t4.xxxx, ....

 

 

 

Astounding
PROC Star
Which dates do you want to select? As it stands, you are comparing dates to datetime, so that comparison eliminates zero observations.
Kurt_Bremser
Super User

More information.

More information.

More information.

 

How big is "big"? How many observations, how many variables, what is the observation size?

How large is t4? What is the type and size of rn? How many distinct values do you have?

Where are these datasets really stored? WORK (as your code suggests), permanent SAS library, external database?

 

Hash method:

data t1;
set t2;
where 
  name like 'PNB%'
  and 
  date <= "7dec2002:00:00:00"dt
;
if _n_ = 1
then do;
  declare hash t4 (dataset:"t4 (keep=rn)");
  t4.definekey("rn");
  t4.definedata("rn");
  t4.definedone();
end;
if t4.check();
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 615 views
  • 9 likes
  • 8 in conversation