proc sql;
create table t1
as
select *
from
t2
where
name like 'PNB%'
and rn in
(select distinct rn from
t4)
and
datepart(DATE) <= '7Dec2002:00:00:00'd
;quit;
I have a very big dataset and its taking forever to run and many times connections disrupts and I need to wait for a long again.
can you please suggest a better way to do it.
Here table T2 is like very big taking almost an hour to complete the query if everything goes well or need to restart many times .
Thank you in advance
kajal
How big is big here? Are we talking millions/billions of rows or thousands?
Is the following slightly better? Do you have any indexes on the datasets or the ability to create any indexes?
proc sql;
create table t1
as
select *
from
t2
where
substr(name, 1, 3) = 'PNB'
and rn in
(select distinct rn from
t4)
and
datepart(DATE) <= '7Dec2002'd
;quit;
@kajal_30 wrote:
proc sql; create table t1 as select * from t2 where name like 'PNB%' and rn in (select distinct rn from t4) and datepart(DATE) <= '7Dec2002:00:00:00'd ;quit;
I have a very big dataset and its taking forever to run and many times connections disrupts and I need to wait for a long again.
can you please suggest a better way to do it.
Here table T2 is like very big taking almost an hour to complete the query if everything goes well or need to restart many times .
Thank you in advance
kajal
Thank you @Reeza it is almost taking the same time. Is it possible to convert this step into hash join where we are selecting all columns ?
Regards
Kajal
As stated by others, you need to provide more informaitno of your actual data.
Please provide a PROC CONTENTS of both datasets.
If in an external RDBMS, the DDL together with stats.
Add the following to your code:
options fullstimer msglevel=i;
proc sql _method;
/* In case of external data source */
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
As a rule of thumb:
- not using functions will speed up the query
- a join is faster than a in (select ..) clause (depending in the existence of an index and the cardinality of the subset)
data TEST;
length A $16;
do I=1 to 1e7;
A=put(ranuni(1)*1e16, 16.0);
output;
end;
run;
proc sql;
create table T as select * from TEST where A eqt '22';
quit;
proc sql; * real time 1.46 seconds;
create table S as
select TEST.*
from TEST
where A in (select A from T);
proc sql; * real time 0.81 seconds;
create table S as
select TEST.*
from TEST, T
where TEST.A=T.A;
quit;
Also, put the most discriminant filters first to reduce volume as soon as possible.
71 proc sql;
72 create table T as select * from TEST where A like '2%' and A like '222222%';
NOTE: Table WORK.T created, with 10 rows and 2 columns.
NOTE: PROCEDURE SQL used (Total process time):
real time 1.42 seconds
73 proc sql;
74 create table T as select * from TEST where A like '222222%' and A like '2%';
NOTE: Table WORK.T created, with 10 rows and 2 columns.
75 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1.26 seconds
Here are a couple of things that might improve performance a little bit.
Use truncated comparison instead of LIKE operator:
/* name like 'PNB%' */
name EQT 'PNB'
Eliminate the DATEPART() function call by comparing to a datetime constant instead:
/* datepart(DATE) <= '7Dec2002:00:00:00'd */
DATE < '08DEC2002:00:00:00'dt
But you say that T2 is a LARGE dataset. So unless you have indexes on the dataset that will allow SAS to skip reading all of the observations the process of just performing those tests on every observation will be long running..
But in general using indexes when the tests are not equality tests i very hard. So SAs might not end up using the index even if they did exist.
Why would it "time-out"? What is timing out? Are you using SAS/Studio to run the code and your connection to SAS/Studio is timing out? Can you submit the step to run as a background ("batch") job instead?
Also where is the data? Perhaps it is not the process of reading the data or preforming the tests that it taking a long time. But instead it is just that you are moving a lot of data across the network from one computer to another.
Is the dataset actually a table in some foreign database, like ORACLE or Microsoft SQL Server? Can you explicitly filter the data on the foreign database side and only move the resulting observations to the SAS server?
You need to tell us more about your tables. When people call a table large then some mean 100K rows and others 100M rows.
A SAS data step hash lookup would likely perform better. Points 1-3 above are for making a decision if your table t4 could fit into memory, point 4 is for making a decision if processing needs to get pushed to a database side for processing (which a data step hash lookup wouldn't).
Also don't use select * but list the variables you need explicitly, like: select t2.name, t2.rn, t4.xxxx, ....
More information.
More information.
More information.
How big is "big"? How many observations, how many variables, what is the observation size?
How large is t4? What is the type and size of rn? How many distinct values do you have?
Where are these datasets really stored? WORK (as your code suggests), permanent SAS library, external database?
Hash method:
data t1;
set t2;
where
name like 'PNB%'
and
date <= "7dec2002:00:00:00"dt
;
if _n_ = 1
then do;
declare hash t4 (dataset:"t4 (keep=rn)");
t4.definekey("rn");
t4.definedata("rn");
t4.definedone();
end;
if t4.check();
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.