Hi all
I have to do a join with datasets that have over 50 millions of rows each one and it´s a nightmare. It takes 5 hours or more in perform the task.
I dont know if there is a better solution, testing with a merge statement takes too much time too because the datasets need to be sorted and
these tasks takes time and I have read that proc sort don´t use an index to perfom the task.
Plus the disk space is reduced drastically.
Could anybody help me please?
Is there a better solution to achieve the result without have to wait for 5 hours?
thanks in advanced
Where is the data? What kind of join are you doing? What is the join on, what indexes are available? Is the result of the join much smaller than the joined tables? Are you getting a message from SAS about a join that can't be optimized? We need answers to these questions to optimize the operation.
PG
Morning all
First, thanks for the help it´s much appreciated for me.
Well, I´ll try to answer your questions.
The data are in an Oracle Server and I make a copy of the datasets connecting to the server through a libname statement and after
data a;
set b (where = (var1<= date and var2>date and var3>date); where "b" is the dataset on the Oracle server
I do a left join, the target in most of the cases is to obtain the surrogate key of the second table and one or two variables more.Something like this
proc sql;
create table x as
select a.*, b.var1, b.var2, b.var3
from dataset1 as a left join dataset2 as b
on (a.var1 = b.var1);
quit;
There is no index at all in any of the datasets and no message appears in the log about an issue in the performance
Regards
Try this:
proc sort
data=b /* this is your original oracle data set */
(where = (var1<= date and var2>date and var3>date)
out=dataset1
;
by var1;
run;
proc sort
data=dataset2 (keep=var1 var2 var3)
out=data2x
;
by var1;
run;
data x;
merge
dataset1 (in=a)
data2x
;
by var1;
if a;
run;
Compare this method and the SQL method using options fullstimer;
Also watch the disks while the jobs are running; you may be surprised by the disk usage(s).
I remember when I first came across a piece of code done by a SAS consultant that had > 100 lines. I quickly saw that I could do the same in one create table with ~ 10 lines in PROC SQL, so why bother with all that code? Then I had to wait 5 hours for my SQL to finish, while his code took about 20 minutes to produce the same result. With less than half the disk space.
As a previous post had mentioned why not perform your tasks on the SQL server where your data resides, store the results into a temporary table and then extract the data into SAS? SQL databases should be fully optimized and geared up to performing code on very large datasets, although once you reach a certain point then it becomes less a database and more of a data warehouse with different storage requirements and processes.
You are not telling where your "dataset2" is stored. If that one is i Oracle two, consider to move all processing to Oracle (SQL implicit/explicit pass-thru).
If you lookup surrogate keys from dataset2, it sounds like that is a permanent table, which mean you could consider applying an index to to it.
When doing similar work in DI Studio, there Look-up transformation is using data step hash tables. This technique has the benefit that it does not require that the "master" table to be sorted, it just performs a clean table scan.
Perhaps your whole process could be re-configured, but there's too little information at this point to give any suggestion in that direction.
Sorry guys, my PC have been half to death all day and I couldn´t test your adviced. I'm trying to sent a the query to Oracle server but I recive an error like this
"ERROR: PROC SQL requires any created table to have at least 1 column."
96 options fullstimer;
97 options sastrace=',,,s ,,d, ,,t,' sastraceloc=saslog nostsuffix;
98
99 proc sql;
100 connect to odbc as oracle (USER=aaaaaa PW=XXXXXXXX DSN='pppp');
101 create table result as select * from connection to oracle
102 (select a.var1, a.var2, a.var3, a.var4, a.var5,
102 a.var6, a.var7, a.var8, a.var9, a.var10, b.var1, b.var2
104 from epi as a left join blas b on (a.var1 = b.var1)
105 where a.var1 <= date and a.var2 >= date and 105! a.var3>date);
ERROR: PROC SQL requires any created table to have at least 1 column
LinusH, both datasets are in the work library, I copy it from a DWH in Oracle server.
Thanks for the link Jaap, It's very helpful for me.
One more thing, I haven`t permissions for create, modify or alter tables in the Server so all the tables that I create should be in the local drive
Regards
Dear Sergio,
When using pass thru in the past, I see two things that could be the cause of your error message. The "as a" might be an issue. Try " epi a". The other issue may be that there are no rows selected by your where statement. Try testing with just one of the were conditions. Are the values in "var1" and "date" formatted exactly the same?
104 from epi as a left join blas b on (a.var1 = b.var1)
105 where a.var1 <= date and a.var2 >= date and 105! a.var3>date);
ERROR: PROC SQL requires any created table to have at least 1 column
This is code that worked for me in the past. "Doris" is the name of the Oracle database that lived in a UNIX environment.
proc sql feedback inobs=max outobs=max;
drop table dlib.hstclmshdr;
connect to oracle (user="&userd." password="&passd." path='mypath'
schema=doris preserve_comments buffsize=8000);
create table work.claims as
select G.*
from connection to oracle
(select d.*
from doris.inst_claim_header d
where (d.line_of_business = 'HST')
and (d.claim_thru_date between TO_DATE('01/01/2012','MM/DD/YYYY') and
TO_DATE('12/31/2012','MM/DD/YYYY'))
) G
Morning all
jwillis your code doesn't work for me, it`s neither an alias issue or were conditions.:smileycry::smileycry:
I'll see if I am able to find a solution.
Thanks
what date are you specifying where you use a.var3>date? If date is current day, do you expect to have future-dated records? do you need to use the date functions/formats to get this to work?
Do both datasets have multiple records for the same value of var1?
If at least one of them VAR1 uniquely identifies rows then just use a normal data step merge .
So instead of SQL code like your example:
create table x as
select a.*, b.var2, b.var3
from dataset1 as a
left join dataset2 as b
on (a.var1 = b.var1)
;
You would use normal SAS code and write this as:
data x ;
merge dataset1 (in=in1) dataset2 (in=in2 keep=var1 var2 var3);
by var1 ;
if in1 ;
run;
Sergio,
You can do a hash sort on both data sets, this will reduce processing time, then try the data step merge.
But the hash sort must be done after you pull the data into sas, you can't hash sort in a pass through query.
I know it Jaap but it's my working computer and there is nothing I can do for change this machine or add more memory RAM
which probably resolved the memory's problem.
thanks bentleyj1 for the links I'll take a look.
Tom, KurtBremser told me something like you are telling me now, I test it when I have a little time .
rachel, I don't kwow what you mean with you talk about hash sort. Could you explain to me please?.
Regards.
Sergio,
I was saying you should using a hash method, not proc sort...that will shorten the run time.
SergioSanchez wrote:
I know it Jaap but it's my working computer and there is nothing I can do for change this machine or add more memory RAM
which probably resolved the memory's problem.
thanks bentleyj1 for the links I'll take a look.
Tom, KurtBremser told me something like you are telling me now, I test it when I have a little time .
rachel, I don't kwow what you mean with you talk about hash sort. Could you explain to me please?.
Regards.
I meant to say sort your two datasets using hash sort, because the hash sort is much faster then the proc sort. instead of proc sort, then try merging with a data step.
data _null_ if _N_ =1 then do;
declare hash h(dataset:"your_data", ordered:'d'); /* 'd ' means descending or you can put 'a' for ascending*/
h.defineKey('your_key');
h.defineData('your_key');
call missing(your_key); /*no quotes'*/
end; /* Create output dataset of sorted data */
rc=h.output(dataset:'out_your_data');
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.