1)Large data set (20 mil records) and a second data set (10,000 records). Say we do an inner join with product ID. But this is too slow.
Any recommendation to improve performance?
2)I get the output data set from a current dataset A, product ID by the last occurrence, check if its excluded in a separate dataset B (say, discontinued products).
How do you do it?
Hello Sahaji,
I will give it my best shot 😉
1)
In my opinion your questions depend on the infrastructure/hardware specs that you have.
There are several options for you to try:
- you could use the smaller data set as a hash file: http://support.sas.com/rnd/base/datastep/dot/better-hashing-sas92.pdf
- you could create a format from the smaller data set and use it on the 20mio data set. Then when your format variable is not filled you would drop that row.
- if you use SPDS for your data sets you can safe a lot of time by using PROC SQL instead of DATA-Steps and make use of parallel processing (SAS(R) Scalable Performance Data Server(R) 4.53: User's Guide). We used PROC SQL instead of DATA-Steps to save a lot of time in our programs.
- you could also try to create indexes on your join variables
2)
Perhaps you could specify some details. Why would you want to check if your product ID is not within a table B?
Do you want to make sure that it is not in table B? Then I would suggest a select where the product Id is not in select product ID from table A.
But with more details we might find a better fitting solution.
Cheers,
Michael
Hello Sahaji,
I will give it my best shot 😉
1)
In my opinion your questions depend on the infrastructure/hardware specs that you have.
There are several options for you to try:
- you could use the smaller data set as a hash file: http://support.sas.com/rnd/base/datastep/dot/better-hashing-sas92.pdf
- you could create a format from the smaller data set and use it on the 20mio data set. Then when your format variable is not filled you would drop that row.
- if you use SPDS for your data sets you can safe a lot of time by using PROC SQL instead of DATA-Steps and make use of parallel processing (SAS(R) Scalable Performance Data Server(R) 4.53: User's Guide). We used PROC SQL instead of DATA-Steps to save a lot of time in our programs.
- you could also try to create indexes on your join variables
2)
Perhaps you could specify some details. Why would you want to check if your product ID is not within a table B?
Do you want to make sure that it is not in table B? Then I would suggest a select where the product Id is not in select product ID from table A.
But with more details we might find a better fitting solution.
Cheers,
Michael
Thank You so much Michael. Thanks for spending time to answer my question. Greatly appreciated.
Is the large dataset indexed or partitioned? If it is, then you may be able to limit the data very early in the process.
Add a where clause that uses a variable that is indexed or partitioned. You can add the where clause right on the
join, as in the example below.
Good luck.
Sheri
PROC SQL;
CREATE TABLE Lab_Results as
select LabGroup, Prov_ID, A.Enc_ID, PATIENT_ID, A.Order_ID, Order_Date, Order_End_Date,
L.Lab_Status_C, ORDER_Value,
L.Result_Date, L.Result_Time, Result_In_Range_YN, L.Result_Status_C
FROM ProdFil.EntityLabs A
JOIN ProdFil.Order_Results (where=(Result_Date >= '01JAN2013'd) L
on A.Enc_ID = L.Enc_ID
where A.Order_Proc_ID = '132856'
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.