Solved: Re: Need answers if you can

sahaji · Posted 11-14-2013 08:11 AM

1)Large data set (20 mil records) and a second data set (10,000 records). Say we do an inner join with product ID. But this is too slow.

Any recommendation to improve performance?

2)I get the output data set from a current dataset A, product ID by the last occurrence, check if its excluded in a separate dataset B (say, discontinued products).

How do you do it?

mfab · Posted 11-14-2013 09:41 AM

Hello Sahaji,

I will give it my best shot 😉

1)

In my opinion your questions depend on the infrastructure/hardware specs that you have.

There are several options for you to try:

- you could use the smaller data set as a hash file: http://support.sas.com/rnd/base/datastep/dot/better-hashing-sas92.pdf

- you could create a format from the smaller data set and use it on the 20mio data set. Then when your format variable is not filled you would drop that row.

- if you use SPDS for your data sets you can safe a lot of time by using PROC SQL instead of DATA-Steps and make use of parallel processing (SAS(R) Scalable Performance Data Server(R) 4.53: User's Guide). We used PROC SQL instead of DATA-Steps to save a lot of time in our programs.

- you could also try to create indexes on your join variables

2)

Perhaps you could specify some details. Why would you want to check if your product ID is not within a table B?

Do you want to make sure that it is not in table B? Then I would suggest a select where the product Id is not in select product ID from table A.

But with more details we might find a better fitting solution.

Cheers,

Michael

View solution in original post

mfab · Posted 11-14-2013 09:41 AM

Hello Sahaji,

I will give it my best shot 😉

1)

In my opinion your questions depend on the infrastructure/hardware specs that you have.

There are several options for you to try:

- you could use the smaller data set as a hash file: http://support.sas.com/rnd/base/datastep/dot/better-hashing-sas92.pdf

- you could create a format from the smaller data set and use it on the 20mio data set. Then when your format variable is not filled you would drop that row.

- if you use SPDS for your data sets you can safe a lot of time by using PROC SQL instead of DATA-Steps and make use of parallel processing (SAS(R) Scalable Performance Data Server(R) 4.53: User's Guide). We used PROC SQL instead of DATA-Steps to save a lot of time in our programs.

- you could also try to create indexes on your join variables

2)

Perhaps you could specify some details. Why would you want to check if your product ID is not within a table B?

Do you want to make sure that it is not in table B? Then I would suggest a select where the product Id is not in select product ID from table A.

But with more details we might find a better fitting solution.

Cheers,

Michael

sahaji · Posted 11-14-2013 10:14 AM

Thank You so much Michael. Thanks for spending time to answer my question. Greatly appreciated.

Sheri · Posted 11-14-2013 10:16 AM

Is the large dataset indexed or partitioned? If it is, then you may be able to limit the data very early in the process.
Add a where clause that uses a variable that is indexed or partitioned. You can add the where clause right on the

join, as in the example below.

Good luck.

Sheri

PROC SQL;

CREATE TABLE Lab_Results as

select LabGroup, Prov_ID, A.Enc_ID, PATIENT_ID, A.Order_ID, Order_Date, Order_End_Date,

L.Lab_Status_C, ORDER_Value,

L.Result_Date, L.Result_Time, Result_In_Range_YN, L.Result_Status_C

FROM ProdFil.EntityLabs A

JOIN ProdFil.Order_Results (where=(Result_Date >= '01JAN2013'd) L

on A.Enc_ID = L.Enc_ID

where A.Order_Proc_ID = '132856'

quit;

Need answers if you can

Re: Need answers if you can

Re: Need answers if you can

Re: Need answers if you can

Re: Need answers if you can

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away