BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sahaji
Calcite | Level 5

1)Large data set (20 mil records) and a second data set (10,000 records). Say we do an inner join with product ID. But this is too slow.

      Any recommendation to improve performance?

2)I get the output data set from a current dataset A, product ID by the last occurrence, check if its excluded in a separate dataset B (say, discontinued products).

     How do you do it?

1 ACCEPTED SOLUTION

Accepted Solutions
mfab
Quartz | Level 8

Hello Sahaji,

I will give it my best shot 😉

1)

In my opinion your questions depend on the infrastructure/hardware specs that you have.

There are several options for you to try:

- you could use the smaller data set as a hash file: http://support.sas.com/rnd/base/datastep/dot/better-hashing-sas92.pdf

- you could create a format from the smaller data set and use it on the 20mio data set. Then when your format variable is not filled you would drop that row.

- if you use SPDS for your data sets you can safe a lot of time by using PROC SQL instead of DATA-Steps and make use of parallel processing (SAS(R) Scalable Performance Data Server(R) 4.53: User's Guide). We used PROC SQL instead of DATA-Steps to save a lot of time in our programs.

- you could also try to create indexes on your join variables

2)

Perhaps you could specify some details. Why would you want to check if your product ID is not within a table B?

Do you want to make sure that it is not in table B? Then I would suggest a select where the product Id is not in select product ID from table A.

But with more details we might find a better fitting solution.

Cheers,

Michael

View solution in original post

3 REPLIES 3
mfab
Quartz | Level 8

Hello Sahaji,

I will give it my best shot 😉

1)

In my opinion your questions depend on the infrastructure/hardware specs that you have.

There are several options for you to try:

- you could use the smaller data set as a hash file: http://support.sas.com/rnd/base/datastep/dot/better-hashing-sas92.pdf

- you could create a format from the smaller data set and use it on the 20mio data set. Then when your format variable is not filled you would drop that row.

- if you use SPDS for your data sets you can safe a lot of time by using PROC SQL instead of DATA-Steps and make use of parallel processing (SAS(R) Scalable Performance Data Server(R) 4.53: User's Guide). We used PROC SQL instead of DATA-Steps to save a lot of time in our programs.

- you could also try to create indexes on your join variables

2)

Perhaps you could specify some details. Why would you want to check if your product ID is not within a table B?

Do you want to make sure that it is not in table B? Then I would suggest a select where the product Id is not in select product ID from table A.

But with more details we might find a better fitting solution.

Cheers,

Michael

sahaji
Calcite | Level 5

Thank You so much Michael. Thanks for spending time to answer my question. Greatly appreciated.

Sheri
Calcite | Level 5

Is the large dataset indexed or partitioned?    If it is, then you may be able to limit the data very early in the process.
Add a where clause that uses a variable that is indexed or partitioned.     You can add the where clause right on the

join, as in the example below.

Good luck.

Sheri

 

PROC SQL;

CREATE TABLE  Lab_Results as

select LabGroup, Prov_ID,  A.Enc_ID, PATIENT_ID, A.Order_ID, Order_Date, Order_End_Date,

           L.Lab_Status_C,   ORDER_Value,

           L.Result_Date, L.Result_Time, Result_In_Range_YN, L.Result_Status_C

FROM ProdFil.EntityLabs A

JOIN  ProdFil.Order_Results (where=(Result_Date >= '01JAN2013'd)  L

on A.Enc_ID = L.Enc_ID

where A.Order_Proc_ID = '132856'

quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 818 views
  • 0 likes
  • 3 in conversation