proc sql; create table new_sample as select a.* from total_sample as a, selected_3000 where a.ID in (select ID from selected_3000 ); quit;
Total_sample has 79 million data rows, from 79,000 unique IDs. File size 90G.
selected_3000 only has 3000 rows with 3000 unique IDs.
Now I want to select those whose IDs are in selected_3000 from the total_sample, using the above proc sql code.
However, it generated a huge file >200G and I had to terminate the procedure. I checked the output huge file and
found the same row was repeated so many times.
What could be the problem in this proc sql code?
Try this:
proc sql; create table new_sample as select * from total_sample where a.ID in (select ID from selected_3000 ); quit;
You don't need to have selected_3000 in the main query and also in the subquery. Since you had it in the main query, it was creating a cartesian product. It was returning 3000 times as many rows as you needed.
Try this:
proc sql; create table new_sample as select * from total_sample where a.ID in (select ID from selected_3000 ); quit;
You don't need to have selected_3000 in the main query and also in the subquery. Since you had it in the main query, it was creating a cartesian product. It was returning 3000 times as many rows as you needed.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.