proc sql;
create table new_sample as
select a.*
from total_sample as a,
selected_3000
where a.ID in (select ID from selected_3000 );
quit;
Total_sample has 79 million data rows, from 79,000 unique IDs. File size 90G.
selected_3000 only has 3000 rows with 3000 unique IDs.
Now I want to select those whose IDs are in selected_3000 from the total_sample, using the above proc sql code.
However, it generated a huge file >200G and I had to terminate the procedure. I checked the output huge file and
found the same row was repeated so many times.
What could be the problem in this proc sql code?
Try this:
proc sql;
create table new_sample as
select *
from total_sample
where a.ID in (select ID from selected_3000 );
quit;You don't need to have selected_3000 in the main query and also in the subquery. Since you had it in the main query, it was creating a cartesian product. It was returning 3000 times as many rows as you needed.
Try this:
proc sql;
create table new_sample as
select *
from total_sample
where a.ID in (select ID from selected_3000 );
quit;You don't need to have selected_3000 in the main query and also in the subquery. Since you had it in the main query, it was creating a cartesian product. It was returning 3000 times as many rows as you needed.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.