I have 60 sas7bdat datasets, around 100 million rows each and 60-70 columns, which contains quite similar columns and would like to process them altogether to filter and create a smaller dataset (containing all the columns) containing rows with specific information with the following code:
PROC SQL;
CREATE TABLE bigtable AS
SELECT * FROM DATA1
OUTER UNION CORR
SELECT * FROM DATA2
.......
OUTER UNION CORR
SELECT * FROM DATA60;
CREATE TABLE filtered AS
SELECT * FROM bigtable
WHERE <filtering based on 6 columns in the 'bigtable' with AND & OR>;
<and then save 'filtered' in CSV>;
Two questions relating this process:
Thank you in advance!
By not filtering as you are reading your input datasets you are reading more data than necessary so your program will be slower. If you add a WHERE clause in my example code it applies equally to all datasets:
data want;
set data1 - data60;
where < your selection logic >;
run;
Where did the SAS datasets get created? You will get that message if the datasets were created in a SAS installation that is not identical to your one. For example if it runs on a different OS.
I'd suggest that you try using the DATA step SET statement to stack your tables as that will likely be faster than SQL. Also should filter the DATA1 to DATA60 tables if possible rather than reading all data then filtering afterwards:
data want;
set data1 - data60;
run;
Thank you for the response! Wouldn't filtering the dataset one by one for all 60 of them result in at least the same time spent than filtering them altogether? Filtering all the datasets together in a single step would require the filter to only traverse once, as opposed to traverse 60 times; although the same amount of rows explored - but have to re-type the filtering code for 60 times?
Can I use that DATA step in conjunction with PROC SQL? So after your code, using PROC SQL to filter the rows?
By not filtering as you are reading your input datasets you are reading more data than necessary so your program will be slower. If you add a WHERE clause in my example code it applies equally to all datasets:
data want;
set data1 - data60;
where < your selection logic >;
run;
There is nothing in your posted code that requires using SQL.
Although I have no idea what you mean by this phrase "filtering based on 6 columns in the 'bigtable' "
Are you trying to imply that some of the 6 variables do not exist in all of the original datasets?
One method that can be faster is to use PROC APPEND.
proc append data=data1 base=BIGTABLE ;
WHERE <filtering based on 6 VARIABLES in the SMALL DATASET with AND & OR>;
run;
proc append data=data2 base=BIGTABLE ;
WHERE <filtering based on 6 VARIABLES in the SMALL DATASET with AND & OR>;
run;
...
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.