topic Re: Data Processing with PROC SQL for Multiple Billion Rows in SAS Programming

Data Processing with PROC SQL for Multiple Billion Rows

sas_novice2 — Mon, 25 Sep 2023 18:54:50 GMT

I have 60 sas7bdat datasets, around 100 million rows each and 60-70 columns, which contains quite similar columns and would like to process them altogether to filter and create a smaller dataset (containing all the columns) containing rows with specific information with the following code:

PROC SQL;

CREATE TABLE bigtable AS

SELECT * FROM DATA1

OUTER UNION CORR

SELECT * FROM DATA2

.......

OUTER UNION CORR

SELECT * FROM DATA60;

CREATE TABLE filtered AS

SELECT * FROM bigtable

WHERE <filtering based on 6 columns in the 'bigtable' with AND & OR>;

<and then save 'filtered' in CSV>;

Two questions relating this process:

DATA1, ..., DATA60 are sas7bdat datasets, but I get the following messages: "Data file is in a format that is native to another host". This may cause processing time to be much slower - datasets are in sas7bdat and in latin1 western (ISO) - any solutions, and why the data is not native formatted even when they're sas7bdat datasets?
Is there a much faster way to process data via PROC SQL, to vertically combine and filter billions of rows?

Thank you in advance!

Re: Data Processing with PROC SQL for Multiple Billion Rows

SASKiwi — Mon, 25 Sep 2023 19:05:40 GMT

Where did the SAS datasets get created? You will get that message if the datasets were created in a SAS installation that is not identical to your one. For example if it runs on a different OS.

I'd suggest that you try using the DATA step SET statement to stack your tables as that will likely be faster than SQL. Also should filter the DATA1 to DATA60 tables if possible rather than reading all data then filtering afterwards:

data want;
  set data1 - data60;
run;

Re: Data Processing with PROC SQL for Multiple Billion Rows

sas_novice2 — Mon, 25 Sep 2023 19:08:50 GMT

Thank you for the response! Wouldn't filtering the dataset one by one for all 60 of them result in at least the same time spent than filtering them altogether? Filtering all the datasets together in a single step would require the filter to only traverse once, as opposed to traverse 60 times; although the same amount of rows explored - but have to re-type the filtering code for 60 times?

Can I use that DATA step in conjunction with PROC SQL? So after your code, using PROC SQL to filter the rows?

Re: Data Processing with PROC SQL for Multiple Billion Rows

SASKiwi — Mon, 25 Sep 2023 20:13:37 GMT

By not filtering as you are reading your input datasets you are reading more data than necessary so your program will be slower. If you add a WHERE clause in my example code it applies equally to all datasets:

data want;
  set data1 - data60;
  where < your selection logic >;
run;

Re: Data Processing with PROC SQL for Multiple Billion Rows

Tom — Mon, 25 Sep 2023 20:37:26 GMT

There is nothing in your posted code that requires using SQL.

Although I have no idea what you mean by this phrase "filtering based on 6 columns in the 'bigtable' "

Are you trying to imply that some of the 6 variables do not exist in all of the original datasets?

One method that can be faster is to use PROC APPEND.

proc append data=data1 base=BIGTABLE ;
      WHERE <filtering based on 6 VARIABLES in the SMALL DATASET with AND & OR>;
run;

proc append data=data2 base=BIGTABLE ;
      WHERE <filtering based on 6 VARIABLES in the SMALL DATASET with AND & OR>;
run;
 ...