topic Re: Optimized way to read variable length csv file in SAS Programming

Optimized way to read variable length csv file

hhh123 — Thu, 03 Jun 2021 11:51:23 GMT

Hi,

I have a comma delimited csv file which has around 240 columns. These columns are of variable length hence I have used PROC IMPORT(as below) to read the file.

PROC IMPORT OUT= WORK.&table_name. DATAFILE= holdit DBMS=DLM REPLACE;
				DELIMITER="&delimiter.";
				GETNAMES=YES;
				DATAROW=2;
				GUESSINGROWS=MAX;
			RUN;

This works fine for smaller amount of data. But when reading csv file with 2 Million records it takes around 1 hour to read and this is because of GUSSINGROWS=MAX. Since data is of variable length I have this limitation to use MAX.

So I took another approach of using infile/input statement with defining maximum width of columns.

data WORK.&wrk_table. ;
infile &var_fref delimiter = "&delimiter" MISSOVER DSD lrecl=32767 firstobs=2 ;
attrib MEMBER_ID informat=$32. format=$32.;
/*prop_list - Contains list of other column names*/
%do p=1 %to %sysfunc(countw(&prop_list));
	attrib %scan(&prop_list,&p)  informat=$255. format=$255.;
%end;
input
/*inp_str - Contains list of all input variable names*/
%do k=1 %to %sysfunc(countw(&inp_str));
	%scan(&inp_str,&k) $
%end;
;
if missing(MEMBER_ID) then delete;
run;

This code takes around 7-8 minutes to complete. But the dataset created is of 140 GB (with proc import it was of 9 GB). And this is because of default length of 255 bytes.

So is there any optimal way to handle this situation which reduces the time to read and does not consume lot of memory?

Re: Optimized way to read variable length csv file

Tom — Thu, 03 Jun 2021 12:46:44 GMT

Use the COMPRESS option (or COMPRESS= dataset option) to generate compressed dataset. That will reduce the size used on the disk for the dataset.

Your program could be a lot simpler. Note there is no need to attach the $ format or $ informat to character variables. SAS already knows how to write and read character variables.

data _null_;
  infile &var_fref obs=1 ;
  input;
  call symputx('varlist',translate(_infile_,' ',"&delimiter"));
run;

data &wrk_table. (compress=yes) ;
  length MEMBDER_ID $32 &varlist $255 ;
  infile &var_fref dsd dlm = "&delimiter" truncover firstobs=2 ;
  input &varlist;
  if missing(MEMBER_ID) then delete;
run;

Re: Optimized way to read variable length csv file

hhh123 — Thu, 03 Jun 2021 13:02:54 GMT

Thanks @Tom , that worked perfectly. It took around 2 minutes to read the data and dataset was compressed to 1 GB only.

I just need to verify if reading this compressed dataset in other data steps increases any processing time or not.

Re: Optimized way to read variable length csv file

Tom — Thu, 03 Jun 2021 13:28:47 GMT

Is this file static? Do you get multiple copies of it over time? (like a weekly or monthly delivery)

If the columns names and order do not change then just take the time to define the structure you need once and read all instances into the same structure. Just define each variable long enough to hold the maximum allowed value for that variable, independent of the maximum value in this one set of 2 million observations.

Re: Optimized way to read variable length csv file

hhh123 — Thu, 03 Jun 2021 13:31:56 GMT

##- This is not a static file and column name and width can change in each
run. Due to this usecase I have previously used PROC IMPORT to read the
file.-##

Re: Optimized way to read variable length csv file

ballardw — Thu, 03 Jun 2021 14:58:01 GMT

@hhh123 wrote:
##- This is not a static file and column name and width can change in each
run. Due to this usecase I have previously used PROC IMPORT to read the
file.-##

When I see a statement that "column name and width" can change I suggest approaching the source of the file for the documentation of what is supposed to be created. If they can't provide that then how do they know what they are creating???

Used to do some contract data-related work for a large company and they asked us "why do you charge us a programming surcharge so often". When we explained that 1) the order of columns in the data they provided, 2) column headings changed and 3) structure of the content in some of those columns changed multiple times per week it wasn't very long before that quit happening as the charges were running thousands per month.

Re: Optimized way to read variable length csv file

hhh123 — Fri, 04 Jun 2021 04:46:00 GMT

Actually this is kind of a generic application on which I am working. This is to be used by multiple consumers each having their own structure of csv file. So basically I need to handle this varied length and column in code part itself.

Re: Optimized way to read variable length csv file

Kurt_Bremser — Fri, 04 Jun 2021 05:16:26 GMT

Instead of putting only column names into a macro variable, have a dataset that describes all the necessary variable attributes (name, type, length, informat, format) and use that in a DATA _NULL_ step which creates the import code with CALL EXECUTE.