Solved: Re: Optimized way to read variable length csv file

hhh123 · Posted 06-03-2021 07:51 AM

Hi,

I have a comma delimited csv file which has around 240 columns. These columns are of variable length hence I have used PROC IMPORT(as below) to read the file.

PROC IMPORT OUT= WORK.&table_name. DATAFILE= holdit DBMS=DLM REPLACE;
				DELIMITER="&delimiter.";
				GETNAMES=YES;
				DATAROW=2;
				GUESSINGROWS=MAX;
			RUN;

This works fine for smaller amount of data. But when reading csv file with 2 Million records it takes around 1 hour to read and this is because of GUSSINGROWS=MAX. Since data is of variable length I have this limitation to use MAX.

So I took another approach of using infile/input statement with defining maximum width of columns.

data WORK.&wrk_table. ;
infile &var_fref delimiter = "&delimiter" MISSOVER DSD lrecl=32767 firstobs=2 ;
attrib MEMBER_ID informat=$32. format=$32.;
/*prop_list - Contains list of other column names*/
%do p=1 %to %sysfunc(countw(&prop_list));
	attrib %scan(&prop_list,&p)  informat=$255. format=$255.;
%end;
input
/*inp_str - Contains list of all input variable names*/
%do k=1 %to %sysfunc(countw(&inp_str));
	%scan(&inp_str,&k) $
%end;
;
if missing(MEMBER_ID) then delete;
run;

This code takes around 7-8 minutes to complete. But the dataset created is of 140 GB (with proc import it was of 9 GB). And this is because of default length of 255 bytes.

So is there any optimal way to handle this situation which reduces the time to read and does not consume lot of memory?

Tom · Posted 06-03-2021 08:46 AM

Use the COMPRESS option (or COMPRESS= dataset option) to generate compressed dataset. That will reduce the size used on the disk for the dataset.

Your program could be a lot simpler. Note there is no need to attach the $ format or $ informat to character variables. SAS already knows how to write and read character variables.

data _null_;
  infile &var_fref obs=1 ;
  input;
  call symputx('varlist',translate(_infile_,' ',"&delimiter"));
run;

data &wrk_table. (compress=yes) ;
  length MEMBDER_ID $32 &varlist $255 ;
  infile &var_fref dsd dlm = "&delimiter" truncover firstobs=2 ;
  input &varlist;
  if missing(MEMBER_ID) then delete;
run;

View solution in original post

Tom · Posted 06-03-2021 08:46 AM

Use the COMPRESS option (or COMPRESS= dataset option) to generate compressed dataset. That will reduce the size used on the disk for the dataset.

Your program could be a lot simpler. Note there is no need to attach the $ format or $ informat to character variables. SAS already knows how to write and read character variables.

data _null_;
  infile &var_fref obs=1 ;
  input;
  call symputx('varlist',translate(_infile_,' ',"&delimiter"));
run;

data &wrk_table. (compress=yes) ;
  length MEMBDER_ID $32 &varlist $255 ;
  infile &var_fref dsd dlm = "&delimiter" truncover firstobs=2 ;
  input &varlist;
  if missing(MEMBER_ID) then delete;
run;

hhh123 · Posted 06-03-2021 09:02 AM

Thanks @Tom , that worked perfectly. It took around 2 minutes to read the data and dataset was compressed to 1 GB only.

I just need to verify if reading this compressed dataset in other data steps increases any processing time or not.

Tom · Posted 06-03-2021 09:28 AM

Is this file static? Do you get multiple copies of it over time? (like a weekly or monthly delivery)

If the columns names and order do not change then just take the time to define the structure you need once and read all instances into the same structure. Just define each variable long enough to hold the maximum allowed value for that variable, independent of the maximum value in this one set of 2 million observations.

hhh123 · Posted 06-03-2021 09:31 AM

##- This is not a static file and column name and width can change in each
run. Due to this usecase I have previously used PROC IMPORT to read the
file.-##

ballardw · Posted 06-03-2021 10:58 AM

@hhh123 wrote:
##- This is not a static file and column name and width can change in each
run. Due to this usecase I have previously used PROC IMPORT to read the
file.-##

When I see a statement that "column name and width" can change I suggest approaching the source of the file for the documentation of what is supposed to be created. If they can't provide that then how do they know what they are creating???

Used to do some contract data-related work for a large company and they asked us "why do you charge us a programming surcharge so often". When we explained that 1) the order of columns in the data they provided, 2) column headings changed and 3) structure of the content in some of those columns changed multiple times per week it wasn't very long before that quit happening as the charges were running thousands per month.

hhh123 · Posted 06-04-2021 12:46 AM

Actually this is kind of a generic application on which I am working. This is to be used by multiple consumers each having their own structure of csv file. So basically I need to handle this varied length and column in code part itself.

Kurt_Bremser · Posted 06-04-2021 01:16 AM

Instead of putting only column names into a macro variable, have a dataset that describes all the necessary variable attributes (name, type, length, informat, format) and use that in a DATA _NULL_ step which creates the import code with CALL EXECUTE.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!