BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
hhh123
Fluorite | Level 6

Hi,

I have a comma delimited csv file which has around 240 columns. These columns are of variable length hence I have used PROC IMPORT(as below) to read the file.

 

PROC IMPORT OUT= WORK.&table_name. DATAFILE= holdit DBMS=DLM REPLACE;
				DELIMITER="&delimiter.";
				GETNAMES=YES;
				DATAROW=2;
				GUESSINGROWS=MAX;
			RUN;

This works fine for smaller amount of data. But when reading csv file with 2 Million records it takes around 1 hour to read and this is because of GUSSINGROWS=MAX. Since data is of variable length I have this limitation to use MAX. 

So I took another approach of using infile/input statement with defining maximum width of columns.

data WORK.&wrk_table. ;
infile &var_fref delimiter = "&delimiter" MISSOVER DSD lrecl=32767 firstobs=2 ;
attrib MEMBER_ID informat=$32. format=$32.;
/*prop_list - Contains list of other column names*/
%do p=1 %to %sysfunc(countw(&prop_list));
	attrib %scan(&prop_list,&p)  informat=$255. format=$255.;
%end;
input
/*inp_str - Contains list of all input variable names*/
%do k=1 %to %sysfunc(countw(&inp_str));
	%scan(&inp_str,&k) $
%end;
;
if missing(MEMBER_ID) then delete;
run;
			

This code takes around 7-8 minutes to complete. But the dataset created is of 140 GB (with proc import it was of 9 GB). And this is because of default length of 255 bytes.

So is there any optimal way to handle this situation which reduces the time to read and does not consume lot of memory?

 

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Use the COMPRESS option (or COMPRESS= dataset option) to generate compressed dataset.  That will reduce the size used on the disk for the dataset.

 

Your program could be a lot simpler.  Note there is no need to attach the $ format or $ informat to character variables.  SAS already knows how to write and read character variables.

 

data _null_;
  infile &var_fref obs=1 ;
  input;
  call symputx('varlist',translate(_infile_,' ',"&delimiter"));
run;

data &wrk_table. (compress=yes) ;
  length MEMBDER_ID $32 &varlist $255 ;
  infile &var_fref dsd dlm = "&delimiter" truncover firstobs=2 ;
  input &varlist;
  if missing(MEMBER_ID) then delete;
run;

View solution in original post

7 REPLIES 7
Tom
Super User Tom
Super User

Use the COMPRESS option (or COMPRESS= dataset option) to generate compressed dataset.  That will reduce the size used on the disk for the dataset.

 

Your program could be a lot simpler.  Note there is no need to attach the $ format or $ informat to character variables.  SAS already knows how to write and read character variables.

 

data _null_;
  infile &var_fref obs=1 ;
  input;
  call symputx('varlist',translate(_infile_,' ',"&delimiter"));
run;

data &wrk_table. (compress=yes) ;
  length MEMBDER_ID $32 &varlist $255 ;
  infile &var_fref dsd dlm = "&delimiter" truncover firstobs=2 ;
  input &varlist;
  if missing(MEMBER_ID) then delete;
run;
hhh123
Fluorite | Level 6

Thanks @Tom , that worked perfectly. It took around 2 minutes to read the data and dataset was compressed to 1 GB only.

I just need to verify if reading this compressed dataset in other data steps increases any processing time or not.

Tom
Super User Tom
Super User

Is this file static?  Do you get multiple copies of it over time? (like a weekly or monthly delivery)

If the columns names and order do not change then just take the time to define the structure you need once and read all instances into the same structure.  Just define each variable long enough to hold the maximum allowed value for that variable, independent of the maximum value in this one set of 2 million observations.

hhh123
Fluorite | Level 6
##- This is not a static file and column name and width can change in each
run. Due to this usecase I have previously used PROC IMPORT to read the
file.-##
ballardw
Super User

@hhh123 wrote:
##- This is not a static file and column name and width can change in each
run. Due to this usecase I have previously used PROC IMPORT to read the
file.-##

When I see a statement that "column name and width" can change I suggest approaching the source of the file for the documentation of what is supposed to be created. If they can't provide that then how do they know what they are creating???

 

Used to do some contract data-related work for a large company and they asked us "why do you charge us a programming surcharge so often". When we explained that 1) the order of columns in the data they provided, 2) column headings changed and 3) structure of the content in some of those columns changed multiple times per week it wasn't very long before that quit happening as the charges were running thousands per month.

hhh123
Fluorite | Level 6

Actually this is kind of a generic application on which I am working. This is to be used by multiple consumers each having their own structure of csv file. So basically I need to handle this varied length and column in code part itself.

Kurt_Bremser
Super User

Instead of putting only column names into a macro variable, have a dataset that describes all the necessary variable attributes (name, type, length, informat, format) and use that in a DATA _NULL_ step which creates the import code with CALL EXECUTE.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 4258 views
  • 2 likes
  • 4 in conversation