Importing multiple large CSV files with varying data formats in one co... - Page 2

Tom · Posted 06-16-2025 09:49 AM

@Season wrote:
..
Yes, I did try PROC IMPORT. Does importing the file with the DATA step necessitates specification of the name and informat of each of every variable in the CSV? That would be a very formidable job as I have possibly thousands of columns in all.

PROC IMPROT will not handle a delimited text file with thousands of columns. It can only read the first 32K bytes of the header row.

Use the %CSV2DS() macro instead if you need help guessing how to read a delimited file.

https://github.com/sasutils/macros/blob/master/csv2ds.sas

ballardw · Posted 06-16-2025 06:10 PM

@Season wrote:

Does importing the file with the DATA step necessitates specification of the name and informat of each of every variable in the CSV? That would be a very formidable job as I have possibly thousands of columns in all.

Proc Import will write a basic data step program to read a text file. The code will be in the log and can be copied from the log to the editor, cleaned up and rerun (or issue a RECALL command immediately after the Proc Import to bring the code into the editor.)

The types of informats that often need to be addressed are those where the value contains all digits but you want to maintain leading zeros, such as account numbers. Change the informat to character long enough to hold the value (a $20. or similar). In most cases when modifying a Proc Import generated data step you can drop the FORMAT statements for variables except date, time and datetime variables unless you want to assign custom formats. Also to look out for are columns with mixed use of negative signs and () for negative values, or currency and percent signs that aren't on every value. These may require additional coding as well as in read as character and parse. If you have multiple currency symbols such as dollar, Yen, Pound, Franc and such this might be a very import consideration if you want to manipulate the currency values in any consistent manner.

Check on the assigned informat for your problem variables. If they were read as character but should be dates that is an indication that you may need to create new variables by parsing the values. Check on your national language settings (NLS) to see what order dates are read. OR if you see lots of invalid data messages involving those variables it is one indicator that the order may be different than your NLS and override to read as character and parse.

If you have variables that would best be considered Boolean, i.e. Yes/No, True/False, and such it may be worth creating and using a custom informat so that the results are numeric 1/0 as that will be much easier to work with in most cases going forward instead of a hodgepodge of Y/N T/F character values.

Another consideration not mentioned yet, is if these files are supposed to be of the same layout you should be able to use the same data step to read all of them by changing name of input file and output data set. But it is very likely that lengths of character variables will differ between files. So modify any of the $w. informats to allow for this. I generally start at 15% or so longer than the generated data step. And then check after reading that the values look right. If not make the informat wider and re-read.

A last issue relates to variable names generated from column headings that are either very long (will get truncated at 32 characters) or identical in the source file. If column headings are identical for the first 32 characters of a longer heading the first will get part of the text as the variable name. The others will get VARxxxx where xxxx may be the column number in the file. Identical shorter heading may get numeric suffixes added. Example a file with multiple headings of "Total", the first will have a variable name of Total, the next Total2 (or Total1 been awhile) with incremented numbers for each following.

Recommend setting option VALIDVARNAME=V7 before Proc Import. Dealing with variable names with spaces and non-standard characters gets old real quick having to use the name literal such as 'Stupid variable name'n every where. The V7 option will replace all the special characters with _ and be easier to type (or rename as desired).

One tool to help deal with some of this if you don't have good documentation is to copy the header row of the CSV, assuming it has column headers, and Paste that TRANSPOSED into a spreadsheet. That will give you one "row" per variable to do such things as examine how long the variable names might be, whether different files have different headers (paste into a different column in the spreadsheet and run a comparison of values). If you have a source that has narrative column headings you can with a little work in the spreadsheet get it to create LABEL assignment statements for variables by pasting the variable names from the proc import generated data set into another column (either using the INPUT statement from the code or Proc Contents output) and use spreadsheet functions to create text like varname ="original column heading text goes here".

Any data source that may have "thousands of columns" and doesn't provide documentation as to content of the file, such as expected lengths of character variables and layouts of date, time or datetime values needs to be considered with great suspicion. Without documentation how do you know what anything represents?

Season · Posted 06-17-2025 04:28 AM

Thank you so much for your very detailed summarization of your experiences in dealing with real-world datasets! I think your response deserves much more than a mere like, but unforunately, that is all I can offer. Still, now that you have invested plenty of your time on composing these words, I strongly suggest that you go a step further and transform them into a paper that may eventually appear in SAS User Group or academic journals like Journal of Statistical Software. Your experience is invaluable for both SAS users and beyond.

@ballardw wrote:

@Season wrote:

Does importing the file with the DATA step necessitates specification of the name and informat of each of every variable in the CSV? That would be a very formidable job as I have possibly thousands of columns in all.

Check on the assigned informat for your problem variables. If they were read as character but should be dates that is an indication that you may need to create new variables by parsing the values. Check on your national language settings (NLS) to see what order dates are read. OR if you see lots of invalid data messages involving those variables it is one indicator that the order may be different than your NLS and override to read as character and parse.

Still, I would like to further consult on NLS. I searched on the web and saw Microsoft having a webpage on NLS, but with a slightly different meaning- national language service. I am not sure if the two NLS's are the same, but anyway, could you please brief introduce what national language settings is and what impact does it have on importing datasets into statistical softwares like SAS? I used to think that the mere difference in the language of the interface and log of SAS does not really have an impact on its core capabilities such as loading and editing datasets.

@ballardw wrote:

Any data source that may have "thousands of columns" and doesn't provide documentation as to content of the file, such as expected lengths of character variables and layouts of date, time or datetime values needs to be considered with great suspicion. Without documentation how do you know what anything represents?

Finally, I would like to make a clarification. Your reminder is of great value and I thank you for it. However, the datasets I use does have a documentation. In fact, it is huge as it has a description for every variable therein. Still, only descriptions of variables instead of their intriacacies are documented. In other words, for a given column, I only know that it stands for a date with a particular meaning but do not know that it can take multiple formats like "12/16/15" and "12 16 15". Only when I imported it into SAS did I realize this issue.

Season · Posted 07-19-2025 11:42 PM

I have an additional question on this issue. Since my question is the same for everyone in the chat, I will only post it in a response to @Tom.

My question is: is it possible to import the CSV files in a parallel instead of a sequential manner? My experience in importing such large files is that the importation process of even a single file takes a lot of time. Therefore, "parallel importation", a word I coined from "parallel computing", is preferred. I wonder if SAS can realize this.

Kurt_Bremser · Posted 07-20-2025 02:43 AM

It‘s possible, and there are multiple ways to do it. But you need to be careful so you do not saturate your resources (storage, network). Reading external data into datasets is always I/O bound.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Season · Posted 07-20-2025 09:00 AM

Could you provide more details on how to do this? I deeply appreciate your kind help if you could offer a sample code. Thank you!

Kurt_Bremser · Posted 07-20-2025 02:47 PM

The simplest method: create separate programs for the files and run them in batch mode in parallel, e.g. in UNIX:

sas program1.sas &
sas program2.sas &
sas program3.sas &

The & sends the command into background.

If you have a scheduler, it can handle this also.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Season · Posted 07-21-2025 12:12 AM

Thank you so much!

Tom · Posted 07-20-2025 02:07 PM

@Season wrote:

I have an additional question on this issue. Since my question is the same for everyone in the chat, I will only post it in a response to @Tom.

My question is: is it possible to import the CSV files in a parallel instead of a sequential manner? My experience in importing such large files is that the importation process of even a single file takes a lot of time. Therefore, "parallel importation", a word I coined from "parallel computing", is preferred. I wonder if SAS can realize this.

Difficult to read in ONE file in parallel since you have to read it all to know where the lines are.

If you have multiple files you can read each of them in separate sessions that can run in parallel. How to do that depends on how you are running SAS. You could just open multiple SAS sessions and run them. You could create multiple separate programs and run them as "batch" jobs. If you have SAS/Connect licensed you could use one SAS session to spawn the other sessions.

Are the CSV files completely different? Or do they all contain the same variables and just represent different cuts of the same data? Do you want to combine the resulting datasets back into a single large dataset?

Season · Posted 07-22-2025 11:34 AM

@Tom wrote:

@Season wrote:

I have an additional question on this issue. Since my question is the same for everyone in the chat, I will only post it in a response to @Tom.

My question is: is it possible to import the CSV files in a parallel instead of a sequential manner? My experience in importing such large files is that the importation process of even a single file takes a lot of time. Therefore, "parallel importation", a word I coined from "parallel computing", is preferred. I wonder if SAS can realize this.

Difficult to read in ONE file in parallel since you have to read it all to know where the lines are.

If you have multiple files you can read each of them in separate sessions that can run in parallel. How to do that depends on how you are running SAS. You could just open multiple SAS sessions and run them.

This idea is intuitive and easy to operate in terms of the requirement of programming skills. However, the number of CSV's are usually so large that opening multiple SAS sessions itself is a formidable work.

@Tom wrote:

If you have SAS/Connect licensed you could use one SAS session to spawn the other sessions.

Could you provide some information on how to check for availability of its license?

@Tom wrote:

Are the CSV files completely different? Or do they all contain the same variables and just represent different cuts of the same data? Do you want to combine the resulting datasets back into a single large dataset?

Nimble guess. They are different cuts of the same data and I do want to combine them back.

Quentin · Posted 07-22-2025 01:10 PM

Troy Hughes has a few papers (and a book or two) that discuss ways to roll your own parallelization using systask to spawn child SAS processes, e.g.:

https://www.lexjansen.com/wuss/2017/117_Final_Paper_PDF.pdf

Season · Posted 07-24-2025 02:45 AM

Thank you for your recommendation! I searched on Amazon and found it difficult to see if his books contained materials I needed given the book names, so I will take a look at your cited paper first.

Tom · Posted 07-22-2025 03:46 PM

You can run PROC SETINIT to see what products you have licensed.

If the goal is to make one large SAS dataset you might just be better off running one job. There are trade offs involved in trying to run in parallel. Especially if the separate jobs will be competing with each other for disk access. Large data files will swamp any caching the OS and disk controllers might be trying to implement. Just reading all of the files in a single thread might get better performance because the OS might be able to pre-fetch the next block of the file into cache and reduce your I/O wait times.

If the filenames all follow a simple pattern you could get SAS to find and read all of the files for you in one data step. Something along the lines of code like this:

data want;
  length fname $300;
  infile '/topleveldir/basename*.csv' dsd truncover filename=fname;
* Skip header lines ;
  input @;
  if fname=lag(fname) then delete;
  length var1 $10 var2 8 var3 8 .... varlast $20 ;
  informat var3 date.;
  format var3 date9.;
  input var1--varlast;
run;

And if the names lend themselves to natural subset, perhaps they have YEAR number in them, then you might be able to break that down into one data step per year and run them in parallel.

But if the names are not so nice then it might be better have a list of the file names in a dataset instead to drive the data step. (For example of how to build a list of file names look at the macro:

https://github.com/sasutils/macros/blob/master/dirtree.sas

With a list of file names you would structure the data step like this instead. So given a dataset named NAMES with a variable named FNAME that has the fully qualified filename you can use the FILEVAR= option of the INFILE statement to tell SAS what file to read.

data want;
  set names ;
  infile csv dsd truncover firstobs=2 filevar=fname end=eof;
  length var1 $10 var2 8 var3 8 .... varlast $20 ;
  informat var3 date.;
  format var3 date9.;
  do while (not eof);
    input var1--varlast;
    output;
  end;
run;

Again you could break the file list into sub lists and run multiple data step in parallel.

Season · Posted 07-24-2025 04:01 AM

Thank you very much for your suggestion and very detailed explanation! Could you please describe

if fname=lag(fname) then delete;

means? Is it used for avoidance of duplicate importation of a same document? That sounds impossible in Windows, because it will automatically append suffixes like (1) for documents sharing the same name and type in a folder.

Kurt_Bremser · Posted 07-24-2025 05:03 AM

Then you need to restructure the condition so it ignores the suffixes.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Registration is open

Registration is open

SAS Training: Just a Click Away