Re: Importing multiple large CSV files with varying data formats in on... - Page 3

Tom · Posted 07-24-2025 08:42 AM

@Season wrote:

Thank you very much for your suggestion and very detailed explanation! Could you please describe
* Skip header lines ;
  input @;
  if fname=lag(fname) then delete;
means? Is it used for avoidance of duplicate importation of a same document? That sounds impossible in Windows, because it will automatically append suffixes like (1) for documents sharing the same name and type in a folder.

It is testing if the filename has changed so you can skip the HEADER lines of the CSV files. When you read one file you can use the FIRSTOBS=2 option on the INFILE statement to skip the header line. But when you are reading multiple files that would only skip the header line of the first file. The INPUT @; statement is important as it forces moving to the next line so that the FNAME variable is updated.

There is also an EOV= option where you an specify a variable name with will be set when the end of file (I think the V is for "volume", like a book subsection). But you need to worry about resetting it, so I find using the LAG() function to check if the filename has changed is easier.

If you have problems with similar filenames then make the part of the code that pulls the key information from the filename smarter. Or keep the full filename by assigning the value of FNAME to another variable that will actually be output into the dataset.

Season · Posted 07-24-2025 11:32 PM

I referenced Example 3: Holding a Record in the Input Buffer in Statements: INPUT Statement - 9.2, which is an example on the @ symbol, but I am still a bit confused about its meaning. First of all, no variable name was sandwitched between "input" and "@" in your code. I am not sure that in this case, what impact this argument will have. Second, I learnt from the example that the function of @ is that is prevents SAS from reading data from the next line. So it somehow contradicts to your explanation in that you stated that it was for forcing SAS "moving to the next line so that the FNAME variable is updated".

Tom · Posted 07-25-2025 08:37 AM

You do not have to have to actually read into any variables on an INPUT statement.

INPUT @ ;

Will read in nothing hold the column pointer at its current place. In this case pointing at column one. Since the INFILE statement has the FILENAME= option stated that named variable will be updated with the name of the file that supplied the current file.

if lag(fname) ne fname then delete;

If this is the first line of a new file then the value of FNAME is different than it was the last time this statement ran. This is true even on the first observation since the the first time you execute a LAG() statement it returns an empty value. The DELETE statement ends this iteration of the data step. So the implied OUTPUT at the end of the data step does not execute for this iteration. So the first line of each file is not written out.

rest of data step

Since the INPUT @ left the column pointer at column one of the line the rest of the data step (including any INPUT statement) will perform as normal on the remaining lines of the file.

Season · Posted 07-26-2025 06:15 AM

Oh, I see! The key function of the sentence in question is to delete the first row of every CSV file, given that we have specified the variable names via the following INPUT statement. Thank you so much for your delicate code as well as your patient explanation! I do think that our discussion can be summarized into a paper that you can embark on writing, if you wish. This is because that when it comes to extremely large datasets, CSV has two advantages over Excel in the sense that the number of rows is not confined to 2^20=1048576 and that CSV files and the compressed files generated from them are smaller than their Excel counterparts. Therefore, there is still a need for importation of CSV files and as a consequence, the need for a paper demonstrating how to do that in SAS in the era of big data.

Kurt_Bremser · Posted 07-26-2025 08:31 AM

Do a Google search (Maxim 6) for „sas how to read csv file“ and you‘ll see that no further paper is needed.
Reading data from text (flat) files has been around since SAS started, so it‘s among the best documented tasks.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 07-26-2025 12:32 PM

The main problem with CSV files as data store is there is no place to store the metadata about what the text in the CSV file means.

One work around is to make an example SAS dataset with the variables in the same order and with necessary informats attached. Then you can use that dataset to drive the reading of the csv file.

data want;
  if 0 then set template;
  infile csv dsd truncover firstobs=2;
  input (_all_) (+0);
run;

The DATA statement names the dataset to create.

The IF statement condition is obviously false, so the dataset TEMPLATE is not loaded. But the data step will see the metadata of TEMPLATE and define all of the variables that it has.

The INFILE statement points to the CSV file and set its for reading a delimited file but skipping the headerline. You can change the delimiter from the default comma to something else by adding the DLM= option.

The INPUT statement reads in all of the variables by using the special variable list keyword _ALL_. To allow the use of a variable list in an INPUT statement you need to use the formatted lists style.

Updated the INPUT statement syntax.

Season · Posted 07-28-2025 11:30 PM

Thank you so much for your introduction made, especially your statement-by-statement introduction of your code! Your help is of great importance to an intermediate SAS user who has not very much expertise in SAS programming. However, I would still like to ask if the structure of the dataset storing the details of variables in the CSV, which, in your code, is named "template", needs to take a particular form. From your introduction, "template" needs to have information on the variable names and informats. Does structuring the dataset by two columns wherein one is used for housing the variable names and the second one is for storing the informats of the corresponding variables suffice?

Tom · Posted 07-29-2025 08:20 AM

I think perhaps you misunderstood. In that situation the CONTENT of the TEMPLATE dataset is not important. Just its definition.

Let's make an example. Say we wanted to share data for SASHELP.CARS. We could make a CSV file easily enough. Using PROC EXPORT for example.

Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
Acura,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6,265,17,23,4451,106,189
Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2,4,200,24,31,2778,101,172
Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4,200,22,29,3230,105,183
Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",3.2,6,270,20,28,3575,108,186
Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",3.5,6,225,18,24,3880,115,197
Acura,3.5 RL w/Navigation 4dr,Sedan,Asia,Front,"$46,100","$41,100",3.5,6,225,18,24,3893,115,197
Acura,NSX coupe 2dr manual S,Sports,Asia,Rear,"$89,765","$79,978",3.2,6,290,17,24,3153,100,174
Audi,A4 1.8T 4dr,Sedan,Europe,Front,"$25,940","$23,508",1.8,4,170,22,31,3252,104,179
Audi,A41.8T convertible 2dr,Sedan,Europe,Front,"$35,940","$32,506",1.8,4,170,23,30,3638,105,180

Then to read it back in we could have an empty dataset that defines the variables in the order they appear in the CSV file. That attached any required informats. And any desired formats or labels.

So for that data you might have this empty dataset.

data cars_template;
  length Make $13 Model $40 Type $8 Origin $6 DriveTrain $5
    MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway
    Weight Wheelbase Length 8
  ;
  informat MSRP Invoice comma. ;
  format MSRP Invoice dollar8. ;
  label 
    EngineSize='Engine Size (L)' 
    MPG_City='MPG (City)'
    MPG_Highway='MPG (Highway)' 
    Weight='Weight (LBS)'
    Wheelbase='Wheelbase (IN)' 
    Length='Length (IN)'
  ;
  stop;
  call missing(of _all_);
run;

Note the attachment of the COMMA informat to the MSRP and INVOICE variables. This will allow SAS to properly interpret the strings in the CSV file that contain commas and dollar signs.

Which you could then use that dataset in your data step that reads in your CSV file.

data cars;
  if 0 then set cars_template;
  infile csv dsd truncover firstobs=2;
  input (_all_) (+0);
run;

If you try it and then compare the results to SASHELP.CARS you will only three differences. We did not add a member label to our new CARS dataset in the WORK library. We had to attach the COMMA informat to the two variables that had the DOLLAR format attached to them so that the INPUT statement would know how to convert the strings with commas in them back into numbers.

And finally you will see that whoever made the original SASHELP.CARS dataset somehow managed to add a leading space into the value of MODEL variable. It is not really possible to preserve such leading spaces in character variables when using the DSD option to read a delimited text file (ie a CSV file).

If you want to make an actual dataset that described the data in the CSV you would probably want to have the following fields:

VARNUM - The order the variables appear in the CSV file.
NAME - The name for the variable. Limit 32 bytes.
TYPE - Is this a numeric or character variable.
LENGTH - Storage length of the variable. Numeric variables should generally be stored in the full 8 bytes it takes to store the 64-bit Binary floating point numbers that SAS uses.
INFORMAT - Informat specification to attach to the variable.
FORMAT - Format specification to attach to the variable.
LABEL - Descriptive label for the variable.

If you have a lot different CSV file then you probably need to add MEMNAME variable to provide a name for the dataset. In which case you might also want to include a MEMLABEL variable.

You don't need to provide that data in a SAS dataset. Instead you could provide it as another CSV file.

It it is then pretty easy to use that data to generate the SAS code you would need to either read in the CSV file or to just create an empty dataset with that structure.

For easier SAS code generation I sometimes do not include a separate TYPE variable in the metadata dataset. Instead the LENGTH variable is character and has the text you would use in the SAS LENGTH statement. So character variables have a $ prefixed to the length.

Season · Posted 07-30-2025 03:36 AM

Thank you so much for your nice, patient and thorough explanation! This is of great help for me to understand the intricacies of reading delimited files like CSV.

@Tom wrote:

I think perhaps you misunderstood. In that situation the CONTENT of the TEMPLATE dataset is not important. Just its definition.

@Tom wrote:
If you want to make an actual dataset that described the data in the CSV you would probably want to have the following fields:

VARNUM - The order the variables appear in the CSV file.

NAME - The name for the variable. Limit 32 bytes.

TYPE - Is this a numeric or character variable.

LENGTH - Storage length of the variable. Numeric variables should generally be stored in the full 8 bytes it takes to store the 64-bit Binary floating point numbers that SAS uses.

INFORMAT - Informat specification to attach to the variable.

FORMAT - Format specification to attach to the variable.

LABEL - Descriptive label for the variable.

My original question was on the structure of the dataset, namely the variables in the dataset "template" that should be present for a successful importation of CSV. So it turns out that such datasets should be more complex than I envisioned. Seven (possibly six) instead of two variables need to be specified for importing CSV files with a particular combination of variables. Thank you for the checklist of variables you provide.

@Tom wrote:

So for that data you might have this empty dataset.

data cars_template;
  length Make $13 Model $40 Type $8 Origin $6 DriveTrain $5
    MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway
    Weight Wheelbase Length 8
  ;
  informat MSRP Invoice comma. ;
  format MSRP Invoice dollar8. ;
  label 
    EngineSize='Engine Size (L)' 
    MPG_City='MPG (City)'
    MPG_Highway='MPG (Highway)' 
    Weight='Weight (LBS)'
    Wheelbase='Wheelbase (IN)' 
    Length='Length (IN)'
  ;
  stop;
  call missing(of _all_);
run;

I have a small question for the last sentence. I searched for the utility of the CALL MISSING routine and found that the parentheses that immedidately follow "call missing" is for storing variable names. Does it suffice to use "_all_" instead of "of _all_" here?

@Tom wrote:

Which you could then use that dataset in your data step that reads in your CSV file.
data cars;
  if 0 then set cars_template;
  infile csv dsd truncover firstobs=2;
  input (_all_) (+0);
run;

I have another question on the last sentence in the DATA step quoted here. I searched for the meaning of "+numeric variable" in the INPUT statement and found that it was used for moving the pointer numeric variable number of columns. I found that "+0" was purposely appended in your edition of the script, so why is it necessary to have this specified here? In addition, does "_all_" have to be flanked by parentheses?

@Tom wrote:
If you try it and then compare the results to SASHELP.CARS you will only three differences. We did not add a member label to our new CARS dataset in the WORK library. We had to attach the COMMA informat to the two variables that had the DOLLAR format attached to them so that the INPUT statement would know how to convert the strings with commas in them back into numbers.

I ran the above code and found one more tiny difference after running two PROC CONTENTS codes- the encoding of work.cars which we create by this code is wlatin1 Western (Windows) while that of sashelp.cars is us-ascii ASCII (ANSI). This difference may be inconsequential, though, as this dataset does not contain any special character that may only be shown correctly in certain encodings.

I would also like to confirm the meaning of "member label" you mentioned in this paragraph. Does it refer to the "Label" that is displayed in the Attributes section of the output result by PROC CONTENTS? There is a difference here- while the "Label" for sashelp.cars is "2004 Car Data", that of work.cars is missing.

Tom · Posted 07-30-2025 08:37 AM

The OF keyword is required to use a variable list with a function call. Otherwise it will expect commas between the values passed to it. It really only makes sense to use a variable list in a function call with a function that accepts an opened ended list of parameters.

The INPUT statement is very powerful (and the PUT statement also). There are many different types of options you can use. This is example of using variable lists and format lists. It very familiar to those of you who remember the FORTRAN statement FORMAT.

For it to work you place the list of variables inside one set of parentheses and then a list of informat (formats for the PUT statement) specifications inside the second set. In this case I did not want to use any actual informats, instead I wanted the INPUT statement to use its defaults or whatever INFORMAT was attached the the variable. But the the format list cannot be empty. Luckily you can also include other modifiers in the list. I chose to use the +0 modifier since it has no effect. I could also have used the ~ modifier, but it is less obvious that ~ has no effect than +0.

The + modifier moves the cursor by the number that follows it. It can just be a nonnegative number, as in this case. It could be a variable, in which case the number of spaces to move could vary depending on the current value of the variable. But it can also be an expression, in which case you will need parentheses around the expression.

You can attach labels to variables and to datasets. SAS calls datasets members of the library they are in. Hence the use of member label to disambiguate which label you are referring to. If you look at the dataset generated by the OUT= option of the PROC CONTENTS statement you will see both a LABEL and a MEMLABEL variable.

You can get away with fewer variables in your metadata that describes the fields in your delimited text file as long as you can derive the things you need from them. I just find it clearer to use variable names that reflect how SAS refers to those concepts. The variables I listed as similar to those the PROC CONTENTS output datasets use. Or the variables used in the DICTIONARY.COLUMNS metadata view. Note that those two have different ways of coding how they store the TYPE and the FORMAT/INFORMAT information.

Season · Posted 07-31-2025 11:36 AM

Thank you so much for your detailed explanation!

Season · Posted 07-31-2025 11:40 AM

@Tom wrote:
If you want to make an actual dataset that described the data in the CSV you would probably want to have the following fields:

VARNUM - The order the variables appear in the CSV file.

NAME - The name for the variable. Limit 32 bytes.

TYPE - Is this a numeric or character variable.

LENGTH - Storage length of the variable. Numeric variables should generally be stored in the full 8 bytes it takes to store the 64-bit Binary floating point numbers that SAS uses.

INFORMAT - Informat specification to attach to the variable.

FORMAT - Format specification to attach to the variable.

LABEL - Descriptive label for the variable.

If you have a lot different CSV file then you probably need to add MEMNAME variable to provide a name for the dataset. In which case you might also want to include a MEMLABEL variable.

You don't need to provide that data in a SAS dataset. Instead you could provide it as another CSV file.

It it is then pretty easy to use that data to generate the SAS code you would need to either read in the CSV file or to just create an empty dataset with that structure.

Another question I have on your response concerns storing metadata on external files like CSV's (Excel would also be OK,right?) as I find it handy.

My question is: how do I let SAS know what the variables in the external file stand for? For instance, how do I let SAS know that one particular column is used for housing variable names while another is for storing informats? Do the variable names in the external file have to have the very same names as you specified in your checklist?

Tom · Posted 07-31-2025 11:45 AM

Basically you make a plan and follow the plan. So tell whoever you are sending the data and metadata to what the metadata is and how to use it. (Even if the person you are sending it to is yourself.)

There have been many attempts to define ways to standardize how to document the fields in a CSV file.

Here is one link I found just now using the query: https://www.google.com/search?q=annotated+csv+files

https://medium.com/swlh/in-this-post-we-share-how-to-interpret-an-annotated-csv-the-flux-query-resul...

Season · Posted 07-31-2025 10:31 PM

Thank you for your response, but I afraid that you had not understood my question. Let me still take the importation of the dataset sashelp.cars as an example. Your response seemed to center around how SAS can understand what the variables in the dataset sashelp.cars or work.cars mean.

However, if we chose to import the CSV via another Excel spreadsheet or CSV file instead of typing them manually in the DATA step code, our work can be partitioned as follows: (1) create an Excel spreadsheet or CSV file that contains the information needed in the "template" dataset; (2) import the spreadsheet or CSV into SAS and name the imported dataset "template"; (3) import the CSV file containing the sashelp.cars data with the help of the "template" dataset.

My question is on the first step of this trio. How should we structure the external spreadsheet or CSV file? We can well arrange it in this format:

VARNUM NAME TYPE LENGTH INFORMAT FORMAT LABEL
1     Make String  13                   Brand of car

So I want to use this to tell SAS that the variable Make comes first in the CSV and is a character variable with length 13 and no associated informat or format (two missing cells in the columns under the heading "informat" and "format"). I wonder if such formulation can really let SAS understand what I wish to convey.

Kurt_Bremser · Posted 08-01-2025 09:44 AM

A text file is always read with a DATA step, so you must write the code for it. Writing code can be automated through different means, e.g. CALL EXECUTE or writing code to a temporary file you later call with %INCLUDE.

To automate code writing, you need to have the necessary information for creating the dynamic parts of the code in data.

Take an example for a text-reading DATA step, and inspect it for the parts that need to be made dynamic. Then you wiill know what to put into a metadata dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Re: Importing multiple large CSV files with varying data formats in one column

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away