BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
balu_g
Fluorite | Level 6

 which way is better?

Importing csv file into SAS either using proc import processor or data step process? 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

"Better" has multiple meanings.

 

Using a DATA step gives you much more control over how the data are read.

 

I think of PROC IMPORT as useful for a quick import of a simple CSV, where I don't need anything special, and it's not part of a production job.  I think when you read a CSV with PROC IMPORT, you can see the DATA step code it generates in the log.  I think some folks will run PROC IMPORT, then use the DATA step code it generated as a starting point for their program.

 

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

View solution in original post

13 REPLIES 13
Quentin
Super User

"Better" has multiple meanings.

 

Using a DATA step gives you much more control over how the data are read.

 

I think of PROC IMPORT as useful for a quick import of a simple CSV, where I don't need anything special, and it's not part of a production job.  I think when you read a CSV with PROC IMPORT, you can see the DATA step code it generates in the log.  I think some folks will run PROC IMPORT, then use the DATA step code it generated as a starting point for their program.

 

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
balu_g
Fluorite | Level 6

"Using a DATA step gives you much more control over how the data are read."

 

what does mean "much more control over how the data are read" In this statement? can you explain briefly?

 

Quentin
Super User
Pls see Tom's answer, it's better. : )

PROC IMPORT hast to guess variable types, variable lengths, etc. Sometimes it guesses 'wrong.' For example, you might now a variable is defined as text by the person generating the CSV, but if you read the variable happens to have all numeric values, PROC IMPORT will read it into a numeric field.

With a DATA step, you explicitly declare the variable type, variable length, delimiter, end-of-file marker, and much much more.
The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
ballardw
Super User

@Quentin wrote:

"Better" has multiple meanings.

 

 

I think of PROC IMPORT as useful for a quick import of a simple CSV, where I don't need anything special, and it's not part of a production job.  I think when you read a CSV with PROC IMPORT, you can see the DATA step code it generates in the log.  I think some folks will run PROC IMPORT, then use the DATA step code it generated as a starting point for their program.

 


 

I would add to this "when using Proc Import with a large or MAX value for the Guessingrows option".

 

I use Proc Import to 1) generate basic code and 2) get a feel for the contents of the file.

What Proc Import does not deal with are:

1) variables that should be character but contain mostly digit characters. Examples are account or similar identification numbers. Some organizations have account "numbers" with leading zeroes that are important. Meaning and ID value of

"000123" is completely different than ID "0123". But Proc Import will bring both of these in as the same Id value. Once the leading 0 is stripped you have no idea if the account should be "123" "0123" "00123" "000123" .... So you use the documentation you should have from the data supplier to modify the data step to read the value as 1) character and 2) specify enough characters to hold the longest expected values.

2) Text values from different files might be of different lengths. Your first data step generated by Proc Import might have a person's last name at 12 characters. The next file you get might be much longer. So again the source of your data file should tell how long the longest values may be. [ Look for the book "The Man Whose Name Wouldn't Fit: Or, The Case of Cartwright-Chickering" by Theodore Tyler for an early and extreme result from can happen when a name field is not long enough]

3) Sparse fields. Some variables in some data sets do not get populated because it is an uncommon occurrence (relative to the size of the data files). These will get set as $1. fields by Proc Import. So they typically need to be expanded to a better width as well as the correct type. If a date, time or datetime the Informat needs to be specified.

4) Another serious consideration is variable name, name length and variable labels. I have worked with "data" files that had as many as 15 columns headed as "Total". Proc Import handles that case as "Total" and "TotalX" where x is number representing repetition. However extremely long names with the exact same characters for the first 32 will cause subsequent variables to be name VARXX, where X is the column number in the file. These are candidates for serious renaming (really, how many 32 character variable names do you want to type) and should have the full text as a label. Especially when the original column heading contained many characters not valid in SAS variable names. I have had imported variables get names like _____Some_sentence_text .

Which does bring up: make sure that Import will build standard SAS variable names unless your really like typing out the name literals "This*char?/stuffName"n .Make sure options Validvarname=V7; is set before importing anything. YMMV.

 

Tom
Super User Tom
Super User

Check out this macro that tries to do a better job of variable name generation, type detection, and writes a much cleaner data step.  It also includes a method to override its decisions by passing in metadata.

 

https://github.com/sasutils/macros/blob/master/csv2ds.sas

Differences from PROC IMPORT
- Supports header lines with more than 32,767 characters
- Supports ZIP and GZIP source files
- Generates unique variable names by adding numeric suffix
- Does not overestimate maxlength when longest value is quoted
- Does NOT force character type if all values are quoted
- Generates label when generated variable name is different than header
- Supports NAMEROW option
- Supports numeric fields with special missing values (MISSING statement)
- Does not attach unneeded informats or formats
- Allows overriding calculated metadata
- Allow using random sample of rows to guess metadata
- Generates more compact SAS code
- Generates analysis summary dataset and raw data view
- Saves generated SAS code to a file
- Forces DATE and DATETIME formats to show century
- Difference in generated V7 compatible variable names
  - Replaces adjacent non-valid characters with single underscore
Tom
Super User Tom
Super User

A CSV file has no information about how to define the variables.  The only information is has is the optional header line which is frequently used to name the variables.

 

PROC IMPORT will guess how to define the variables.  What NAME to use.  What TYPE to use (character or numeric).  If character how many bytes to use for storage.  It will also try to guess what informat to use to read the text from the CSV into values. And what format to attach to the variables to control how they are displayed as text.  And what label if any to attach to the variable.

 

If you already know how the variables should be define (or can make a better guess than PROC IMPORT, which is a very low bar to get over) then your own data step will be more accurate.

 

One of the biggest issues with PROC IMPORT is when you need to read many examples of the same type of CSV file.  With PROC IMPORT it can only use the examples it sees in the current file to makes its guesses about how to define the variables.  So when you need to combine the datasets generated by importing two or more of such files you will frequently have issues because the lengths of the character variables are different.  Or worse that a variable was defined as character one time and numeric the next time.

 

The one advantage of PROC IMPORT is that it does an ok job for a quick look at the data without you having to do much in the way of typing.  But to read a CSV with your own SAS code is very easy also.  You do not have to write a data step that looks like the ugly stuff that PROC IMPORT generates.

DavidSmith2
Fluorite | Level 6

I don't understand why there isn't a simple 'Get Data' tool that allows you to preview, modify, templatize (snippet) and load the data. MS has had this in excel for years. We wind up using excel to 'pre-treat' external data that needs to come into our environment which helps to minimize SAS guessing wrong. 
We format the column and that helps - but there is the downside of length which excel doesn't seem to support but SAS should. 

We have a lot of external data that comes into our environment in different formats. A tool like this would increase productivity.
Acceptable in 2000, not in 2022.

ballardw
Super User

@DavidSmith2 wrote:

I don't understand why there isn't a simple 'Get Data' tool that allows you to preview, modify, templatize (snippet) and load the data. MS has had this in excel for years.


And is extremely crappy for many types of files. And useless for those that mix data inside columns based on row rules. Plus your "template" built from the first file may not have all the characteristics of the next file that is supposed to have the same layout because of things like no data for a column in the first file, or short lengths of text, did not set the template properly.

 

A data step can be written to read/parse many complex file forms, such as when a record is extended across multiple rows. Your Excel get data does not do that at all.

 

Plus the point and grunt and select is not a single bit better than writing the data step from a document describing the file layout. Some very well constructed file documents, extending to literally 100's of variables, that I have worked with allowed me to read the description and use that to write the SAS data step code to read the data file and provide formats and variable labels.

 

I deal with people that provide "identical" data files that move columns constantly, or change the cell properties partway through a column so that what should be a date or customer identification number is currency, or currency as date. The data steps I write to deal with their data after their Excel is saved to a CSV format can throw errors because of such invalid data. When they move a column I only have to modify the INPUT step by moving a couple of variable names, not have to go through the point and click to set properties all over again.

 

Spreadsheets are not designed for data interchange, they are designed for accountants to play with stuff and as such often result in very suboptimal for interchange purposes.

 

I'm not sure what you mean by "length which excel doesn't seem to support but SAS should". SAS currently supports text variables up to about 32K. If you need more than that many programs will choke.

DavidSmith2
Fluorite | Level 6

Thanks for the comments.

 

The objective of my post is to say there should be an option between SAS decides and coding a step. It's a productivity hit to us to code. Often, all that is needed is a small edit rather than writing a step for (example) 30 columns. The delta is in minutes. But X people X minutes adds up. 
We're trying to get limit coding as much as possible to focus energies on understanding patterns.

 

DS

Tom
Super User Tom
Super User

@DavidSmith2 wrote:

Thanks for the comments.

 

The objective of my post is to say there should be an option between SAS decides and coding a step. It's a productivity hit to us to code. Often, all that is needed is a small edit rather than writing a step for (example) 30 columns. The delta is in minutes. But X people X minutes adds up. 
We're trying to get limit coding as much as possible to focus energies on understanding patterns.

 

DS


The main issue is that type of thing becomes essentially a task for something other than the base SAS software.  The current trend seems to be to use SAS/Studio as the user interface, so perhaps SAS could build some tool into that interface.  Or perhaps someone else could build a tool that could convert the users inputs into actual SAS code that could later be submitted to read the data.  Perhaps you could build something using Python and the saspy package.

DavidSmith2
Fluorite | Level 6

Hi Tom - Thanks for taking an interest.

 

Yes - SAS can't be the all-singing-all-dancing bear! 

 

We will be looking at other ways to speed thigs up.

 

ds

SASKiwi
PROC Star

@DavidSmith2  - In Enterprise Guide the Import Data menu tool gives you a non-coding solution, if this is available to you.

Kurt_Bremser
Super User

Writing code is faster than point-and-click BY FAR. 20+ years of SAS, and 40+ years of other coding experience talking here.

When I had to write or change data import programs, the coding was the smallest part of it, anyway. Check out/check in, testing and documenting is 90% of the work.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2669 views
  • 6 likes
  • 7 in conversation