Re: Reading raw files

BURHAN_CIGDEM · Posted 03-12-2018 11:13 AM

Hello all;

Is there any difference between those codes;

data work.sfosch;
infile '/folders/myfolders/sasuser.v94/prog1data/sfosch.txt';
input FlightID $1-7 RouteID $8-14 Destination $18-20
run;

and

proc import out=work.sfoch;
datafile='/folders/myfolders/sasuser.v94/prog1data/sfosch.txt'
dbms=tab;
run;

According to me, the first code works in case there is no any specific delimiter but second one only can be applied for if delimiter of raw data is "tab". Am I correct?

Thanks.

RW9 · Posted 03-12-2018 11:25 AM

Nope, in your first step your explictly state the variables to read and where and how long to read them. So you don't have a delimiter read, it will read char's 1-7 and put that into var 1.

The second, proc import - is a guessing procedure. Your leaving the whole thing up to the software to guess your data. Sure the delimiter it looks for is tab, and if chars 1-7 have data, and then there is a tab, then you should get the same result, but proc import is still guessing and may get it wrong. If however there is 8 characters, then a tab, then your variable will have 8 characters, and the first code would only read 7 of them.

BURHAN_CIGDEM · Posted 03-12-2018 12:57 PM

I guess, if there is a delimiter in raw data(i.e. tab), second code is easier to write.But in the first code you depend on variables' long,location etc.

Kurt_Bremser · Posted 03-12-2018 11:28 AM

Yes, there is a fundamental difference, as in the first code you decide how data is read, while in the second you rely on the guesses that proc import has to make.

Second, in the data step you use specific positions for your columns, while in the proc import SAS will write a data step that uses dlm='09'x in the infile statement, and scans dynamically for delimiters in each input line.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

BURHAN_CIGDEM · Posted 03-12-2018 01:06 PM

Well, while there is a delimiter in the raw data, isn't proc import enough for us? Can I trust the proc import step or should i write my data step?

ballardw · Posted 03-12-2018 07:28 PM

@BURHAN_CIGDEM wrote:
Well, while there is a delimiter in the raw data, isn't proc import enough for us? Can I trust the proc import step or should i write my data step?

If you read multiple files that should use the same layout you will find, unless the data is exceptionally cleanly formatted, that the lengths of character variables are likely to change. Which means that when data sets are combined you will get warnings about potentially truncated variables and sometimes actually truncated values.

If you have data files with values contain mixes of all digits with digits plus alpha characters, such as occurs with zip codes (12345 and 12345-4432 with zip plus 4), account numbers (123456789 or MB988847), product codes (12345 or 9-12345) you may find that your variable changes type between data sets. Which means that attempts to combine data sets will fail because a variable can only be of one type.

You may even have variables change names depending on how clean your data supplier is. I worked with one client that asked why we had monthly billings for changing code. They would change the order and names of columns constantly (2 or 3 times a month) in the files they sent us. Proc Import in that case would have created multiple data sets with different variables requiring additional steps to get all of the "productname" variables into one variable of a useable length.

Proc Import guesses. If you are going to use Proc Import for delimited files I recommend habitually using the GUESSINGROWS option to give the best chance of good data. Otherwise SAS only examines 20 rows of data to determine variable types. If a value is only populated sometimes, like "fifth child name", you are likely to end up with only one character in the names that actually occur if it doesn't appear in the first 20 lines of data.

Kurt_Bremser · Posted 03-13-2018 02:18 AM

Since proc import creates a data step for text files, you can use it to quickly get code which you can then adapt to the file specification.

With experience, you will find that writing such steps directly is quicker for you.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reading raw files