BookmarkSubscribeRSS Feed
Asim1
Fluorite | Level 6

I am learning base SAS programming. I have seen lots of examples in books and other websites of importing txt,csv files using different styles. My question is what is the effective way to import a file (say a text file) which does not have a uniform space delimiter. If the dataset is small then i can specify the column numbers of each variable, but what if the data set is large, what then? I have attached a sample txt file .

SAS version : university edition

OS: Windows 8.1 64 bits

13 REPLIES 13
ballardw
Super User

If your data is fixed colums then hopefully you have a document that tells you what is where within the file. It is sometimes realitively easy to turn a specification document into SAS variable Informat and input statements and often labels. Without a document how do you know what anything is or where a field stops?

 

You example data appears to have a location and year combined in a single field OR it is not fixed width.

 

Suppose you receive a line such as this:

YELLOWSTONE ID/MT/WY 1872 4065493344556677

how would you know that there is a break in variables 4065493 344556677 without an external descriptor?

 

Asim1
Fluorite | Level 6
@ballardw I dont have anything like 4065493 344556677 in my example data. the first line of the data is
YELLOWSTONE ID/MT/WY 1872 4,065,493
LinusH
Tourmaline | Level 20

Looks like a fixed width file (depending on how you want to import the state/no field.

If you want to separate them, it would probably work better to that in a separate operation.

What kind of SAS installation do you have access to? (Base SAS, Enterprise Guide/SAS Studio...)?

Data never sleeps
Asim1
Fluorite | Level 6
@LinusH I have SAS studio.
DMO467
SAS Employee

Proc import also work for space-delimited files. When dbms = dlm is indicated, SAS assumes the file is space-delimited unless otherwise specified (as we did above for commas and tabs). 

We first show sample a program for reading in a space-delimited file that does not include variable names. SAS creates default variable names as VAR1-VARn in when we do not provide variable names.

proc import datafile="practice_4.txt" out=mydata dbms=dlm replace;
getnames=no;
run;


Next, if your space-delimited file contains variable names, you change the
getnames option

proc import datafile="practice_4.txt" out=mydata dbms=dlm replace;
getnames=yes;
run;



If you wish to create a permanent SAS data file using proc import in the directory "c:\practice4 you can use the code below. 

libname dis "c:\practice4";
proc import datafile="practice_4.txt" out=dis.mydata dbms=dlm replace;
getnames=yes;
run;


Proc Import will have to either use the first line as variable names or to create dummy variables for the file. Another problem is that the Proc Import may not properly assign attributes such as the variable length and format.

The Data Step with infile and input statement may provide better control in this aspect. You will,however need to specify the start and end for each variable on the input statement.

When you run proc import, code for a data step is generated behind the scenes and that is what is actually run to read in your data. This data step can be found in the SAS log after running proc import and can be copied, amended, and rerun.

The best solution is run PROC import first as outlined above and then view the SAS log. Then modify this code using an infile and input statement using TRUNCOVER

data WORK.MYDATA ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\practice_4.txt" delimiter = ' ' TRUNCOVER DSD lrecl=32767 ;
informat VAR1 $X
informat VAR2 $X
informat VAR3 $X
informat VAR4 $X
*****************************

DATA temp;
infile 'C:\practice_4.txt" delimiter = ' ' TRUNCOVER DSD lrecl=32767 ;
INPUT var1 var2
RUN;

PROC PRINT DATA=temp;
RUN;



However, hard coding hundreds or even thousands of variables, if not totally impossible, is very tedious and time consuming, increasing the possibility of human errors. Also the contents of the dataset may change from time to time, making it even more difficult for SAS programmers to update and to maintain the codes.

There is a good SUGI paper that describes how to efficiently import text files with a large number of variables.

See: http://www.lexjansen.com/scsug/2007/data/Data-Hu.pdf




Asim1
Fluorite | Level 6
@DMO467 i read the pdf, its very helpful. Can you plz explain what does missover and lrecl does ?
Kurt_Bremser
Super User

LRECL= sets the maximum line size that can be read. The default is 256.

MISSOVER specifies that, if not enough data is found on a line to satisfy the whole INPUT statement, the remaining variables are set to missing.

This prevents erroneous skipping of lines.

But I would use TRUNCOVER instead. If not enough data is present for a given variable, MISSOVER will set it to missing, while TRUNCOVER fills the variable with the present partial data.

 

If you don't know what a SAS statement or option does, doing a search for "SAS xxxxx" (xxxxx being the SAS word you are looking for) will almost always yield an answer.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

I would just like to clarify on this point:

"However, hard coding hundreds or even thousands of variables, if not totally impossible, is very tedious and time consuming, increasing the possibility of human errors. Also the contents of the dataset may change from time to time, making it even more difficult for SAS programmers to update and to maintain the codes."

 

Two points.

Firstly, when transfering data, to my mind, the most important part is the documentation.  Create an import agreement, agree with the vendor what they should send, what variables, what formats etc.  Have it written down in writing and signed off.  This will help you in three ways:

- It is a documented and validatable system

- Your coding will be simply copying from the documentation, and it can be versioned inline with the documentaion

- The file will not change, or the agreement is broken and the process needs to start again - good for budgets/time spent

 

Secondly, the file format itself.  Generally speaking, if a file has a hundred variables, then I would recommend a normalised structure - if your familiar with CDISC, or databases then you will know what that means.  So long data, rather than wide data.  Two many variables becomes unweildy and harder to code.  Of course you can always transpose up if there is a need (say for a report).

 

Simply put, keep it a simple as possible, and document, document, document, program, test, test, test, release.  Actual programming is 0.01% of setting up a data transfer system.

DMO467
SAS Employee
Yes, I agree in the meantime

MISSOVER

When the MISSOVER option is used on the INFILE statement, the INPUT statement does not jump to the next line when reading a short line. Instead, MISSOVER sets variables to
missing.

LRECL

The option is lrecl= (logical record length) is used to specify the number of characters for each line. The default LRECL in SAS is 256. The maximum is 32,767.

Example

filename myFile "c:\some directory\some file.txt" LRECL= 400;

This means each line in the text file has a logical record length of 400 characters. If the line is greater than 400 then the line will wrap.

With MISSOVER you will get a missing value if the line does not have at least 400 characters (counting trailing spaces).

With TRUNCOVER you get the first 400 characters of the line, even for short lines.

This paper explains all the different options quite well:

http://www2.sas.com/proceedings/sugi26/p009-26.pdf
Kurt_Bremser
Super User

If LRECL is too short, lines are not wrapped, they are truncated, as indicated by the message

"One or more lines were truncated."

MISSOVER does not mean that variables are set to missing when the line is shorter that LRECL, it just means that variables for which not enough (or none at all) data is present are set to missing.

So, if you read a delimited file with LRECL=400, want to input 5 strings, 4 delimiters are present with at least one character after the fourth delimiter, you will get 5 non-missing variables, even if the line was just 20 characters long.

If you read 5 strings with $70., and the actual record size of a given record is 350, no variable will be missing, although the line is shorter than 400.

TRUNCOVER and MISSOVER are affected by the amount of data present with regards to what is needed in the input statement. If the LRECL is met or not, is secondary.

Asim1
Fluorite | Level 6
Can you plz elaborate on the example that u gave?
Asim1
Fluorite | Level 6

That is a very good explanation, but i am still confused with How does SAS reads the data from the input statement ? I am attaching a word file and a raw data txt file. The word file contains screenshots of the code I have used to import the txt file into SAS - the part A and B both are same except in B I have use truncover. Part B is the correct import. Can you help me understand what is happening in part A ?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

I think you will be best off reading through the documentation:

https://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#titlepage.htm

 

And maybe this white paper:

http://www2.sas.com/proceedings/sugi26/p009-26.pdf

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 4235 views
  • 8 likes
  • 6 in conversation