I am learning base SAS programming. I have seen lots of examples in books and other websites of importing txt,csv files using different styles. My question is what is the effective way to import a file (say a text file) which does not have a uniform space delimiter. If the dataset is small then i can specify the column numbers of each variable, but what if the data set is large, what then? I have attached a sample txt file .
SAS version : university edition
OS: Windows 8.1 64 bits
If your data is fixed colums then hopefully you have a document that tells you what is where within the file. It is sometimes realitively easy to turn a specification document into SAS variable Informat and input statements and often labels. Without a document how do you know what anything is or where a field stops?
You example data appears to have a location and year combined in a single field OR it is not fixed width.
Suppose you receive a line such as this:
YELLOWSTONE ID/MT/WY 1872 4065493344556677
how would you know that there is a break in variables 4065493 344556677 without an external descriptor?
Looks like a fixed width file (depending on how you want to import the state/no field.
If you want to separate them, it would probably work better to that in a separate operation.
What kind of SAS installation do you have access to? (Base SAS, Enterprise Guide/SAS Studio...)?
Proc import also work for space-delimited files. When dbms = dlm is indicated, SAS assumes the file is space-delimited unless otherwise specified (as we did above for commas and tabs).
We first show sample a program for reading in a space-delimited file that does not include variable names. SAS creates default variable names as VAR1-VARn in when we do not provide variable names.
proc import datafile="practice_4.txt" out=mydata dbms=dlm replace;
getnames=no;
run;
Next, if your space-delimited file contains variable names, you change the
getnames option
proc import datafile="practice_4.txt" out=mydata dbms=dlm replace;
getnames=yes;
run;
If you wish to create a permanent SAS data file using proc import in the directory "c:\practice4 you can use the code below.
libname dis "c:\practice4";
proc import datafile="practice_4.txt" out=dis.mydata dbms=dlm replace;
getnames=yes;
run;
Proc Import will have to either use the first line as variable names or to create dummy variables for the file. Another problem is that the Proc Import may not properly assign attributes such as the variable length and format.
The Data Step with infile and input statement may provide better control in this aspect. You will,however need to specify the start and end for each variable on the input statement.
When you run proc import, code for a data step is generated behind the scenes and that is what is actually run to read in your data. This data step can be found in the SAS log after running proc import and can be copied, amended, and rerun.
The best solution is run PROC import first as outlined above and then view the SAS log. Then modify this code using an infile and input statement using TRUNCOVER
data WORK.MYDATA ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\practice_4.txt" delimiter = ' ' TRUNCOVER DSD lrecl=32767 ;
informat VAR1 $X
informat VAR2 $X
informat VAR3 $X
informat VAR4 $X
*****************************
DATA temp;
infile 'C:\practice_4.txt" delimiter = ' ' TRUNCOVER DSD lrecl=32767 ;
INPUT var1 var2
RUN;
PROC PRINT DATA=temp;
RUN;
However, hard coding hundreds or even thousands of variables, if not totally impossible, is very tedious and time consuming, increasing the possibility of human errors. Also the contents of the dataset may change from time to time, making it even more difficult for SAS programmers to update and to maintain the codes.
There is a good SUGI paper that describes how to efficiently import text files with a large number of variables.
See: http://www.lexjansen.com/scsug/2007/data/Data-Hu.pdf
LRECL= sets the maximum line size that can be read. The default is 256.
MISSOVER specifies that, if not enough data is found on a line to satisfy the whole INPUT statement, the remaining variables are set to missing.
This prevents erroneous skipping of lines.
But I would use TRUNCOVER instead. If not enough data is present for a given variable, MISSOVER will set it to missing, while TRUNCOVER fills the variable with the present partial data.
If you don't know what a SAS statement or option does, doing a search for "SAS xxxxx" (xxxxx being the SAS word you are looking for) will almost always yield an answer.
I would just like to clarify on this point:
"However, hard coding hundreds or even thousands of variables, if not totally impossible, is very tedious and time consuming, increasing the possibility of human errors. Also the contents of the dataset may change from time to time, making it even more difficult for SAS programmers to update and to maintain the codes."
Two points.
Firstly, when transfering data, to my mind, the most important part is the documentation. Create an import agreement, agree with the vendor what they should send, what variables, what formats etc. Have it written down in writing and signed off. This will help you in three ways:
- It is a documented and validatable system
- Your coding will be simply copying from the documentation, and it can be versioned inline with the documentaion
- The file will not change, or the agreement is broken and the process needs to start again - good for budgets/time spent
Secondly, the file format itself. Generally speaking, if a file has a hundred variables, then I would recommend a normalised structure - if your familiar with CDISC, or databases then you will know what that means. So long data, rather than wide data. Two many variables becomes unweildy and harder to code. Of course you can always transpose up if there is a need (say for a report).
Simply put, keep it a simple as possible, and document, document, document, program, test, test, test, release. Actual programming is 0.01% of setting up a data transfer system.
If LRECL is too short, lines are not wrapped, they are truncated, as indicated by the message
"One or more lines were truncated."
MISSOVER does not mean that variables are set to missing when the line is shorter that LRECL, it just means that variables for which not enough (or none at all) data is present are set to missing.
So, if you read a delimited file with LRECL=400, want to input 5 strings, 4 delimiters are present with at least one character after the fourth delimiter, you will get 5 non-missing variables, even if the line was just 20 characters long.
If you read 5 strings with $70., and the actual record size of a given record is 350, no variable will be missing, although the line is shorter than 400.
TRUNCOVER and MISSOVER are affected by the amount of data present with regards to what is needed in the input statement. If the LRECL is met or not, is secondary.
That is a very good explanation, but i am still confused with How does SAS reads the data from the input statement ? I am attaching a word file and a raw data txt file. The word file contains screenshots of the code I have used to import the txt file into SAS - the part A and B both are same except in B I have use truncover. Part B is the correct import. Can you help me understand what is happening in part A ?
I think you will be best off reading through the documentation:
https://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#titlepage.htm
And maybe this white paper:
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.