BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
CathyVI
Pyrite | Level 9

Hello,

I want to import a text file data into SAS. The file contain some issues that you will see in my example. 

Some of the data is missing  and the data is not well organized as you can see. Student_id and d_pin are character variable. I would like to learn how to use either infile or proc import or both to import the data CORRECTLY.

Here is sample code. I have over 10,000 records.

Student_ID|D_PIN|SERVICE_DATE
|1624413098|11/08/2023
|1628543409|11/08/2023
001020084|1615413808|10/12/2023
001021665|1616220906|03/31/2023

002781314|1462647405|07/19/2023
|1331118504|11/02/2023
|1611325610|12/26/2023

080504901||11/08/2023

 

Here is my code:

data STUDENT; *;
infile 'c:\dog\dummydata.txt' truncover delimiter = '|' firstobs=2;
format SERVICE_DATE mmddyy10.;
input Student_ID $ 1-9 D_PIN $ 11-20 SERVICE_DATE :mmddyy10.;
run;

I am getting many of the notes below. When I checked it seems it's because the original text file is not well arranged.

 

NOTE: Invalid data for SERVICE_DATE in line 1756 21-22.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1756 |1331118504|11/02/2023 22
SERVICE_DATE=. Student_ID=|13311185 D_PIN=4|11/02/20 _ERROR_=1 _N_=1755
NOTE: Invalid data for SERVICE_DATE in line 1757 21-22.
1757 |1611325610|12/26/2023 22

Thanks

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

I presume the first two data lines in your sample actually have no student_id values, yes?

 

You have to make a couple of changes.

 

  1. Include the "DSD" option on the INFILE statement.  DSD (for delimiter sensitive data) will allow the INPUT statement to assume a missing value for a text field with a length of 0 (e.g. for student_id in your first two obs).
  2. But specifying columns in your INPUT statement counter-productively overrides the DSD option.  One option is to use an INFORMAT statement as I have done in the code below.

data STUDENT; 
 * infile 'c:\dog\dummydata.txt' truncover delimiter = '|' firstobs=2;
  infile datalines delimiter = '|' dsd firstobs=2;
  informat student_id $9.  D_PIN $10.  service_date mmddyy10. ;
  format SERVICE_DATE mmddyy10.;
 * input Student_ID $ 1-9 D_PIN $ 11-20 SERVICE_DATE :mmddyy10.;
  input Student_ID D_PIN SERVICE_DATE ;
datalines;
Student_ID|D_PIN|SERVICE_DATE
|1624413098|11/08/2023
|1628543409|11/08/2023
001020084|1615413808|10/12/2023
001021665|1616220906|03/31/2023
002781314|1462647405|07/19/2023
|1331118504|11/02/2023
|1611325610|12/26/2023
080504901||11/08/2023
run;

I've preceded each of my additions with your analogous statement, commented out.

 

Notice the INPUT statement merely lists variable names, not locations, which means INPUT will utilize expected formats in the INFORMAT statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
mkeintz
PROC Star

I presume the first two data lines in your sample actually have no student_id values, yes?

 

You have to make a couple of changes.

 

  1. Include the "DSD" option on the INFILE statement.  DSD (for delimiter sensitive data) will allow the INPUT statement to assume a missing value for a text field with a length of 0 (e.g. for student_id in your first two obs).
  2. But specifying columns in your INPUT statement counter-productively overrides the DSD option.  One option is to use an INFORMAT statement as I have done in the code below.

data STUDENT; 
 * infile 'c:\dog\dummydata.txt' truncover delimiter = '|' firstobs=2;
  infile datalines delimiter = '|' dsd firstobs=2;
  informat student_id $9.  D_PIN $10.  service_date mmddyy10. ;
  format SERVICE_DATE mmddyy10.;
 * input Student_ID $ 1-9 D_PIN $ 11-20 SERVICE_DATE :mmddyy10.;
  input Student_ID D_PIN SERVICE_DATE ;
datalines;
Student_ID|D_PIN|SERVICE_DATE
|1624413098|11/08/2023
|1628543409|11/08/2023
001020084|1615413808|10/12/2023
001021665|1616220906|03/31/2023
002781314|1462647405|07/19/2023
|1331118504|11/02/2023
|1611325610|12/26/2023
080504901||11/08/2023
run;

I've preceded each of my additions with your analogous statement, commented out.

 

Notice the INPUT statement merely lists variable names, not locations, which means INPUT will utilize expected formats in the INFORMAT statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

File looks well arranged to me.  You have a DELIMITED file.  So tell SAS that in the INFILE statement by adding DSD option.

 

First let's convert your example lines into an actual file so we have something we can test agains.

options parmcards=text;
filename text temp;

parmcards;
Student_ID|D_PIN|SERVICE_DATE
|1624413098|11/08/2023
|1628543409|11/08/2023
001020084|1615413808|10/12/2023
001021665|1616220906|03/31/2023
002781314|1462647405|07/19/2023
|1331118504|11/02/2023
|1611325610|12/26/2023
080504901||11/08/2023
;

Now we can start writing a data step to read the file.  First is the DATA statement to name the dataset that will be created.  Then the INFILE statement to tell where to find the data to be READ .  Then define the variables.  Then attach INFORMAT to any variables that NEED them.  And attach FORMAT to any variables that NEED them. Then INPUT the variables, if you have defined the variables in the order they appear in the source file then the INPUT statement can just use a simple positional variable list.  Then end the step with a RUN (or if you are using in-line data a CARDS or DATALINES block).

data want;
  infile text dsd dlm='|' truncover firstobs=2;
  length Student_ID $10 D_PIN $10 SERVICE_DATE 8;
  informat SERVICE_DATE mmddyy.;
  format SERVICE_DATE yymmdd10.;
  input Student_ID -- SERVICE_DATE ;
run;

Result

Tom_0-1708025225481.png

 

HInts. 

  • Putting the FORMAT statement before you have defined the variables (or let the INPUT statement define them by implying a type based on the usage) will cause the variables to be in a different order in the dataset.  That is just going to make looking at your dataset more cumbersome.
  • Displaying dates in either MDY or DMY order will confuse half of your audience. Use DATE or YYMMDD format to display dates to avoid confusion.
CathyVI
Pyrite | Level 9

@mkeintz  @Tom  Thank you. I found something interesting this morning. I noticed there was a particular row in the txt file I imported with only the service date. e.g., 

 

Student_ID|D_PIN|SERVICE_DATE

                                   |12/04/2023

The student_ID and D_PIN are both blank in the text file. I just noticed that this line did not output correctly with the codes provided by @mkeintz . 

This is what the output looks like. The date is displayed on the D_PIN column

Student_ID       D_PIN                 SERVICE_DATE
                         12/04/2023                   .
 
How can I resolved this, Thanks in advance.
Tom
Super User Tom
Super User

The problem with that line is it only has TWO values instead of THREE because there is only one DELIMITER.

 

Look carefully at the whole file with a text editor that displays text using a FIXED width font.  Do the values of SERVICE_DATE always start in the same column?  If so you can read the file as FIXED columns instead of as DELIMITED.

 

If instead there are not extra spaces to pad the short or missing values so that D_PIN and SERVICE_DATE always start in the same place you will have to treat it as delimited.

 

If just a few have this issue of missing delimiters and they all follow this same pattern then you could possible adapt the code to test how many delimiters there are and read the line differently depending on the test.

Something like this:

data want;
  infile 'myfile.txt' dsd dlm='|' truncover;
  length Student_ID D_PIN $10 SERVICE_DATE 8;
  input @;
  if countc(_infile_,'|') = 2 then input Student_ID :$10. D_PIN :$10. SERVICE_DATE :mmddyy.;
  else input Student_ID SERVICE_DATE :mmddyy.;
  format service_date yymmdd10.;
run;

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 911 views
  • 5 likes
  • 3 in conversation