BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jdadams
Fluorite | Level 6

Hi, I'm trying to import a semicolon-delimited text file into SAS.  The file is large, consisting of ~35 million observations.   Data transfer stops at observation 10,892,049.  The file consist of technical documents that have titles.  It has six variables (not enclosed in parentheses), and the fourth variable is the title of the document, which is VARCHAR 1500--you can see why the title could be very long, for example in the case of a chemistry paper that reports on a large biological molecule, while in other cases, it could be very short.  Titles contain heterogeneous text data.  When I examine the final observation, 10,892,049, execution ceases in the middle of the title field, with character '/'.  And yes, I know that this is a special character in SAS.

 

So, I have three questions.  First, I do not really need the title variable.  Using INFILE and INFORMATS, is there a way to skip over the title field?  Second,  and alternatively, can I tell SAS to accept special characters as part of the title field?  Finally, I am not thinking that memory problems are causing execution to stop, but am I correct in this?

 

I am running SAS 9.4 on Windows 10 installed on an HP server.  The code that I have been using so far is PROC IMPORT, and that code is as follows:

 

options nocenter replace ls=76 ps=54;


libname tlaivs 'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\sasdsns';

 

proc import datafile="C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt"
    dbms=dlm
    out=tlaivs.item
    replace;

 

    delimiter=';';
    getnames=yes;

    guessingrows=MAX;
run;

 

proc contents data=tlaivs.item;
run;

 

--------

The log file looks like this:

 

NOTE: Copyright (c) 2016 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M6)
Licensed to JAMES ADAMS, Site 70250096.
NOTE: This session is executing on the X64_10PRO platform.

 

NOTE: Analytical products:

SAS/STAT 15.1

NOTE: Additional host information:

X64_10PRO WIN 10.0.18362 Workstation

NOTE: SAS initialization used:
real time 1.43 seconds
cpu time 1.28 seconds

1 options nocenter replace ls=76 ps=54;
2 libname tlaivs
2 ! 'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\sasdsns';
NOTE: Libref TLAIVS was successfully assigned as follows:
Engine: V9
Physical Name:
C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\sasdsns
3
4 proc import
4 ! datafile="C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.tx
4 ! t"
5 dbms=dlm
6 out=tlaivs.item
7 replace;
8
9 delimiter=';';
10 getnames=yes;
11 guessingrows=MAX;
12 run;

13 /*********************************************************************
13 ! *
14 * PRODUCT: SAS
15 * VERSION: 9.4
16 * CREATOR: External File Interface
17 * DATE: 11OCT19
18 * DESC: Generated SAS Datastep Code
19 * TEMPLATE SOURCE: (None Specified.)
20 **********************************************************************
20 ! */
21 data TLAIVS.ITEM ;
22 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
23 infile
23 ! 'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt'
23 ! delimiter = ';' MISSOVER DSD lrecl=32767 firstobs=2 ;
24 informat item_id $15. ;
25 informat issue_id $10. ;
26 informat item_number best32. ;
27 informat title $819. ;
28 informat doc_type $32. ;
29 informat ref_count $4. ;
30 format item_id $15. ;
31 format issue_id $10. ;
32 format item_number best12. ;
33 format title $819. ;
34 format doc_type $32. ;
35 format ref_count $4. ;
36 input
37 item_id $
38 issue_id $
39 item_number
40 title $
41 doc_type $
42 ref_count $
43 ;
44 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection
44 ! macro variable */
45 run;

NOTE: The infile
'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt' is:

Filename=C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.tx
t,
RECFM=V,LRECL=32767,
File Size (bytes)=4792868783,
Last Modified=16Jul2015:16:45:04,
Create Time=07Oct2019:08:55:42

NOTE: 10892049 records were read from the infile
'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt'.
The minimum record length was 40.
The maximum record length was 865.
NOTE: The data set TLAIVS.ITEM has 10892049 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 50.38 seconds
cpu time 11.95 seconds


10892049 rows created in TLAIVS.ITEM from
C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt.

 

NOTE: TLAIVS.ITEM data set was successfully created.
NOTE: The data set TLAIVS.ITEM has 10892049 observations and 6 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 31:17.49
cpu time 30:35.93


46
47 proc contents data=tlaivs.item;
NOTE: Writing HTML Body file: sashtml.htm
48 run;

NOTE: PROCEDURE CONTENTS used (Total process time):
real time 2.98 seconds
cpu time 0.57 seconds

 

------------------------------------

 

Thank you for your help! I regret having to issue this post, but none of the white papers that I have read have directly addressed the questions that I have posed above.

 

Sincerely,

James D. Adams

SAS User

1 ACCEPTED SOLUTION

Accepted Solutions
jdadams
Fluorite | Level 6

I would like to thank everyone for their generous replies to my original post, and I want to report on the eventual solution that I hit upon.

 

First, thank you Patrick, reading just the first character of the up to 1500 character title field worked very well.  And thanks to all for the commentary.

 

It turns out that the problem reading in the file had to do with Windows 10 text files.  Sometimes a DOS End of File or EOF character gets inserted in the midst of these files, ending execution as soon as it is encountered.  This character is hexadecimal '1A'x.

 

To solve this, you insert a IGNOREDOSEOF  into the options section of the INFILE command.  My code looked like this, in the meantime adopting Patrick's suggestion of reading just the first character of the very long title variable, which I did not need:

 

options nocenter replace ls=80 ps=54;
libname tlaivs 'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\sasdsns';

data tlaivs.item2;
infile 'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt' delimiter = ';' DSD MISSOVER lrecl = 1600 firstobs=2 ignoredoseof;

input
item_id :$15.
issue_id :$10.
item_number :best32.
_dummy :$1.
doc_type :$32.
ref_count :$4.
;

 

drop _dummy;
run;

 

proc contents data=tlaivs.item2;
run;

 

This suggestion of inserting IGNOREDOSEOF is taken from a blog post by SAS guru Charley Mullin, written on April 17, 2015, and entitled "Turning text files into SAS data sets--six common problems and their solution".

 

This did the trick.  Instead of having 10+ million records, my file now includes all 35+ million records.  I hope that this rediscovery helps others who are having trouble reading in text files into SAS under Windows 10.  At least it is one more weapon in the arsenal.

View solution in original post

5 REPLIES 5
SASKiwi
PROC Star

In the SAS log you posted you will note there is a DATA step starting - data TLAIVS.ITEM; - and ending with - run; -.

 

The first step to getting this program working is to copy the DATA step lines into your program editor so you can then customise how it runs. The trick to doing a "block" copy from the SAS log to avoid the line numbers is to press the ALT key while highlighting the code block with your mouse.

 

Once you have the program in the program editor, remove the line - title $ - to stop reading the title you don't want. Now re-run your program then tell us how you got on.

Tom
Super User Tom
Super User

You shouldn't need to resort to using a guessing procedure to read a file with only 6 variables. It will be faster to type the variable names yourself than it would be to type the PROC IMPORT code.

 

From your description of what is in the file the biggest problem might be that the file is not properly formatted to be read.  Make sure that none of the titles or other variables contain end of line characters.  Also make sure that if any of the values contain the delimiter character (semi-colon) that those values are enclosed in quotes.

 

Try just reading it with a simple data step.

data want (compress=yes);
  infile "C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt"
  dsd dlm=';' truncover firstobs=2;
  length var1-var6 $2000 ;
  input var1-var6;
run;

Then check it and perhaps make adjustments.  Check things like: that the values look like they ended up in the right columns.  That the number of observations matches how many should be in the file. etc.

 

Patrick
Opal | Level 21

@jdadams 

Memory won't be the issue as SAS only reads one line at a time into memory.

If reading a file using a delimiter then you can't really skip reading a column in the middle - but you can read only a single character from this column and then drop it so it doesn't get added to the target table.

 

Below code based on the log you've posted should work as long as there isn't a semicolon in your free-text column "title". If there can be semicolons in your text then things become a bit more complicated and one would have to count semicolons to work out where the actual data for column doc_type starts.

data TLAIVS.ITEM;
  infile
     'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt'
     delimiter = ';' truncover DSD lrecl=1600 firstobs=2;
  input
    item_id :$15.
    issue_id :$10.
    item_number :best32.
    _dummy :$1.
    doc_type :$32.
    ref_count :$4.
    ;
  drop _dummy;
run;

 

I believe a forward slash in column doc_type shouldn't cause any issues. Only an end-of-line character would - but it appears your file is on Windows so end-of-line will be CRLF and even an LF only shouldn't create issues.

 

You could use below code to check how many semicolons you've got in a row of source data.

data test;
  infile
     'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt'
     delimiter = ';' MISSOVER DSD lrecl=1600 firstobs=2;
  input;
  data_row=_n_;
  delim_count=countc(_infile_,';');
run;

proc freq data=test;
  table delim_count /nocum nopercent;
run;

 

And just in case here how you could read your data if there are semicolons in your free text column.

libname TLAIVS "%sysfunc(pathname(work))";
filename have temp;
data _null_;
  file have;
  put 'some header';
  put 'Case 1;2;3;4;5;6';
  put 'Case 2;2;3;4;4;5;6';
  put 'Case 3;2;3;44;4;4;5;6';
  put 'Case 4;2;3;4;5;6';
  put 'Case 5;2;3;4;5';
  put 'Case 6;2;3;4;4;5';
  stop;
run;

data TLAIVS.ITEM(drop=_:);
  infile
/*     'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt'*/
     "%sysfunc(pathname(have))"
     delimiter = ';' truncover DSD lrecl=1600 firstobs=2;

  input
    item_id :$15.
    issue_id :$10.
    item_number :best32.
    @;

  _loop_stop=max(1,countc(_infile_,';')-4);
  do _i=1 to _loop_stop;
    input _dummy :$1. @;
  end;

  input
    doc_type :$32.
    ref_count :$4.
    ;
run;

proc print;
run;

Above still doesn't cover for Case 6 where you've got a semicolon in your free text column but a missing in your last column without the last semicolon added to the data. This case is a bit harder to code for so let us know if this is even possible/a problem in your actual data (a count of semicolons <5 would indicate that there is such a possibility).

Capture.JPG

jdadams
Fluorite | Level 6

I would like to thank everyone for their generous replies to my original post, and I want to report on the eventual solution that I hit upon.

 

First, thank you Patrick, reading just the first character of the up to 1500 character title field worked very well.  And thanks to all for the commentary.

 

It turns out that the problem reading in the file had to do with Windows 10 text files.  Sometimes a DOS End of File or EOF character gets inserted in the midst of these files, ending execution as soon as it is encountered.  This character is hexadecimal '1A'x.

 

To solve this, you insert a IGNOREDOSEOF  into the options section of the INFILE command.  My code looked like this, in the meantime adopting Patrick's suggestion of reading just the first character of the very long title variable, which I did not need:

 

options nocenter replace ls=80 ps=54;
libname tlaivs 'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\sasdsns';

data tlaivs.item2;
infile 'C:\Adams_Data\Adams_NSF1_Commerc_Compsci\Database_IVs\item.txt' delimiter = ';' DSD MISSOVER lrecl = 1600 firstobs=2 ignoredoseof;

input
item_id :$15.
issue_id :$10.
item_number :best32.
_dummy :$1.
doc_type :$32.
ref_count :$4.
;

 

drop _dummy;
run;

 

proc contents data=tlaivs.item2;
run;

 

This suggestion of inserting IGNOREDOSEOF is taken from a blog post by SAS guru Charley Mullin, written on April 17, 2015, and entitled "Turning text files into SAS data sets--six common problems and their solution".

 

This did the trick.  Instead of having 10+ million records, my file now includes all 35+ million records.  I hope that this rediscovery helps others who are having trouble reading in text files into SAS under Windows 10.  At least it is one more weapon in the arsenal.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 2043 views
  • 3 likes
  • 5 in conversation