Hi, I am hoping you can help me read in a large txt file (about 2 to 3 GB).
ERROR: Import unsuccessful. See SAS Log for details.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 34.94 seconds
cpu time 13.95 seconds
these 2 steps (data null step and the data infile step) work GREAT!!
the only thing is the date field is all missing when it was supposed to be "1/30/2017" format
I guess I just have to define the date correctly
many thanks for the great help!
Are there any Error messages in the SAS log prior to the one you've shared? If so then please share.
For example if I execute your code then my SAS log looks like:
Do you have a locally installed SAS or are you connecting to a remote SAS Server via EG or Studio?
yes, sorry for not including in the last message: here is the error message prior. I removed some of the error message that contain PHI
14709 proc import
14710 datafile="C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt"
14711 dbms=dlm
14712 out=work.procedure
14713 replace;
14714 getnames=YES;
14715
14716 run;
Number of names found is less than number of variables found.
Name
EMPI|EPIC_PMRN|MRN_Type|MRN|Date|Procedure_Name|Code_Type|Code|Procedure_Flag|Quantity|Provider|Clinic
|Hospital|Inpatient_Outpatient|Encounter_number truncated to
EMPI_EPIC_PMRN_MRN_Type_MRN_Date.
Problems were detected with provided names. See LOG.
14717 /**********************************************************************
14718 * PRODUCT: SAS
14719 * VERSION: 9.4
14720 * CREATOR: External File Interface
14721 * DATE: 11OCT24
14722 * DESC: Generated SAS Datastep Code
14723 * TEMPLATE SOURCE: (None Specified.)
14724 ***********************************************************************/
14725 data WORK.PROCEDURE ;
14726 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
14727 infile 'C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt'
14727! delimiter = ' ' MISSOVER DSD lrecl=32767 firstobs=2 ;
14728 informat EMPI_EPIC_PMRN_MRN_Type_MRN_Date $61. ;
14729 informat VAR2 $20. ;
14730 informat VAR3 $12. ;
14731 informat VAR4 $18. ;
14732 informat VAR5 $10. ;
14733 informat VAR6 $33. ;
14734 informat VAR7 $31. ;
14735 informat VAR8 $31. ;
14736 informat VAR9 $15. ;
14737 informat VAR10 $11. ;
14738 informat VAR11 $34. ;
14739 informat VAR12 $16. ;
14740 informat VAR13 $51. ;
14741 informat VAR14 $51. ;
14742 informat VAR15 $35. ;
14743 informat VAR16 $51. ;
14744 informat VAR17 $8. ;
14745 informat VAR18 $9. ;
14746 informat VAR19 $51. ;
14747 informat VAR20 $51. ;
14748 informat VAR21 $38. ;
14749 informat VAR22 $51. ;
14750 informat VAR23 $9. ;
14751 informat VAR24 $51. ;
14752 informat VAR25 $51. ;
14753 informat VAR26 $7. ;
14754 informat VAR27 $51. ;
14755 informat VAR28 $11. ;
14756 informat VAR29 $32. ;
14757 informat VAR30 $17. ;
14758 informat VAR31 $38. ;
14759 informat VAR32 $33. ;
14760 informat VAR33 $42. ;
14761 informat VAR34 $43. ;
14762 informat VAR35 $38. ;
14763 informat VAR36 $8. ;
14764 informat VAR37 $2. ;
14765 informat VAR38 best32. ;
14766 informat VAR39 $39. ;
14767 format EMPI_EPIC_PMRN_MRN_Type_MRN_Date $61. ;
14768 format VAR2 $20. ;
14769 format VAR3 $12. ;
14770 format VAR4 $18. ;
14771 format VAR5 $10. ;
14772 format VAR6 $33. ;
14773 format VAR7 $31. ;
14774 format VAR8 $31. ;
14775 format VAR9 $15. ;
14776 format VAR10 $11. ;
14777 format VAR11 $34. ;
14778 format VAR12 $16. ;
14779 format VAR13 $51. ;
14780 format VAR14 $51. ;
14781 format VAR15 $35. ;
14782 format VAR16 $51. ;
14783 format VAR17 $8. ;
14784 format VAR18 $9. ;
14785 format VAR19 $51. ;
14786 format VAR20 $51. ;
14787 format VAR21 $38. ;
14788 format VAR22 $51. ;
14789 format VAR23 $9. ;
14790 format VAR24 $51. ;
14791 format VAR25 $51. ;
14792 format VAR26 $7. ;
14793 format VAR27 $51. ;
14794 format VAR28 $11. ;
14795 format VAR29 $32. ;
14796 format VAR30 $17. ;
14797 format VAR31 $38. ;
14798 format VAR32 $33. ;
14799 format VAR33 $42. ;
14800 format VAR34 $43. ;
14801 format VAR35 $38. ;
14802 format VAR36 $8. ;
14803 format VAR37 $2. ;
14804 format VAR38 best12. ;
14805 format VAR39 $39. ;
14806 input
14807 EMPI_EPIC_PMRN_MRN_Type_MRN_Date $
14808 VAR2 $
14809 VAR3 $
14810 VAR4 $
14811 VAR5 $
14812 VAR6 $
14813 VAR7 $
14814 VAR8 $
14815 VAR9 $
14816 VAR10 $
14817 VAR11 $
14818 VAR12 $
14819 VAR13 $
14820 VAR14 $
14821 VAR15 $
14822 VAR16 $
14823 VAR17 $
14824 VAR18 $
14825 VAR19 $
14826 VAR20 $
14827 VAR21 $
14828 VAR22 $
14829 VAR23 $
14830 VAR24 $
14831 VAR25 $
14832 VAR26 $
14833 VAR27 $
14834 VAR28 $
14835 VAR29 $
14836 VAR30 $
14837 VAR31 $
14838 VAR32 $
14839 VAR33 $
14840 VAR34 $
14841 VAR35 $
14842 VAR36 $
14843 VAR37 $
14844 VAR38
14845 VAR39 $
14846 ;
14847 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
14848 run;
NOTE: The infile 'C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt' is:
Filename=C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt,
RECFM=V,LRECL=32767,
File Size (bytes)=2464662277,
Last Modified=07Feb2024:20:10:47,
Create Time=12Oct2024:20:35:49
NOTE: Invalid data for VAR38 in line 33 338-340.
.......
NOTE: 8293654 records were read from the infile
'C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt'.
The minimum record length was 109.
The maximum record length was 1311.
NOTE: The data set WORK.PROCEDURE has 8293654 observations and 39 variables.
NOTE: DATA statement used (Total process time):
real time 37.36 seconds
cpu time 8.15 seconds
Errors detected in submitted DATA step. Examine log.
8293654 rows created in WORK.PROCEDURE from
C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt.
There were many "invalid data for var 38....
The comments about using Proc Import have already been made so no more going there.
With Proc Import and text files as source: Use also GUESSINGROWS=MAX; to ensure SAS scans as many source rows as possible to determine type and length of the required SAS variables. And of course define the appropriate delimiter.
Do not use PROC IMPORT for delimited files. Write the DATA step yourself, according to the description of the file. PROC IMPORT makes guesses, and fixing the mistakes from those takes more time than writing the code in the first place.
If your SAS session runs on a remote server (accessed through Enterprise Guide or SAS Studio), the file must be accessible there, either through a network share or by first uploading the data to the server. Please tell us the specifics of your SAS setup.
thanks
The txt file I am trying to read into SAS is stored in a dropbox folder in a local C drive. The SAS is also a desktop version not an online network version. I am not using SAS enterprise. I hope I answered your question correctly
OK, so let's see the description/documentation of the file.
If there is none:
Try using data step. Change the dlm = respective delimiter like pipe to "|", comma separated to "," etc. Define length of all variables reading from input file and define formats and informats. Please double check formats variables of dates.
data my_table;
infile 'C:\mydocuments\bala\sas_sample\document.csv' dlm=";" firstobs=2;
length Employe_name $16
job_location $25
employee_code $15
Address1 $50
address2 $50
city $30
state $25
Zip_code $10
Country $10
date_of_joining 8
date_of_birth 8;
informat Employe_name $16
job_location $25
employee_code $15
Address1 $50
address2 $50
city $30
state $25
Zip_code $10
Country $10
date_of_joining date9.
date_of_birth date9.;
format Employe_name $16
job_location $25
employee_code $15
Address1 $50
address2 $50
city $30
state $25
Zip_code $10
Country $10
date_of_joining date9.
date_of_birth date9.;
input Employe_name $
job_location $
employee_code $
Address1 $
address2 $
city $
state $
Zip_code $
Country $
date_of_joining
date_of_birth ;
run;
That is good advice.
But note that once the variable's type has been set, in your case by the LENGTH statement, there is no need to include those $ characters in the INPUT statement.
The $ modifier is only needed in an INPUT statement when the INPUT statement is the first place where SAS will be forced to set the type of the variable. If there is no other information SAS could use to set the type, such as an in-line informat specification in the INPUT statement, then SAS will make the new variable as numeric by default. Once the type is already set the $ in the INPUT statement does nothing. It won't even cause SAS to issue a note or warning if you include it after a variable that was already defined as numeric.
Try this test:
data test;
x=1;
input x $ ;
cards;
2
3
;
proc contents data=test;
run;
Have you tried using a DATA step instead of PROC IMPORT? It can be more worthy for large text files. Reading in smaller chunks might help too.
thanks
do you mind sharing an example how to use Data step to read in a large txt file? The txt file is extremely large with thousands of columns of variables so I do not know how to use infile statement to define the large number of variables.
thanks
@rykwong wrote:
thanks
do you mind sharing an example how to use Data step to read in a large txt file? The txt file is extremely large with thousands of columns of variables so I do not know how to use infile statement to define the large number of variables.
thanks
Thousands of columns? That means the resulting dataset will have thousands of variables. PROC IMPORT cannot guess the names of so many variables because it will only read the first 32K bytes from the header line. If you must GUESS how to name the variables use a tool like %csv2ds() that can read whole header line.
Note that having thousands of observations does not matter when deciding how to read the file. Just whether you will have enough space to store the resulting dataset.
From your SAS log it does not look like the file is delimited with spaces as your told PROC IMPORT to use.
Instead I suspect the delimiter is | character:
EMPI|EPIC_PMRN|MRN_Type|MRN|Date|Procedure_Name|Code_Type|Code|Procedure_Flag|Quantity|Provider|Clinic |Hospital|Inpatient_Outpatient|Encounter_number truncated to EMPI_EPIC_PMRN_MRN_Type_MRN_Date.
And I doubt that you have thousands of columns since the longest line is only 1,311 bytes long.
The minimum record length was 109. The maximum record length was 1311.
Try adding this statement to your PROC IMPORT step.
delimiter='|';
Note that you can write your own data step to define the variables and read them in. Just define the variables (set the type and for character variables the maximum storage they will need) then read them in. If any of them need special instructions for reading attach an informat. If any of them need special instructions for displaying attach a format. Most variables will need neither attached. The main exception are DATE, TIME and DATETIME values.
So your program might look something like:
data procedure;
infile "C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt"
dsd dlm='|' truncover firstobs=2
;
length EMPI $20 EPIC_PMRN $20 MRN_Type $10 MRN $15 Date 8
Procedure_Name $40 Code_Type $10 Code $10
Procedure_Flag $8 Quantity 8
Provider $20 Clinic $20 Hospital $20
Inpatient_Outpatient $10 Encounter_number 8
;
input EMPI -- Encounter_number;
informat date date.;
format date date9.;
run;
The DATA statement says what dataset to create.
The INFILE statement says which file to read, and how to read it. In this case lines delimited by | starting with the second line.
The LENGTH statement defines the variable types and storage length (use 8 from numeric variables since SAS stores them as 64 bit floating point numbers).
The INPUT statement says what variables to read from the line. This one uses a positional variable list. So make sure to define the variables in the order they appear in the file.
The FORMAT and INFORMAT statement attach those to the DATE variable so SAS will know how to read the text in the file into a date value and how to display that date value in a way that humans will understand.
But you probably need to look at the file more carefully to decide if it has 12, 39 or some other number of variables. An easy way to look at the file is to use a data _null_ step with the LIST statement to dump the first few lines to the SAS log.
data _null_;
infile "C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt"
obs=5 lrecl=1000000
;
input;
list;
run;
Then you can check if there are the same number of names on the first line as there are values on lines 2 to 5. And begin to get a sense for what types of values they have and what might be a reasonable storage length to use for each one.
these 2 steps (data null step and the data infile step) work GREAT!!
the only thing is the date field is all missing when it was supposed to be "1/30/2017" format
I guess I just have to define the date correctly
many thanks for the great help!
one more question. This is in the log and it seems the only field that was not read correctly is the date. All values in "date" were missing as a result. Thanks
99 data procedure;
100 infile "C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt"
101 dsd dlm='|' truncover firstobs=2
102 ;
103 length EMPI $20 EPIC_PMRN $20 MRN_Type $10 MRN $15 Date 8
104 Procedure_Name $40 Code_Type $10 Code $10
105 Procedure_Flag $8 Quantity 8
106 Provider $20 Clinic $20 Hospital $20
107 Inpatient_Outpatient $10 Encounter_number 8
108 ;
109 input EMPI -- Encounter_number;
110 informat date date.;
111 format date date9.;
112 run;
NOTE: The infile "C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt" is:
Filename=C:\Users\rk64\Desktop\cardreport_diag_procedure\RK64_20240207_103523_Prc.txt,
RECFM=V,LRECL=32767,
File Size (bytes)=2464662277,
Last Modified=07Feb2024:20:10:47,
Create Time=12Oct2024:20:35:49
NOTE: Invalid data for Date in line 2 36-45.
NOTE: Invalid data for Encounter_number in line 2 184-207.
A date stored as 1/30/2017 needs the MMDDYY10. informat.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.