Hello Everyone,
I was trying to import a .csv file which had almost 37000 records.
When i tried to read them using proc import it is giving me the below note.
NOTE: Invalid data for provider_id in line 5641 36-43.
4061 proc import out = diagnosis
4062 datafile = "/Projects/data/diagnosis.csv"
4063 dbms = csv replace;
4064 delimiter=",";
4065
4066 run;
4067 /*******************************************************************
4067 ! ***
4068 * PRODUCT: SAS
4069 * VERSION: 9.4
4070 * CREATOR: External File Interface
4071 * DATE: 05AUG21
4072 * DESC: Generated SAS Datastep Code
4073 * TEMPLATE SOURCE: (None Specified.)
4074 ********************************************************************
4074 ! ***/
4075 data DIAGNOSIS ;
4076 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
4077 infile Projects/
4077 ! data/diagnosis.csv' delimiter = ',' MISSOVER DSD
4077 ! lrecl=32767 firstobs=2 ;
4078 informat patient_mrn best32. ;
4079 informat encounter_id best32. ;
4080 informat enc_type $2. ;
4081 informat dx_date $21. ;
4082 informat provider_id best32. ;
4083 informat provider_name $30. ;
4084 informat provider_title $4. ;
4085 informat dx_name $101. ;
4086 informat dx_code $8. ;
4087 informat dx_type best32. ;
4088 informat dx_source $2. ;
4089 informat dx_origin $2. ;
4090 informat pdx $1. ;
4091 informat raw_pdx $24. ;
4092 informat sourcesystem_cd $11. ;
4093 format patient_mrn best12. ;
4094 format encounter_id best12. ;
4095 format enc_type $2. ;
4096 format dx_date $21. ;
9 The SAS System
4097 format provider_id best12. ;
4098 format provider_name $30. ;
4099 format provider_title $4. ;
4100 format dx_name $101. ;
4101 format dx_code $8. ;
4102 format dx_type best12. ;
4103 format dx_source $2. ;
4104 format dx_origin $2. ;
4105 format pdx $1. ;
4106 format raw_pdx $24. ;
4107 format sourcesystem_cd $11. ;
4108 input
4109 patient_mrn
4110 encounter_id
4111 enc_type $
4112 dx_date $
4113 provider_id
4114 provider_name $
4115 provider_title $
4116 dx_name $
4117 dx_code $
4118 dx_type
4119 dx_source $
4120 dx_origin $
4121 pdx $
4122 raw_pdx $
4123 sourcesystem_cd $
4124 ;
4125 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR
4125 ! detection macro variable */
4126 run;
NOTE: The infile '/Projects/
data/diagnosis.csv' is:
Filename=/Projects/
data/diagnosis.csv,
Owner Name=s_ravi,Group Name=BIOSTATS-USERS,
Access Permission=-rwxr--r--,
Last Modified=30Jul2021:07:57:16,
File Size (bytes)=605197879
NOTE: Invalid data for provider_id in line 5641 36-43.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7
5641 00365298,,NI,"2010-03-04 00:00:00",TC110397,"STEPIEN, KIMBERLY E",MD,"
71 360.21 Progressive high (degenerative) myopia",360.21,09,FI,BI,P,"IDX
141 DX_1",DX|BI|IDX 155
patient_mrn=365298 encounter_id= enc_type=NI dx_date=2010-03-04 00:00:00
provider_id= provider_name=STEPIEN, KIMBERLY E provider_title=MD
dx_name=360.21 Progressive high (degenerative) myopia dx_code=360.21 dx_type=9
dx_source=FI dx_origin=BI pdx=P raw_pdx=IDX DX_1 sourcesystem_cd=DX|BI|IDX
_ERROR_=1 _N_=5640
NOTE: Invalid data for provider_id in line 6377 36-43.
6377 00011177,,NI,"2010-04-27 00:00:00",TC100135,"SHERMAN, TRACY L",MD,"290
71 .40 Vascular dementia, uncomplicated",290.40,09,FI,BI,P,"IDX DX_1",DX|
141 BI|IDX 146
patient_mrn=11177 encounter_id= enc_type=NI dx_date=2010-04-27 00:00:00
provider_id= provider_name=SHERMAN, TRACY L provider_title=MD
dx_name=290.40 Vascular dementia, uncomplicated dx_code=290.40 dx_type=9
dx_source=FI dx_origin=BI pdx=P raw_pdx=IDX DX_1 sourcesystem_cd=DX|BI|IDX
_ERROR_=1 _N_=6376
and the log goes on with those above notes for so many subjects.
The provider_id has values like below:
Patient_mrn Provider_id
00365298 TC110397,
00011177 TC100135,
00365298 TC110397
while other provider_id for other subjects as
provider_id
41357
15089
7682748
247974928
so the provider_id has both char and numeric values. How can I read that variable into my data?
Thanks
Why are you using PROC IMPORT to read a delimited text file? One with only 14 variables.
Just write your own data step to read the file.
Then you won't define the variables, like PATIENT_MRN and PROVIDER_ID, using the wrong type. MRN is an identifier, so make it a character variable. The average medical record number has no meaning. Plus the values you are showing have leading zeros that probably want to preserve.
Something like this.
data DIAGNOSIS ;
infile "/Projects/data/diagnosis.csv" dsd truncover firstobs=2;
length
patient_mrn $10
encounter_id $20
enc_type $2.
dx_date 8
provider_id $20
provider_name $30
provider_title $10
dx_name $200
dx_code $8
dx_type $2
dx_source $2
dx_origin $2
pdx $2
raw_pdx $30
sourcesystem_cd $20
;
informat dx_date anydtdte.;
format dx_date yymmdd10.;
input patient_mrn -- sourcesystem_cd ;
run;
You should define the lengths of the character variables based on the documentation of the maximum length they need. But if you don't have such documentation then you could analysis the whole file yourself to figure out the longest string in each field.
data _null_ ;
infile "/Projects/data/diagnosis.csv" dsd truncover firstobs=2 end=eof;
array lengths[14];
do col=1 to 14;
input string :$32767. @;
langths[col]=max(lengths[col],lengthn(string));
end;
if eof then do col=1 to 14 ;
put col= lengths[col] ;
end;
run;
PROVIDER_ID is a character field but you're trying to read it in as character.
5641 00365298,,NI,"2010-03-04 00:00:00",TC110397,"STEPIEN, KIMBERLY E",MD," 71 360.21 Progressive high (degenerative) myopia",360.21,09,FI,BI,P,"IDX 141 DX_1",DX|BI|IDX 155
4082 informat provider_id best32. ;
So you should modify the code to change PROVIDER_ID to be character.
You can add GUESSINGROWS=MAX; to your PROC IMPORT code and it will scan the whole file before it guesses at types or you can copy the data step and modify it manually.
@RAVI2000 wrote:
Hello Everyone,
I was trying to import a .csv file which had almost 37000 records.
When i tried to read them using proc import it is giving me the below note.
NOTE: Invalid data for provider_id in line 5641 36-43.
4061 proc import out = diagnosis 4062 datafile = "/Projects/data/diagnosis.csv" 4063 dbms = csv replace; 4064 delimiter=","; 4065 4066 run; 4067 /******************************************************************* 4067 ! *** 4068 * PRODUCT: SAS 4069 * VERSION: 9.4 4070 * CREATOR: External File Interface 4071 * DATE: 05AUG21 4072 * DESC: Generated SAS Datastep Code 4073 * TEMPLATE SOURCE: (None Specified.) 4074 ******************************************************************** 4074 ! ***/ 4075 data DIAGNOSIS ; 4076 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 4077 infile Projects/ 4077 ! data/diagnosis.csv' delimiter = ',' MISSOVER DSD 4077 ! lrecl=32767 firstobs=2 ; 4078 informat patient_mrn best32. ; 4079 informat encounter_id best32. ; 4080 informat enc_type $2. ; 4081 informat dx_date $21. ; 4082 informat provider_id best32. ; 4083 informat provider_name $30. ; 4084 informat provider_title $4. ; 4085 informat dx_name $101. ; 4086 informat dx_code $8. ; 4087 informat dx_type best32. ; 4088 informat dx_source $2. ; 4089 informat dx_origin $2. ; 4090 informat pdx $1. ; 4091 informat raw_pdx $24. ; 4092 informat sourcesystem_cd $11. ; 4093 format patient_mrn best12. ; 4094 format encounter_id best12. ; 4095 format enc_type $2. ; 4096 format dx_date $21. ; 9 The SAS System 4097 format provider_id best12. ; 4098 format provider_name $30. ; 4099 format provider_title $4. ; 4100 format dx_name $101. ; 4101 format dx_code $8. ; 4102 format dx_type best12. ; 4103 format dx_source $2. ; 4104 format dx_origin $2. ; 4105 format pdx $1. ; 4106 format raw_pdx $24. ; 4107 format sourcesystem_cd $11. ; 4108 input 4109 patient_mrn 4110 encounter_id 4111 enc_type $ 4112 dx_date $ 4113 provider_id 4114 provider_name $ 4115 provider_title $ 4116 dx_name $ 4117 dx_code $ 4118 dx_type 4119 dx_source $ 4120 dx_origin $ 4121 pdx $ 4122 raw_pdx $ 4123 sourcesystem_cd $ 4124 ; 4125 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR 4125 ! detection macro variable */ 4126 run; NOTE: The infile '/Projects/ data/diagnosis.csv' is: Filename=/Projects/ data/diagnosis.csv, Owner Name=s_ravi,Group Name=BIOSTATS-USERS, Access Permission=-rwxr--r--, Last Modified=30Jul2021:07:57:16, File Size (bytes)=605197879 NOTE: Invalid data for provider_id in line 5641 36-43. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7 5641 00365298,,NI,"2010-03-04 00:00:00",TC110397,"STEPIEN, KIMBERLY E",MD," 71 360.21 Progressive high (degenerative) myopia",360.21,09,FI,BI,P,"IDX 141 DX_1",DX|BI|IDX 155 patient_mrn=365298 encounter_id= enc_type=NI dx_date=2010-03-04 00:00:00 provider_id= provider_name=STEPIEN, KIMBERLY E provider_title=MD dx_name=360.21 Progressive high (degenerative) myopia dx_code=360.21 dx_type=9 dx_source=FI dx_origin=BI pdx=P raw_pdx=IDX DX_1 sourcesystem_cd=DX|BI|IDX _ERROR_=1 _N_=5640 NOTE: Invalid data for provider_id in line 6377 36-43. 6377 00011177,,NI,"2010-04-27 00:00:00",TC100135,"SHERMAN, TRACY L",MD,"290 71 .40 Vascular dementia, uncomplicated",290.40,09,FI,BI,P,"IDX DX_1",DX| 141 BI|IDX 146 patient_mrn=11177 encounter_id= enc_type=NI dx_date=2010-04-27 00:00:00 provider_id= provider_name=SHERMAN, TRACY L provider_title=MD dx_name=290.40 Vascular dementia, uncomplicated dx_code=290.40 dx_type=9 dx_source=FI dx_origin=BI pdx=P raw_pdx=IDX DX_1 sourcesystem_cd=DX|BI|IDX _ERROR_=1 _N_=6376
and the log goes on with those above notes for so many subjects.
The provider_id has values like below:
Patient_mrn Provider_id
00365298 TC110397,
00011177 TC100135,
00365298 TC110397
while other provider_id for other subjects as
provider_id
41357
15089
7682748
247974928
so the provider_id has both char and numeric values. How can I read that variable into my data?
Thanks
@RAVI2000 wrote:
And once that error has occurred I have to close that program and create a new program.
?????
Did you save the program before running it?
I'm also a tad confused about this:
4061 proc import out = sj.diagnosis 4062 datafile = "/Projects/ 4062 ! data/diagnosis.csv" 4063 dbms = csv replace; 4064 delimiter=","; 4065 GUESSINGROWS = MAX; 4066 run;
That ! with two line 4062 indicates a continuation because the submitted line is longer than the log likes and normally means lines in excess of 80 or so characters.
This sounds like you are runninng a batch job and the previous program is still running.
Are you in Windows or UNIX or ?
Often you can find the Process ID in the log file. If you know the Process ID, then you can "kill" an existing process.
Jim
Why are you using PROC IMPORT to read a delimited text file? One with only 14 variables.
Just write your own data step to read the file.
Then you won't define the variables, like PATIENT_MRN and PROVIDER_ID, using the wrong type. MRN is an identifier, so make it a character variable. The average medical record number has no meaning. Plus the values you are showing have leading zeros that probably want to preserve.
Something like this.
data DIAGNOSIS ;
infile "/Projects/data/diagnosis.csv" dsd truncover firstobs=2;
length
patient_mrn $10
encounter_id $20
enc_type $2.
dx_date 8
provider_id $20
provider_name $30
provider_title $10
dx_name $200
dx_code $8
dx_type $2
dx_source $2
dx_origin $2
pdx $2
raw_pdx $30
sourcesystem_cd $20
;
informat dx_date anydtdte.;
format dx_date yymmdd10.;
input patient_mrn -- sourcesystem_cd ;
run;
You should define the lengths of the character variables based on the documentation of the maximum length they need. But if you don't have such documentation then you could analysis the whole file yourself to figure out the longest string in each field.
data _null_ ;
infile "/Projects/data/diagnosis.csv" dsd truncover firstobs=2 end=eof;
array lengths[14];
do col=1 to 14;
input string :$32767. @;
langths[col]=max(lengths[col],lengthn(string));
end;
if eof then do col=1 to 14 ;
put col= lengths[col] ;
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.