BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RAVI2000
Lapis Lazuli | Level 10

Hello Everyone,

 

I was trying to import a .csv file which had almost 37000 records. 

When i tried to read them using proc import it is giving me the below note. 

NOTE: Invalid data for provider_id in line 5641 36-43.

 

4061       proc import out = diagnosis
4062       	datafile = "/Projects/data/diagnosis.csv"
4063       	dbms = csv replace;
4064           delimiter=",";
4065       
4066       run;

4067        /*******************************************************************
4067     ! ***
4068        *   PRODUCT:   SAS
4069        *   VERSION:   9.4
4070        *   CREATOR:   External File Interface
4071        *   DATE:      05AUG21
4072        *   DESC:      Generated SAS Datastep Code
4073        *   TEMPLATE SOURCE:  (None Specified.)
4074        ********************************************************************
4074     ! ***/
4075           data DIAGNOSIS    ;
4076           %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
4077           infile Projects/
4077     ! data/diagnosis.csv' delimiter = ',' MISSOVER DSD
4077     ! lrecl=32767 firstobs=2 ;
4078              informat patient_mrn best32. ;
4079              informat encounter_id best32. ;
4080              informat enc_type $2. ;
4081              informat dx_date $21. ;
4082              informat provider_id best32. ;
4083              informat provider_name $30. ;
4084              informat provider_title $4. ;
4085              informat dx_name $101. ;
4086              informat dx_code $8. ;
4087              informat dx_type best32. ;
4088              informat dx_source $2. ;
4089              informat dx_origin $2. ;
4090              informat pdx $1. ;
4091              informat raw_pdx $24. ;
4092              informat sourcesystem_cd $11. ;
4093              format patient_mrn best12. ;
4094              format encounter_id best12. ;
4095              format enc_type $2. ;
4096              format dx_date $21. ;
9 The SAS System

4097              format provider_id best12. ;
4098              format provider_name $30. ;
4099              format provider_title $4. ;
4100              format dx_name $101. ;
4101              format dx_code $8. ;
4102              format dx_type best12. ;
4103              format dx_source $2. ;
4104              format dx_origin $2. ;
4105              format pdx $1. ;
4106              format raw_pdx $24. ;
4107              format sourcesystem_cd $11. ;
4108           input
4109                       patient_mrn
4110                       encounter_id
4111                       enc_type  $
4112                       dx_date  $
4113                       provider_id
4114                       provider_name  $
4115                       provider_title  $
4116                       dx_name  $
4117                       dx_code  $
4118                       dx_type
4119                       dx_source  $
4120                       dx_origin  $
4121                       pdx  $
4122                       raw_pdx  $
4123                       sourcesystem_cd  $
4124           ;
4125           if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR
4125     ! detection macro variable */
4126           run;

NOTE: The infile '/Projects/
      data/diagnosis.csv' is:
      
      Filename=/Projects/ 
     data/diagnosis.csv,
      Owner Name=s_ravi,Group Name=BIOSTATS-USERS,
      Access Permission=-rwxr--r--,
      Last Modified=30Jul2021:07:57:16,
      File Size (bytes)=605197879

NOTE: Invalid data for provider_id in line 5641 36-43.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7
5641      00365298,,NI,"2010-03-04 00:00:00",TC110397,"STEPIEN, KIMBERLY E",MD,"
      71  360.21 Progressive high (degenerative) myopia",360.21,09,FI,BI,P,"IDX 
     141  DX_1",DX|BI|IDX 155
patient_mrn=365298 encounter_id= enc_type=NI dx_date=2010-03-04 00:00:00
provider_id= provider_name=STEPIEN, KIMBERLY E provider_title=MD
dx_name=360.21 Progressive high (degenerative) myopia dx_code=360.21 dx_type=9
dx_source=FI dx_origin=BI pdx=P raw_pdx=IDX DX_1 sourcesystem_cd=DX|BI|IDX
_ERROR_=1 _N_=5640
NOTE: Invalid data for provider_id in line 6377 36-43.
6377      00011177,,NI,"2010-04-27 00:00:00",TC100135,"SHERMAN, TRACY L",MD,"290
      71  .40 Vascular dementia, uncomplicated",290.40,09,FI,BI,P,"IDX DX_1",DX|
     141  BI|IDX 146
patient_mrn=11177 encounter_id= enc_type=NI dx_date=2010-04-27 00:00:00
provider_id= provider_name=SHERMAN, TRACY L provider_title=MD
dx_name=290.40 Vascular dementia, uncomplicated dx_code=290.40 dx_type=9
dx_source=FI dx_origin=BI pdx=P raw_pdx=IDX DX_1 sourcesystem_cd=DX|BI|IDX
_ERROR_=1 _N_=6376

and the log goes on with those above notes for so many subjects.

 

The provider_id has values like below:

Patient_mrn                     Provider_id

00365298                      TC110397,

00011177                        TC100135,

00365298                        TC110397

 

while other provider_id for other subjects as 

 provider_id

41357

15089

7682748

247974928

 

so the provider_id has both char and numeric values. How can I read that variable into my data?

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Why are you using PROC IMPORT to read a delimited text file?  One with only 14 variables.

Just write your own data step to read the file.

Then you won't define the variables, like PATIENT_MRN and PROVIDER_ID, using the wrong type.  MRN is an identifier, so make it a character variable.  The average medical record number has no meaning.  Plus the values you are showing have leading zeros that probably want to preserve.

Something like this.  

data DIAGNOSIS ;
  infile "/Projects/data/diagnosis.csv" dsd truncover firstobs=2;
  length          
         patient_mrn $10 
         encounter_id $20
         enc_type $2. 
         dx_date 8
         provider_id $20 
         provider_name $30
         provider_title $10
         dx_name $200
         dx_code $8
         dx_type $2
         dx_source $2
         dx_origin $2
         pdx $2
         raw_pdx $30
         sourcesystem_cd $20
   ;
   informat dx_date anydtdte.;
   format dx_date yymmdd10.;
   input patient_mrn -- sourcesystem_cd ;
 run;

You should define the lengths of the character variables based on the documentation of the maximum length they need.  But if you don't have such documentation then you could analysis the whole file yourself to figure out the longest string in each field.

data _null_ ;
  infile "/Projects/data/diagnosis.csv" dsd truncover firstobs=2 end=eof;
  array lengths[14];
  do col=1 to 14;
    input string :$32767. @;
    langths[col]=max(lengths[col],lengthn(string));
  end;
  if eof then do col=1 to 14 ;
    put col= lengths[col] ;
  end;
run;

View solution in original post

7 REPLIES 7
Reeza
Super User

PROVIDER_ID is a character field but you're trying to read it in as character.

 

5641      00365298,,NI,"2010-03-04 00:00:00",TC110397,"STEPIEN, KIMBERLY E",MD,"
      71  360.21 Progressive high (degenerative) myopia",360.21,09,FI,BI,P,"IDX 
     141  DX_1",DX|BI|IDX 155
4082              informat provider_id best32. ;

So you should modify the code to change PROVIDER_ID to be character. 

You can add GUESSINGROWS=MAX; to your PROC IMPORT code and it will scan the whole file before it guesses at types or you can copy the data step and modify it manually. 

 

Spoiler

@RAVI2000 wrote:

Hello Everyone,

 

I was trying to import a .csv file which had almost 37000 records. 

When i tried to read them using proc import it is giving me the below note. 

NOTE: Invalid data for provider_id in line 5641 36-43.

 

4061       proc import out = diagnosis
4062       	datafile = "/Projects/data/diagnosis.csv"
4063       	dbms = csv replace;
4064           delimiter=",";
4065       
4066       run;

4067        /*******************************************************************
4067     ! ***
4068        *   PRODUCT:   SAS
4069        *   VERSION:   9.4
4070        *   CREATOR:   External File Interface
4071        *   DATE:      05AUG21
4072        *   DESC:      Generated SAS Datastep Code
4073        *   TEMPLATE SOURCE:  (None Specified.)
4074        ********************************************************************
4074     ! ***/
4075           data DIAGNOSIS    ;
4076           %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
4077           infile Projects/
4077     ! data/diagnosis.csv' delimiter = ',' MISSOVER DSD
4077     ! lrecl=32767 firstobs=2 ;
4078              informat patient_mrn best32. ;
4079              informat encounter_id best32. ;
4080              informat enc_type $2. ;
4081              informat dx_date $21. ;
4082              informat provider_id best32. ;
4083              informat provider_name $30. ;
4084              informat provider_title $4. ;
4085              informat dx_name $101. ;
4086              informat dx_code $8. ;
4087              informat dx_type best32. ;
4088              informat dx_source $2. ;
4089              informat dx_origin $2. ;
4090              informat pdx $1. ;
4091              informat raw_pdx $24. ;
4092              informat sourcesystem_cd $11. ;
4093              format patient_mrn best12. ;
4094              format encounter_id best12. ;
4095              format enc_type $2. ;
4096              format dx_date $21. ;
9 The SAS System

4097              format provider_id best12. ;
4098              format provider_name $30. ;
4099              format provider_title $4. ;
4100              format dx_name $101. ;
4101              format dx_code $8. ;
4102              format dx_type best12. ;
4103              format dx_source $2. ;
4104              format dx_origin $2. ;
4105              format pdx $1. ;
4106              format raw_pdx $24. ;
4107              format sourcesystem_cd $11. ;
4108           input
4109                       patient_mrn
4110                       encounter_id
4111                       enc_type  $
4112                       dx_date  $
4113                       provider_id
4114                       provider_name  $
4115                       provider_title  $
4116                       dx_name  $
4117                       dx_code  $
4118                       dx_type
4119                       dx_source  $
4120                       dx_origin  $
4121                       pdx  $
4122                       raw_pdx  $
4123                       sourcesystem_cd  $
4124           ;
4125           if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR
4125     ! detection macro variable */
4126           run;

NOTE: The infile '/Projects/
      data/diagnosis.csv' is:
      
      Filename=/Projects/ 
     data/diagnosis.csv,
      Owner Name=s_ravi,Group Name=BIOSTATS-USERS,
      Access Permission=-rwxr--r--,
      Last Modified=30Jul2021:07:57:16,
      File Size (bytes)=605197879

NOTE: Invalid data for provider_id in line 5641 36-43.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7
5641      00365298,,NI,"2010-03-04 00:00:00",TC110397,"STEPIEN, KIMBERLY E",MD,"
      71  360.21 Progressive high (degenerative) myopia",360.21,09,FI,BI,P,"IDX 
     141  DX_1",DX|BI|IDX 155
patient_mrn=365298 encounter_id= enc_type=NI dx_date=2010-03-04 00:00:00
provider_id= provider_name=STEPIEN, KIMBERLY E provider_title=MD
dx_name=360.21 Progressive high (degenerative) myopia dx_code=360.21 dx_type=9
dx_source=FI dx_origin=BI pdx=P raw_pdx=IDX DX_1 sourcesystem_cd=DX|BI|IDX
_ERROR_=1 _N_=5640
NOTE: Invalid data for provider_id in line 6377 36-43.
6377      00011177,,NI,"2010-04-27 00:00:00",TC100135,"SHERMAN, TRACY L",MD,"290
      71  .40 Vascular dementia, uncomplicated",290.40,09,FI,BI,P,"IDX DX_1",DX|
     141  BI|IDX 146
patient_mrn=11177 encounter_id= enc_type=NI dx_date=2010-04-27 00:00:00
provider_id= provider_name=SHERMAN, TRACY L provider_title=MD
dx_name=290.40 Vascular dementia, uncomplicated dx_code=290.40 dx_type=9
dx_source=FI dx_origin=BI pdx=P raw_pdx=IDX DX_1 sourcesystem_cd=DX|BI|IDX
_ERROR_=1 _N_=6376

and the log goes on with those above notes for so many subjects.

 

The provider_id has values like below:

Patient_mrn                     Provider_id

00365298                      TC110397,

00011177                        TC100135,

00365298                        TC110397

 

while other provider_id for other subjects as 

 provider_id

41357

15089

7682748

247974928

 

so the provider_id has both char and numeric values. How can I read that variable into my data?

 

Thanks


RAVI2000
Lapis Lazuli | Level 10
When I am using GUESSINGROWS = MAX;
It's giving me the below error.

4061 proc import out = sj.diagnosis
4062 datafile = "/Projects/
4062 ! data/diagnosis.csv"
4063 dbms = csv replace;
4064 delimiter=",";
4065 GUESSINGROWS = MAX;
4066 run;
ERROR: according to datain.log, 'sas datain' already in progress
job abend Thu Aug 5 15:25:46 CDT 2021
if this is incorrect, then 'rm datain.log' and re-submit
RAVI2000
Lapis Lazuli | Level 10
And once that error has occurred I have to close that program and create a new program.
ballardw
Super User

@RAVI2000 wrote:
And once that error has occurred I have to close that program and create a new program.

?????

Did you save the program before running it?

 

I'm also a tad confused about this:

4061 proc import out = sj.diagnosis
4062 datafile = "/Projects/
4062 ! data/diagnosis.csv"
4063 dbms = csv replace;
4064 delimiter=",";
4065 GUESSINGROWS = MAX;
4066 run;

That ! with two line 4062 indicates a continuation because the submitted line is longer than the log likes and normally means lines in excess of 80 or so characters.

This sounds like you are runninng a batch job and the previous program is still running.

RAVI2000
Lapis Lazuli | Level 10
Yes, I did save the program. How can i stop the previous program from running?
jimbarbour
Meteorite | Level 14

Are you in Windows or UNIX or ?

 

Often you can find the Process ID in the log file.  If you know the Process ID, then you can "kill" an existing process.

 

Jim

Tom
Super User Tom
Super User

Why are you using PROC IMPORT to read a delimited text file?  One with only 14 variables.

Just write your own data step to read the file.

Then you won't define the variables, like PATIENT_MRN and PROVIDER_ID, using the wrong type.  MRN is an identifier, so make it a character variable.  The average medical record number has no meaning.  Plus the values you are showing have leading zeros that probably want to preserve.

Something like this.  

data DIAGNOSIS ;
  infile "/Projects/data/diagnosis.csv" dsd truncover firstobs=2;
  length          
         patient_mrn $10 
         encounter_id $20
         enc_type $2. 
         dx_date 8
         provider_id $20 
         provider_name $30
         provider_title $10
         dx_name $200
         dx_code $8
         dx_type $2
         dx_source $2
         dx_origin $2
         pdx $2
         raw_pdx $30
         sourcesystem_cd $20
   ;
   informat dx_date anydtdte.;
   format dx_date yymmdd10.;
   input patient_mrn -- sourcesystem_cd ;
 run;

You should define the lengths of the character variables based on the documentation of the maximum length they need.  But if you don't have such documentation then you could analysis the whole file yourself to figure out the longest string in each field.

data _null_ ;
  infile "/Projects/data/diagnosis.csv" dsd truncover firstobs=2 end=eof;
  array lengths[14];
  do col=1 to 14;
    input string :$32767. @;
    langths[col]=max(lengths[col],lengthn(string));
  end;
  if eof then do col=1 to 14 ;
    put col= lengths[col] ;
  end;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1578 views
  • 5 likes
  • 5 in conversation