BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wlierman
Lapis Lazuli | Level 10

I am importing two Excel csv files into a SAS program.  The first import went fine with all the records importing properly.  The files are of the  same number of records - they are two views of tracer records from the MS Arias software program.  The second import has errors.  This is the proc import code.

Proc import datafile = "I:\Health Analytics\AMB_INFORMATICS\Arias_All_Contacts_Excel_Upload\System_All Contacts 9-7-2020 9-05-22 AM.csv"
            out = ARIAS.System_Contacts_per_tracer
            dbms = csv	
			replace;
			getnames=yes;
			run;

And the log

 

NOTE: 74467 records were read from the infile 'I:\Health
Analytics\AMB_INFORMATICS\Arias_All_Contacts_Excel_Upload\System_All Contacts 9-7-2020 9-05-22
AM.csv'.
The minimum record length was 1.
The maximum record length was 375.
NOTE: The data set ARIAS.SYSTEM_CONTACTS_PER_TRACER has 74467 observations and 20 variables.
NOTE: Compressing data set ARIAS.SYSTEM_CONTACTS_PER_TRACER decreased size by 48.25 percent.
Compressed is 222 pages; un-compressed would require 429 pages.
NOTE: DATA statement used (Total process time):
real time 0.29 seconds
cpu time 0.29 seconds


Errors detected in submitted DATA step. Examine log.

The problem is there are only 41K obs in the file.  This file is the same size of the first file the same which I imported. So I don't know where the 74,467 came from. When I examine the csv file after exporting to Excel from Arias, this file has the same records as the first (successful) csv that I imported (a little over 41K).

 

Further examining the log I see this error

NOTE: The infile 'I:\Health Analytics\AMB_INFORMATICS\Arias_All_Contacts_Excel_Upload\System_All
      Contacts 9-7-2020 9-05-22 AM.csv' is:

      Filename=I:\Health Analytics\AMB_INFORMATICS\Arias_All_Contacts_Excel_Upload\System_All
      Contacts 9-7-2020 9-05-22 AM.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=13618067,
      Last Modified=07Sep2020:09:07:35,
      Create Time=07Sep2020:09:07:35

NOTE: Invalid data for DOB in line 32 49-75.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--
32        Beaverton, OR 97008",Beaverton,Washington,Phone,sarkodiefrancisca@gmail.com,971-706-1617,8/1
      93  9/2020,9/2/2020,Washington CHD,Jose Ramirez,8/21/2020 9:30, 151
_Do_Not_Modify__Contact=Beaverton _Do_Not_Modify__Row_Checksum=OR 97008" _Do_Not_Modify__Modified_On=.
Last_Name=Washington First_Name=Phone DOB=. Monitoring_Status=971-706-1617 Status=8/19/202
Address_1=9/2/2020 City=Washingto County=Jose Ramir Preferred_Method_of_Contact=8/21/ Email=
Mobile_Phone=  Date_of_Last_Exposure=. End_of_Quarantine_Period=. Jurisdiction=  Owner=  Created_On=.
Language=  _ERROR_=1 _N_=31
NOTE: Invalid data for DOB in line 458 51-70.

I just entered the first line.  It reads 'problem with dob in line 32...'

Checking the csv file doesn't show any miscoded values for dob at line 32.  I have attached a small sample of the csv file. The dob field is sparsely populated. The shaded record is line 32 mentioned in the SAS log and the dob field is empty.

Finally, I tried to read the data using the infile statement

Data ARIAS.System_contacts_per_tracer;
   Infile 'I:\Health Analytics\AMB_INFORMATICS\ARIAS_All_Contacts_Excel_Upload\System_All Contacts 9-7-2020 9-05-22 AM.csv' dsd truncover firstobs=2; 
   Input
      _Do_Not_Modify__Contact: $36.
	  _Do_Not_Modify__Row_Checksum: $88.
	  _Do_Not_Modify__Modified_On: anydtdtm40.
      Last_Name:$12.
	  First_Name:$24.
	  DOB: mmddyy10.
	  Monitoring_Status:$30.
	  Status: $8.
	  Address_1:$17.
	  City:$9.
	  County:$10.
	  Preferred_Method_of_Contact:$5.
	  Email: $24.
	  Mobile_Phone: $14.
	  Date_of_Last_Exposure: mmddyy10.
	  End_of_Quarantine_Period: mmddyy10.
	  Jurisdiction:$26.
	  Owner:$16.
      Created_On: anydtdtm40.
	  Language:$11.
	  ;
Run;

But I received an error

ERROR: File is in use, I:\Health Analytics\AMB_INFORMATICS\ARIAS_All_Contacts_Excel_Upload\System_All
       Contacts 9-7-2020 9-05-22 AM.csv.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set ARIAS.SYSTEM_CONTACTS_PER_TRACER may be incomplete.  When this step was stopped
         there were 0 observations and 20 variables.
WARNING: Data set ARIAS.SYSTEM_CONTACTS_PER_TRACER was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.01 seconds

So that is the error that I am struggling with.  This data is imported every morning from Opera software so I thought maybe there was a glitch at that point.  But the first view imported without errors into SAS.  So I am not sure.

 

Any help, suggestions, or directions to next steps would be appreciated.

 

wlierman

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Unless you use the options to create ugly name literals the import procedure replaces all non-letter, non_digit and non_underscore characters with _ in the names, which in your case means mostly the spaces and parentheses. You can force use of variable names like "Name with spaces"n but that gets old pretty quickly.

 

Personally I would look at the data step code generated by proc import and modify it to something nicer. Your variables could be just plain Contact, Checksum and Modified_on. Use a label if needed to contain the "do not modify" information or make prettier column labels for reporting.

 

What is lost from transcoding depends on the content and what results. You can manually check the given lines to see what was in the CSV and what the SAS data set has. Since the data is apparently address related I would suspect national language characters like ë or É or similar that are not in your current national language setting.

View solution in original post

7 REPLIES 7
wlierman
Lapis Lazuli | Level 10

Thank you for the analysis.

I ran the proc import again keeping the file as an xlsx. The row count was a little higher than my "good" import. But the log shows a lot of adjustments that SAS made

Proc import datafile = "I:\Health Analytics\AMB_INFORMATICS\Arias_All_Contacts_Excel_Upload\Copy
13 ! of All Sys Contacts 9-7-2020 11-41-02 AM.xlsx"
14               out = ARIAS.System_Contacts_per_tracer
15               dbms = xlsx
16               replace;
17               getnames=yes;
18               run;

NOTE:    Variable Name Change.  (Do Not Modify) Contact -> _Do_Not_Modify__Contact
NOTE:    Variable Name Change.  (Do Not Modify) Row Checksum -> _Do_Not_Modify__Row_Checksum
NOTE:    Variable Name Change.  (Do Not Modify) Modified On -> _Do_Not_Modify__Modified_On
NOTE:    Variable Name Change.  Last Name -> Last_Name
NOTE:    Variable Name Change.  First Name -> First_Name
NOTE:    Variable Name Change.  Monitoring Status -> Monitoring_Status
NOTE:    Variable Name Change.  Address 1 -> Address_1
NOTE:    Variable Name Change.  Preferred Method of Contact -> Preferred_Method_of_Contact
NOTE:    Variable Name Change.  Mobile Phone -> Mobile_Phone
NOTE:    Variable Name Change.  Date of Last Exposure -> Date_of_Last_Exposure
NOTE:    Variable Name Change.  End of Quarantine Period -> End_of_Quarantine_Period
NOTE:    Variable Name Change.  Created On -> Created_On
NOTE: One or more variables were converted because the data type is not supported by the V9 engine.
      For more details, run with options MSGLEVEL=I.
WARNING: Some character data was lost during transcoding in column: Address 1 at obs 22763.
WARNING: Some character data was lost during transcoding in column: Address 1 at obs 22774.
WARNING: Some character data was lost during transcoding in column: Address 1 at obs 22799.
WARNING: Some character data was lost during transcoding in column: Address 1 at obs 22899.
NOTE: The import data set has 41280 observations and 20 variables.
NOTE: Compressing data set ARIAS.SYSTEM_CONTACTS_PER_TRACER decreased size by 51.03 percent.
      Compressed is 213 pages; un-compressed would require 435 pages.
NOTE: ARIAS.SYSTEM_CONTACTS_PER_TRACER data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           10.48 seconds
      cpu time            9.82 seconds

 SAS changed the variable names.  What is lost during transcoding?

 

I think that there could be some extraneous characters in this data when it was exported from OPERA into Arias.

Thank you for your help. Oh yes, how does a linefeed get embedded into data?

 

Thanks.

 

wlierman

Shmuel
Garnet | Level 18

Running proc import generates a code.

Copy that code from the log to run it directly, without invoking proc import for same format files.

1) is the generated code of first run is the same as of the second file?

2) In the note where the "Invalid data for DOB ..." exist:

NOTE: Invalid data for DOB in line 32 49-75.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--
32        Beaverton, OR 97008",Beaverton,Washington,Phone,sarkodiefrancisca@gmail.com,971-706-1617,8/1
      93  9/2020,9/2/2020,Washington CHD,Jose Ramirez,8/21/2020 9:30, 151

there are quotes (position 20 - OR 97008") and later (

Address_1=9/2/2020 

which is obviously not correct - I think this quote (") generated a shifting of reading that specific line or maybe it started one or more lines before that line.

Alternatively, the file is a CSV - maybe one of the variables contains a comma which caused that shifting.

 

I would suggest, if possible, to generate the files with different delimiter like TAB ("09"x) or '|' and adapt the proc import options to:

      dbms=delimiter ...;

      dlm = '09'x;   /* the actually used delimiter */

 

 

wlierman
Lapis Lazuli | Level 10
Thank you very much. I will follow your steps. I will let you know the outcome.

Thank you again.

wlierman
wlierman
Lapis Lazuli | Level 10

Thank you for the analysis.

I ran the proc import again keeping the file as an xlsx. The row count was a little higher than my "good" import. But the log shows a lot of adjustments that SAS made

Proc import datafile = "I:\Health Analytics\AMB_INFORMATICS\Arias_All_Contacts_Excel_Upload\Copy
13 ! of All Sys Contacts 9-7-2020 11-41-02 AM.xlsx"
14               out = ARIAS.System_Contacts_per_tracer
15               dbms = xlsx
16               replace;
17               getnames=yes;
18               run;

NOTE:    Variable Name Change.  (Do Not Modify) Contact -> _Do_Not_Modify__Contact
NOTE:    Variable Name Change.  (Do Not Modify) Row Checksum -> _Do_Not_Modify__Row_Checksum
NOTE:    Variable Name Change.  (Do Not Modify) Modified On -> _Do_Not_Modify__Modified_On
NOTE:    Variable Name Change.  Last Name -> Last_Name
NOTE:    Variable Name Change.  First Name -> First_Name
NOTE:    Variable Name Change.  Monitoring Status -> Monitoring_Status
NOTE:    Variable Name Change.  Address 1 -> Address_1
NOTE:    Variable Name Change.  Preferred Method of Contact -> Preferred_Method_of_Contact
NOTE:    Variable Name Change.  Mobile Phone -> Mobile_Phone
NOTE:    Variable Name Change.  Date of Last Exposure -> Date_of_Last_Exposure
NOTE:    Variable Name Change.  End of Quarantine Period -> End_of_Quarantine_Period
NOTE:    Variable Name Change.  Created On -> Created_On
NOTE: One or more variables were converted because the data type is not supported by the V9 engine.
      For more details, run with options MSGLEVEL=I.
WARNING: Some character data was lost during transcoding in column: Address 1 at obs 22763.
WARNING: Some character data was lost during transcoding in column: Address 1 at obs 22774.
WARNING: Some character data was lost during transcoding in column: Address 1 at obs 22799.
WARNING: Some character data was lost during transcoding in column: Address 1 at obs 22899.
NOTE: The import data set has 41280 observations and 20 variables.
NOTE: Compressing data set ARIAS.SYSTEM_CONTACTS_PER_TRACER decreased size by 51.03 percent.
      Compressed is 213 pages; un-compressed would require 435 pages.
NOTE: ARIAS.SYSTEM_CONTACTS_PER_TRACER data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           10.48 seconds
      cpu time            9.82 seconds

 SAS changed the variable names.  What is lost during transcoding?

 

I think that there could be some extraneous characters in this data when it was exported from OPERA into Arias.

Thank you for your help.

 

Thanks.

 

wlierman

ballardw
Super User

Unless you use the options to create ugly name literals the import procedure replaces all non-letter, non_digit and non_underscore characters with _ in the names, which in your case means mostly the spaces and parentheses. You can force use of variable names like "Name with spaces"n but that gets old pretty quickly.

 

Personally I would look at the data step code generated by proc import and modify it to something nicer. Your variables could be just plain Contact, Checksum and Modified_on. Use a label if needed to contain the "do not modify" information or make prettier column labels for reporting.

 

What is lost from transcoding depends on the content and what results. You can manually check the given lines to see what was in the CSV and what the SAS data set has. Since the data is apparently address related I would suspect national language characters like ë or É or similar that are not in your current national language setting.

wlierman
Lapis Lazuli | Level 10

Thank you for the follow-up information.  I will be sure to follow-up, since the data will continue to

come from contact tracers in the field then through an epidemiology-based data file.  Could be plenty of places for extraneous characters to be introduced.

 

Thank you agaoin.

 

wlierman

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2964 views
  • 2 likes
  • 4 in conversation