Since I am not allowed to access the database, I receive unloads in the form of a CSV file. These CSV files contain 15 to 20 columns and 700.000 rows. At least 2 (sometimes even more) of these columns contain dates. In the CSV file the used format for these dates are dd/mm/yyyy. Since proc import tries to discover and to recognize data on its own, I format these dates to the SAS default format dMMMyyyy to make this process easier.
I try to import these files using different methods. Proc import with a CSV file, proc import with an Excel file, data infile with a CSV file, data infile with an Excel file, or simply using File > Import Data and configure it using the SAS data import wizard. (I even use informat and format to define every single variable.)
When using a CSV file, I use the ; delimiter. No matter which above method I use, the delimiter is always ignored (even though I state "delimiter = ';' ;" in my script. To give you an example, the first variable will be "employee number;firstname" and the second variable will be "lastname; division;", while it actually had to be employee number; firstname lastname; division; and so on.
On top of that if I import either a CSV- or Excel file, the log file will show errors regarding invalid data. The error is "Invalid data for var in line 249 126-133". This causes more than 100.000 rows to lose all of their dates and this makes my analysis useless. I don't get why SAS thinks this is invalid data, because I even formatted it into dMMMyyyy.
When I don't format the dates in the original file, SAS will automatically turn the dates into numbers. I have used several methods to change the format of these columns, but they are simply not changing to dates no matter what I do.
I want to know what the best way is to handle these imports, knowing I will only receive CSV- and Excel files. I don't mind converting it to another file type or sorting the data before importing it, but I need SAS to stick to the data inside these files. My entire analysis is based on this data, so if the data goes "corrupt" at the very start, I will have a very bad time. Any ideas or suggestions?
I imported this:
r_object_id;object_name;document_direction;company;type_of_dossier_flow;kind_of_document;document_handler;dossier_number;detail_kind_of_document;date_treated;r_creation_date;document_state 090241ed8473996c;object_title_1;OUT;division;type_of_dossier;TRIGGER;Name 1;dossier_number_1;details_1;18/mrt/16;28/jul/15;CLASS 090241ed8473ae94;object_title_2;IN;division;type_of_dossier;DOK;Name 2;dossier_number_2;details_2;29/jul/15;28/jul/15;TREATED 090241ed865072af;object_title_3;IN;division;type_of_dossier;DON;Name 3;dossier_number_3;details_3;08/dec/15;27/nov/15;TREATED 090241ed840d8e61;object_title_4;IN;division;type_of_dossier;DOD;Name 4;dossier_number_4;details_4;30/jun/15;26/jun/15;TREATED 090241ed840d9642;object_title_5;OUT;division;type_of_dossier;FAK;Name 5;dossier_number_5;details_5;24/mrt/16;26/jun/15;CLASS
using this code:
%macro convert_date(invar,outvar);
format &outvar yymmdd10.;
&invar = tranwrd(&invar,'mrt','mar');
&invar = tranwrd(&invar,'mei','may');
&invar = tranwrd(&invar,'okt','oct');
&outvar = input(&invar,date9.);
drop &invar;
%mend;
data boonen;
infile '$HOME/sascommunity/boonen.txt' dlm=';' truncover firstobs=2 lrecl=500;
length
r_object_id $16
object_name $100
document_direction $3
company $10
type_of_dossier_flow $20
kind_of_document $10
document_handler $20
dossier_number $20
detail_kind_of_document $100
date_treated_c $9
date_treated 5
r_creation_date_c $9
r_creation_date 5
document_state $10
;
input
r_object_id
object_name
document_direction
company
type_of_dossier_flow
kind_of_document
document_handler
dossier_number
detail_kind_of_document
date_treated_c
r_creation_date_c
document_state
;
%convert_date(date_treated_c,date_treated);
%convert_date(r_creation_date_c,r_creation_date);
run;
proc print;
run;
with no errors, thsi is the result:
document_ type_of_ kind_of_ Obs r_object_id object_name direction company dossier_flow document 1 090241ed8473996c object_title_1 OUT division type_of_dossier TRIGGER 2 090241ed8473ae94 object_title_2 IN division type_of_dossier DOK 3 090241ed865072af object_title_3 IN division type_of_dossier DON 4 090241ed840d8e61 object_title_4 IN division type_of_dossier DOD 5 090241ed840d9642 object_title_5 OUT division type_of_dossier FAK detail_ document_ kind_of_ date_ r_creation_ document_ Obs handler dossier_number document treated date state 1 Name 1 dossier_number_1 details_1 2016-03-18 2015-07-28 CLASS 2 Name 2 dossier_number_2 details_2 2015-07-29 2015-07-28 TREATED 3 Name 3 dossier_number_3 details_3 2015-12-08 2015-11-27 TREATED 4 Name 4 dossier_number_4 details_4 2015-06-30 2015-06-26 TREATED 5 Name 5 dossier_number_5 details_5 2016-03-24 2015-06-26 CLASS
Hello.
If I have to choose I will say data infile.
The behaviour you mention is generated by something wrong in the code.
If you can give a snapshot of your data and the piece of code you use I am sure the experts in this forum will offer you a great solution.
The data looks like this:
r_object_id | object_name | document_direction | company | type_of_dossier_flow | kind_of_document | document_handler | dossier_number | detail_kind_of_document | date_treated | r_creation_date | document_state |
090241ed8473996c | object_title_1 | OUT | division | type_of_dossier | TRIGGER | Name 1 | dossier_number_1 | details_1 | 18/mrt/16 | 28/jul/15 | CLASS |
090241ed8473ae94 | object_title_2 | IN | division | type_of_dossier | DOK | Name 2 | dossier_number_2 | details_2 | 29/jul/15 | 28/jul/15 | TREATED |
090241ed865072af | object_title_3 | IN | division | type_of_dossier | DON | Name 3 | dossier_number_3 | details_3 | 08/dec/15 | 27/nov/15 | TREATED |
090241ed840d8e61 | object_title_4 | IN | division | type_of_dossier | DOD | Name 4 | dossier_number_4 | details_4 | 30/jun/15 | 26/jun/15 | TREATED |
090241ed840d9642 | object_title_5 | OUT | division | type_of_dossier | FAK | Name 5 | dossier_number_5 | details_5 | 24/mrt/16 | 26/jun/15 | CLASS |
As you can see, company and type_of_dossier_flow are always the same value.
The variables that really are required for my analysis are kind_of_document, document_handler, date_treated, r_creation_date and document_state.
This is the proc import code:
proc import
out=test
datafile="c:\temp\test.csv"
dbms=csv replace;
delimiter=';';
getnames=yes;
datarow=2;
run;
/* FOR EXCEL I CHANGE DATAFILE EXTENTION TO XLSX, DBMS TO XLSX AND REMOVE DELIMITER. */
This is the infile code:
data test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\temp\test.csv' firstobs=2;
informat kind_of_document $25.;
informat document_handler $25.;
informat date_treated DATE9.;
informat r_creation_date DATE9.;
informat document_state $25.;
format kind_of_document $25.;
format document_handler $25.;
format date_treated DATE9.;
format r_creation_date DATE9.;
format document_state $25.;
input
kind_of_document $
document_handler $
date_treated
r_creation_date
document_state $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Hello,
You shall import the data file as it is and limit the kept variables within the sas file as to your needs.
So you shall adapt your code, something like this:
data test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\temp\test.csv' firstobs=2;
informat r_object_id $25.;
informat object_name $25.;
etc...
format r_object_ud $25.;
etc..
input
r_object_id.;
;
keep
YOUR INTEREST VARIABLES
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Also, is it worth it to make an Excel into a library? (I just briefly touched this subject, so the code below will most likely be incomplete.)
Like:
libname want "C:\temp\sas_data\excelfile.xlsx"
STAY AWAY from Excel. Period. Causes you endless grief.
And also, don't use PROC IMPORT. It makes the computer make guesses about the data structure, and COMPUTERS ARE DUMB.
Create a properly written data step from the definition, and you'll be fine forever after. If the structure changes unexpectedly, you get ERROR messages and can react accordingly.
If you need help with creating a data step, provide some test data in original file format and a short description.
The data is provided above. The result a table with kind_of_document, document_handler, date_treated, r_creation_date and document_state.
As soon as I can get that table into SAS, I can start working on the data.
Hi
You are using locale specific month names. SAS can deal with this kind of date when you set the LOCALE= system option accordingly and use one of the NL... date informats. It looks liek that for dutch SAS expects the month abreviation as mrt. instead of just mrt as in your example.
Below is some code to illustrate this. So if you can get the dates in a locale independent format, everything would be a lot easier.
options locale=nl_NL ;
data want;
infile cards dlm=";";
input
detail_kind_of_document : $32.
date_treated : $32.
;
someDate = "01mar2016"d;
date_treated_n = input(date_treated, nldate20.);
format date_treated_n ddmmyyp10.;
format someDate NLDATEL12.;
cards4;
details_1;18/mrt./16
details_2;29/jul/15
details_3;08/dec/15
details_4;30/jun/15
details_5;24/mrt/16
;;;;
Bruno
What does that cards command doe actually?
I have seen it a couple times now, but I am clueless as to why you would use that.
Yves
The CARDS or DATALINES statement allows to simulate the data from an external file as part of your code. this is very convenient for testing purposes or provide sample data as part of a program.
Bruno
This is my progress now:
data tables.test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\temp\SAS_Data\Tables\document.csv' firstobs=2;
informat r_object_id $16.;informat object_name $25.;informat document_direction $5.;informat company $3.;informat type_of_dossier_flow $15.;informat kind_of_document $8.;informat document_handler $25.;informat dossier_number $6.;informat details_kind_of_document $10.;informat date_treated date9.;informat r_creation_date date9.;informat document_state $10.;
format r_object_id $16.;format object_name $25.;format document_direction $5.;format company $3.;format type_of_dossier_flow $15.;format kind_of_document $8.;format document_handler $25.;format dossier_number $6.;format details_kind_of_document $10.;format date_treated date9.;format r_creation_date date9.;format document_state $10.;
input
r_object_id.; /* <--- Why is this here? Do I need to add anything to it? */
;keep kind_ofdocument document_handler date_treated r_creation_date document_state;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
403 data tables.test;
404 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
405 infile 'C:\temp\SAS_Data\Tables\document.csv' firstobs=2;
406 informat r_object_id $16.;informat object_name $25.;informat document_direction $5.;informat
406! company $3.;informat type_of_dossier_flow $15.;informat kind_of_document $8.;informat
406! document_handler $25.;informat dossier_number $6.;informat details_kind_of_document $10.;
406! informat date_treated date9.;informat r_creation_date date9.;informat document_state $10.;
407 format r_object_id $16.;format object_name $25.;format document_direction $5.;format company
407! $3.;format type_of_dossier_flow $15.;format kind_of_document $8.;format document_handler $25.
407! ;format dossier_number $6.;format details_kind_of_document $10.;format date_treated date9.;
407! format r_creation_date date9.;format document_state $10.;
408 input
409
410 ;keep kind_ofdocument document_handler date_treated r_creation_date document_state;
411 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
412 run;
NOTE: Variable r_object_id is uninitialized.
NOTE: Variable object_name is uninitialized.
NOTE: Variable document_direction is uninitialized.
NOTE: Variable company is uninitialized.
NOTE: Variable type_of_dossier_flow is uninitialized.
NOTE: Variable kind_of_document is uninitialized.
NOTE: Variable document_handler is uninitialized.
NOTE: Variable dossier_number is uninitialized.
NOTE: Variable details_kind_of_document is uninitialized.
NOTE: Variable date_treated is uninitialized.
NOTE: Variable r_creation_date is uninitialized.
NOTE: Variable document_state is uninitialized.
WARNING: The variable kind_ofdocument in the DROP, KEEP, or RENAME list has never been referenced.
NOTE: The infile 'C:\temp\SAS_Data\Tables\document.csv' is:
Filename=C:\temp\SAS_Data\Tables\document.csv,
RECFM=V,LRECL=32767,
File Size (bytes)=123899780,
Last Modified=08 april 2016 11:24:34 uur,
Create Time=08 april 2016 08:43:27 uur
NOTE: 696743 records were read from the infile 'C:\temp\SAS_Data\Tables\document.csv'.
The minimum record length was 81.
The maximum record length was 348.
NOTE: The data set TABLES.TEST has 696743 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.29 seconds
cpu time 0.29 seconds
Result:
An empty table with my keep variables.
Hello,
You need the inout statement to bring the values to sas database.
So just add :
input
r_object_id $
format object_name $
document_direction $
format company $
format type_of_dossier_flow $
kind_of_document $
document_handler $
dossier_number $
details_kind_of_document $
date_treated
r_creation_date
document_state $
;
Now the issue I was having, has come up again.
453 data tables.test;
454 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
455 infile 'C:\temp\SAS_Data\Tables\document.csv' firstobs=2;
456 informat r_object_id $16.;informat object_name $25.;informat document_direction $5.;informat
456! company $3.;informat type_of_dossier_flow $15.;informat kind_of_document $32.;informat
456! document_handler $32.;informat dossier_number $6.;informat details_kind_of_document $10.;
456! informat date_treated date11.;informat r_creation_date date11.;informat document_state $32.;
457 format r_object_id $16.;format object_name $25.;format document_direction $5.;format company
457! $3.;format type_of_dossier_flow $15.;format kind_of_document $32.;format document_handler
457! $32.;format dossier_number $6.;format details_kind_of_document $10.;format date_treated
457! date11.;format r_creation_date date11.;format document_state $32.;
458 input
459 r_object_id $
460 format object_name $
461 document_direction $
462 format company $
463 format type_of_dossier_flow $
464 kind_of_document $
465 document_handler $
466 dossier_number $
467 details_kind_of_document $
468 date_treated
469 r_creation_date
470 document_state $
471 ;
472 keep kind_of_document document_handler date_treated r_creation_date document_state;
473 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
474 run;
NOTE: The infile 'C:\temp\SAS_Data\Tables\document.csv' is:
Filename=C:\temp\SAS_Data\Tables\document.csv,
RECFM=V,LRECL=32767,
File Size (bytes)=123899780,
Last Modified=08 april 2016 11:24:34 uur,
Create Time=08 april 2016 08:43:27 uur
NOTE: Invalid data for format in line 2 31-35.
NOTE: Invalid data for format in line 2 108-112.
NOTE: Invalid data for format in line 3 1-24.
NOTE: Invalid data for date_treated in line 4 1-24.
NOTE: Invalid data for r_creation_date in line 4 26-70.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
data tables.test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\temp\SAS_Data\Tables\document.csv' firstobs=2;
informat r_object_id $16.;informat object_name $25.;informat document_direction $5.;informat company $3.;informat type_of_dossier_flow $15.;informat kind_of_document $32.;informat document_handler $32.;informat dossier_number $6.;informat details_kind_of_document $10.;informat date_treated date11.;informat r_creation_date date11.;informat document_state $32.;
format r_object_id $16.;format object_name $25.;format document_direction $5.;format company $3.;format type_of_dossier_flow $15.;format kind_of_document $32.;format document_handler $32.;format dossier_number $6.;format details_kind_of_document $10.;format date_treated date11.;format r_creation_date date11.;format document_state $32.;
input
r_object_id $
format object_name $
document_direction $
format company $
format type_of_dossier_flow $
kind_of_document $
document_handler $
dossier_number $
details_kind_of_document $
date_treated
r_creation_date
document_state $
;
keep kind_of_document document_handler date_treated r_creation_date document_state;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
And the result is a crazy table. See the attachment.
Hello,
in your code you have written format where it should have been input 😉
I imported this:
r_object_id;object_name;document_direction;company;type_of_dossier_flow;kind_of_document;document_handler;dossier_number;detail_kind_of_document;date_treated;r_creation_date;document_state 090241ed8473996c;object_title_1;OUT;division;type_of_dossier;TRIGGER;Name 1;dossier_number_1;details_1;18/mrt/16;28/jul/15;CLASS 090241ed8473ae94;object_title_2;IN;division;type_of_dossier;DOK;Name 2;dossier_number_2;details_2;29/jul/15;28/jul/15;TREATED 090241ed865072af;object_title_3;IN;division;type_of_dossier;DON;Name 3;dossier_number_3;details_3;08/dec/15;27/nov/15;TREATED 090241ed840d8e61;object_title_4;IN;division;type_of_dossier;DOD;Name 4;dossier_number_4;details_4;30/jun/15;26/jun/15;TREATED 090241ed840d9642;object_title_5;OUT;division;type_of_dossier;FAK;Name 5;dossier_number_5;details_5;24/mrt/16;26/jun/15;CLASS
using this code:
%macro convert_date(invar,outvar);
format &outvar yymmdd10.;
&invar = tranwrd(&invar,'mrt','mar');
&invar = tranwrd(&invar,'mei','may');
&invar = tranwrd(&invar,'okt','oct');
&outvar = input(&invar,date9.);
drop &invar;
%mend;
data boonen;
infile '$HOME/sascommunity/boonen.txt' dlm=';' truncover firstobs=2 lrecl=500;
length
r_object_id $16
object_name $100
document_direction $3
company $10
type_of_dossier_flow $20
kind_of_document $10
document_handler $20
dossier_number $20
detail_kind_of_document $100
date_treated_c $9
date_treated 5
r_creation_date_c $9
r_creation_date 5
document_state $10
;
input
r_object_id
object_name
document_direction
company
type_of_dossier_flow
kind_of_document
document_handler
dossier_number
detail_kind_of_document
date_treated_c
r_creation_date_c
document_state
;
%convert_date(date_treated_c,date_treated);
%convert_date(r_creation_date_c,r_creation_date);
run;
proc print;
run;
with no errors, thsi is the result:
document_ type_of_ kind_of_ Obs r_object_id object_name direction company dossier_flow document 1 090241ed8473996c object_title_1 OUT division type_of_dossier TRIGGER 2 090241ed8473ae94 object_title_2 IN division type_of_dossier DOK 3 090241ed865072af object_title_3 IN division type_of_dossier DON 4 090241ed840d8e61 object_title_4 IN division type_of_dossier DOD 5 090241ed840d9642 object_title_5 OUT division type_of_dossier FAK detail_ document_ kind_of_ date_ r_creation_ document_ Obs handler dossier_number document treated date state 1 Name 1 dossier_number_1 details_1 2016-03-18 2015-07-28 CLASS 2 Name 2 dossier_number_2 details_2 2015-07-29 2015-07-28 TREATED 3 Name 3 dossier_number_3 details_3 2015-12-08 2015-11-27 TREATED 4 Name 4 dossier_number_4 details_4 2015-06-30 2015-06-26 TREATED 5 Name 5 dossier_number_5 details_5 2016-03-24 2015-06-26 CLASS
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.