BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yves_Boonen
Quartz | Level 8

Since I am not allowed to access the database, I receive unloads in the form of a CSV file. These CSV files contain 15 to 20 columns and 700.000 rows. At least 2 (sometimes even more) of these columns contain dates. In the CSV file the used format for these dates are dd/mm/yyyy. Since proc import tries to discover and to recognize data on its own, I format these dates to the SAS default format dMMMyyyy to make this process easier.

 

I try to import these files using different methods. Proc import with a CSV file, proc import with an Excel file, data infile with a CSV file, data infile with an Excel file, or simply using File > Import Data and configure it using the SAS data import wizard. (I even use informat and format to define every single variable.)

 

When using a CSV file, I use the ; delimiter. No matter which above method I use, the delimiter is always ignored (even though I state "delimiter = ';' ;" in my script. To give you an example, the first variable will be "employee number;firstname" and the second variable will be "lastname; division;", while it actually had to be employee number; firstname lastname; division; and so on.

 

On top of that if I import either a CSV- or Excel file, the log file will show errors regarding invalid data. The error is "Invalid data for var in line 249 126-133". This causes more than 100.000 rows to lose all of their dates and this makes my analysis useless. I don't get why SAS thinks this is invalid data, because I even formatted it into dMMMyyyy.

 

When I don't format the dates in the original file, SAS will automatically turn the dates into numbers. I have used several methods to change the format of these columns, but they are simply not changing to dates no matter what I do.

 

I want to know what the best way is to handle these imports, knowing I will only receive CSV- and Excel files. I don't mind converting it to another file type or sorting the data before importing it, but I need SAS to stick to the data inside these files. My entire analysis is based on this data, so if the data goes "corrupt" at the very start, I will have a very bad time. Any ideas or suggestions?

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

I imported this:

r_object_id;object_name;document_direction;company;type_of_dossier_flow;kind_of_document;document_handler;dossier_number;detail_kind_of_document;date_treated;r_creation_date;document_state
090241ed8473996c;object_title_1;OUT;division;type_of_dossier;TRIGGER;Name 1;dossier_number_1;details_1;18/mrt/16;28/jul/15;CLASS
090241ed8473ae94;object_title_2;IN;division;type_of_dossier;DOK;Name 2;dossier_number_2;details_2;29/jul/15;28/jul/15;TREATED
090241ed865072af;object_title_3;IN;division;type_of_dossier;DON;Name 3;dossier_number_3;details_3;08/dec/15;27/nov/15;TREATED
090241ed840d8e61;object_title_4;IN;division;type_of_dossier;DOD;Name 4;dossier_number_4;details_4;30/jun/15;26/jun/15;TREATED
090241ed840d9642;object_title_5;OUT;division;type_of_dossier;FAK;Name 5;dossier_number_5;details_5;24/mrt/16;26/jun/15;CLASS

using this code:

%macro convert_date(invar,outvar);
format &outvar yymmdd10.;
&invar = tranwrd(&invar,'mrt','mar');
&invar = tranwrd(&invar,'mei','may');
&invar = tranwrd(&invar,'okt','oct');
&outvar = input(&invar,date9.);
drop &invar;
%mend;

data boonen;
infile '$HOME/sascommunity/boonen.txt' dlm=';' truncover firstobs=2 lrecl=500;
length
  r_object_id $16
  object_name $100
  document_direction $3
  company $10
  type_of_dossier_flow $20
  kind_of_document $10
  document_handler $20
  dossier_number $20
  detail_kind_of_document $100
  date_treated_c $9
  date_treated 5
  r_creation_date_c $9
  r_creation_date 5
  document_state $10
;
input
  r_object_id
  object_name
  document_direction
  company
  type_of_dossier_flow
  kind_of_document
  document_handler
  dossier_number
  detail_kind_of_document
  date_treated_c
  r_creation_date_c
  document_state
;
%convert_date(date_treated_c,date_treated);
%convert_date(r_creation_date_c,r_creation_date);
run;

proc print;
run;

with no errors, thsi is the result:

                                                              document_                   type_of_        kind_of_
                 Obs      r_object_id        object_name      direction    company      dossier_flow      document

                  1     090241ed8473996c    object_title_1       OUT       division    type_of_dossier    TRIGGER 
                  2     090241ed8473ae94    object_title_2       IN        division    type_of_dossier    DOK     
                  3     090241ed865072af    object_title_3       IN        division    type_of_dossier    DON     
                  4     090241ed840d8e61    object_title_4       IN        division    type_of_dossier    DOD     
                  5     090241ed840d9642    object_title_5       OUT       division    type_of_dossier    FAK     

                                                          detail_
                        document_                        kind_of_          date_    r_creation_    document_
                 Obs     handler      dossier_number     document        treated       date          state

                  1      Name 1      dossier_number_1    details_1    2016-03-18    2015-07-28      CLASS   
                  2      Name 2      dossier_number_2    details_2    2015-07-29    2015-07-28      TREATED 
                  3      Name 3      dossier_number_3    details_3    2015-12-08    2015-11-27      TREATED 
                  4      Name 4      dossier_number_4    details_4    2015-06-30    2015-06-26      TREATED 
                  5      Name 5      dossier_number_5    details_5    2016-03-24    2015-06-26      CLASS   

View solution in original post

23 REPLIES 23
Loko
Barite | Level 11

Hello.

 

If I have to choose I will say data infile.

The behaviour you mention is generated by something wrong in the code.

 

If you can give a snapshot of your data and the piece of code you use I am sure the experts in this forum will offer you a great solution.

Yves_Boonen
Quartz | Level 8

The data looks like this:

 

r_object_idobject_namedocument_directioncompanytype_of_dossier_flowkind_of_documentdocument_handlerdossier_numberdetail_kind_of_documentdate_treatedr_creation_datedocument_state
090241ed8473996cobject_title_1OUTdivisiontype_of_dossierTRIGGERName 1dossier_number_1details_118/mrt/1628/jul/15CLASS
090241ed8473ae94object_title_2INdivisiontype_of_dossierDOKName 2dossier_number_2details_229/jul/1528/jul/15TREATED
090241ed865072afobject_title_3INdivisiontype_of_dossierDONName 3dossier_number_3details_308/dec/1527/nov/15TREATED
090241ed840d8e61object_title_4INdivisiontype_of_dossierDODName 4dossier_number_4details_430/jun/1526/jun/15TREATED
090241ed840d9642object_title_5OUTdivisiontype_of_dossierFAKName 5dossier_number_5details_524/mrt/1626/jun/15CLASS

 

As you can see, company and type_of_dossier_flow are always the same value.

 

The variables that really are required for my analysis are kind_of_document, document_handler, date_treated, r_creation_date and document_state.

 

 

This is the proc import code:

 

proc import 
out=test
datafile="c:\temp\test.csv"
dbms=csv replace;
delimiter=';';
getnames=yes;
datarow=2;
run;

/* FOR EXCEL I CHANGE DATAFILE EXTENTION TO XLSX, DBMS TO XLSX AND REMOVE DELIMITER. */

 

 

This is the infile code:

 

data test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\temp\test.csv' firstobs=2;
informat kind_of_document $25.;
informat document_handler $25.;
informat date_treated DATE9.;
informat r_creation_date DATE9.;
informat document_state $25.;
format kind_of_document $25.;
format document_handler $25.;
format date_treated DATE9.;
format r_creation_date DATE9.;
format document_state $25.;
input
kind_of_document $
document_handler $
date_treated
r_creation_date
document_state $
;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;
Loko
Barite | Level 11

Hello,

 

You shall import the data file as it is and limit the kept variables within the sas file as to your needs.

So you shall adapt your code, something like this:

 

data test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\temp\test.csv' firstobs=2;
informat r_object_id $25.;
informat object_name $25.;
etc...
format r_object_ud $25.;
etc.. input r_object_id.; ;

keep
YOUR INTEREST VARIABLES
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ run;

 

 

Yves_Boonen
Quartz | Level 8

Also, is it worth it to make an Excel into a library? (I just briefly touched this subject, so the code below will most likely be incomplete.)

 

Like:

 

libname want "C:\temp\sas_data\excelfile.xlsx"

Kurt_Bremser
Super User

STAY AWAY from Excel. Period. Causes you endless grief.

 

And also, don't use PROC IMPORT. It makes the computer make guesses about the data structure, and COMPUTERS ARE DUMB.

 

Create a properly written data step from the definition, and you'll be fine forever after. If the structure changes unexpectedly, you get ERROR messages and can react accordingly.

 

If you need help with creating a data step, provide some test data in original file format and a short description.

Yves_Boonen
Quartz | Level 8

The data is provided above. The result a table with kind_of_document, document_handler, date_treated, r_creation_date and document_state.

 

As soon as I can get that table into SAS, I can start working on the data.

BrunoMueller
SAS Super FREQ

Hi

 

You are using locale specific month names. SAS can deal with this kind of date when you set the LOCALE= system option accordingly and use one of the NL... date informats. It looks liek that for dutch SAS expects the month abreviation as mrt. instead of just mrt as in your example.

 

Below is some code to illustrate this. So if you can get the dates in a locale independent format, everything would be a lot easier.

 

options locale=nl_NL ;
data want;
  infile cards dlm=";";
  input
    detail_kind_of_document : $32.
    date_treated	: $32.
  ;
  someDate = "01mar2016"d;
  date_treated_n = input(date_treated, nldate20.);
  format date_treated_n ddmmyyp10.;
  format someDate NLDATEL12.;
cards4;
details_1;18/mrt./16
details_2;29/jul/15
details_3;08/dec/15
details_4;30/jun/15
details_5;24/mrt/16
;;;;

Bruno

Yves_Boonen
Quartz | Level 8

What does that cards command doe actually?

 

I have seen it a couple times now, but I am clueless as to why you would use that.

BrunoMueller
SAS Super FREQ

Yves

 

The CARDS or DATALINES statement allows to simulate the data from an external file as part of your code. this is very convenient for testing purposes or provide sample data as part of a program.

 

Bruno

 

 

Yves_Boonen
Quartz | Level 8

This is my progress now:

 

data tables.test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\temp\SAS_Data\Tables\document.csv' firstobs=2;
informat r_object_id $16.;informat object_name $25.;informat document_direction $5.;informat company $3.;informat type_of_dossier_flow $15.;informat kind_of_document $8.;informat document_handler $25.;informat dossier_number $6.;informat details_kind_of_document $10.;informat date_treated date9.;informat r_creation_date date9.;informat document_state $10.;
format r_object_id $16.;format object_name $25.;format document_direction $5.;format company $3.;format type_of_dossier_flow $15.;format kind_of_document $8.;format document_handler $25.;format dossier_number $6.;format details_kind_of_document $10.;format date_treated date9.;format r_creation_date date9.;format document_state $10.;
input
r_object_id.; /* <--- Why is this here? Do I need to add anything to it? */ ;keep kind_ofdocument document_handler date_treated r_creation_date document_state; if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ run;

 

403  data tables.test;
404  %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
405  infile 'C:\temp\SAS_Data\Tables\document.csv' firstobs=2;
406  informat r_object_id $16.;informat object_name $25.;informat document_direction $5.;informat
406! company $3.;informat type_of_dossier_flow $15.;informat kind_of_document $8.;informat
406! document_handler $25.;informat dossier_number $6.;informat details_kind_of_document $10.;
406! informat date_treated date9.;informat r_creation_date date9.;informat document_state $10.;
407  format r_object_id $16.;format object_name $25.;format document_direction $5.;format company
407! $3.;format type_of_dossier_flow $15.;format kind_of_document $8.;format document_handler $25.
407! ;format dossier_number $6.;format details_kind_of_document $10.;format date_treated date9.;
407! format r_creation_date date9.;format document_state $10.;
408  input
409
410  ;keep kind_ofdocument document_handler date_treated r_creation_date document_state;
411  if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
412  run;

NOTE: Variable r_object_id is uninitialized.
NOTE: Variable object_name is uninitialized.
NOTE: Variable document_direction is uninitialized.
NOTE: Variable company is uninitialized.
NOTE: Variable type_of_dossier_flow is uninitialized.
NOTE: Variable kind_of_document is uninitialized.
NOTE: Variable document_handler is uninitialized.
NOTE: Variable dossier_number is uninitialized.
NOTE: Variable details_kind_of_document is uninitialized.
NOTE: Variable date_treated is uninitialized.
NOTE: Variable r_creation_date is uninitialized.
NOTE: Variable document_state is uninitialized.
WARNING: The variable kind_ofdocument in the DROP, KEEP, or RENAME list has never been referenced.
NOTE: The infile 'C:\temp\SAS_Data\Tables\document.csv' is:
      Filename=C:\temp\SAS_Data\Tables\document.csv,
      RECFM=V,LRECL=32767,
      File Size (bytes)=123899780,
      Last Modified=08 april 2016 11:24:34 uur,
      Create Time=08 april 2016 08:43:27 uur

NOTE: 696743 records were read from the infile 'C:\temp\SAS_Data\Tables\document.csv'.
      The minimum record length was 81.
      The maximum record length was 348.
NOTE: The data set TABLES.TEST has 696743 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.29 seconds
      cpu time            0.29 seconds


 

Result:

An empty table with my keep variables.

Loko
Barite | Level 11

Hello,

 

You need the inout statement to bring the values to sas database.

 

So just add :

input
 r_object_id $
 format object_name $
 document_direction $
 format company $
 format type_of_dossier_flow $
 kind_of_document $
 document_handler $
 dossier_number $
 details_kind_of_document $
 date_treated
 r_creation_date
 document_state $
 ;

Yves_Boonen
Quartz | Level 8

Now the issue I was having, has come up again.

 

453  data tables.test;
454  %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
455  infile 'C:\temp\SAS_Data\Tables\document.csv' firstobs=2;
456  informat r_object_id $16.;informat object_name $25.;informat document_direction $5.;informat
456! company $3.;informat type_of_dossier_flow $15.;informat kind_of_document $32.;informat
456! document_handler $32.;informat dossier_number $6.;informat details_kind_of_document $10.;
456! informat date_treated date11.;informat r_creation_date date11.;informat document_state $32.;
457  format r_object_id $16.;format object_name $25.;format document_direction $5.;format company
457! $3.;format type_of_dossier_flow $15.;format kind_of_document $32.;format document_handler
457! $32.;format dossier_number $6.;format details_kind_of_document $10.;format date_treated
457! date11.;format r_creation_date date11.;format document_state $32.;
458  input
459  r_object_id $
460  format object_name $
461  document_direction $
462  format company $
463  format type_of_dossier_flow $
464  kind_of_document $
465  document_handler $
466  dossier_number $
467  details_kind_of_document $
468  date_treated
469  r_creation_date
470  document_state $
471  ;
472  keep kind_of_document document_handler date_treated r_creation_date document_state;
473  if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
474  run;

NOTE: The infile 'C:\temp\SAS_Data\Tables\document.csv' is:
      Filename=C:\temp\SAS_Data\Tables\document.csv,
      RECFM=V,LRECL=32767,
      File Size (bytes)=123899780,
      Last Modified=08 april 2016 11:24:34 uur,
      Create Time=08 april 2016 08:43:27 uur

NOTE: Invalid data for format in line 2 31-35.
NOTE: Invalid data for format in line 2 108-112.
NOTE: Invalid data for format in line 3 1-24.
NOTE: Invalid data for date_treated in line 4 1-24.
NOTE: Invalid data for r_creation_date in line 4 26-70.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+


data tables.test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'C:\temp\SAS_Data\Tables\document.csv' firstobs=2;
informat r_object_id $16.;informat object_name $25.;informat document_direction $5.;informat company $3.;informat type_of_dossier_flow $15.;informat kind_of_document $32.;informat document_handler $32.;informat dossier_number $6.;informat details_kind_of_document $10.;informat date_treated date11.;informat r_creation_date date11.;informat document_state $32.;
format r_object_id $16.;format object_name $25.;format document_direction $5.;format company $3.;format type_of_dossier_flow $15.;format kind_of_document $32.;format document_handler $32.;format dossier_number $6.;format details_kind_of_document $10.;format date_treated date11.;format r_creation_date date11.;format document_state $32.;
input
r_object_id $
format object_name $
document_direction $
format company $
format type_of_dossier_flow $
kind_of_document $
document_handler $
dossier_number $
details_kind_of_document $
date_treated
r_creation_date
document_state $
;
keep kind_of_document document_handler date_treated r_creation_date document_state;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;

And the result is a crazy table. See the attachment.

 


test.JPG
Loko
Barite | Level 11

Hello,

 

in your code you have written format where it should have been input 😉

Kurt_Bremser
Super User

I imported this:

r_object_id;object_name;document_direction;company;type_of_dossier_flow;kind_of_document;document_handler;dossier_number;detail_kind_of_document;date_treated;r_creation_date;document_state
090241ed8473996c;object_title_1;OUT;division;type_of_dossier;TRIGGER;Name 1;dossier_number_1;details_1;18/mrt/16;28/jul/15;CLASS
090241ed8473ae94;object_title_2;IN;division;type_of_dossier;DOK;Name 2;dossier_number_2;details_2;29/jul/15;28/jul/15;TREATED
090241ed865072af;object_title_3;IN;division;type_of_dossier;DON;Name 3;dossier_number_3;details_3;08/dec/15;27/nov/15;TREATED
090241ed840d8e61;object_title_4;IN;division;type_of_dossier;DOD;Name 4;dossier_number_4;details_4;30/jun/15;26/jun/15;TREATED
090241ed840d9642;object_title_5;OUT;division;type_of_dossier;FAK;Name 5;dossier_number_5;details_5;24/mrt/16;26/jun/15;CLASS

using this code:

%macro convert_date(invar,outvar);
format &outvar yymmdd10.;
&invar = tranwrd(&invar,'mrt','mar');
&invar = tranwrd(&invar,'mei','may');
&invar = tranwrd(&invar,'okt','oct');
&outvar = input(&invar,date9.);
drop &invar;
%mend;

data boonen;
infile '$HOME/sascommunity/boonen.txt' dlm=';' truncover firstobs=2 lrecl=500;
length
  r_object_id $16
  object_name $100
  document_direction $3
  company $10
  type_of_dossier_flow $20
  kind_of_document $10
  document_handler $20
  dossier_number $20
  detail_kind_of_document $100
  date_treated_c $9
  date_treated 5
  r_creation_date_c $9
  r_creation_date 5
  document_state $10
;
input
  r_object_id
  object_name
  document_direction
  company
  type_of_dossier_flow
  kind_of_document
  document_handler
  dossier_number
  detail_kind_of_document
  date_treated_c
  r_creation_date_c
  document_state
;
%convert_date(date_treated_c,date_treated);
%convert_date(r_creation_date_c,r_creation_date);
run;

proc print;
run;

with no errors, thsi is the result:

                                                              document_                   type_of_        kind_of_
                 Obs      r_object_id        object_name      direction    company      dossier_flow      document

                  1     090241ed8473996c    object_title_1       OUT       division    type_of_dossier    TRIGGER 
                  2     090241ed8473ae94    object_title_2       IN        division    type_of_dossier    DOK     
                  3     090241ed865072af    object_title_3       IN        division    type_of_dossier    DON     
                  4     090241ed840d8e61    object_title_4       IN        division    type_of_dossier    DOD     
                  5     090241ed840d9642    object_title_5       OUT       division    type_of_dossier    FAK     

                                                          detail_
                        document_                        kind_of_          date_    r_creation_    document_
                 Obs     handler      dossier_number     document        treated       date          state

                  1      Name 1      dossier_number_1    details_1    2016-03-18    2015-07-28      CLASS   
                  2      Name 2      dossier_number_2    details_2    2015-07-29    2015-07-28      TREATED 
                  3      Name 3      dossier_number_3    details_3    2015-12-08    2015-11-27      TREATED 
                  4      Name 4      dossier_number_4    details_4    2015-06-30    2015-06-26      TREATED 
                  5      Name 5      dossier_number_5    details_5    2016-03-24    2015-06-26      CLASS   

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 23 replies
  • 4319 views
  • 6 likes
  • 5 in conversation