This is a good example file to test the %CSV2DS() macro with.
The first thing you will notice is that you cannot use the ZIP file directly with PROC IMPORT.
filename csvzip zip 'c:\downloads\results.csv.zip' member='results.csv';
proc import datafile=csvzip dbms=csv out=import1 replace;
run;
That causes errors:
343 proc import datafile=csvzip dbms=csv out=import1 replace;
344 run;
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
Unable to sample external file, no data in first 5 records.
ERROR: Import unsuccessful. See SAS Log for details.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.06 seconds
cpu time 0.07 seconds
So you will have to unzip the file to use PROC IMPORT. So let's try again with the unzipped file.
filename csv 'c:\downloads\results.csv';
proc import datafile=csv dbms=csv out=import2 replace;
run;
Now at least it makes the dataset, but there are errors in the SAS log.
NOTE: Invalid data for home_score in line 43083 26-27.
NOTE: Invalid data for away_score in line 43083 29-30.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
43083 2021-12-15,Tunisia,Egypt,NA,NA,Arab Cup,Doha,Qatar,TRUE 55
date=2021-12-15 home_team=Tunisia away_team=Egypt home_score=. away_score=. tournament=Arab Cup city=Doha country=Qatar neutral=TRUE
_ERROR_=1 _N_=43082
So let's try again using the GUESSINGROWS=MAX option.
proc import datafile=csv dbms=csv out=import3 replace;
guessingrows=max;
run;
Finally we got something.
NOTE: WORK.IMPORT3 data set was successfully created.
NOTE: The data set WORK.IMPORT3 has 43086 observations and 9 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 11.09 seconds
cpu time 10.28 seconds
Now let's try %CSV2DS() on the zip file.
%csv2ds(csvzip,out=csv2ds1,replace=1);
Results:
NOTE: 43086 records were read from the infile CSVZIP.
The minimum record length was 45.
The maximum record length was 156.
NOTE: The data set WORK.CSV2DS1 has 43086 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.06 seconds
NOTE: %INCLUDE (level 1) ending.
NOTE: CSV2DS used (Total process time): 1.09200000762939 seconds
So one tenth of the time compared to PROC IMPORT and we didn't have to waste time (and disk space) unzipping the file.
Let's compare and see how different they are:
proc compare data=import3 compare=csv2ds1;
run;
Values match:
NOTE: No unequal values were found. All values compared are exactly equal.
But the dataset generated by PROC IMPORT has unneeded (and dangerous) formats/informats attached to all of the character variables.
Listing of Common Variables with Differing Attributes
Variable Dataset Type Length Format Informat
date WORK.IMPORT3 Num 8 YYMMDD10. YYMMDD10.
WORK.CSV2DS1 Num 8 YYMMDD10. YYMMDD.
home_team WORK.IMPORT3 Char 32 $32. $32.
WORK.CSV2DS1 Char 32
away_team WORK.IMPORT3 Char 32 $32. $32.
WORK.CSV2DS1 Char 32
home_score WORK.IMPORT3 Char 2 $2. $2.
WORK.CSV2DS1 Char 2
away_score WORK.IMPORT3 Char 2 $2. $2.
WORK.CSV2DS1 Char 2
tournament WORK.IMPORT3 Char 45 $45. $45.
WORK.CSV2DS1 Char 45
city WORK.IMPORT3 Char 28 $28. $28.
WORK.CSV2DS1 Char 28
country WORK.IMPORT3 Char 32 $32. $32.
WORK.CSV2DS1 Char 32
neutral WORK.IMPORT3 Char 5 $5. $5.
WORK.CSV2DS1 Char 5
Now let's compare the SAS code they generated.
Here is the data step generated by %CSV2DS(). 7 statements.
441 +data csv2ds1;
442 + infile CSVZIP dlm=',' dsd truncover firstobs=2 ;
443 + length date 8 home_team $32 away_team $32 home_score $2 away_score $2
444 + tournament $45 city $28 country $32 neutral $5
445 + ;
446 + informat date yymmdd. ;
447 + format date yymmdd10. ;
448 + input date -- neutral ;
449 +run;
Here is the code generated by PROC IMPORT.
404 data WORK.IMPORT3 ;
405 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
406 infile CSV delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
407 informat date yymmdd10. ;
408 informat home_team $32. ;
409 informat away_team $32. ;
410 informat home_score $2. ;
411 informat away_score $2. ;
412 informat tournament $45. ;
413 informat city $28. ;
414 informat country $32. ;
415 informat neutral $5. ;
416 format date yymmdd10. ;
417 format home_team $32. ;
418 format away_team $32. ;
419 format home_score $2. ;
420 format away_score $2. ;
421 format tournament $45. ;
422 format city $28. ;
423 format country $32. ;
424 format neutral $5. ;
425 input
426 date
427 home_team $
428 away_team $
429 home_score $
430 away_score $
431 tournament $
432 city $
433 country $
434 neutral $
435 ;
436 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
437 run;
Now let's look at why that second PROC IMPORT (without GUESSINGROWS=MAX) generated errors. The reason is that the file has NA in two score fields that are otherwise numeric, but there aren't any in the first few rows. Now it would be nice if we could read those fields as numbers and set the NA values to missing. One way to do that is to define a special informat that knows what to do with the NA strings. So something like this:
proc format;
invalue na (upcase default=32)
'NA'=._
other=[32.]
;
run;
Now how can we tell SAS to use that informat to read those two score columns as numeric? We could copy the code from the LOG and modify it. But with %CSV2DS() we can also give it a metadata file to override its guessing decisions about how to read the file. So let's make an overrides dataset. We can start by using the _TYPES_ dataset the macro produces to find those variables that have a lot of numeric values and also NA strings.
data overrides;
set _types_;
where max='NA' and .80 < divide(numeric,nonmiss);
keep varnum length informat ;
length='8';
informat='na.';
run;
Now we can re-run the macro passing in those overrides.
%csv2ds(csvzip
,out=csv2ds2
,replace=1
,overrides=overrides
);
Which generates and runs this data step instead.
459 +data csv2ds2;
460 + infile CSVZIP dlm=',' dsd truncover firstobs=2 ;
461 + length date 8 home_team $32 away_team $32 home_score 8 away_score 8
462 + tournament $45 city $28 country $32 neutral $5
463 + ;
464 + informat date yymmdd. home_score na. away_score na. ;
465 + format date yymmdd10. ;
466 + input date -- neutral ;
467 +run;
And now we have HOME_SCORE and AWAY_SCORE as numeric.
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
-----------------------------------------------------------------------------------
date 43086 11638.06 8954.60 -31807.00 22632.00
home_score 43081 1.7432279 1.7539820 0 31.0000000
away_score 43081 1.1836076 1.4006205 0 21.0000000
-----------------------------------------------------------------------------------
... View more