Hello everyone,
I ve got a dataset with 40k rows. Under the variable Date i have numeric dates with the format DD-MM-YY but also i have dates prior to 1900 and they are char and the format is YYYY-MM-DD. I cant see how am i gonna work around that.
DATA firstload;
INFILE '/mypath/results.csv' DLM=',' FIRSTOBS=2 DSD TRUNCOVER;
INPUT Date:$10. Home_Team:$32. Away_Team:$32. Home_Score:2. Away_Score:2. Tournament:$45. City:$28. Country:$32. Neutral:$5.;
RUN;
DATA task2.results;
RETAIN Date;
SET firstload (RENAME=(Date=Old));
Date = put(Old,YYMMDD10.);
DROP Old;
FORMAT Date YYMMDD10.;
RUN;
Here are my blocks.
and here is a screenshot of the results i am getting.
and here is the contents
WHEN i try the below
DATA firstload;
INFILE 'mypath/results.csv' DLM=',' FIRSTOBS=2 DSD TRUNCOVER;
INPUT Date:$10. Home_Team:$32. Away_Team:$32. Home_Score:2. Away_Score:2. Tournament:$45.
City:$28. Country:$32. Neutral:$5. Date2:YYMMDD10.;
RUN;
DATA task2.results;
RETAIN Date2;
SET firstload;
Date2 = input(Date, YYMMDD10.);
FORMAT Date2 YYMMDD10.;
DROP Date;
RENAME Date2=Date;
RUN;
i get
but the problem persists when im trying to PROC SORT By date as u can see the dates above...
ANY IDEAS?
It looks like you have dates that are both 4 digit and 2 digit years in the data set.
And it looks like the 4 digit years are fine, the two digit years are not and everything is being read as character.
For the dates with the two digit year, what should be the first two digits, 1911 or 2011?
How do you know which ones are 1911 versus 2011?
@Primohunter wrote:
Hello everyone,
I ve got a dataset with 40k rows. Under the variable Date i have numeric dates with the format DD-MM-YY but also i have dates prior to 1900 and they are char and the format is YYYY-MM-DD. I cant see how am i gonna work around that.DATA firstload;
INFILE '/mypath/results.csv' DLM=',' FIRSTOBS=2 DSD TRUNCOVER;
INPUT Date:$10. Home_Team:$32. Away_Team:$32. Home_Score:2. Away_Score:2. Tournament:$45. City:$28. Country:$32. Neutral:$5.;
RUN;
DATA task2.results;
RETAIN Date;
SET firstload (RENAME=(Date=Old));
Date = put(Old,YYMMDD10.);
DROP Old;
FORMAT Date YYMMDD10.;
RUN;
Here are my blocks.
and here is a screenshot of the results i am getting.
and here is the contents
WHEN i try the below
DATA firstload;
INFILE 'mypath/results.csv' DLM=',' FIRSTOBS=2 DSD TRUNCOVER;
INPUT Date:$10. Home_Team:$32. Away_Team:$32. Home_Score:2. Away_Score:2. Tournament:$45.
City:$28. Country:$32. Neutral:$5. Date2:YYMMDD10.;
RUN;
DATA task2.results;
RETAIN Date2;
SET firstload;
Date2 = input(Date, YYMMDD10.);
FORMAT Date2 YYMMDD10.;
DROP Date;
RENAME Date2=Date;RUN;
i get
but the problem persists when im trying to PROC SORT By date as u can see the dates above...
ANY IDEAS?
Well i hope this answers your question.
When i open my CSV file in excel, i could see that the dates prior to 1900 are all in GENERAL Form and they are all written in the form of 1899-10-23
and the dates after 1900 are all in DATES form (DD-MM-YY). i really dont know hot to fix this via SaS.. Any way to remove first all the numeric dates and change the character ones to numerics and then add the numeric ones? is there another way?
Do you have an EXCEL file or a CSV file. Note that a CSV is just a plain old text file and has nothing to do with EXCEL.
You will confuse yourself if you let EXCEL open a CSV on its own as it will change some of the values without telling you it is doing that. For example strings that it thinks represent numbers become numbers. Strings that it thinks are dates become dates.
To look at a CSV file use a TEXT editor. Like the SAS program editor. Or a SAS data step.
To see the first few lines of you text file use a data step like this:
DATA _null_;
INFILE 'mypath/results.csv' obs=10;
input;
list;
run;
The file is .CSV just as i have downloaded it from kaggle. I used excel just to check the format and help myself understand why i cant solve the problem.
The sas text editor was showing me the same thing
Probably the column Date has character values written like YYYY-DD-MM that. and Also dates with numeric values with the format DD-MM-YY...
@Primohunter wrote:
The file is .CSV just as i have downloaded it from kaggle. I used excel just to check the format and help myself understand why i cant solve the problem.
Excel will pick its own display and possibly CHANGE the values in CSV files depending on the contents. So using Excel to check anything other than column alignment is hazardous. If you SAVE the file from Excel you may well want to go back and find an original as it may have been corrupted. Dates are one of the things Excel can do very strange things with.
Plus Excel doesn't do very well with dates prior to 1900. SAS at least doesn't have many issues until get prior to about 1585 or so.
How 2-digit years are treated depends on your current setting of the YEARCUTOFF option in SAS. You can run this to check your setting:
proc options option=yearcutoff; run;
Which will show something like this in the log:
YEARCUTOFF=1926 Specifies the first year of a 100-year span that is used by date informats and functions to read a two-digit year.
Which means any two-digit year smaller than 26 is treated as 20xx and 27 would be treated as 1927. If you deal with any two-digit years that are supposed to be in a century other than 1900 or 2000 you need to provide some logic to deal with how to read the data. A brief example.
data example; input x :yymmdd.; format x date9.; datalines; 21-01-01 22-01-01 23-01-01 24-01-01 25-01-01 26-01-01 27-01-01 28-01-01 29-01-01 ;
BTW, this one of those examples of why not to use two-digit years for anything.
It's fixable but you have to be able to define rules that are consistent.
If some are in YYYY-MM-DD and some are in DD-MM-YY but they're always in those two forms that's easy enough to separate.
Since you're reading all in as character initially, look at the length of the variable. If it's 8 then it's the DDMMYY form, if it's 10 then it's the YYMMDD form.
However, you cannot open a CSV with Excel. Excel then has already 'imported' the data and done it's own conversion so you need to know what is in the raw CSV text file.
CSV are text files, it means comma separated values.
So once you confirm the rules that are consistent we can help you convert it appropriately.
@Primohunter wrote:
Well i hope this answers your question.
When i open my CSV file in excel, i could see that the dates prior to 1900 are all in GENERAL Form and they are all written in the form of 1899-10-23
and the dates after 1900 are all in DATES form (DD-MM-YY). i really dont know hot to fix this via SaS.. Any way to remove first all the numeric dates and change the character ones to numerics and then add the numeric ones? is there another way?
It's a public source then, just share the Kaggle link or the full file.
data WORK.RESULTS;
%let _EFIERR_ = 0;
/* set the ERROR detection macro variable */
infile '/home/fkhurshed/Demo1/results.csv' delimiter=',' TRUNCOVER DSD
lrecl=32767 firstobs=2;
informat date yymmdd10.;
informat home_team $50.;
informat away_team $50.;
informat home_score $8.;
informat away_score $8.;
informat tournament $50.;
informat city $50.;
informat country $50.;
informat neutral $5.;
format date yymmdd10.;
format home_team $8.;
format away_team $8.;
format home_score $8.;
format away_score $8.;
format tournament $50.;
format city $50.;
format country $50.;
format neutral $5.;
input date home_team $
away_team $
home_score away_score
tournament $
city $
country $
neutral $;
if _ERROR_ then
call symputx('_EFIERR_', 1);
/* set ERROR detection macro variable */
run;
data WORK.SHOOTOUT;
%let _EFIERR_ = 0;
/* set the ERROR detection macro variable */
infile '/home/fkhurshed/Demo1/shootouts.csv' delimiter=',' TRUNCOVER DSD
lrecl=32767 firstobs=2;
informat date yymmdd10.;
informat home_team $50.;
informat away_team $50.;
informat winner $50.;
format date yymmdd10.;
format home_team $50.;
format away_team $50.;
format winner $50.;
input date home_team $
away_team $
winner $;
if _ERROR_ then
call symputx('_EFIERR_', 1);
/* set ERROR detection macro variable */
run;
This worked to read the files correctly and have your dates as SAS dates.
This is a good example file to test the %CSV2DS() macro with.
The first thing you will notice is that you cannot use the ZIP file directly with PROC IMPORT.
filename csvzip zip 'c:\downloads\results.csv.zip' member='results.csv';
proc import datafile=csvzip dbms=csv out=import1 replace;
run;
That causes errors:
343 proc import datafile=csvzip dbms=csv out=import1 replace; 344 run; ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. ERROR: Requested function is not supported. Unable to sample external file, no data in first 5 records. ERROR: Import unsuccessful. See SAS Log for details. NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE IMPORT used (Total process time): real time 0.06 seconds cpu time 0.07 seconds
So you will have to unzip the file to use PROC IMPORT. So let's try again with the unzipped file.
filename csv 'c:\downloads\results.csv';
proc import datafile=csv dbms=csv out=import2 replace;
run;
Now at least it makes the dataset, but there are errors in the SAS log.
NOTE: Invalid data for home_score in line 43083 26-27. NOTE: Invalid data for away_score in line 43083 29-30. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 43083 2021-12-15,Tunisia,Egypt,NA,NA,Arab Cup,Doha,Qatar,TRUE 55 date=2021-12-15 home_team=Tunisia away_team=Egypt home_score=. away_score=. tournament=Arab Cup city=Doha country=Qatar neutral=TRUE _ERROR_=1 _N_=43082
So let's try again using the GUESSINGROWS=MAX option.
proc import datafile=csv dbms=csv out=import3 replace;
guessingrows=max;
run;
Finally we got something.
NOTE: WORK.IMPORT3 data set was successfully created. NOTE: The data set WORK.IMPORT3 has 43086 observations and 9 variables. NOTE: PROCEDURE IMPORT used (Total process time): real time 11.09 seconds cpu time 10.28 seconds
Now let's try %CSV2DS() on the zip file.
%csv2ds(csvzip,out=csv2ds1,replace=1);
Results:
NOTE: 43086 records were read from the infile CSVZIP. The minimum record length was 45. The maximum record length was 156. NOTE: The data set WORK.CSV2DS1 has 43086 observations and 9 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.06 seconds NOTE: %INCLUDE (level 1) ending. NOTE: CSV2DS used (Total process time): 1.09200000762939 seconds
So one tenth of the time compared to PROC IMPORT and we didn't have to waste time (and disk space) unzipping the file.
Let's compare and see how different they are:
proc compare data=import3 compare=csv2ds1;
run;
Values match:
NOTE: No unequal values were found. All values compared are exactly equal.
But the dataset generated by PROC IMPORT has unneeded (and dangerous) formats/informats attached to all of the character variables.
Listing of Common Variables with Differing Attributes Variable Dataset Type Length Format Informat date WORK.IMPORT3 Num 8 YYMMDD10. YYMMDD10. WORK.CSV2DS1 Num 8 YYMMDD10. YYMMDD. home_team WORK.IMPORT3 Char 32 $32. $32. WORK.CSV2DS1 Char 32 away_team WORK.IMPORT3 Char 32 $32. $32. WORK.CSV2DS1 Char 32 home_score WORK.IMPORT3 Char 2 $2. $2. WORK.CSV2DS1 Char 2 away_score WORK.IMPORT3 Char 2 $2. $2. WORK.CSV2DS1 Char 2 tournament WORK.IMPORT3 Char 45 $45. $45. WORK.CSV2DS1 Char 45 city WORK.IMPORT3 Char 28 $28. $28. WORK.CSV2DS1 Char 28 country WORK.IMPORT3 Char 32 $32. $32. WORK.CSV2DS1 Char 32 neutral WORK.IMPORT3 Char 5 $5. $5. WORK.CSV2DS1 Char 5
Now let's compare the SAS code they generated.
Here is the data step generated by %CSV2DS(). 7 statements.
441 +data csv2ds1; 442 + infile CSVZIP dlm=',' dsd truncover firstobs=2 ; 443 + length date 8 home_team $32 away_team $32 home_score $2 away_score $2 444 + tournament $45 city $28 country $32 neutral $5 445 + ; 446 + informat date yymmdd. ; 447 + format date yymmdd10. ; 448 + input date -- neutral ; 449 +run;
Here is the code generated by PROC IMPORT.
404 data WORK.IMPORT3 ; 405 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 406 infile CSV delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ; 407 informat date yymmdd10. ; 408 informat home_team $32. ; 409 informat away_team $32. ; 410 informat home_score $2. ; 411 informat away_score $2. ; 412 informat tournament $45. ; 413 informat city $28. ; 414 informat country $32. ; 415 informat neutral $5. ; 416 format date yymmdd10. ; 417 format home_team $32. ; 418 format away_team $32. ; 419 format home_score $2. ; 420 format away_score $2. ; 421 format tournament $45. ; 422 format city $28. ; 423 format country $32. ; 424 format neutral $5. ; 425 input 426 date 427 home_team $ 428 away_team $ 429 home_score $ 430 away_score $ 431 tournament $ 432 city $ 433 country $ 434 neutral $ 435 ; 436 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ 437 run;
Now let's look at why that second PROC IMPORT (without GUESSINGROWS=MAX) generated errors. The reason is that the file has NA in two score fields that are otherwise numeric, but there aren't any in the first few rows. Now it would be nice if we could read those fields as numbers and set the NA values to missing. One way to do that is to define a special informat that knows what to do with the NA strings. So something like this:
proc format;
invalue na (upcase default=32)
'NA'=._
other=[32.]
;
run;
Now how can we tell SAS to use that informat to read those two score columns as numeric? We could copy the code from the LOG and modify it. But with %CSV2DS() we can also give it a metadata file to override its guessing decisions about how to read the file. So let's make an overrides dataset. We can start by using the _TYPES_ dataset the macro produces to find those variables that have a lot of numeric values and also NA strings.
data overrides;
set _types_;
where max='NA' and .80 < divide(numeric,nonmiss);
keep varnum length informat ;
length='8';
informat='na.';
run;
Now we can re-run the macro passing in those overrides.
%csv2ds(csvzip
,out=csv2ds2
,replace=1
,overrides=overrides
);
Which generates and runs this data step instead.
459 +data csv2ds2; 460 + infile CSVZIP dlm=',' dsd truncover firstobs=2 ; 461 + length date 8 home_team $32 away_team $32 home_score 8 away_score 8 462 + tournament $45 city $28 country $32 neutral $5 463 + ; 464 + informat date yymmdd. home_score na. away_score na. ; 465 + format date yymmdd10. ; 466 + input date -- neutral ; 467 +run;
And now we have HOME_SCORE and AWAY_SCORE as numeric.
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ----------------------------------------------------------------------------------- date 43086 11638.06 8954.60 -31807.00 22632.00 home_score 43081 1.7432279 1.7539820 0 31.0000000 away_score 43081 1.1836076 1.4006205 0 21.0000000 -----------------------------------------------------------------------------------
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.