BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Primohunter
Obsidian | Level 7

Hello everyone,


I ve got a dataset with 40k rows. Under the variable Date i have numeric dates with the format DD-MM-YY but also i have dates prior to 1900 and they are char and the format is YYYY-MM-DD. I cant see how am i gonna work around that.


DATA firstload;
          INFILE '/mypath/results.csv' DLM=',' FIRSTOBS=2 DSD TRUNCOVER;
          INPUT Date:$10. Home_Team:$32. Away_Team:$32. Home_Score:2. Away_Score:2.                        Tournament:$45. City:$28. Country:$32. Neutral:$5.;
RUN;


DATA task2.results;
          RETAIN Date;
          SET firstload (RENAME=(Date=Old));
          Date = put(Old,YYMMDD10.);
          DROP Old;
          FORMAT Date YYMMDD10.;
RUN;


Here are my blocks.

Primohunter_0-1643749471115.png

 

and here is a screenshot of the results i am getting.

and here is the contents

Primohunter_1-1643749471444.png

 



WHEN i try the below

 

DATA firstload;
      INFILE 'mypath/results.csv' DLM=',' FIRSTOBS=2 DSD TRUNCOVER;
      INPUT Date:$10. Home_Team:$32. Away_Team:$32. Home_Score:2. Away_Score:2. Tournament:$45.
               City:$28. Country:$32. Neutral:$5. Date2:YYMMDD10.;
RUN;


DATA task2.results;
       RETAIN Date2;
       SET firstload;
       Date2 = input(Date, YYMMDD10.);
       FORMAT Date2 YYMMDD10.;
       DROP Date;
       RENAME Date2=Date;

RUN;

 

i get 

Primohunter_2-1643750043981.png


but the problem persists when im trying to PROC SORT By date as u can see the dates above...


ANY IDEAS?

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
Just downloaded both the results and shootouts files. None have the dates as shown in the images. Redownload the files and use SAS directly to import the CSVs. Do not use Excel to manipulate the files or change anything. If you disagree, let me know what line numbers have the dates messed up.

If you read the discussion forums as well, other users have had the same issue and Excel was the cause as well.

View solution in original post

13 REPLIES 13
Reeza
Super User

It looks like you have dates that are both 4 digit and 2 digit years in the data set. 

And it looks like the 4 digit years are fine, the two digit years are not and everything is being read as character.

 

For the dates with the two digit year, what should be the first two digits, 1911 or 2011? 

How do you know which ones are 1911 versus 2011?

 


@Primohunter wrote:

Hello everyone,


I ve got a dataset with 40k rows. Under the variable Date i have numeric dates with the format DD-MM-YY but also i have dates prior to 1900 and they are char and the format is YYYY-MM-DD. I cant see how am i gonna work around that.


DATA firstload;
          INFILE '/mypath/results.csv' DLM=',' FIRSTOBS=2 DSD TRUNCOVER;
          INPUT Date:$10. Home_Team:$32. Away_Team:$32. Home_Score:2. Away_Score:2.                        Tournament:$45. City:$28. Country:$32. Neutral:$5.;
RUN;


DATA task2.results;
          RETAIN Date;
          SET firstload (RENAME=(Date=Old));
          Date = put(Old,YYMMDD10.);
          DROP Old;
          FORMAT Date YYMMDD10.;
RUN;


Here are my blocks.

Primohunter_0-1643749471115.png

 

and here is a screenshot of the results i am getting.

and here is the contents

Primohunter_1-1643749471444.png

 



WHEN i try the below

 

DATA firstload;
      INFILE 'mypath/results.csv' DLM=',' FIRSTOBS=2 DSD TRUNCOVER;
      INPUT Date:$10. Home_Team:$32. Away_Team:$32. Home_Score:2. Away_Score:2. Tournament:$45.
               City:$28. Country:$32. Neutral:$5. Date2:YYMMDD10.;
RUN;


DATA task2.results;
       RETAIN Date2;
       SET firstload;
       Date2 = input(Date, YYMMDD10.);
       FORMAT Date2 YYMMDD10.;
       DROP Date;
       RENAME Date2=Date;

RUN;

 

i get 

Primohunter_2-1643750043981.png


but the problem persists when im trying to PROC SORT By date as u can see the dates above...


ANY IDEAS?


 

Primohunter
Obsidian | Level 7

 

Well i hope this answers your question.

Primohunter_0-1643750931514.png

 

 

When i open my CSV file in excel, i could see that the dates prior to 1900 are all in GENERAL Form and they are all written in the form of 1899-10-23 
and the dates after 1900 are all in  DATES form (DD-MM-YY). i really dont know hot to fix this via SaS.. Any way to remove first all the numeric dates and change the character ones to numerics and then add the numeric ones? is there another way?

Tom
Super User Tom
Super User

Do you have an EXCEL file or a CSV file.  Note that a CSV is just a plain old text file and has nothing to do with EXCEL.

You will confuse yourself if you let EXCEL open a CSV on its own as it will change some of the values without telling you it is doing that. For example strings that it thinks represent numbers become numbers.  Strings that it thinks are dates become dates.

 

To look at a CSV file use a TEXT editor.  Like the SAS program editor. Or a SAS data step.

To see the first few lines of you text file use a data step like this:

DATA _null_;
  INFILE 'mypath/results.csv' obs=10;
  input;
  list;
run;

 

Primohunter
Obsidian | Level 7

The file is .CSV just as i have downloaded it from kaggle. I used excel just to check the format and help myself understand why i cant solve the problem. 

The sas text editor was showing me the same thing

 

 

Primohunter_0-1643751816823.png

Primohunter_1-1643751830002.png


Probably the column Date has character values written like YYYY-DD-MM that.   and Also dates with numeric values with the format DD-MM-YY...

ballardw
Super User

@Primohunter wrote:

The file is .CSV just as i have downloaded it from kaggle. I used excel just to check the format and help myself understand why i cant solve the problem. 

Excel will pick its own display and possibly CHANGE the values in CSV files depending on the contents. So using Excel to check anything other than column alignment is hazardous. If you SAVE the file from Excel you may well want to go back and find an original as it may have been corrupted. Dates are one of the things Excel can do very strange things with.

 

Plus Excel doesn't do very well with dates prior to 1900. SAS at least doesn't have many issues until get prior to about 1585 or so.

 

How 2-digit years are treated depends on your current setting of the YEARCUTOFF option in SAS. You can run this to check your setting:

proc options option=yearcutoff;
run;

Which will show something like this in the log:

 YEARCUTOFF=1926   Specifies the first year of a 100-year span that is used by date informats and
                   functions to read a two-digit year.

Which means any two-digit year smaller than 26 is treated as 20xx and 27 would be treated as 1927. If you deal with any two-digit years that are supposed to be in a century other than 1900 or 2000 you need to provide some logic to deal with how to read the data. A brief example.

data example;
   input x :yymmdd.;
   format x date9.;
datalines;
21-01-01
22-01-01
23-01-01
24-01-01
25-01-01
26-01-01
27-01-01
28-01-01
29-01-01
;

BTW, this one of those examples of why not to use two-digit years for anything.

 

Primohunter
Obsidian | Level 7
Thanks for the whole explanation. U are too kind!
Thanks again everyone!
Reeza
Super User

It's fixable but you have to be able to define rules that are consistent. 

 

If some are in YYYY-MM-DD and some are in DD-MM-YY but they're always in those two forms that's easy enough to separate. 

Since you're reading all in as character initially, look at the length of the variable. If it's 8 then it's the DDMMYY form, if it's 10 then it's the YYMMDD form. 

 

However, you cannot open a CSV with Excel. Excel then has already 'imported' the data and done it's own conversion so you need to know what is in the raw CSV text file. 

CSV are text files, it means comma separated values. 

 

So once you confirm the rules that are consistent we can help you convert it appropriately. 

 


@Primohunter wrote:

 

Well i hope this answers your question.

Primohunter_0-1643750931514.png

 

 

When i open my CSV file in excel, i could see that the dates prior to 1900 are all in GENERAL Form and they are all written in the form of 1899-10-23 
and the dates after 1900 are all in  DATES form (DD-MM-YY). i really dont know hot to fix this via SaS.. Any way to remove first all the numeric dates and change the character ones to numerics and then add the numeric ones? is there another way?


 

Reeza
Super User

It's a public source then, just share the Kaggle link or the full file. 

 

Reeza
Super User
Just downloaded both the results and shootouts files. None have the dates as shown in the images. Redownload the files and use SAS directly to import the CSVs. Do not use Excel to manipulate the files or change anything. If you disagree, let me know what line numbers have the dates messed up.

If you read the discussion forums as well, other users have had the same issue and Excel was the cause as well.
Primohunter
Obsidian | Level 7
thanks you. that seemed to be the problem. redownloaded the file changed the default program to open csv files to notepad and reuploaded it in SAS. it seems to be working properly now.
Thanks again for your time
Reeza
Super User
data WORK.RESULTS;
	%let _EFIERR_ = 0;

	/* set the ERROR detection macro variable */
	infile '/home/fkhurshed/Demo1/results.csv' delimiter=',' TRUNCOVER DSD 
		lrecl=32767 firstobs=2;
	informat date yymmdd10.;
	informat home_team $50.;
	informat away_team $50.;
	informat home_score $8.;
	informat away_score $8.;
	informat tournament $50.;
	informat city $50.;
	informat country $50.;
	informat neutral $5.;
	format date yymmdd10.;
	format home_team $8.;
	format away_team $8.;
	format home_score $8.;
	format away_score $8.;
	format tournament $50.;
	format city $50.;
	format country $50.;
	format neutral $5.;
	input date home_team  $
                         away_team  $
                        home_score away_score 
		tournament  $
                         city  $
                         country  $
                         neutral  $;

	if _ERROR_ then
		call symputx('_EFIERR_', 1);

	/* set ERROR detection macro variable */
run;

data WORK.SHOOTOUT;
	%let _EFIERR_ = 0;

	/* set the ERROR detection macro variable */
	infile '/home/fkhurshed/Demo1/shootouts.csv' delimiter=',' TRUNCOVER DSD 
		lrecl=32767 firstobs=2;
	informat date yymmdd10.;
	informat home_team $50.;
	informat away_team $50.;
	informat winner $50.;
	format date yymmdd10.;
	format home_team $50.;
	format away_team $50.;
	format winner $50.;
	input date home_team  $
                       away_team  $
                        winner  $;

	if _ERROR_ then
		call symputx('_EFIERR_', 1);

	/* set ERROR detection macro variable */
run;

This worked to read the files correctly and have your dates as SAS dates.

 

 

 

 

Tom
Super User Tom
Super User

This is a good example file to test the %CSV2DS() macro with.

The first thing you will notice is that you cannot use the ZIP file directly with PROC IMPORT.

filename csvzip zip 'c:\downloads\results.csv.zip' member='results.csv';
proc import datafile=csvzip dbms=csv out=import1 replace;
run;

That causes errors:

343    proc import datafile=csvzip dbms=csv out=import1 replace;
344    run;

ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
ERROR: Requested function is not supported.
Unable to sample external file, no data in first 5 records.
ERROR: Import unsuccessful.  See SAS Log for details.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.06 seconds
      cpu time            0.07 seconds

So you will have to unzip the file to use PROC IMPORT.  So let's try again with the unzipped file.

filename csv 'c:\downloads\results.csv';
proc import datafile=csv dbms=csv out=import2 replace;
run;

Now at least it makes the dataset, but there are errors in the SAS log.

NOTE: Invalid data for home_score in line 43083 26-27.
NOTE: Invalid data for away_score in line 43083 29-30.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
43083     2021-12-15,Tunisia,Egypt,NA,NA,Arab Cup,Doha,Qatar,TRUE 55
date=2021-12-15 home_team=Tunisia away_team=Egypt home_score=. away_score=. tournament=Arab Cup city=Doha country=Qatar neutral=TRUE
_ERROR_=1 _N_=43082

So let's try again using the GUESSINGROWS=MAX option.

proc import datafile=csv dbms=csv out=import3 replace;
  guessingrows=max;
run;

Finally we got something.

NOTE: WORK.IMPORT3 data set was successfully created.
NOTE: The data set WORK.IMPORT3 has 43086 observations and 9 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           11.09 seconds
      cpu time            10.28 seconds

Now let's try %CSV2DS() on the zip file. 

%csv2ds(csvzip,out=csv2ds1,replace=1);

Results:

NOTE: 43086 records were read from the infile CSVZIP.
      The minimum record length was 45.
      The maximum record length was 156.
NOTE: The data set WORK.CSV2DS1 has 43086 observations and 9 variables.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.06 seconds


NOTE: %INCLUDE (level 1) ending.
NOTE: CSV2DS used (Total process time): 1.09200000762939 seconds

So one tenth of the time compared to PROC IMPORT and we didn't have to waste time (and disk space) unzipping the file.

 

Let's compare and see how different they are:

proc compare data=import3 compare=csv2ds1;
run;

Values match:

NOTE: No unequal values were found. All values compared are exactly equal.

But the dataset generated by PROC IMPORT has unneeded (and dangerous) formats/informats attached to all of the character variables.

Listing of Common Variables with Differing Attributes

Variable    Dataset       Type  Length  Format     Informat

date        WORK.IMPORT3  Num        8  YYMMDD10.  YYMMDD10.
            WORK.CSV2DS1  Num        8  YYMMDD10.  YYMMDD.
home_team   WORK.IMPORT3  Char      32  $32.       $32.
            WORK.CSV2DS1  Char      32
away_team   WORK.IMPORT3  Char      32  $32.       $32.
            WORK.CSV2DS1  Char      32
home_score  WORK.IMPORT3  Char       2  $2.        $2.
            WORK.CSV2DS1  Char       2
away_score  WORK.IMPORT3  Char       2  $2.        $2.
            WORK.CSV2DS1  Char       2
tournament  WORK.IMPORT3  Char      45  $45.       $45.
            WORK.CSV2DS1  Char      45
city        WORK.IMPORT3  Char      28  $28.       $28.
            WORK.CSV2DS1  Char      28
country     WORK.IMPORT3  Char      32  $32.       $32.
            WORK.CSV2DS1  Char      32
neutral     WORK.IMPORT3  Char       5  $5.        $5.
            WORK.CSV2DS1  Char       5

Now let's compare the SAS code they generated.

Here is the data step generated by %CSV2DS(). 7 statements.

441   +data csv2ds1;
442   +  infile CSVZIP dlm=',' dsd truncover firstobs=2 ;
443   +  length date 8 home_team $32 away_team $32 home_score $2 away_score $2
444   +    tournament $45 city $28 country $32 neutral $5
445   +  ;
446   +  informat date yymmdd. ;
447   +  format date yymmdd10. ;
448   +  input date -- neutral ;
449   +run;

Here is the code generated by PROC IMPORT.

404        data WORK.IMPORT3    ;
405        %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
406        infile CSV delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
407           informat date yymmdd10. ;
408           informat home_team $32. ;
409           informat away_team $32. ;
410           informat home_score $2. ;
411           informat away_score $2. ;
412           informat tournament $45. ;
413           informat city $28. ;
414           informat country $32. ;
415           informat neutral $5. ;
416           format date yymmdd10. ;
417           format home_team $32. ;
418           format away_team $32. ;
419           format home_score $2. ;
420           format away_score $2. ;
421           format tournament $45. ;
422           format city $28. ;
423           format country $32. ;
424           format neutral $5. ;
425        input
426                    date
427                    home_team  $
428                    away_team  $
429                    home_score  $
430                    away_score  $
431                    tournament  $
432                    city  $
433                    country  $
434                    neutral  $
435        ;
436        if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
437        run;

Now let's look at why that second PROC IMPORT (without GUESSINGROWS=MAX) generated errors.  The reason is that the file has NA in two score fields that are otherwise numeric, but there aren't any in the first few rows.   Now it would be nice if we could read those fields as numbers and set the NA values to missing.  One way to do that is to define a special informat that knows what to do with the NA strings.  So something like this:

proc format;
  invalue na (upcase default=32) 
    'NA'=._
    other=[32.]
  ;
run;

Now how can we tell SAS to use that informat to read those two score columns as numeric?  We could copy the code from the LOG and modify it.  But with %CSV2DS() we can also give it a metadata file to override its guessing decisions about how to read the file.  So let's make an overrides dataset.  We can start by using the _TYPES_ dataset the macro produces to find those variables that have a lot of numeric values and also NA strings.

data overrides;
  set _types_;
  where max='NA' and .80 < divide(numeric,nonmiss);
  keep varnum length informat ;
  length='8';
  informat='na.';
run;

Now we can re-run the macro passing in those overrides.

%csv2ds(csvzip
,out=csv2ds2
,replace=1
,overrides=overrides
);

Which generates and runs this data step instead.

459   +data csv2ds2;
460   +  infile CSVZIP dlm=',' dsd truncover firstobs=2 ;
461   +  length date 8 home_team $32 away_team $32 home_score 8 away_score 8
462   +    tournament $45 city $28 country $32 neutral $5
463   +  ;
464   +  informat date yymmdd. home_score na. away_score na. ;
465   +  format date yymmdd10. ;
466   +  input date -- neutral ;
467   +run;

And now we have HOME_SCORE and AWAY_SCORE as numeric.

The MEANS Procedure

Variable          N            Mean         Std Dev         Minimum         Maximum
-----------------------------------------------------------------------------------
date          43086        11638.06         8954.60       -31807.00        22632.00
home_score    43081       1.7432279       1.7539820               0      31.0000000
away_score    43081       1.1836076       1.4006205               0      21.0000000
-----------------------------------------------------------------------------------

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2183 views
  • 7 likes
  • 4 in conversation