BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Primohunter
Obsidian | Level 7

Hello,

 

i have a dataset with 60 variables. i wanna use in my code only 10-11 of them. But i cant input only those because of some missing values on a variable i m not gonna use. So i dropped the DATA want;
                                              INFILE 'mypath.csv';          way of importing the data and then i moved to PROC IMPORT where i get a result without an error ONLY if use the guessingrows=max but then the variable Player Name gets to 385 length 

Primohunter_0-1643804382444.png

 and im afraid something else is also screwed up in the dataset.



My Code so far is

Primohunter_1-1643804483178.png



Any help doin this firstly with the INFILE method and then the import one?

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

This works:

proc format;
invalue nonecheck
  'None' = .
  other = [best32.]
;
run;

data WORK.CBPLAYERS    ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile '~/CollegeBasketballPlayers2022.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
   informat player_name $19. ;
   informat team $22. ;
   informat conf $4. ;
   informat GP best32. ;
   informat Min_per best32. ;
   informat Ortg best32. ;
   informat usg best32. ;
   informat eFG best32. ;
   informat TS_per best32. ;
   informat ORB_per best32. ;
   informat DRB_per best32. ;
   informat AST_per best32. ;
   informat TO_per best32. ;
   informat FTM best32. ;
   informat FTA best32. ;
   informat FT_per best32. ;
   informat twoPM best32. ;
   informat twoPA best32. ;
   informat twoP_per best32. ;
   informat TPM best32. ;
   informat TPA best32. ;
   informat TP_per best32. ;
   informat blk_per best32. ;
   informat stl_per best32. ;
   informat ftr best32. ;
   informat yr $2. ;
   informat ht $6. ;
   informat num nonecheck. ;
   informat porpag best32. ;
   informat adjoe best32. ;
   informat pfr best32. ;
   informat year best32. ;
   informat pid best32. ;
   informat type $3. ;
   informat Rec_Rank best32. ;
   informat ast_tov best32. ;
   informat rimmade best32. ;
   informat rimmade_rimmiss best32. ;
   informat midmade best32. ;
   informat midmade_midmiss best32. ;
   informat rimmade__rimmade_rimmiss_ best32. ;
   informat midmade__midmade_midmiss_ best32. ;
   informat dunksmade best32. ;
   informat dunksmiss_dunksmade best32. ;
   informat dunksmade__dunksmade_dunksmiss_ best32. ;
   informat pick $1. ;
   informat drtg best32. ;
   informat adrtg best32. ;
   informat dporpag best32. ;
   informat stops best32. ;
   informat bpm best32. ;
   informat obpm best32. ;
   informat dbpm best32. ;
   informat gbpm best32. ;
   informat mp best32. ;
   informat ogbpm best32. ;
   informat dgbpm best32. ;
   informat oreb best32. ;
   informat dreb best32. ;
   informat treb best32. ;
   informat ast best32. ;
   informat stl best32. ;
   informat blk best32. ;
   informat pts best32. ;
   informat VAR65 $10. ;
   informat VAR66 best32. ;
   format player_name $19. ;
   format team $22. ;
   format conf $4. ;
   format GP best12. ;
   format Min_per best12. ;
   format Ortg best12. ;
   format usg best12. ;
   format eFG best12. ;
   format TS_per best12. ;
   format ORB_per best12. ;
   format DRB_per best12. ;
   format AST_per best12. ;
   format TO_per best12. ;
   format FTM best12. ;
   format FTA best12. ;
   format FT_per best12. ;
   format twoPM best12. ;
   format twoPA best12. ;
   format twoP_per best12. ;
   format TPM best12. ;
   format TPA best12. ;
   format TP_per best12. ;
   format blk_per best12. ;
   format stl_per best12. ;
   format ftr best12. ;
   format yr $2. ;
   format ht $6. ;
   format num best12. ;
   format porpag best12. ;
   format adjoe best12. ;
   format pfr best12. ;
   format year best12. ;
   format pid best12. ;
   format type $3. ;
   format Rec_Rank best12. ;
   format ast_tov best12. ;
   format rimmade best12. ;
   format rimmade_rimmiss best12. ;
   format midmade best12. ;
   format midmade_midmiss best12. ;
   format rimmade__rimmade_rimmiss_ best12. ;
   format midmade__midmade_midmiss_ best12. ;
   format dunksmade best12. ;
   format dunksmiss_dunksmade best12. ;
   format dunksmade__dunksmade_dunksmiss_ best12. ;
   format pick $1. ;
   format drtg best12. ;
   format adrtg best12. ;
   format dporpag best12. ;
   format stops best12. ;
   format bpm best12. ;
   format obpm best12. ;
   format dbpm best12. ;
   format gbpm best12. ;
   format mp best12. ;
   format ogbpm best12. ;
   format dgbpm best12. ;
   format oreb best12. ;
   format dreb best12. ;
   format treb best12. ;
   format ast best12. ;
   format stl best12. ;
   format blk best12. ;
   format pts best12. ;
   format VAR65 $10. ;
   format VAR66 best12. ;
input
            player_name  $
            team  $
            conf  $
            GP
            Min_per
            Ortg
            usg
            eFG
            TS_per
            ORB_per
            DRB_per
            AST_per
            TO_per
            FTM
            FTA
            FT_per
            twoPM
            twoPA
            twoP_per
            TPM
            TPA
            TP_per
            blk_per
            stl_per
            ftr
            yr  $
            ht  $
            num
            porpag
            adjoe
            pfr
            year
            pid
            type  $
            Rec_Rank
            ast_tov
            rimmade
            rimmade_rimmiss
            midmade
            midmade_midmiss
            rimmade__rimmade_rimmiss_
            midmade__midmade_midmiss_
            dunksmade
            dunksmiss_dunksmade
            dunksmade__dunksmade_dunksmiss_
            pick  $
            drtg
            adrtg
            dporpag
            stops
            bpm
            obpm
            dbpm
            gbpm
            mp
            ogbpm
            dgbpm
            oreb
            dreb
            treb
            ast
            stl
            blk
            pts
            VAR65  $
            VAR66
;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;

What did I do?

  • run PROC IMPORT
  • copy the data step code from the log to the program editor, and removing the empty columns and log line numbers in front (use the Alt key while selecting with the mouse)
  • study the log to find out which values cause the read to fail, it is "None" for the variable num
  • create an informat that reads "None" as missing, everything else with BEST32.
  • use that informat for variable num

You can now strip the code down to make it more readable (like combining the multiple INFORMAT and FORMAT statements to one each), or find a name for VAR65 and VAR66 if those are needed.

View solution in original post

8 REPLIES 8
Sajid01
Meteorite | Level 14

Hello
As we don't have your data we cannot reproduce the result.
Create a new variable say player_name2=TRIM(Player_name).
This should solve the issue.

SASJedi
SAS Super FREQ
For more specific help, you'll need to include a sample of the CSV file (a few records at the least) and a description of what you WANT the result to look like...
Check out my Jedi SAS Tricks for SAS Users
Primohunter
Obsidian | Level 7

Well here is my dataset.

I wanna use some of its stats

Primohunter_0-1643809491687.png

 

for a correlation analysis (PROC CANCORR)  and i d like to fix my dataset in order to do so... The missing values ( . , NA, ) are messing with how the program reads the lines.

 

Im also attaching the csv file

Kurt_Bremser
Super User

This works:

proc format;
invalue nonecheck
  'None' = .
  other = [best32.]
;
run;

data WORK.CBPLAYERS    ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile '~/CollegeBasketballPlayers2022.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
   informat player_name $19. ;
   informat team $22. ;
   informat conf $4. ;
   informat GP best32. ;
   informat Min_per best32. ;
   informat Ortg best32. ;
   informat usg best32. ;
   informat eFG best32. ;
   informat TS_per best32. ;
   informat ORB_per best32. ;
   informat DRB_per best32. ;
   informat AST_per best32. ;
   informat TO_per best32. ;
   informat FTM best32. ;
   informat FTA best32. ;
   informat FT_per best32. ;
   informat twoPM best32. ;
   informat twoPA best32. ;
   informat twoP_per best32. ;
   informat TPM best32. ;
   informat TPA best32. ;
   informat TP_per best32. ;
   informat blk_per best32. ;
   informat stl_per best32. ;
   informat ftr best32. ;
   informat yr $2. ;
   informat ht $6. ;
   informat num nonecheck. ;
   informat porpag best32. ;
   informat adjoe best32. ;
   informat pfr best32. ;
   informat year best32. ;
   informat pid best32. ;
   informat type $3. ;
   informat Rec_Rank best32. ;
   informat ast_tov best32. ;
   informat rimmade best32. ;
   informat rimmade_rimmiss best32. ;
   informat midmade best32. ;
   informat midmade_midmiss best32. ;
   informat rimmade__rimmade_rimmiss_ best32. ;
   informat midmade__midmade_midmiss_ best32. ;
   informat dunksmade best32. ;
   informat dunksmiss_dunksmade best32. ;
   informat dunksmade__dunksmade_dunksmiss_ best32. ;
   informat pick $1. ;
   informat drtg best32. ;
   informat adrtg best32. ;
   informat dporpag best32. ;
   informat stops best32. ;
   informat bpm best32. ;
   informat obpm best32. ;
   informat dbpm best32. ;
   informat gbpm best32. ;
   informat mp best32. ;
   informat ogbpm best32. ;
   informat dgbpm best32. ;
   informat oreb best32. ;
   informat dreb best32. ;
   informat treb best32. ;
   informat ast best32. ;
   informat stl best32. ;
   informat blk best32. ;
   informat pts best32. ;
   informat VAR65 $10. ;
   informat VAR66 best32. ;
   format player_name $19. ;
   format team $22. ;
   format conf $4. ;
   format GP best12. ;
   format Min_per best12. ;
   format Ortg best12. ;
   format usg best12. ;
   format eFG best12. ;
   format TS_per best12. ;
   format ORB_per best12. ;
   format DRB_per best12. ;
   format AST_per best12. ;
   format TO_per best12. ;
   format FTM best12. ;
   format FTA best12. ;
   format FT_per best12. ;
   format twoPM best12. ;
   format twoPA best12. ;
   format twoP_per best12. ;
   format TPM best12. ;
   format TPA best12. ;
   format TP_per best12. ;
   format blk_per best12. ;
   format stl_per best12. ;
   format ftr best12. ;
   format yr $2. ;
   format ht $6. ;
   format num best12. ;
   format porpag best12. ;
   format adjoe best12. ;
   format pfr best12. ;
   format year best12. ;
   format pid best12. ;
   format type $3. ;
   format Rec_Rank best12. ;
   format ast_tov best12. ;
   format rimmade best12. ;
   format rimmade_rimmiss best12. ;
   format midmade best12. ;
   format midmade_midmiss best12. ;
   format rimmade__rimmade_rimmiss_ best12. ;
   format midmade__midmade_midmiss_ best12. ;
   format dunksmade best12. ;
   format dunksmiss_dunksmade best12. ;
   format dunksmade__dunksmade_dunksmiss_ best12. ;
   format pick $1. ;
   format drtg best12. ;
   format adrtg best12. ;
   format dporpag best12. ;
   format stops best12. ;
   format bpm best12. ;
   format obpm best12. ;
   format dbpm best12. ;
   format gbpm best12. ;
   format mp best12. ;
   format ogbpm best12. ;
   format dgbpm best12. ;
   format oreb best12. ;
   format dreb best12. ;
   format treb best12. ;
   format ast best12. ;
   format stl best12. ;
   format blk best12. ;
   format pts best12. ;
   format VAR65 $10. ;
   format VAR66 best12. ;
input
            player_name  $
            team  $
            conf  $
            GP
            Min_per
            Ortg
            usg
            eFG
            TS_per
            ORB_per
            DRB_per
            AST_per
            TO_per
            FTM
            FTA
            FT_per
            twoPM
            twoPA
            twoP_per
            TPM
            TPA
            TP_per
            blk_per
            stl_per
            ftr
            yr  $
            ht  $
            num
            porpag
            adjoe
            pfr
            year
            pid
            type  $
            Rec_Rank
            ast_tov
            rimmade
            rimmade_rimmiss
            midmade
            midmade_midmiss
            rimmade__rimmade_rimmiss_
            midmade__midmade_midmiss_
            dunksmade
            dunksmiss_dunksmade
            dunksmade__dunksmade_dunksmiss_
            pick  $
            drtg
            adrtg
            dporpag
            stops
            bpm
            obpm
            dbpm
            gbpm
            mp
            ogbpm
            dgbpm
            oreb
            dreb
            treb
            ast
            stl
            blk
            pts
            VAR65  $
            VAR66
;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;

What did I do?

  • run PROC IMPORT
  • copy the data step code from the log to the program editor, and removing the empty columns and log line numbers in front (use the Alt key while selecting with the mouse)
  • study the log to find out which values cause the read to fail, it is "None" for the variable num
  • create an informat that reads "None" as missing, everything else with BEST32.
  • use that informat for variable num

You can now strip the code down to make it more readable (like combining the multiple INFORMAT and FORMAT statements to one each), or find a name for VAR65 and VAR66 if those are needed.

Tom
Super User Tom
Super User

This is another good dataset to use with the %CSV2DS() macro

Download the macro from github and try it on your CSV file.

%csv2ds('C:\downloads\CollegeBasketballPlayers2022.csv'
,out=bb2022,replace=1
);

Seems to work well.  Here is the data step code it generated.

740   +data bb2022;
741   +  infile 'C:\downloads\CollegeBasketballPlayers2022.csv' dlm=',' dsd
742   +    truncover firstobs=2
743   +  ;
744   +  length player_name $28 team $22 conf $4 GP 8 Min_per 8 Ortg 8 usg 8
745   +    eFG 8 TS_per 8 ORB_per 8 DRB_per 8 AST_per 8 TO_per 8 FTM 8 FTA 8
746   +    FT_per 8 twoPM 8 twoPA 8 twoP_per 8 TPM 8 TPA 8 TP_per 8 blk_per 8
747   +    stl_per 8 ftr 8 yr $2 ht $6 num $4 porpag 8 adjoe 8 pfr 8 year 8 pid 8
748   +    type $3 Rec_Rank 8 ast_tov 8 rimmade 8 rimmade_rimmiss 8 midmade 8
749   +    midmade_midmiss 8 rimmade_rimmade_rimmiss 8 midmade_midmade_midmiss 8
750   +    dunksmade 8 dunksmiss_dunksmade 8 dunksmade_dunksmade_dunksmiss 8
751   +    pick $1 drtg 8 adrtg 8 dporpag 8 stops 8 bpm 8 obpm 8 dbpm 8 gbpm 8
752   +    mp 8 ogbpm 8 dgbpm 8 oreb 8 dreb 8 treb 8 ast 8 stl 8 blk 8 pts 8
753   +    VAR65 $10 VAR66 8
754   +  ;
755   +  label Rec_Rank='Rec Rank' ast_tov='ast/tov'
756   +    rimmade_rimmiss='rimmade+rimmiss' midmade_midmiss='midmade+midmiss'
757   +    rimmade_rimmade_rimmiss='rimmade/(rimmade+rimmiss)'
758   +    midmade_midmade_midmiss='midmade/(midmade+midmiss)'
759   +    dunksmiss_dunksmade='dunksmiss+dunksmade'
760   +    dunksmade_dunksmade_dunksmiss='dunksmade/(dunksmade+dunksmiss)'
761   +  ;
762   +  input player_name -- VAR66 ;
763   +run;

A couple of things to notice right away is that many of the column headers are not good variable names.  And two of the columns don't even have any column headers so the macro had make up names for them.

 

You mentioned that text values like NONE or NA were causing trouble.  Let's use the _VALUES_ view that the macro creates to see where those appear.

proc freq data=_values_;
  where upcase(short) in ('NA','NONE');
  tables short*varnum/list;
run;

So it looks like None appears in two of the columns.

The FREQ Procedure

                                            Cumulative    Cumulative
short    varnum    Frequency     Percent     Frequency      Percent
--------------------------------------------------------------------
None         27          11       11.70            11        11.70
None         28          83       88.30            94       100.00

So let's look at what analysis the macro did for those variables in the _TYPES_ dataset.

proc print data=_types_;
  where varnum in (27 28);
run;

Results

Screenshot 2022-02-02 104202.jpg

So the variable NUM looks like it has numbers as most of the non-blank values are integers.

But the variable HT looks messed up.

Looks like someone opened the CSV in EXCEL and EXCEL converted the strings with height in feet and inches into dates. So you probably have at least one BB player that is seven feet one inches tall.

To fix that you should go back to the original CSV file before it got messed up by EXCEL.

Primohunter
Obsidian | Level 7
never opened again the csv with the excel. in fact the default program to open csv file is notepad.
Tom
Super User Tom
Super User

Just goes to show how commonly Excel corrupts data. It even made into something someone published on-line.

Here how you could fix HT and NUM using custom informats.

The next to last column looks like the players position. I have no idea what that last column is supposed to be. I just called it RANKING.

proc format;
  invalue none (default=32 upcase) 'NONE'=. other=[32.];
  invalue $ht 
    'None'  =' '
    '8-May' ='5- 8'
    '9-May' ='5- 9'
    '2-May' ='5- 2'
    '6-May' ='5- 6'
    '7-May' ='5- 7'
    '10-May'='5-10'
    '11-May'='5-11'
    'Jun-00'='6- 0'
    '1-Jun' ='6- 1'
    '2-Jun' ='6- 2'
    '3-Jun' ='6- 3'
    '4-Jun' ='6- 4'
    '5-Jun' ='6- 5'
    '6-Jun' ='6- 6'
    '7-Jun' ='6- 7'
    '8-Jun' ='6- 8'
    '9-Jun' ='6- 9'
    '10-Jun'='6-10'
    '11-Jun'='6-11'
    'Jul-00'='7- 0'
    '1-Jul' ='7- 1'
    '2-Jul' ='7- 2'
    '3-Jul' ='7- 3'
    '4-Jul' ='7- 4'
  ;
run;

data bb2022;
  infile 'c:\downloads\bb2022.zip' zip member='CollegeBasketballPlayers2022.csv'
    dlm=',' dsd truncover firstobs=2
  ;
  length player_name $28 team $22 conf $4 GP 8 Min_per 8 Ortg 8 usg 8
    eFG 8 TS_per 8 ORB_per 8 DRB_per 8 AST_per 8 TO_per 8 FTM 8 FTA 8
    FT_per 8 twoPM 8 twoPA 8 twoP_per 8 TPM 8 TPA 8 TP_per 8 blk_per 8
    stl_per 8 ftr 8 yr $2 ht $6 num 8 porpag 8 adjoe 8 pfr 8 year 8 pid 8
    type $3 Rec_Rank 8 ast_tov 8 rimmade 8 rimmade_rimmiss 8 midmade 8
    midmade_midmiss 8 rimmade_rimmade_rimmiss 8 midmade_midmade_midmiss 8
    dunksmade 8 dunksmiss_dunksmade 8 dunksmade_dunksmade_dunksmiss 8
    pick $1 drtg 8 adrtg 8 dporpag 8 stops 8 bpm 8 obpm 8 dbpm 8 gbpm 8
    mp 8 ogbpm 8 dgbpm 8 oreb 8 dreb 8 treb 8 ast 8 stl 8 blk 8 pts 8
    position $10 ranking 8
  ;
  informat ht $ht. num none.;
  label
    Rec_Rank='Rec Rank' 
    ast_tov='ast/tov'
    rimmade_rimmiss='rimmade+rimmiss' 
    midmade_midmiss='midmade+midmiss'
    rimmade_rimmade_rimmiss='rimmade/(rimmade+rimmiss)'
    midmade_midmade_midmiss='midmade/(midmade+midmiss)'
    dunksmiss_dunksmade='dunksmiss+dunksmade'
    dunksmade_dunksmade_dunksmiss='dunksmade/(dunksmade+dunksmiss)'
  ;
  input player_name -- ranking ;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 710 views
  • 7 likes
  • 5 in conversation