Hello,
i have a dataset with 60 variables. i wanna use in my code only 10-11 of them. But i cant input only those because of some missing values on a variable i m not gonna use. So i dropped the DATA want;
INFILE 'mypath.csv'; way of importing the data and then i moved to PROC IMPORT where i get a result without an error ONLY if use the guessingrows=max but then the variable Player Name gets to 385 length
and im afraid something else is also screwed up in the dataset.
My Code so far is
Any help doin this firstly with the INFILE method and then the import one?
This works:
proc format;
invalue nonecheck
'None' = .
other = [best32.]
;
run;
data WORK.CBPLAYERS ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile '~/CollegeBasketballPlayers2022.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat player_name $19. ;
informat team $22. ;
informat conf $4. ;
informat GP best32. ;
informat Min_per best32. ;
informat Ortg best32. ;
informat usg best32. ;
informat eFG best32. ;
informat TS_per best32. ;
informat ORB_per best32. ;
informat DRB_per best32. ;
informat AST_per best32. ;
informat TO_per best32. ;
informat FTM best32. ;
informat FTA best32. ;
informat FT_per best32. ;
informat twoPM best32. ;
informat twoPA best32. ;
informat twoP_per best32. ;
informat TPM best32. ;
informat TPA best32. ;
informat TP_per best32. ;
informat blk_per best32. ;
informat stl_per best32. ;
informat ftr best32. ;
informat yr $2. ;
informat ht $6. ;
informat num nonecheck. ;
informat porpag best32. ;
informat adjoe best32. ;
informat pfr best32. ;
informat year best32. ;
informat pid best32. ;
informat type $3. ;
informat Rec_Rank best32. ;
informat ast_tov best32. ;
informat rimmade best32. ;
informat rimmade_rimmiss best32. ;
informat midmade best32. ;
informat midmade_midmiss best32. ;
informat rimmade__rimmade_rimmiss_ best32. ;
informat midmade__midmade_midmiss_ best32. ;
informat dunksmade best32. ;
informat dunksmiss_dunksmade best32. ;
informat dunksmade__dunksmade_dunksmiss_ best32. ;
informat pick $1. ;
informat drtg best32. ;
informat adrtg best32. ;
informat dporpag best32. ;
informat stops best32. ;
informat bpm best32. ;
informat obpm best32. ;
informat dbpm best32. ;
informat gbpm best32. ;
informat mp best32. ;
informat ogbpm best32. ;
informat dgbpm best32. ;
informat oreb best32. ;
informat dreb best32. ;
informat treb best32. ;
informat ast best32. ;
informat stl best32. ;
informat blk best32. ;
informat pts best32. ;
informat VAR65 $10. ;
informat VAR66 best32. ;
format player_name $19. ;
format team $22. ;
format conf $4. ;
format GP best12. ;
format Min_per best12. ;
format Ortg best12. ;
format usg best12. ;
format eFG best12. ;
format TS_per best12. ;
format ORB_per best12. ;
format DRB_per best12. ;
format AST_per best12. ;
format TO_per best12. ;
format FTM best12. ;
format FTA best12. ;
format FT_per best12. ;
format twoPM best12. ;
format twoPA best12. ;
format twoP_per best12. ;
format TPM best12. ;
format TPA best12. ;
format TP_per best12. ;
format blk_per best12. ;
format stl_per best12. ;
format ftr best12. ;
format yr $2. ;
format ht $6. ;
format num best12. ;
format porpag best12. ;
format adjoe best12. ;
format pfr best12. ;
format year best12. ;
format pid best12. ;
format type $3. ;
format Rec_Rank best12. ;
format ast_tov best12. ;
format rimmade best12. ;
format rimmade_rimmiss best12. ;
format midmade best12. ;
format midmade_midmiss best12. ;
format rimmade__rimmade_rimmiss_ best12. ;
format midmade__midmade_midmiss_ best12. ;
format dunksmade best12. ;
format dunksmiss_dunksmade best12. ;
format dunksmade__dunksmade_dunksmiss_ best12. ;
format pick $1. ;
format drtg best12. ;
format adrtg best12. ;
format dporpag best12. ;
format stops best12. ;
format bpm best12. ;
format obpm best12. ;
format dbpm best12. ;
format gbpm best12. ;
format mp best12. ;
format ogbpm best12. ;
format dgbpm best12. ;
format oreb best12. ;
format dreb best12. ;
format treb best12. ;
format ast best12. ;
format stl best12. ;
format blk best12. ;
format pts best12. ;
format VAR65 $10. ;
format VAR66 best12. ;
input
player_name $
team $
conf $
GP
Min_per
Ortg
usg
eFG
TS_per
ORB_per
DRB_per
AST_per
TO_per
FTM
FTA
FT_per
twoPM
twoPA
twoP_per
TPM
TPA
TP_per
blk_per
stl_per
ftr
yr $
ht $
num
porpag
adjoe
pfr
year
pid
type $
Rec_Rank
ast_tov
rimmade
rimmade_rimmiss
midmade
midmade_midmiss
rimmade__rimmade_rimmiss_
midmade__midmade_midmiss_
dunksmade
dunksmiss_dunksmade
dunksmade__dunksmade_dunksmiss_
pick $
drtg
adrtg
dporpag
stops
bpm
obpm
dbpm
gbpm
mp
ogbpm
dgbpm
oreb
dreb
treb
ast
stl
blk
pts
VAR65 $
VAR66
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
What did I do?
You can now strip the code down to make it more readable (like combining the multiple INFORMAT and FORMAT statements to one each), or find a name for VAR65 and VAR66 if those are needed.
Hello
As we don't have your data we cannot reproduce the result.
Create a new variable say player_name2=TRIM(Player_name).
This should solve the issue.
Well here is my dataset.
I wanna use some of its stats
for a correlation analysis (PROC CANCORR) and i d like to fix my dataset in order to do so... The missing values ( . , NA, ) are messing with how the program reads the lines.
Im also attaching the csv file
This works:
proc format;
invalue nonecheck
'None' = .
other = [best32.]
;
run;
data WORK.CBPLAYERS ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile '~/CollegeBasketballPlayers2022.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat player_name $19. ;
informat team $22. ;
informat conf $4. ;
informat GP best32. ;
informat Min_per best32. ;
informat Ortg best32. ;
informat usg best32. ;
informat eFG best32. ;
informat TS_per best32. ;
informat ORB_per best32. ;
informat DRB_per best32. ;
informat AST_per best32. ;
informat TO_per best32. ;
informat FTM best32. ;
informat FTA best32. ;
informat FT_per best32. ;
informat twoPM best32. ;
informat twoPA best32. ;
informat twoP_per best32. ;
informat TPM best32. ;
informat TPA best32. ;
informat TP_per best32. ;
informat blk_per best32. ;
informat stl_per best32. ;
informat ftr best32. ;
informat yr $2. ;
informat ht $6. ;
informat num nonecheck. ;
informat porpag best32. ;
informat adjoe best32. ;
informat pfr best32. ;
informat year best32. ;
informat pid best32. ;
informat type $3. ;
informat Rec_Rank best32. ;
informat ast_tov best32. ;
informat rimmade best32. ;
informat rimmade_rimmiss best32. ;
informat midmade best32. ;
informat midmade_midmiss best32. ;
informat rimmade__rimmade_rimmiss_ best32. ;
informat midmade__midmade_midmiss_ best32. ;
informat dunksmade best32. ;
informat dunksmiss_dunksmade best32. ;
informat dunksmade__dunksmade_dunksmiss_ best32. ;
informat pick $1. ;
informat drtg best32. ;
informat adrtg best32. ;
informat dporpag best32. ;
informat stops best32. ;
informat bpm best32. ;
informat obpm best32. ;
informat dbpm best32. ;
informat gbpm best32. ;
informat mp best32. ;
informat ogbpm best32. ;
informat dgbpm best32. ;
informat oreb best32. ;
informat dreb best32. ;
informat treb best32. ;
informat ast best32. ;
informat stl best32. ;
informat blk best32. ;
informat pts best32. ;
informat VAR65 $10. ;
informat VAR66 best32. ;
format player_name $19. ;
format team $22. ;
format conf $4. ;
format GP best12. ;
format Min_per best12. ;
format Ortg best12. ;
format usg best12. ;
format eFG best12. ;
format TS_per best12. ;
format ORB_per best12. ;
format DRB_per best12. ;
format AST_per best12. ;
format TO_per best12. ;
format FTM best12. ;
format FTA best12. ;
format FT_per best12. ;
format twoPM best12. ;
format twoPA best12. ;
format twoP_per best12. ;
format TPM best12. ;
format TPA best12. ;
format TP_per best12. ;
format blk_per best12. ;
format stl_per best12. ;
format ftr best12. ;
format yr $2. ;
format ht $6. ;
format num best12. ;
format porpag best12. ;
format adjoe best12. ;
format pfr best12. ;
format year best12. ;
format pid best12. ;
format type $3. ;
format Rec_Rank best12. ;
format ast_tov best12. ;
format rimmade best12. ;
format rimmade_rimmiss best12. ;
format midmade best12. ;
format midmade_midmiss best12. ;
format rimmade__rimmade_rimmiss_ best12. ;
format midmade__midmade_midmiss_ best12. ;
format dunksmade best12. ;
format dunksmiss_dunksmade best12. ;
format dunksmade__dunksmade_dunksmiss_ best12. ;
format pick $1. ;
format drtg best12. ;
format adrtg best12. ;
format dporpag best12. ;
format stops best12. ;
format bpm best12. ;
format obpm best12. ;
format dbpm best12. ;
format gbpm best12. ;
format mp best12. ;
format ogbpm best12. ;
format dgbpm best12. ;
format oreb best12. ;
format dreb best12. ;
format treb best12. ;
format ast best12. ;
format stl best12. ;
format blk best12. ;
format pts best12. ;
format VAR65 $10. ;
format VAR66 best12. ;
input
player_name $
team $
conf $
GP
Min_per
Ortg
usg
eFG
TS_per
ORB_per
DRB_per
AST_per
TO_per
FTM
FTA
FT_per
twoPM
twoPA
twoP_per
TPM
TPA
TP_per
blk_per
stl_per
ftr
yr $
ht $
num
porpag
adjoe
pfr
year
pid
type $
Rec_Rank
ast_tov
rimmade
rimmade_rimmiss
midmade
midmade_midmiss
rimmade__rimmade_rimmiss_
midmade__midmade_midmiss_
dunksmade
dunksmiss_dunksmade
dunksmade__dunksmade_dunksmiss_
pick $
drtg
adrtg
dporpag
stops
bpm
obpm
dbpm
gbpm
mp
ogbpm
dgbpm
oreb
dreb
treb
ast
stl
blk
pts
VAR65 $
VAR66
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
What did I do?
You can now strip the code down to make it more readable (like combining the multiple INFORMAT and FORMAT statements to one each), or find a name for VAR65 and VAR66 if those are needed.
This is another good dataset to use with the %CSV2DS() macro
Download the macro from github and try it on your CSV file.
%csv2ds('C:\downloads\CollegeBasketballPlayers2022.csv'
,out=bb2022,replace=1
);
Seems to work well. Here is the data step code it generated.
740 +data bb2022; 741 + infile 'C:\downloads\CollegeBasketballPlayers2022.csv' dlm=',' dsd 742 + truncover firstobs=2 743 + ; 744 + length player_name $28 team $22 conf $4 GP 8 Min_per 8 Ortg 8 usg 8 745 + eFG 8 TS_per 8 ORB_per 8 DRB_per 8 AST_per 8 TO_per 8 FTM 8 FTA 8 746 + FT_per 8 twoPM 8 twoPA 8 twoP_per 8 TPM 8 TPA 8 TP_per 8 blk_per 8 747 + stl_per 8 ftr 8 yr $2 ht $6 num $4 porpag 8 adjoe 8 pfr 8 year 8 pid 8 748 + type $3 Rec_Rank 8 ast_tov 8 rimmade 8 rimmade_rimmiss 8 midmade 8 749 + midmade_midmiss 8 rimmade_rimmade_rimmiss 8 midmade_midmade_midmiss 8 750 + dunksmade 8 dunksmiss_dunksmade 8 dunksmade_dunksmade_dunksmiss 8 751 + pick $1 drtg 8 adrtg 8 dporpag 8 stops 8 bpm 8 obpm 8 dbpm 8 gbpm 8 752 + mp 8 ogbpm 8 dgbpm 8 oreb 8 dreb 8 treb 8 ast 8 stl 8 blk 8 pts 8 753 + VAR65 $10 VAR66 8 754 + ; 755 + label Rec_Rank='Rec Rank' ast_tov='ast/tov' 756 + rimmade_rimmiss='rimmade+rimmiss' midmade_midmiss='midmade+midmiss' 757 + rimmade_rimmade_rimmiss='rimmade/(rimmade+rimmiss)' 758 + midmade_midmade_midmiss='midmade/(midmade+midmiss)' 759 + dunksmiss_dunksmade='dunksmiss+dunksmade' 760 + dunksmade_dunksmade_dunksmiss='dunksmade/(dunksmade+dunksmiss)' 761 + ; 762 + input player_name -- VAR66 ; 763 +run;
A couple of things to notice right away is that many of the column headers are not good variable names. And two of the columns don't even have any column headers so the macro had make up names for them.
You mentioned that text values like NONE or NA were causing trouble. Let's use the _VALUES_ view that the macro creates to see where those appear.
proc freq data=_values_;
where upcase(short) in ('NA','NONE');
tables short*varnum/list;
run;
So it looks like None appears in two of the columns.
The FREQ Procedure Cumulative Cumulative short varnum Frequency Percent Frequency Percent -------------------------------------------------------------------- None 27 11 11.70 11 11.70 None 28 83 88.30 94 100.00
So let's look at what analysis the macro did for those variables in the _TYPES_ dataset.
proc print data=_types_;
where varnum in (27 28);
run;
Results
So the variable NUM looks like it has numbers as most of the non-blank values are integers.
But the variable HT looks messed up.
Looks like someone opened the CSV in EXCEL and EXCEL converted the strings with height in feet and inches into dates. So you probably have at least one BB player that is seven feet one inches tall.
To fix that you should go back to the original CSV file before it got messed up by EXCEL.
Just goes to show how commonly Excel corrupts data. It even made into something someone published on-line.
Here how you could fix HT and NUM using custom informats.
The next to last column looks like the players position. I have no idea what that last column is supposed to be. I just called it RANKING.
proc format;
invalue none (default=32 upcase) 'NONE'=. other=[32.];
invalue $ht
'None' =' '
'8-May' ='5- 8'
'9-May' ='5- 9'
'2-May' ='5- 2'
'6-May' ='5- 6'
'7-May' ='5- 7'
'10-May'='5-10'
'11-May'='5-11'
'Jun-00'='6- 0'
'1-Jun' ='6- 1'
'2-Jun' ='6- 2'
'3-Jun' ='6- 3'
'4-Jun' ='6- 4'
'5-Jun' ='6- 5'
'6-Jun' ='6- 6'
'7-Jun' ='6- 7'
'8-Jun' ='6- 8'
'9-Jun' ='6- 9'
'10-Jun'='6-10'
'11-Jun'='6-11'
'Jul-00'='7- 0'
'1-Jul' ='7- 1'
'2-Jul' ='7- 2'
'3-Jul' ='7- 3'
'4-Jul' ='7- 4'
;
run;
data bb2022;
infile 'c:\downloads\bb2022.zip' zip member='CollegeBasketballPlayers2022.csv'
dlm=',' dsd truncover firstobs=2
;
length player_name $28 team $22 conf $4 GP 8 Min_per 8 Ortg 8 usg 8
eFG 8 TS_per 8 ORB_per 8 DRB_per 8 AST_per 8 TO_per 8 FTM 8 FTA 8
FT_per 8 twoPM 8 twoPA 8 twoP_per 8 TPM 8 TPA 8 TP_per 8 blk_per 8
stl_per 8 ftr 8 yr $2 ht $6 num 8 porpag 8 adjoe 8 pfr 8 year 8 pid 8
type $3 Rec_Rank 8 ast_tov 8 rimmade 8 rimmade_rimmiss 8 midmade 8
midmade_midmiss 8 rimmade_rimmade_rimmiss 8 midmade_midmade_midmiss 8
dunksmade 8 dunksmiss_dunksmade 8 dunksmade_dunksmade_dunksmiss 8
pick $1 drtg 8 adrtg 8 dporpag 8 stops 8 bpm 8 obpm 8 dbpm 8 gbpm 8
mp 8 ogbpm 8 dgbpm 8 oreb 8 dreb 8 treb 8 ast 8 stl 8 blk 8 pts 8
position $10 ranking 8
;
informat ht $ht. num none.;
label
Rec_Rank='Rec Rank'
ast_tov='ast/tov'
rimmade_rimmiss='rimmade+rimmiss'
midmade_midmiss='midmade+midmiss'
rimmade_rimmade_rimmiss='rimmade/(rimmade+rimmiss)'
midmade_midmade_midmiss='midmade/(midmade+midmiss)'
dunksmiss_dunksmade='dunksmiss+dunksmade'
dunksmade_dunksmade_dunksmiss='dunksmade/(dunksmade+dunksmiss)'
;
input player_name -- ranking ;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.