- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I was tasked to find a data set for an assignment. While trying to read in the csv, I wanted to take a look at it before I proceeded with my relatively simple assignment. Here is the code that I first used:
data basketball;
infile "C:\Users\wonha\Desktop\SMU Class Material\6308 SAS II and Databases\mens-professional-basketball\basketball_player_allstar.csv" dlm=","
firstobs=2 dsd;
input ID $ L_Name $ F_Name $ Season $ Conference $ League $ Games Minutes Points Off_Rebounds Def_Rebounds Rebounds Assists Steals Blocks Turnovers
Fouls FG_Attempt FG_Made FT_Attempt FT_Made Threes_Attempt Threes_Made;
proc print data=basketball (Firstobs=1 obs=5);
run;
Here is what appears in the results viewer:
I thought that I should then use an informat for the ID, first name, and last name variables, so I use the $12. option on each.
data basketball;
infile "C:\Users\wonha\Desktop\SMU Class Material\6308 SAS II and Databases\mens-professional-basketball\basketball_player_allstar.csv" dlm=","
firstobs=2 dsd;
input ID $12. L_Name $12. F_Name $12. Season $ Conference $ League $ Games Minutes Points Off_Rebounds Def_Rebounds Rebounds Assists Steals Blocks Turnovers
Fouls FG_Attempt FG_Made FT_Attempt FT_Made Threes_Attempt Threes_Made;
proc print data=basketball (Firstobs=1 obs=5);
run;
Which produced the second set of results: .
I took a look at my csv file in a text file and found that each variable and value were in double quotation marks. Here is a short portion of the txt file of the dataset:
"player_id","last_name","first_name","season_id","conference","league_id","games_played","minutes","points","o_rebounds","d_rebounds","rebounds","assists","steals","blocks","turnovers","personal_fouls","fg_attempted","fg_made","ft_attempted","ft_made","three_attempted","three_made"
"abdulka01","Abdul-Jabbar","Kareem","1978","West","NBA","1","28","11",,,"8","3",,,,,"12","5","2","1",,
"abdulka01","Abdul-Jabbar","Kareem","1969","East","NBA","1","18","10",,,"11","4",,,,,"8","4","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1988","West","NBA","1","13","4",,,"3","0",,,,,"6","1","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1987","West","NBA","1","14","10",,,"4","0",,,,,"9","4","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1986","West","NBA","1","27","10",,,"8","3",,,,,"9","4","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1985","West","NBA","1","32","21",,,"7","2",,,,,"15","9","4","3",,
"abdulka01","Abdul-Jabbar","Kareem","1984","West","NBA","1","23","11",,,"6","1",,,,,"10","5","2","1",,
"abdulka01","Abdul-Jabbar","Kareem","1983","West","NBA","1","37","25",,,"13","2",,,,,"19","11","4","3",,
"abdulka01","Abdul-Jabbar","Kareem","1982","West","NBA","1","32","20",,,"6","5",,,,,"12","9","3","2",,
"abdulka01","Abdul-Jabbar","Kareem","1980","West","NBA","1","23","15",,,"6","4",,,,,"9","6","3","3",,
"abdulka01","Abdul-Jabbar","Kareem","1979","West","NBA","1","30","17",,,"16","9",,,,,"17","6","6","5",,
"abdulka01","Abdul-Jabbar","Kareem","1976","West","NBA","1","23","21",,,"4","2",,,,,"14","8","6","5",,
"abdulka01","Abdul-Jabbar","Kareem","1975","West","NBA","1","36","22",,,"5","3",,,,,"16","9","4","4",,
"abdulka01","Abdul-Jabbar","Kareem","1974","West","NBA","1","19","7",,,"10","3",,,,,"10","3","2","1",,
"abdulka01","Abdul-Jabbar","Kareem","1973","West","NBA","1","23","14",,,"8","6",,,,,"11","7","0","0",,
"abdulka01","Abdul-Jabbar","Kareem","1972","West","NBA","1","98",,,,,,,,,,,,,,,
"abdulka01","Abdul-Jabbar","Kareem","1971","West","NBA","1","19","12",,,"7","2",,,,,"10","5","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1970","West","NBA","1","30","19",,,"14","1",,,,,"16","8","4","3",,
"abdulka01","Abdul-Jabbar","Kareem","1981","West","NBA","1","22","2","1","2","3","1","3","2","1","3","10","1","0","0","0","0"
"abdursh01","A-Rahim","Shareef","2001","East","NBA","1","21","9","1","5","6","0","0","0","0","2","4","4","0","0","1","1"
I have tried various informat and format options, but get about the same as picture above. Is there any way to remove all of the quotation marks while trying to read in the .csv file?
Thank you.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Replace
input ID $12. L_Name:$12. F_Name $12.
with
input ID :$12. L_Name :$12. F_Name :$12.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Replace
input ID $12. L_Name:$12. F_Name $12.
with
input ID :$12. L_Name :$12. F_Name :$12.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is actually standard in a CSV, at least for text variables. I'm surprised the numerics are all in quotes but it can still be dealt with. The DSD option handles the quotes, you don't need do explicitly deal with it.
You should apply your informants ahead of the INPUT statement to get it to work correctly, rather than on the INPUT statement.
informat ID L_Name F_Name $12.;
input ..... ;
Give that a shot first. If that doesn't work, you may want to read everything in as a character and then convert to numeric.
@wgonzalez wrote:
I was tasked to find a data set for an assignment. While trying to read in the csv, I wanted to take a look at it before I proceeded with my relatively simple assignment. Here is the code that I first used:
data basketball;
infile "C:\Users\wonha\Desktop\SMU Class Material\6308 SAS II and Databases\mens-professional-basketball\basketball_player_allstar.csv" dlm=","
firstobs=2 dsd;
input ID $ L_Name $ F_Name $ Season $ Conference $ League $ Games Minutes Points Off_Rebounds Def_Rebounds Rebounds Assists Steals Blocks Turnovers
Fouls FG_Attempt FG_Made FT_Attempt FT_Made Threes_Attempt Threes_Made;
proc print data=basketball (Firstobs=1 obs=5);
run;
Here is what appears in the results viewer:
I thought that I should then use an informat for the ID, first name, and last name variables, so I use the $12. option on each.
data basketball;
infile "C:\Users\wonha\Desktop\SMU Class Material\6308 SAS II and Databases\mens-professional-basketball\basketball_player_allstar.csv" dlm=","
firstobs=2 dsd;
input ID $12. L_Name $12. F_Name $12. Season $ Conference $ League $ Games Minutes Points Off_Rebounds Def_Rebounds Rebounds Assists Steals Blocks Turnovers
Fouls FG_Attempt FG_Made FT_Attempt FT_Made Threes_Attempt Threes_Made;
proc print data=basketball (Firstobs=1 obs=5);
run;
Which produced the second set of results:
.
I took a look at my csv file in a text file and found that each variable and value were in double quotation marks. Here is a short portion of the txt file of the dataset:
"player_id","last_name","first_name","season_id","conference","league_id","games_played","minutes","points","o_rebounds","d_rebounds","rebounds","assists","steals","blocks","turnovers","personal_fouls","fg_attempted","fg_made","ft_attempted","ft_made","three_attempted","three_made"
"abdulka01","Abdul-Jabbar","Kareem","1978","West","NBA","1","28","11",,,"8","3",,,,,"12","5","2","1",,
"abdulka01","Abdul-Jabbar","Kareem","1969","East","NBA","1","18","10",,,"11","4",,,,,"8","4","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1988","West","NBA","1","13","4",,,"3","0",,,,,"6","1","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1987","West","NBA","1","14","10",,,"4","0",,,,,"9","4","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1986","West","NBA","1","27","10",,,"8","3",,,,,"9","4","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1985","West","NBA","1","32","21",,,"7","2",,,,,"15","9","4","3",,
"abdulka01","Abdul-Jabbar","Kareem","1984","West","NBA","1","23","11",,,"6","1",,,,,"10","5","2","1",,
"abdulka01","Abdul-Jabbar","Kareem","1983","West","NBA","1","37","25",,,"13","2",,,,,"19","11","4","3",,
"abdulka01","Abdul-Jabbar","Kareem","1982","West","NBA","1","32","20",,,"6","5",,,,,"12","9","3","2",,
"abdulka01","Abdul-Jabbar","Kareem","1980","West","NBA","1","23","15",,,"6","4",,,,,"9","6","3","3",,
"abdulka01","Abdul-Jabbar","Kareem","1979","West","NBA","1","30","17",,,"16","9",,,,,"17","6","6","5",,
"abdulka01","Abdul-Jabbar","Kareem","1976","West","NBA","1","23","21",,,"4","2",,,,,"14","8","6","5",,
"abdulka01","Abdul-Jabbar","Kareem","1975","West","NBA","1","36","22",,,"5","3",,,,,"16","9","4","4",,
"abdulka01","Abdul-Jabbar","Kareem","1974","West","NBA","1","19","7",,,"10","3",,,,,"10","3","2","1",,
"abdulka01","Abdul-Jabbar","Kareem","1973","West","NBA","1","23","14",,,"8","6",,,,,"11","7","0","0",,
"abdulka01","Abdul-Jabbar","Kareem","1972","West","NBA","1","98",,,,,,,,,,,,,,,
"abdulka01","Abdul-Jabbar","Kareem","1971","West","NBA","1","19","12",,,"7","2",,,,,"10","5","2","2",,
"abdulka01","Abdul-Jabbar","Kareem","1970","West","NBA","1","30","19",,,"14","1",,,,,"16","8","4","3",,
"abdulka01","Abdul-Jabbar","Kareem","1981","West","NBA","1","22","2","1","2","3","1","3","2","1","3","10","1","0","0","0","0"
"abdursh01","A-Rahim","Shareef","2001","East","NBA","1","21","9","1","5","6","0","0","0","0","2","4","4","0","0","1","1"
I have tried various informat and format options, but get about the same as picture above. Is there any way to remove all of the quotation marks while trying to read in the .csv file?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Two things.
The DSD option on the INFILE statement will automatically remove the optional quotes around values. It will also properly handle any null values.
Second when reading delimited data you need to use what SAS calls LIST mode input style instead of FORMATTED input style. So either totally remove the informats from the INPUT statement (or add the colon modifier in front of the informat specifications). If you slip into formatted input mode then you might accidentally read past the delimiter and mess up the parsing of the input line.
It is usually best to define you variables before you use them (otherwise SAS will guess how to define them based on how you first use them). For example you can just use a LENGTH statement to define the type and length of each variable. Numeric variables should be length 8 since SAS stores numbers using 8 byte floating point. Make the character variables long enough for the maximum number of character you expect them to hold.
infile "...." dlm="," firstobs=2 dsd truncover;
length ID $12 L_Name $12 F_Name $12 Season $12 Conference $12 League $12
Games Minutes Points Off_Rebounds Def_Rebounds Rebounds Assists Steals Blocks
Turnovers Fouls FG_Attempt FG_Made FT_Attempt FT_Made Threes_Attempt Threes_Made 8
;
input id -- Threes_Made ;