- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset I am working on for a class project in which I am looking for a character variable that is 50 in length. Here is the code:
DATA CoImpt.Race_Ethnicity; INFILE "&CourseRoot/CDPHE Study/Data/1_Source/ct_race_ethn.csv" FIRSTOBS=2 DSD MISSOVER; INPUT Census_tract :$50. Population_density N_White Pctn_White N_Africanamerican Pctn_Africanamerican N_AIAN Pctn_AIAN N_Asian Pctn_Asian N_NH_OPI Pctn_NH_OPI N_Other Pctn_Other N_Hispanic_Latino Pctn_Hispanic_Latino N_NHL Pctn_NHL a b ; PROC CONTENTS DATA=CoImpt.Race_Ethnicity; RUN;
When I run proc contents, I only get 48 length. My data looks like this:
"Census Tract 9790, Fremont County, Colorado",3.4,3366.00,97.00,4.00,0.10,19.00,0.50,115.00,3.30,0.00,0.00,50.00,1.40,157.00,4.50,3176.00,91.50,295.00,8.50
And the quotation marks do not show up in the output data. Is SAS excluding them? How do I get SAS to include them? End goal is variable census_tract with a length of 50. Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First thing: End your data step with a RUN; statement.
Your code should create the variable with a length of 50. Are you sure you didn't have some error in your code and Proc Contents lists a table version that you created in an earlier run with different code?
Below sample code based on what you've shared creates the variable with a length of 50. That SAS removes the quotes is imho a good thing as that's just how in a .csv strings get stored so that you can have a comma within such a string.
/* create csv file under WORK path */
%let csv_file=%sysfunc(pathname(work))/ct_race_ethn.csv;
data _null_;
file "&csv_file";
infile datalines;
input;
put _infile_;
datalines;
header header header
"Census Tract 9790, Fremont County, Colorado",3.4,3366.00,97.00,4.00,0.10,19.00,0.50,115.00,3.30,0.00,0.00,50.00,1.40,157.00,4.50,3176.00,91.50,295.00,8.50
;
/* read csv into SAS dataset */
data work.Race_Ethnicity;
infile "&csv_file" firstobs=2 dsd missover;
input
Census_tract :$50.
Population_density
;
run;
proc contents data=work.race_ethnicity;
run;
proc print data=work.race_ethnicity;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First thing: End your data step with a RUN; statement.
Your code should create the variable with a length of 50. Are you sure you didn't have some error in your code and Proc Contents lists a table version that you created in an earlier run with different code?
Below sample code based on what you've shared creates the variable with a length of 50. That SAS removes the quotes is imho a good thing as that's just how in a .csv strings get stored so that you can have a comma within such a string.
/* create csv file under WORK path */
%let csv_file=%sysfunc(pathname(work))/ct_race_ethn.csv;
data _null_;
file "&csv_file";
infile datalines;
input;
put _infile_;
datalines;
header header header
"Census Tract 9790, Fremont County, Colorado",3.4,3366.00,97.00,4.00,0.10,19.00,0.50,115.00,3.30,0.00,0.00,50.00,1.40,157.00,4.50,3176.00,91.50,295.00,8.50
;
/* read csv into SAS dataset */
data work.Race_Ethnicity;
infile "&csv_file" firstobs=2 dsd missover;
input
Census_tract :$50.
Population_density
;
run;
proc contents data=work.race_ethnicity;
run;
proc print data=work.race_ethnicity;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I closed SAS and reopened it, and the code ran as it was supposed to. Not exactly sure what happened, but thank you for guiding me in the right direction!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is there a way to do this without using data lines? I have well over 1,000 lines of data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Looks like SAS is doing exactly what you asked. And you asked it to do the right thing.
Since the first place you reference Census_tract is in the INPUT statement SAS will use the WIDTH of the INFORMAT used in the INPUT to help it GUESS what LENGTH to define the variable.
Since you use the DSD option in the INFILE statement the quotes that had to be added around the value
Census Tract 9790, Fremont County, Colorado
so that the line can be properly parsed (because the value contains the delimiter character) will be removed so they do not accidentally become part of the value.
The LENGTH() function returns the length of the string excluding the trailing spaces that are added to pad the value to the full storage length.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why did you use $50. informat if you wanted the variable to defined as length $48?
Did you let PROC IMPORT guess how to read the file? PROC IMPORT will overestimate the length needed when any of the values have had quotes added to protect embedded delimiters or quotes.
Let's run an experiment so we can see the issue. Let's make a CSV file and add quotes around one of the variables. Then let PROC IMPORT try to GUESS how to read it.
filename csv temp;
data _null_;
file csv dsd ;
set sashelp.class(obs=3);
if _n_=1 then put 'Name,Age,Sex';
put name ~ age sex;
run;
proc import file=csv out=test replace dbms=csv;
run;
So the CSV file look like this:
RULE: ----+----1----+----2----+----3----+----4- 1 Name,Age,Sex 12 2 "Alfred",14,M 13 3 "Alice",13,F 12 4 "Barbara",13,F 14
The longest value of Name is Barbara which is only 7 characters long. But with the quotes it takes 9 bytes in the line of text.
And the code generated by PROC IMPORT sets the length of the variable $9 instead of $7.
153 proc import file=csv out=test replace dbms=csv; 154 run; 155 /********************************************************************** 156 * PRODUCT: SAS 157 * VERSION: 9.4 158 * CREATOR: External File Interface 159 * DATE: 26SEP23 160 * DESC: Generated SAS Datastep Code 161 * TEMPLATE SOURCE: (None Specified.) 162 ***********************************************************************/ 163 data WORK.TEST ; 164 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 165 infile CSV delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ; 166 informat Name $9. ; 167 informat Age best32. ; 168 informat Sex $1. ; 169 format Name $9. ; 170 format Age best12. ; 171 format Sex $1. ; 172 input 173 Name $ 174 Age 175 Sex $ 176 ; 177 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ 178 run; NOTE: The infile CSV is: (system-specific pathname), (system-specific file attributes) NOTE: 3 records were read from the infile (system-specific pathname). The minimum record length was 12. The maximum record length was 14. NOTE: The data set WORK.TEST has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 3 rows created in WORK.TEST from CSV. NOTE: WORK.TEST data set was successfully created. NOTE: The data set WORK.TEST has 3 observations and 3 variables. NOTE: PROCEDURE IMPORT used (Total process time): real time 0.05 seconds cpu time 0.03 seconds
If you want a tool that will not overestimate the length needed for character variable use %CSVDS() instead of PROC IMPORT.
filename url url "https://raw.githubusercontent.com/sasutils/macros/master/csv2ds.sas";
%include url;
filename url url "https://raw.githubusercontent.com/sasutils/macros/master/parmv.sas";
%include url;
filename url;
%csv2ds(csv,replace=Y);
Generated code:
NOTE: %INCLUDE (level 1) file _CODE_ is (system-specific pathname). 1029 +data fromcsv; 1030 + infile CSV dlm=',' dsd truncover firstobs=2 ; 1031 + length Name $7 Age 8 Sex $1 ; 1032 + input Name -- Sex ; 1033 +run; NOTE: The infile CSV is: (system-specific pathname), (system-specific file attributes) NOTE: 3 records were read from the infile (system-specific pathname). The minimum record length was 12. The maximum record length was 14. NOTE: The data set WORK.FROMCSV has 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds