I have a large CSV file with a lot of variables (100+). I have read them in with a data step, but all of the missing values are reading in as empty strings.
For example:
proc means data=mydata nmiss;
run;
returns 0 for every variable, which is not the case. How can I read in these empty strings (which may be 1 or more spaces) in as missing in the data step for every variable? I have tried a few things, but nothing even comes close enough to mention. I have seen options to replace missing values with strings, but no documentation to go the other way. Do I need to address each variable individually, or is there a quicker way?
Run the following and post the log and output please.
*Formats for missing for character and numeric variables;
proc format;
value $ missfmt ' '="Missing"
other="Not Missing"
;
value nmissfmt . ="Missing"
other="Not Missing"
;
run;
*Run frequency for tables;
ods select none;
ods table onewayfreqs=temp;
proc freq data=capstone.houses;
format _character_ $missfmt.;
format _numeric_ nmissfmt.;
table _all_;
run;
ods select all;
*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);
Variable_Value=strip(trim(vvaluex(variable)));
keep variable variable_value frequency percent cum:;
label variable='Variable'
variable_value='Variable Value';
run;
*Display;
proc print data=want label;
run;
Your probably need to show some of the lines of the CSV file (as TEXT, not photos and definitely NOT after it has been manipulated by Excel or some other spreadsheet type program). Also show the data step you used to read the file. Make sure to use the Insert Code or Insert SAS Code button to get a pop-up window to paste the lines of data or code, respectively.
Your statement seems confused. If the value in the CSV file for a field is all spaces (or nothing at all) then a normal input statement will treat that as missing, whether you are reading it into a numeric or character variable. Since SAS stores character variables as fixed length they are always padded with spaces. But SAS code will treat a character variable that is all spaces as missing.
Note that PROC MEANS will only work on numeric variables (it is impossible to find the mean of a string).
You can use the NLEVELS option of PROC FREQ to see how many distinct values (including missing) .
proc freq data=mydata nlevels;
tables _all_ / noprint;
run;
@AForrest wrote:
I have a large CSV file with a lot of variables (100+). I have read them in with a data step, but all of the missing values are reading in as empty strings.
[removed....] Do I need to address each variable individually, or is there a quicker way?
Unfortunately this means that the data was not imported in correctly so you need to fix the data step. Start there. Post your code and examples of the issue and we can help you fix the code. Rather than fix it after the fact, fix it at the source.
Here is my first data step to read in the CSV file.
data library.houses; infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" dsd delimiter="," missover firstobs=2; input Major :6. Minor :4. BldgNum:3. LivingUnits :1. Address :$60. HouseNum :$5. Fraction :$3. DirPrefix :$2. StName :$25. StType :$6. DirSuffix :$2. ZipCode :$10. Stories :3. BldgGrade :2. BldgGradeVar :2. SqFtFirstFl :5. SqFtHalfFl :5. SqFtSecondFl :5. SqFtUpperFl :5. SqFtUnfinishedFull :5. SqFtUnfinishedHalf :5. SqFtTotalLiving :5. SqFtTotalBsmt :5. SqFtFinishedBsmt :5. FinBsmtGrade :2. SqFtGarageBsmt :5. SqFtGarageAtt :5. DaylightBsmt :$1. SqFtOpenPorch :5. SqFtEnclPorch :5. SqFtDeck :5. HeatSystem :2. HeatSource :2. PctBrickStone :3. ViewUtil :$1. Bedrooms :2. HalfBath :2. ThreeQtrBath :2. FullBath :2. FirePlSingle :2. FireplMult :2. FireplFreestanding :2. FireplAddl :2. YearBuilt :$4. YearRenovated :$4. PctComplete :3. Obsolescence :3. PctNetCond :3. Condition :1. AddlCost :6.; run;
There are technically three CSV files that I am joining together later in the process, but the steps to read in each one are essentially the same. I am reading in the CSV files as is to the library, and then in a separate file, I use different data steps to bring the dataset into WORK. I did this to avoid extra processing time because each CSV file is so large and it only needs to be done once. The second data steps are relatively straightforward. Here is a quick snip of some examples of fully empty cells that are not reading as empty from the first data step. .
Can you show the lines of the CSV file that correspond to the photograph of whatever (spreadsheet? dataset viewer?) that you posted.
Also show the lines from the SAS log for the data step you posted. As text, not another photograph.
From the photograph I am assuming that your issue is not about missing values at all, but about proper alignment in reading the data. It looks in some lines the values you expected to appear in say the 10th variable are being read into the 8th or 9th variable instead?
The notes from the SAS log will help tell if the issue is embedded line breaks. So instead of one line with 100 values you have two lines. One where there are only say 70 values and another where the first value 70 values are just not there.
Sure! Here is the log:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 data capstone.houses; 74 infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" dsd delimiter="," 75 missover firstobs=2; 76 input Major :6. Minor :4. BldgNum:3. LivingUnits :1. Address :$60. 77 HouseNum :$5. Fraction :$3. DirPrefix :$2. StName :$25. StType :$6. 78 DirSuffix :$2. ZipCode :$10. Stories :3. BldgGrade :2. BldgGradeVar :2. 79 SqFtFirstFl :5. SqFtHalfFl :5. SqFtSecondFl :5. SqFtUpperFl :5. 80 SqFtUnfinishedFull :5. SqFtUnfinishedHalf :5. SqFtTotalLiving :5. 81 SqFtTotalBsmt :5. SqFtFinishedBsmt :5. FinBsmtGrade :2. SqFtGarageBsmt :5. 82 SqFtGarageAtt :5. DaylightBsmt :$1. SqFtOpenPorch :5. SqFtEnclPorch :5. 83 SqFtDeck :5. HeatSystem :2. HeatSource :2. PctBrickStone :3. ViewUtil :$1. 84 Bedrooms :2. HalfBath :2. ThreeQtrBath :2. FullBath :2. FirePlSingle :2. 85 FireplMult :2. FireplFreestanding :2. FireplAddl :2. YearBuilt :$4. 86 YearRenovated :$4. PctComplete :3. Obsolescence :3. PctNetCond :3. 87 Condition :1. AddlCost :6.; 88 run; NOTE: The infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" is: Filename=/folders/myfolders/capstone/EXTR_ResBldg.csv, Owner Name=root,Group Name=vboxsf, Access Permission=-rwxrwx---, Last Modified=23Apr2020:15:33:03, File Size (bytes)=149112131 NOTE: 514596 records were read from the infile "/folders/myfolders/capstone/EXTR_ResBldg.csv". The minimum record length was 242. The maximum record length was 321. NOTE: The data set CAPSTONE.HOUSES has 514596 observations and 50 variables. NOTE: DATA statement used (Total process time): real time 15.90 seconds cpu time 8.84 seconds 89 90 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 102
Here are the first three lines of the source file.
"Major","Minor","BldgNbr","NbrLivingUnits","Address","BuildingNumber","Fraction","DirectionPrefix","StreetName","StreetType","DirectionSuffix","ZipCode","Stories","BldgGrade","BldgGradeVar","SqFt1stFloor","SqFtHalfFloor","SqFt2ndFloor","SqFtUpperFloor","SqFtUnfinFull","SqFtUnfinHalf","SqFtTotLiving","SqFtTotBasement","SqFtFinBasement","FinBasementGrade","SqFtGarageBasement","SqFtGarageAttached","DaylightBasement","SqFtOpenPorch","SqFtEnclosedPorch","SqFtDeck","HeatSystem","HeatSource","BrickStone","ViewUtilization","Bedrooms","BathHalfCount","Bath3qtrCount","BathFullCount","FpSingleStory","FpMultiStory","FpFreestanding","FpAdditional","YrBuilt","YrRenovated","PcntComplete","Obsolescence","PcntNetCondition","Condition","AddnlCost" "042504","9127","1","1","7709 30TH AVE NE 98115","7709 "," "," ","30TH ","AVE ","NE","98115","1","8","0","1380","0","0","0","0","0","2180","1380","800","7","0","0"," ","0","0","370","5","1","0"," ","4","0","2","1","0","1","0","1","1956","0","0","0","0","3","0" "042504","9140","1","1","6540 27TH AVE NE 98115","6540 "," "," ","27TH ","AVE ","NE","98115","1","7","0","1030","0","0","0","0","0","1450","1030","420","6","240","0","N","0","0","0","5","1","100","N","3","0","1","1","0","1","0","1","1950","0","0","0","0","3","0"
Here is the results from the log:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 72 73 74 data capstone.houses; 75 infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" dsd delimiter="," 76 missover firstobs=2; 77 input Major :6. Minor :4. BldgNum:3. LivingUnits :1. Address :$60. 78 HouseNum :$5. Fraction :$3. DirPrefix :$2. StName :$25. StType :$6. 79 DirSuffix :$2. ZipCode :$10. Stories :3. BldgGrade :2. BldgGradeVar :2. 80 SqFtFirstFl :5. SqFtHalfFl :5. SqFtSecondFl :5. SqFtUpperFl :5. 81 SqFtUnfinishedFull :5. SqFtUnfinishedHalf :5. SqFtTotalLiving :5. 82 SqFtTotalBsmt :5. SqFtFinishedBsmt :5. FinBsmtGrade :2. SqFtGarageBsmt :5. 83 SqFtGarageAtt :5. DaylightBsmt :$1. SqFtOpenPorch :5. SqFtEnclPorch :5. 84 SqFtDeck :5. HeatSystem :2. HeatSource :2. PctBrickStone :3. ViewUtil :$1. 85 Bedrooms :2. HalfBath :2. ThreeQtrBath :2. FullBath :2. FirePlSingle :2. 86 FireplMult :2. FireplFreestanding :2. FireplAddl :2. YearBuilt :$4. 87 YearRenovated :$4. PctComplete :3. Obsolescence :3. PctNetCond :3. 88 Condition :1. AddlCost :6.; 89 run; NOTE: The infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" is: Filename=/folders/myfolders/capstone/EXTR_ResBldg.csv, Owner Name=root,Group Name=vboxsf, Access Permission=-rwxrwx---, Last Modified=23Apr2020:15:33:03, File Size (bytes)=149112131 NOTE: 514596 records were read from the infile "/folders/myfolders/capstone/EXTR_ResBldg.csv". The minimum record length was 242. The maximum record length was 321. NOTE: The data set CAPSTONE.HOUSES has 514596 observations and 50 variables. NOTE: DATA statement used (Total process time): real time 10.32 seconds cpu time 6.97 seconds 90 91 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 103
I can see why it might look like the variables are not aligned correctly, but that is not the case. In the screenshot, it is just a case of mutual exclusivity. As far as the data itself, I think my previous post may have been removed because it contains address data. I will add a slightly scrubbed version in a separate comment.
Here are the first few rows of a scrubbed version of the CSV file:
"Major","Minor","BldgNbr","NbrLivingUnits","Address","BuildingNumber","Fraction","DirectionPrefix","StreetName","StreetType","DirectionSuffix","ZipCode","Stories","BldgGrade","BldgGradeVar","SqFt9stFloor","SqFtHalfFloor","SqFt9ndFloor","SqFtUpperFloor","SqFtUnfinFull","SqFtUnfinHalf","SqFtTotLiving","SqFtTotBasement","SqFtFinBasement","FinBasementGrade","SqFtGarageBasement","SqFtGarageAttached","DaylightBasement","SqFtOpenPorch","SqFtEnclosedPorch","SqFtDeck","HeatSystem","HeatSource","BrickStone","ViewUtilization","Bedrooms","BathHalfCount","Bath9qtrCount","BathFullCount","FpSingleStory","FpMultiStory","FpFreestanding","FpAdditional","YrBuilt","YrRenovated","PcntComplete","Obsolescence","PcntNetCondition","Condition","AddnlCost" "099909","9999","9","9","9909 90TH AVE NE 99999","9909 "," "," ","90TH ","AVE ","NE","99999","9","9","0","9990","0","0","0","0","0","9990","9990","900","9","0","0"," ","0","0","990","9","9","0"," ","9","0","9","9","0","9","0","9","9999","0","0","0","0","9","0" "099909","9990","9","9","9990 99TH AVE NE 99999","9990 "," "," ","99TH ","AVE ","NE","99999","9","9","0","9090","0","0","0","0","0","9990","9090","990","9","990","0","N","0","0","0","9","9","900","N","9","0","9","9","0","9","0","9","9990","0","0","0","0","9","0" "099909","9099","9","9","9909 999ST AVE NE 99099","9909 "," "," ","999ST ","AVE ","NE","99099","9","9","0","9990","0","9990","0","0","0","9900","0","0","0","0","990"," ","990","0","0","9","9","0"," ","9","9","0","9","0","9","0","0","9999","0","0","0","0","9","0" "099909","9090","9","9","99999 NE 99TH ST ","99999"," ","NE","99TH ","ST "," ","","9","9","0","9990","0","9990","0","0","0","9090","0","0","0","0","990"," ","990","0","0","9","9","0"," ","9","9","0","9","9","0","0","0","9099","0","0","0","0","9","0"
I don't see anything wrong with those lines.
Can you explain what your issue is? Does one of those lines have the issue? For which variable? What does the SAS variable have? What should it have?
Each one has exactly 50 values.
67 data _null_; 68 input; 69 nwords=countw(_infile_,',','mq'); 70 put (_n_ nwords) (=); 71 cards4; _N_=1 nwords=50 _N_=2 nwords=50 _N_=3 nwords=50 _N_=4 nwords=50 _N_=5 nwords=50
But they do have a lot more characters on each line than they need to show that information. You only need quotes when the value contains the delimiter (comma in this case) or quotes. You don't need any trailing spaces.
If I read them in field by field and write them back out I get a file list this:
Major,Minor,BldgNbr,NbrLivingUnits,Address,BuildingNumber,Fraction,DirectionPrefix,StreetName,StreetType,DirectionSuffix,ZipCode,Stories,BldgGrade,BldgGradeVar,SqFt9stFloor,SqFtHalfFloor,SqFt9ndFloor,SqFtUpperFloor,SqFtUnfinFull,SqFtUnfinHalf,SqFtTotLiving 099909,9999,9,9,9909 90TH AVE NE 99999,9909,,,90TH,AVE,NE,99999,9,9,0,9990,0,0,0,0,0,9990,9990,900,9,0,0,,0,0,990,9,9,0,,9,0,9,9,0,9,0,9,9999,0,0,0,0,9,0 099909,9990,9,9,9990 99TH AVE NE 99999,9990,,,99TH,AVE,NE,99999,9,9,0,9090,0,0,0,0,0,9990,9090,990,9,990,0,N,0,0,0,9,9,900,N,9,0,9,9,0,9,0,9,9990,0,0,0,0,9,0 099909,9099,9,9,9909 999ST AVE NE 99099,9909,,,999ST,AVE,NE,99099,9,9,0,9990,0,9990,0,0,0,9900,0,0,0,0,990,,990,0,0,9,9,0,,9,9,0,9,0,9,0,0,9999,0,0,0,0,9,0 099909,9090,9,9,99999 NE 99TH ST,99999,,NE,99TH,ST,,,9,9,0,9990,0,9990,0,0,0,9090,0,0,0,0,990,,990,0,0,9,9,0,,9,9,0,9,9,0,0,0,9099,0,0,0,0,9,0
I am reading in the file OK. I did not add the quotes, they were from the raw CSV file that I am reading in. I am getting variable lengths from the data owner, so I assume that at some point in the 500000+ rows, there is a variable that length.The problem is that variables that only have empty strings are not showing as a missing value and I need them to. How can I strip these so that I get the missing values and I can identify? Is the issue the quotes? Should I remove them with a text editor before reading in?
The problem is that variables that only have empty strings are not showing as a missing value and I need them to.
You keep saying that, but you have not defined what it means.
Create a small example with just 3 or 4 variables and 4 or 5 lines of data. Like this:
data test;
infile cards dsd truncover;
input a :$20. b :$20. x y ;
cards;
"No missing","B",1,2
"B missing",,3,4
"X missing","B3",,5
"Y missing","B4",6,
;
proc print;
run;
Obs a b x y 1 No missing B 1 2 2 B missing 3 4 3 X missing B3 . 5 4 Y missing B4 6 .
Can you create a simple example that has one of these empty strings that are not considered missing?
For the numeric values, I can run
proc means data=work.test nmiss; run;
and it returns 1 for x and y, which is the desired output for those variables, but it does not address b, because b is a character value. I need to run proc freq instead, but so much of the data is unique, and I only want the part of the output that has Frequency Missing = X. How can I modify the below statement so it works like nmiss on the proc means?
proc freq data=test; run;
Well that is a different question. And a common one.
Here is a link to an example question with a solution.
https://communities.sas.com/t5/SAS-Programming/Proc-Freq-Missing-Values-all-Fields/td-p/387784
Run the following and post the log and output please.
*Formats for missing for character and numeric variables;
proc format;
value $ missfmt ' '="Missing"
other="Not Missing"
;
value nmissfmt . ="Missing"
other="Not Missing"
;
run;
*Run frequency for tables;
ods select none;
ods table onewayfreqs=temp;
proc freq data=capstone.houses;
format _character_ $missfmt.;
format _numeric_ nmissfmt.;
table _all_;
run;
ods select all;
*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);
Variable_Value=strip(trim(vvaluex(variable)));
keep variable variable_value frequency percent cum:;
label variable='Variable'
variable_value='Variable Value';
run;
*Display;
proc print data=want label;
run;
Here are the results of the code:
*Run frequency for tables;
ods select none;
ods table onewayfreqs=temp;
proc freq data=capstone.houses;
format _character_ $missfmt.;
format _numeric_ nmissfmt.;
table _all_;
run;
ods select all;
*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);
Variable_Value=strip(trim(vvaluex(variable)));
keep variable variable_value frequency percent cum:;
label variable='Variable'
variable_value='Variable Value';
run;
*Display;
proc print data=want label;
run;
This is just what I needed. Thank you.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.