BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AForrest
Fluorite | Level 6

I have a large CSV file with a lot of variables (100+). I have read them in with a data step, but all of the missing values are reading in as empty strings. 

For example:

proc means data=mydata nmiss;
run;

returns 0 for every variable, which is not the case. How can I read in these empty strings (which may be 1 or more spaces) in as missing in the data step for every variable? I have tried a few things, but nothing even comes close enough to mention. I have seen options to replace missing values with strings, but no documentation to go the other way. Do I need to address each variable individually, or is there a quicker way?

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Run the following and post the log and output please. 

 

	*Formats for missing for character and numeric variables;
	proc format;
value $ missfmt ' '="Missing"
other="Not Missing"
;
value nmissfmt . ="Missing"
other="Not Missing"
;
run;

*Run frequency for tables;
ods select none;
ods table onewayfreqs=temp;
proc freq data=capstone.houses;
format _character_ $missfmt.;
format _numeric_ nmissfmt.;
table _all_;
run;
ods select all;

*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);

Variable_Value=strip(trim(vvaluex(variable)));

keep variable variable_value frequency percent cum:;
label variable='Variable' 
	variable_value='Variable Value';
run;

*Display;
proc print data=want label;
run;

View solution in original post

15 REPLIES 15
Tom
Super User Tom
Super User

Your probably need to show some of the lines of the CSV file (as TEXT, not photos and definitely NOT after it has been manipulated by Excel or some other spreadsheet type program).  Also show the data step you used to read the file.  Make sure to use the Insert Code or Insert SAS Code button to get a pop-up window to paste the lines of data or code, respectively.

 

Your statement seems confused.  If the value in the CSV file for a field is all spaces (or nothing at all) then a normal input statement will treat that as missing, whether you are reading it into a numeric or character variable.   Since SAS stores character variables as fixed length they are always padded with spaces.  But SAS code will treat a character variable that is all spaces as missing.

 

Note that PROC MEANS will only work on numeric variables (it is impossible to find the mean of a string).

 

You can use the NLEVELS option of PROC FREQ to see how many distinct values (including missing) .

proc freq data=mydata nlevels;
  tables _all_ / noprint;
run;
Reeza
Super User

@AForrest wrote:

I have a large CSV file with a lot of variables (100+). I have read them in with a data step, but all of the missing values are reading in as empty strings. 

[removed....] Do I need to address each variable individually, or is there a quicker way?


Unfortunately this means that the data was not imported in correctly so you need to fix the data step. Start there. Post your code and examples of the issue and we can help you fix the code. Rather than fix it after the fact, fix it at the source. 

AForrest
Fluorite | Level 6

Here is my first data step to read in the CSV file.

data library.houses;
	infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" dsd delimiter="," 
		missover firstobs=2;
	input Major :6. Minor :4. BldgNum:3. LivingUnits :1. Address :$60. 
		HouseNum :$5. Fraction :$3. DirPrefix :$2. StName :$25. StType :$6. 
		DirSuffix :$2. ZipCode :$10. Stories :3. BldgGrade :2. BldgGradeVar :2. 
		SqFtFirstFl :5. SqFtHalfFl :5. SqFtSecondFl :5. SqFtUpperFl :5. 
		SqFtUnfinishedFull :5. SqFtUnfinishedHalf :5. SqFtTotalLiving :5. 
		SqFtTotalBsmt :5. SqFtFinishedBsmt :5. FinBsmtGrade :2. SqFtGarageBsmt :5. 
		SqFtGarageAtt :5. DaylightBsmt :$1. SqFtOpenPorch :5. SqFtEnclPorch :5. 
		SqFtDeck :5. HeatSystem :2. HeatSource :2. PctBrickStone :3. ViewUtil :$1. 
		Bedrooms :2. HalfBath :2. ThreeQtrBath :2. FullBath :2. FirePlSingle :2. 
		FireplMult :2. FireplFreestanding :2. FireplAddl :2. YearBuilt :$4. 
		YearRenovated :$4. PctComplete :3. Obsolescence :3. PctNetCond :3. 
		Condition :1. AddlCost :6.;
run;


There are technically three CSV files that I am joining together later in the process, but the steps to read in each one are essentially the same. I am reading in the CSV files as is to the library, and then in a separate file, I use different data steps to bring the dataset into WORK. I did this to avoid extra processing time because each CSV file is so large and it only needs to be done once. The second data steps are relatively straightforward. Here is a quick snip of some examples of fully empty cells that are not reading as empty from the first data step. . 

 

AForrest_0-1588433677197.png

 

Tom
Super User Tom
Super User

Can you show the lines of the CSV file that correspond to the photograph of whatever (spreadsheet? dataset viewer?) that you posted.

Also show the lines from the SAS log for the data step you posted.  As text, not another photograph.

 

From the photograph I am assuming that your issue is not about missing values at all, but about proper alignment in reading the data.  It looks  in some lines the values you expected to appear in say the 10th variable are being read into the 8th or 9th variable instead?  

The notes from the SAS log will help tell if the issue is embedded line breaks.  So instead of one line with 100 values you have two lines. One where there are only say 70 values and another where the first value 70 values are just not there.

AForrest
Fluorite | Level 6

Sure! Here is the log:

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         data capstone.houses;
 74         infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" dsd delimiter=","
 75         missover firstobs=2;
 76         input Major :6. Minor :4. BldgNum:3. LivingUnits :1. Address :$60.
 77         HouseNum :$5. Fraction :$3. DirPrefix :$2. StName :$25. StType :$6.
 78         DirSuffix :$2. ZipCode :$10. Stories :3. BldgGrade :2. BldgGradeVar :2.
 79         SqFtFirstFl :5. SqFtHalfFl :5. SqFtSecondFl :5. SqFtUpperFl :5.
 80         SqFtUnfinishedFull :5. SqFtUnfinishedHalf :5. SqFtTotalLiving :5.
 81         SqFtTotalBsmt :5. SqFtFinishedBsmt :5. FinBsmtGrade :2. SqFtGarageBsmt :5.
 82         SqFtGarageAtt :5. DaylightBsmt :$1. SqFtOpenPorch :5. SqFtEnclPorch :5.
 83         SqFtDeck :5. HeatSystem :2. HeatSource :2. PctBrickStone :3. ViewUtil :$1.
 84         Bedrooms :2. HalfBath :2. ThreeQtrBath :2. FullBath :2. FirePlSingle :2.
 85         FireplMult :2. FireplFreestanding :2. FireplAddl :2. YearBuilt :$4.
 86         YearRenovated :$4. PctComplete :3. Obsolescence :3. PctNetCond :3.
 87         Condition :1. AddlCost :6.;
 88         run;
 
 NOTE: The infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" is:
       Filename=/folders/myfolders/capstone/EXTR_ResBldg.csv,
       Owner Name=root,Group Name=vboxsf,
       Access Permission=-rwxrwx---,
       Last Modified=23Apr2020:15:33:03,
       File Size (bytes)=149112131
 
 NOTE: 514596 records were read from the infile "/folders/myfolders/capstone/EXTR_ResBldg.csv".
       The minimum record length was 242.
       The maximum record length was 321.
 NOTE: The data set CAPSTONE.HOUSES has 514596 observations and 50 variables.
 NOTE: DATA statement used (Total process time):
       real time           15.90 seconds
       cpu time            8.84 seconds
       
 
 89         
 90         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 102        

Here are the first three lines of the source file. 

"Major","Minor","BldgNbr","NbrLivingUnits","Address","BuildingNumber","Fraction","DirectionPrefix","StreetName","StreetType","DirectionSuffix","ZipCode","Stories","BldgGrade","BldgGradeVar","SqFt1stFloor","SqFtHalfFloor","SqFt2ndFloor","SqFtUpperFloor","SqFtUnfinFull","SqFtUnfinHalf","SqFtTotLiving","SqFtTotBasement","SqFtFinBasement","FinBasementGrade","SqFtGarageBasement","SqFtGarageAttached","DaylightBasement","SqFtOpenPorch","SqFtEnclosedPorch","SqFtDeck","HeatSystem","HeatSource","BrickStone","ViewUtilization","Bedrooms","BathHalfCount","Bath3qtrCount","BathFullCount","FpSingleStory","FpMultiStory","FpFreestanding","FpAdditional","YrBuilt","YrRenovated","PcntComplete","Obsolescence","PcntNetCondition","Condition","AddnlCost"
"042504","9127","1","1","7709   30TH AVE NE  98115","7709 ","   ","  ","30TH                     ","AVE   ","NE","98115","1","8","0","1380","0","0","0","0","0","2180","1380","800","7","0","0"," ","0","0","370","5","1","0"," ","4","0","2","1","0","1","0","1","1956","0","0","0","0","3","0"
"042504","9140","1","1","6540   27TH AVE NE  98115","6540 ","   ","  ","27TH                     ","AVE   ","NE","98115","1","7","0","1030","0","0","0","0","0","1450","1030","420","6","240","0","N","0","0","0","5","1","100","N","3","0","1","1","0","1","0","1","1950","0","0","0","0","3","0"

 

 

 

AForrest
Fluorite | Level 6

Here is the results from the log:

 

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         
 74         data capstone.houses;
 75         infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" dsd delimiter=","
 76         missover firstobs=2;
 77         input Major :6. Minor :4. BldgNum:3. LivingUnits :1. Address :$60.
 78         HouseNum :$5. Fraction :$3. DirPrefix :$2. StName :$25. StType :$6.
 79         DirSuffix :$2. ZipCode :$10. Stories :3. BldgGrade :2. BldgGradeVar :2.
 80         SqFtFirstFl :5. SqFtHalfFl :5. SqFtSecondFl :5. SqFtUpperFl :5.
 81         SqFtUnfinishedFull :5. SqFtUnfinishedHalf :5. SqFtTotalLiving :5.
 82         SqFtTotalBsmt :5. SqFtFinishedBsmt :5. FinBsmtGrade :2. SqFtGarageBsmt :5.
 83         SqFtGarageAtt :5. DaylightBsmt :$1. SqFtOpenPorch :5. SqFtEnclPorch :5.
 84         SqFtDeck :5. HeatSystem :2. HeatSource :2. PctBrickStone :3. ViewUtil :$1.
 85         Bedrooms :2. HalfBath :2. ThreeQtrBath :2. FullBath :2. FirePlSingle :2.
 86         FireplMult :2. FireplFreestanding :2. FireplAddl :2. YearBuilt :$4.
 87         YearRenovated :$4. PctComplete :3. Obsolescence :3. PctNetCond :3.
 88         Condition :1. AddlCost :6.;
 89         run;
 
 NOTE: The infile "/folders/myfolders/capstone/EXTR_ResBldg.csv" is:
       Filename=/folders/myfolders/capstone/EXTR_ResBldg.csv,
       Owner Name=root,Group Name=vboxsf,
       Access Permission=-rwxrwx---,
       Last Modified=23Apr2020:15:33:03,
       File Size (bytes)=149112131
 
 NOTE: 514596 records were read from the infile "/folders/myfolders/capstone/EXTR_ResBldg.csv".
       The minimum record length was 242.
       The maximum record length was 321.
 NOTE: The data set CAPSTONE.HOUSES has 514596 observations and 50 variables.
 NOTE: DATA statement used (Total process time):
       real time           10.32 seconds
       cpu time            6.97 seconds
       
 
 90         
 91         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 103   

I can see why it might look like the variables are not aligned correctly, but that is not the case. In the screenshot, it is just a case of mutual exclusivity. As far as the data itself, I think my previous post may have been removed because it contains address data. I will add a slightly scrubbed version in a separate comment. 

AForrest
Fluorite | Level 6

Here are the first few rows of a scrubbed version of the CSV file:

"Major","Minor","BldgNbr","NbrLivingUnits","Address","BuildingNumber","Fraction","DirectionPrefix","StreetName","StreetType","DirectionSuffix","ZipCode","Stories","BldgGrade","BldgGradeVar","SqFt9stFloor","SqFtHalfFloor","SqFt9ndFloor","SqFtUpperFloor","SqFtUnfinFull","SqFtUnfinHalf","SqFtTotLiving","SqFtTotBasement","SqFtFinBasement","FinBasementGrade","SqFtGarageBasement","SqFtGarageAttached","DaylightBasement","SqFtOpenPorch","SqFtEnclosedPorch","SqFtDeck","HeatSystem","HeatSource","BrickStone","ViewUtilization","Bedrooms","BathHalfCount","Bath9qtrCount","BathFullCount","FpSingleStory","FpMultiStory","FpFreestanding","FpAdditional","YrBuilt","YrRenovated","PcntComplete","Obsolescence","PcntNetCondition","Condition","AddnlCost"
"099909","9999","9","9","9909   90TH AVE NE  99999","9909 ","   ","  ","90TH                     ","AVE   ","NE","99999","9","9","0","9990","0","0","0","0","0","9990","9990","900","9","0","0"," ","0","0","990","9","9","0"," ","9","0","9","9","0","9","0","9","9999","0","0","0","0","9","0"
"099909","9990","9","9","9990   99TH AVE NE  99999","9990 ","   ","  ","99TH                     ","AVE   ","NE","99999","9","9","0","9090","0","0","0","0","0","9990","9090","990","9","990","0","N","0","0","0","9","9","900","N","9","0","9","9","0","9","0","9","9990","0","0","0","0","9","0"
"099909","9099","9","9","9909   999ST AVE NE  99099","9909 ","   ","  ","999ST                    ","AVE   ","NE","99099","9","9","0","9990","0","9990","0","0","0","9900","0","0","0","0","990"," ","990","0","0","9","9","0"," ","9","9","0","9","0","9","0","0","9999","0","0","0","0","9","0"
"099909","9090","9","9","99999  NE 99TH ST   ","99999","   ","NE","99TH                     ","ST    ","  ","","9","9","0","9990","0","9990","0","0","0","9090","0","0","0","0","990"," ","990","0","0","9","9","0"," ","9","9","0","9","9","0","0","0","9099","0","0","0","0","9","0"
Tom
Super User Tom
Super User

I don't see anything wrong with those lines.

Can you explain what your issue is?  Does one of those lines have the issue?  For which variable? What does the SAS variable have?  What should it have?

 

Each one has exactly 50 values.

67    data _null_;
68      input;
69      nwords=countw(_infile_,',','mq');
70      put (_n_ nwords) (=);
71    cards4;

_N_=1 nwords=50
_N_=2 nwords=50
_N_=3 nwords=50
_N_=4 nwords=50
_N_=5 nwords=50

But they do have a lot more characters on each line than they need to show that information.  You only need quotes when the value contains the delimiter (comma in this case) or quotes.  You don't need any trailing spaces.

 

If I read them in field by field and write them back out I get a file list this:

Major,Minor,BldgNbr,NbrLivingUnits,Address,BuildingNumber,Fraction,DirectionPrefix,StreetName,StreetType,DirectionSuffix,ZipCode,Stories,BldgGrade,BldgGradeVar,SqFt9stFloor,SqFtHalfFloor,SqFt9ndFloor,SqFtUpperFloor,SqFtUnfinFull,SqFtUnfinHalf,SqFtTotLiving
099909,9999,9,9,9909   90TH AVE NE  99999,9909,,,90TH,AVE,NE,99999,9,9,0,9990,0,0,0,0,0,9990,9990,900,9,0,0,,0,0,990,9,9,0,,9,0,9,9,0,9,0,9,9999,0,0,0,0,9,0
099909,9990,9,9,9990   99TH AVE NE  99999,9990,,,99TH,AVE,NE,99999,9,9,0,9090,0,0,0,0,0,9990,9090,990,9,990,0,N,0,0,0,9,9,900,N,9,0,9,9,0,9,0,9,9990,0,0,0,0,9,0
099909,9099,9,9,9909   999ST AVE NE  99099,9909,,,999ST,AVE,NE,99099,9,9,0,9990,0,9990,0,0,0,9900,0,0,0,0,990,,990,0,0,9,9,0,,9,9,0,9,0,9,0,0,9999,0,0,0,0,9,0
099909,9090,9,9,99999  NE 99TH ST,99999,,NE,99TH,ST,,,9,9,0,9990,0,9990,0,0,0,9090,0,0,0,0,990,,990,0,0,9,9,0,,9,9,0,9,9,0,0,0,9099,0,0,0,0,9,0
AForrest
Fluorite | Level 6

I am reading in the file OK. I did not add the quotes, they were from the raw CSV file that I am reading in. I am getting variable lengths from the data owner, so I assume that at some point in the 500000+ rows, there is a variable that length.The problem is that variables that only have empty strings are not showing as a missing value and I need them to. How can I strip these so that I get the missing values and I can identify? Is the issue the quotes? Should I remove them with a text editor before reading in?  

Tom
Super User Tom
Super User

The problem is that variables that only have empty strings are not showing as a missing value and I need them to.

You keep saying that, but you have not defined what it means.

 

Create a small example with just 3 or 4 variables and 4 or 5 lines of data. Like this:

data test;
  infile cards dsd  truncover;
  input a :$20. b :$20. x y ;
cards;
"No missing","B",1,2
"B missing",,3,4
"X missing","B3",,5
"Y missing","B4",6,
;
proc print;
run;
Obs        a         b     x    y

 1     No missing    B     1    2
 2     B missing           3    4
 3     X missing     B3    .    5
 4     Y missing     B4    6    .

Can you create a simple example that has one of these empty strings that are not considered missing?

AForrest
Fluorite | Level 6

For the numeric values, I can run 

 

proc means data=work.test nmiss;
run;

and it returns 1 for x and y, which is the desired output for those variables, but it does not address b, because b is a character value. I need to run proc freq instead, but so much of the data is unique, and I only want the part of the output that has Frequency Missing = X. How can I modify the below statement so it works like nmiss on the proc means? 

proc freq data=test;
run;
Tom
Super User Tom
Super User

Well that is a different question. And a common one.

Here is a link to an example question with a solution.

https://communities.sas.com/t5/SAS-Programming/Proc-Freq-Missing-Values-all-Fields/td-p/387784

 

Reeza
Super User

Run the following and post the log and output please. 

 

	*Formats for missing for character and numeric variables;
	proc format;
value $ missfmt ' '="Missing"
other="Not Missing"
;
value nmissfmt . ="Missing"
other="Not Missing"
;
run;

*Run frequency for tables;
ods select none;
ods table onewayfreqs=temp;
proc freq data=capstone.houses;
format _character_ $missfmt.;
format _numeric_ nmissfmt.;
table _all_;
run;
ods select all;

*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);

Variable_Value=strip(trim(vvaluex(variable)));

keep variable variable_value frequency percent cum:;
label variable='Variable' 
	variable_value='Variable Value';
run;

*Display;
proc print data=want label;
run;
AForrest
Fluorite | Level 6

Here are the results of the code:

*Run frequency for tables;
ods select none;
ods table onewayfreqs=temp;
proc freq data=capstone.houses;
format _character_ $missfmt.;
format _numeric_ nmissfmt.;
table _all_;
run;
ods select all;

*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);

Variable_Value=strip(trim(vvaluex(variable)));

keep variable variable_value frequency percent cum:;
label variable='Variable' 
	variable_value='Variable Value';
run;

*Display;
proc print data=want label;
run;

AForrest_0-1588543080487.png

This is just what I needed. Thank you. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 10478 views
  • 2 likes
  • 3 in conversation