Hey.
I have a data set with character variables. The values for the variable are numbers, ' ' (empty, as character missings should be), but there are also other kinds of missing values (eg. '...', '.. .' ' .')
I would like to make a format which for these different values take changes the values to : number --> 'numbers' and let all observations which are not numbers stay the same.
That will make it possible for me to find the different kind of missing values.
I hope you understand and have a good idea to how to create this format.
If you have questions, just ask
Best regards, Silas Skovsbo, Aarhus University, institute of economics.
If I correctly understand what you are trying to accomplish, here is another approach:
data test;
input (x y) ($);
cards;
1z400 this1
4 45
7 .
8isnt 0900
;
proc sql noprint;
select catt("_",name,"=input(",name,",?? 32.);",
"if not missing(_",name,") then do;",
name,"='numb';end;") into :missings
separated by " "
from dictionary.columns
where libname="WORK" and
memname="TEST"
;
quit;
data want (drop=_:);
set test;
&missings.
run;
proc freq data=want;
tables _all_;
run;
I'm not quite following you.
A brief example with example data, and desired output would help.
An example:
For two character variables, X and Y:
X Y
1 22
2 ....
3 . . .
. 32
.. 11
. . 78
Then I want a format which gives me:
X Y
numb numb
numb ....
numb . . .
. numb
.. numb
. . numb
Is this what you need?
data have;
input x $ ;
cards;
a
4
5
6
.
..
.
...
.
4
4
;
proc format;
value $ fmt 0-99999 = "Number"
;
data want;
set have;
format x $fmt.;
run;
proc freq data = want;table x;run;
So in this way, if you do a frequency you get what you want.
(given that I understood what you need)
Let me know if you have any questions.
Good luck!
Thank you!
That solved the problem
so easy when you know the code!
If your variable is character then the ranges you propose will include many non numbers.
For example 1ABC would be in the range '0' - '9999999' and is not a valid number.
Agreed.
BUT
My understanding was that silasskovsbo wanted to focus on the missing data (basically wanted to find out how many ways the missing data was coded in).
If that's not indeed correct then the PROC FORMAT needs to be modified accordingly.
Cheers!
I would like to suggest using the NOTDIGIT function which would be easier than creating a format.
Thank you for your suggestions.
As Tom mentions, it is a problem if a value as '1ABC' is in the interval '0'-'9999999999' since I want only pure numbers to be formatted, so I am able to find all the observation which differ and therefor is missing.
How can I then format all values which is a number, but not all other kinds of values, eg. '1ABC', '...', ' .' ?
Nondigit function?
I'd like to suggest that the logic from https://communities.sas.com/message/145163#145163 makes sense in this context. Let SAS figure out if it can read it as a number, otherwise it remains unchanged. Note the two exceptions; if the string is a single period, SAS will convert it to character missing (all blank), and of course if the string is "NUMBER", it needs to be converted so it won't be lumped in with the actual numbers.
Tom
data have;
length X Y $80;
infile datalines dlm='09'x;
input X Y; /* tab delimited, to retain blanks in the values */
datalines;
1 22
2 ....
NUMBER 51
3 . . .
. 32
12 NUMBER
.. 11
. . 78
run;
data want;
length X2 Y2 $80;
set have;
if X = "" then
X = "MISSING";
if X = "NUMBER" then
X = "'NUMBER'";
if Y = "" then
Y = "MISSING";
if Y = "NUMBER" then
Y = "'NUMBER'";
if input(X, ??16.) > . then
X2 = "NUMBER";
else X2 = X;
if input(Y, ??16.) > . then
Y2 = "NUMBER";
else Y2 = Y;
run;
proc freq data = want;
table X2 Y2;
run;
ok,all, I think here is a simpler code using the not alpha numeric function.
/*NOTALNUM
Purpose: To determine the position of the first character in a string that is not an alphanumeric (any upper- or lowercase letter or a number). If none is found, the function returns a 0. With the use of an optional parameter, this function can begin searching at any position in the string and can also search from right to left, if desired.
*/
data have;
input x $ ;
cards;
1a
1234g
3h
j6
4
5
6
.
..
.
...
.
4
4
;
data want;
length value $10.;
set have;
temp = NOTALNUM(x);
if temp = 1 then value = x;
else if temp = 2 then value = "Number";
else value = "Non-Number";
run;
If I correctly understand what you are trying to accomplish, here is another approach:
data test;
input (x y) ($);
cards;
1z400 this1
4 45
7 .
8isnt 0900
;
proc sql noprint;
select catt("_",name,"=input(",name,",?? 32.);",
"if not missing(_",name,") then do;",
name,"='numb';end;") into :missings
separated by " "
from dictionary.columns
where libname="WORK" and
memname="TEST"
;
quit;
data want (drop=_:);
set test;
&missings.
run;
proc freq data=want;
tables _all_;
run;
Thank you for all the great commentaries. Unfortunate it is still not really doing.
The program needs to automatically do it on all character variables, since there are a huge amount of variables.
I have tried your program, Arthur. But something seems to go wrong?? for the '&missing.' it says statement is not valid??
I think Art is on the right track here.
Within a simple datastep a statement like
if not notdigit(compress(your_char_var,' .')) and
not missing(input(your_char_var,?? 8.)) then
your_char_var = compress("'"!!your_char_var!!"'") ;
should do. (Note that the ! exclamation marks are the concatenation operators in this germany based installation. Note also my use of the compress function's second argument to cater for numbers with a decimal point. Scrap that if your numbers of interest are all integers.)
This was my test setup (where I assumed that scientific notation is not used and where it appears in your data purely coincidental) :
data x ;
input orig_var $char8. @1 num_read ?? 8. ; /* read both as char and num */
miss_char = missing(orig_var) ;
miss_num = missing(input(orig_var,?? 8.)) ;
not_digit = notdigit(compress(orig_var,' .')) ; /* used to identify scientific notation */
if not miss_num and not not_digit then
orig_var = compress("'"!!orig_var!!"'") ;
cards ;
a
1a
12
123
1e5
8 8
123.45
.
..
;
run ;
proc print data=x ;
run ;
Thank you. This returns a table showing the type of each observation.
But I need a format that does the below: (convert from have --> want)
have
1abc
bc2
...
1234
want
1abc
bc2
...
number
-----------------------------------
By this I will be able to find all the data in a data set with character variables which is not numbers.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.