I have realised some strange problems with a command which should be quite simple, proc format.
The problem is as follows.
My data is like
A$ B
. 3
4
e .
I have as seen above both missing data for the character variable, A$, shich is given by "." and " "
I want to use proc format to change the format so the data looks like this
A$ B
. other
other
other .
It is because I then can make a proc freq which gives me
. | other | |
---|---|---|
Somebody who can help me?
Best regards, Silas, Aarhus University
Not sure why you would want to do that, but one way you could do it is by using something like:
proc format;
value $cmiss
'b.'='.'
' '=' '
other='other'
;
value nmiss
.=.
other='other'
;
run;
options missing=' ';
data have;
infile cards truncover;
input @;
_infile_=tranwrd(_infile_,'.','b.');
format a $cmiss. b nmiss.;
input a $2. b;
cards;
. 3
4
e
;
Hey Arthur.
Thanks for the fast answer, i see it works for with your code, but unfortunate it still not works with my original.
I have a data set with the variables YEAR and AKASSE. AKASSE does have codes " ", "." and other numbers.
When I only use the proc format of yours I only get " " and "other" in my proc freq results and "." are included in "other".
Do you understand?
So how can I use the proc format properly on the original data set?
Again thanks!
I don't understand what your data look like. You will probably have to provide an example in the form of a datastep.
If you are saying that you have both blank and . values for both numeric and character variables, then you might be able to achieve what you want with something like:
proc format;
value $cmiss
'.b'='.'
' '=' '
other='other'
;
value nmiss
.b=' '
.=.
other='other'
;
run;
options missing=' ';
data have;
infile cards truncover;
input @;
_infile_=tranwrd(_infile_,'.','.b');
format a $cmiss. b nmiss.;
input a $2. b;
cards;
. 3
4
e
f .
;
I have a dataset as shown in the my first question, just with a lot more observations and with a variable YEAR which shall not be affected by the format.
then I have tried this code inspired by you:
proc format;
value $cmiss
'b.'='.'
' '=' '
other='other'
;
value nmiss
.=.
other='other'
;
run;
proc freq data=XXX;
tables year*A / nocol nopercent missing;
format _numeric_ nmiss. _character_ $cmiss. year;
run;
but this just gives me that A only consist of " " and "other", the "." is included in "other".
Do you understand? I would like to know how to continue after the proc format step given that I have a data set with the name XXXX and the three variables YEAR, A, B?
Best //
Btw the reason I need to do is because when the danish statstics publics new data there is the problem that for character-variables it happens that missing data for some years are given "." and for other years are given " ". So at the institute of Economy in Aarhus we need to find out what years missing data is given the one or the other value.
Does the following do what you need?:
proc format;
value $cmiss
'.b'='.'
' '=' '
other='other'
;
value nmiss
.b=' '
.=.
other='other'
;
run;
options missing='.';
data xxx;
infile cards truncover;
input @;
_infile_=tranwrd(_infile_,'.','.b');
input year a $2. b;
cards;
2009 . 3
2010 4
2011 e
2012 f .
;
proc freq data=XXX;
tables year*A / nocol nopercent missing;
format _numeric_ nmiss. _character_ $cmiss. year;
run;
Your requirements are nearly clear, but ...
If A and B fall into the "other" category, do you need them broken down by year? If not, you might consider a new variable entirely:
if B=. then B_new = year;
else b_new=B;
Then run the PROC FREQ on B_new.
For A, it's a little trickier, since you have to make sure A has a length of at least 4. Assuming it does:
if A in (' ', '.') then A_new = put(year,4.);
else A_new = A;
Then run the PROC FREQ on A_new.
Depends on what you are trying to find out. And if you can afford to run multiple PROC FREQs, one for each variable, skip all the complexity:
proc freq data=have;
tables year;
where B=.;
run;
proc freq data=have;
tables year;
where A in (' ', '.');
run;
Of course, all of this is dependent on whether I'm interpreting the problem correctly or not.
Good luck.
It works fine for me (see below).
Do you want the blanks and periods to be distinct categories?
Did you remember to use the MISSING option in the PROC FREQ code?
Are you sure that the periods are getting into the SAS dataset? If you are reading from a text file SAS normally converts '.' to ' '. You can prevent this by using $CHAR informat instead of $ informat.
data test;
a=' '; output;
a='.'; output;
a='X'; output;
run;
proc format ;
value $a ' '=' ' '.'='.' other='Z';
run;
proc freq ;
tables a / missing ;
format a $a.;
run;
The FREQ Procedure
Cumulative Cumulative
a Frequency Percent Frequency Percent
------------------------------------------------------
1 33.33 1 33.33
. 1 33.33 2 66.67
Z 1 33.33 3 100.00
Hey Tom.
Your program does also work for me with the data set you make.
But unfortunate - and I do not understand why - it still does not work when I use the program on my original data set.
To show my problem more specific below I have uploaded my problemativ results:
This is the result when I do proc freq on original data
Then I use your format and I then have this result:
Which is clearly wrong.
Do you have any idea on how it fails. I need three categories 1: '.' 2: ' ' 3: 'other'
And it works perfect for me when I use your data set?
Thank again! Best / Silas
Thanks for help everybody... Problem seems to be solved now! :smileygrin:
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.